Skip to content

arXiv provider

The arXiv provider mounts the arXiv API under /arxiv. Each paper is a directory keyed by its arXiv ID. You get the PDF, the LaTeX source archive, structured metadata, and version history, all readable with standard file tools and no SDK.

/omnifs/arxiv/

The provider requires no credentials. All reads go through the arXiv public API.

Terminal window
ls /omnifs/arxiv/papers/{id}
paper.pdf source.tar.gz metadata.json links.json versions/

{id} is the arXiv paper identifier in any of its standard forms: 1706.03762, 2301.00001, or cs.LG/0510009 for older papers.

PathDescription
paper.pdfThe compiled PDF, current version
source.tar.gzLaTeX source archive, current version
metadata.jsonTitle, authors, abstract, categories, submission date, DOI, and version list
links.jsonRelated links: DOI, journal ref, and HTML abstract page
versions/Directory with one entry per revision (v1, v2, …)
Terminal window
ls /omnifs/arxiv/papers/{id}/versions/v{n}

Each version directory exposes the same leaves as the top-level paper directory (paper.pdf, source.tar.gz, metadata.json), pinned to that specific revision. Useful for comparing a preprint against its published revision.

Terminal window
ls /omnifs/arxiv/categories/{cat}/new
ls /omnifs/arxiv/categories/{cat}/{YYYY}/{MM}/{DD}

{cat} is any arXiv category identifier: cs.LG, quant-ph, math.CO, and so on. The new directory lists papers from the most recent announcement batch. Date directories list papers announced on that day.

Each entry in a category listing is itself an ID directory with the same per-paper structure above.

Terminal window
ls /omnifs/arxiv/search/{query}

{query} is a URL-encoded search string passed to the arXiv search API. Results appear as subdirectories named by arXiv ID. Example: ls /omnifs/arxiv/search/transformer+attention.

The arXiv provider has no required configuration. The config block in the provider manifest is empty. No tokens, no credentials.

Read the title of “Attention Is All You Need” (arXiv 1706.03762):

Terminal window
cat /omnifs/arxiv/papers/1706.03762/metadata.json | jq .title
"Attention Is All You Need"

List all files for that paper:

Terminal window
ls /omnifs/arxiv/papers/1706.03762
paper.pdf source.tar.gz metadata.json links.json versions/

Pull the abstract into a variable:

Terminal window
abstract=$(cat /omnifs/arxiv/papers/1706.03762/metadata.json | jq -r .abstract)

Grep abstracts from a category feed:

Terminal window
for d in /omnifs/arxiv/categories/cs.LG/new/*/; do
id=$(basename "$d")
cat "$d/metadata.json" | jq -r '"'"$id"': " + .title'
done

Compare metadata between v1 and v3 of a paper:

Terminal window
diff \
<(cat /omnifs/arxiv/papers/2301.00001/versions/v1/metadata.json | jq .) \
<(cat /omnifs/arxiv/papers/2301.00001/versions/v3/metadata.json | jq .)

Download the LaTeX source of a specific version:

Terminal window
cp /omnifs/arxiv/papers/1706.03762/versions/v5/source.tar.gz ~/downloads/

Search and print titles:

Terminal window
ls /omnifs/arxiv/search/attention+mechanism | while read id; do
cat /omnifs/arxiv/papers/"$id"/metadata.json | jq -r .title
done

The paper.pdf and source.tar.gz leaves can be large. The omnifs host caches fetched content in a capacity-bounded cache invalidated by upstream events, so repeated reads of the same paper version do not re-fetch from arXiv.

Older papers using the pre-2007 identifier format (cs.LG/0510009) work as-is in the path. The slash in those IDs is part of the arXiv standard; omnifs encodes it transparently.

The arXiv API enforces a rate limit of one request every three seconds for unauthenticated clients. The host-level cache absorbs most repeated access patterns within a session.