If you use Claude Code, Cursor, Aider, or any other AI coding agent seriously, you have probably had the experience of asking it to use library X correctly and watching it confidently produce code that calls a function that does not exist. The cause is rarely the model — modern coding agents reason well — it is the context. The agent does not have library X's documentation in its index, so it pattern-matches against whatever it remembers from training, and the training cutoff is often months behind the library's current API.
This post is the workflow we use to fix that for any library we work with seriously: enumerate the docs site, bulk-convert it to clean Markdown, drop the result into a known location in the project, and let the coding agent index it like the rest of the codebase. The agent stops guessing because the actual docs are sitting one retrieval-hop away. If you want the deeper why behind feeding agents Markdown specifically, the agent context primer covers what each agent actually does with the file.
Why local docs beat live URL fetches
It is tempting to think that the right answer for keeping coding agents current is to let them browse the docs live. Claude Code's @web and Cursor's @docs URL commands both support this. In practice, fetching live works for spot-checks and falls apart for sustained work — for three reasons.
The first is latency. A live fetch adds 800ms to 3 seconds per query, and a coding agent might need to consult docs ten times in a single multi-step task. The cumulative wait derails flow. A local Markdown index returns in tens of milliseconds.
The second is rate limiting. Documentation sites are increasingly aggressive about throttling repeated fetches from the same IP, and your coding agent does not distinguish between "I have already read this page three times this session" and "I need to fetch it again." Local files have no rate limit.
The third, and most consequential, is chunk selection. When an agent fetches a live URL it gets one chunk back — usually the most relevant section as decided by the fetcher's extractor. A local Markdown corpus lets the agent's own retriever pick the best chunk from the entire site, which means it can pull together information from three different pages to answer a single question. That cross-page synthesis is where local docs produce dramatically better answers than live fetches.
The reason coding agents feel "smarter" on a familiar codebase than on an unfamiliar library is that the codebase is local and the library docs are not.
The five-step workflow
The pipeline we use, and the one we recommend, has five steps. The whole flow takes about ten minutes for a typical mid-sized library; large frameworks take twenty.
Step 1 — Enumerate the docs URLs
Start with the docs site's sitemap, which is almost always at https://example.com/docs/sitemap.xml or https://example.com/sitemap.xml. Open the file in a browser, copy the URLs, and filter them down to the docs paths you actually care about. For a typical library, this is the /docs/, /api/, and /guides/ subtrees; you can skip blog posts, marketing pages, and changelogs unless your work depends on them.
For libraries that publish their docs on a JavaScript-heavy framework without a sitemap, the fallback is to crawl the docs landing page and follow internal links. Most docs sites have a consistent navigation pattern that yields the full URL list within one or two levels.
Step 2 — Bulk-convert to Markdown in the browser
Open the bulk-conversion dashboard in BulkMD, paste the URL list, and let it run. The defaults — Article mode, three-tab concurrency, 500ms between page starts — are tuned for documentation sites and will produce one clean Markdown file per page. A 400-page library typically completes in three to six minutes. The output is a ZIP of .md files, one per page, with the URL preserved as a citation block at the top of each.
We cover the architecture that makes this reliable in the Manifest V3 service-worker post; the relevant point for this workflow is that you can step away, fix lunch, and come back to a complete corpus.
Step 3 — Drop into a known project location
Unpack the ZIP into a known location your coding agent will recognize. The conventions that work across the major agents:
your-project/
├── src/
├── package.json
└── docs/
└── .ai/
├── library-name/
│ ├── getting-started.md
│ ├── api-reference.md
│ └── ...
└── another-library/
└── ...
The docs/.ai/ directory is deliberately under-the-radar — humans rarely browse it, but every coding agent we have tested indexes everything under docs/ by default. The per-library subdirectory keeps file names readable in retrieval results.
If you do not want to commit the Markdown corpus to source control, add docs/.ai/ to .gitignore and treat the conversion as a developer-machine step. We commit it for shared projects, gitignore it for personal projects — neither is wrong, but consistency within a team matters.
Step 4 — Trigger the index
Coding agents detect new files differently. Claude Code indexes on the next conversation; Cursor indexes within a few seconds via its background watcher. To force a full re-index when you want immediate effect, the conventional commands are /reindex in Claude Code and the Command Palette → "Codebase: Reindex" in Cursor.
For larger corpora the first indexing pass takes ten to thirty seconds; subsequent retrieval is near-instant. You will know the index is working when asking the agent a docs-specific question produces an answer that quotes a section header or file name from your corpus.
Step 5 — Refresh on a schedule
The single most common failure mode of a local docs corpus is staleness. Six months after you build it, the library has released two major versions, your Markdown is wrong, and your agent is confidently producing code that no longer compiles. The fix is a refresh schedule — anywhere from weekly for fast-moving libraries to quarterly for stable ones.
A two-line shell script that re-runs the bulk export against the stored URL list, replaces the contents of docs/.ai/library-name/, and commits the diff is enough. The diff itself is a useful artifact — you can review it in your usual code-review flow to see what changed in the upstream docs since last refresh.
How big is the corpus, really?
A common worry is that committing documentation to a project repo will balloon the repo size. The empirical answer, from a sample of ten libraries we use regularly:
| Library | Pages exported | Markdown size | Avg per page |
|---|---|---|---|
| React (react.dev) | 187 | 4.2 MB | 23 KB |
| Next.js | 412 | 9.1 MB | 22 KB |
| TanStack Query | 76 | 1.6 MB | 21 KB |
| Tailwind CSS | 195 | 3.8 MB | 19 KB |
| Prisma | 284 | 6.7 MB | 24 KB |
| FastAPI | 158 | 3.1 MB | 20 KB |
| Stripe API | 612 | 14.8 MB | 24 KB |
| AWS S3 | 488 | 11.2 MB | 23 KB |
| Postgres docs | 947 | 22.4 MB | 24 KB |
| Anthropic API | 91 | 1.9 MB | 21 KB |
The median is around 22 KB of clean Markdown per documentation page, which gives a rough rule of thumb: every hundred pages of docs adds roughly 2 MB to your repo. For most projects this is negligible alongside node_modules lockfiles and image assets, and it shows up exactly once in source control because Markdown deltas compress beautifully.
The Postgres docs are an outlier worth noting: at 947 pages and 22 MB, they push the boundary of what is reasonable to commit. For corpora that large, we recommend the gitignore approach plus a make docs target that any developer can run to rebuild the local copy.
What agents do differently when given a local corpus
After running this workflow on roughly forty libraries over the past year, the pattern of behavior change is consistent enough to be worth naming.
Coding agents stop hallucinating function signatures. When the docs are local, the agent's retriever surfaces the correct signature, and the agent quotes it rather than guessing. We have seen the hallucination rate on library-specific function calls drop from roughly fifteen percent to under two percent on a fixed evaluation set after introducing local docs.
Agents start citing specific pages by name. Where an answer used to say "you can use the useEffect hook," a docs-indexed answer says "according to react/hooks-effect.md, the useEffect hook…" with a deep-link to the section. The citation is itself useful — it tells you exactly which part of the docs the agent is reasoning from, which makes it easy to spot when the docs themselves are wrong or out of date.
Agents handle version-specific questions correctly. If the corpus you committed is for v5, the agent answers for v5; if you refresh to v6, the answers shift on the next conversation. Without a local corpus, the agent answers for whatever version its training data emphasized, which is almost never the version you care about.
Cross-library questions improve sharply. Asking "how do I wire TanStack Query to a Next.js Server Component" produces a far better answer when both libraries' docs are local than when either is missing. The agent's retriever pulls relevant chunks from both corpora and the model synthesizes them; without local docs, you get half an answer.
TL;DR
Coding agents are limited by what they can retrieve. The cheapest, fastest way to make Claude Code, Cursor, or Aider materially better at a library you work with seriously is to give them a local Markdown copy of that library's docs. Enumerate the URLs, bulk-convert in the browser, drop the result under docs/.ai/, let the agent index it, and refresh on a schedule.
If you want the conversion step to take three minutes instead of three hours, BulkMD is the free Chrome extension that runs the bulk conversion entirely in your browser — no API key, no server, no rate limit beyond what the docs site enforces.
Frequently asked questions
Won't a large docs corpus pollute my coding agent with irrelevant context?
Why commit the Markdown instead of letting the agent fetch live?
What if the library publishes its own llms.txt or AI-ready docs?
Does this work for Aider, Cline, and other coding agents besides Claude Code and Cursor?
How do I keep the corpus from going stale?
About the author
Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.
Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.