BulkMD

Building a Claude Code Knowledge Base from Web Docs

A reproducible workflow for turning any documentation site into a local Markdown knowledge base that Claude Code, Cursor, and other coding agents can index.

M. H. Tawfik11 min read

If you use Claude Code, Cursor, Aider, or any other AI coding agent seriously, you have probably had the experience of asking it to use library X correctly and watching it confidently produce code that calls a function that does not exist. The cause is rarely the model — modern coding agents reason well — it is the context. The agent does not have library X's documentation in its index, so it pattern-matches against whatever it remembers from training, and the training cutoff is often months behind the library's current API.

This post is the workflow we use to fix that for any library we work with seriously: enumerate the docs site, bulk-convert it to clean Markdown, drop the result into a known location in the project, and let the coding agent index it like the rest of the codebase. The agent stops guessing because the actual docs are sitting one retrieval-hop away. If you want the deeper why behind feeding agents Markdown specifically, the agent context primer covers what each agent actually does with the file.

Why local docs beat live URL fetches

It is tempting to think that the right answer for keeping coding agents current is to let them browse the docs live. Claude Code's @web and Cursor's @docs URL commands both support this. In practice, fetching live works for spot-checks and falls apart for sustained work — for three reasons.

The first is latency. A live fetch adds 800ms to 3 seconds per query, and a coding agent might need to consult docs ten times in a single multi-step task. The cumulative wait derails flow. A local Markdown index returns in tens of milliseconds.

The second is rate limiting. Documentation sites are increasingly aggressive about throttling repeated fetches from the same IP, and your coding agent does not distinguish between "I have already read this page three times this session" and "I need to fetch it again." Local files have no rate limit.

The third, and most consequential, is chunk selection. When an agent fetches a live URL it gets one chunk back — usually the most relevant section as decided by the fetcher's extractor. A local Markdown corpus lets the agent's own retriever pick the best chunk from the entire site, which means it can pull together information from three different pages to answer a single question. That cross-page synthesis is where local docs produce dramatically better answers than live fetches.

The reason coding agents feel "smarter" on a familiar codebase than on an unfamiliar library is that the codebase is local and the library docs are not.

The five-step workflow

The pipeline we use, and the one we recommend, has five steps. The whole flow takes about ten minutes for a typical mid-sized library; large frameworks take twenty.

Step 1 — Enumerate the docs URLs

Start with the docs site's sitemap, which is almost always at https://example.com/docs/sitemap.xml or https://example.com/sitemap.xml. Open the file in a browser, copy the URLs, and filter them down to the docs paths you actually care about. For a typical library, this is the /docs/, /api/, and /guides/ subtrees; you can skip blog posts, marketing pages, and changelogs unless your work depends on them.

For libraries that publish their docs on a JavaScript-heavy framework without a sitemap, the fallback is to crawl the docs landing page and follow internal links. Most docs sites have a consistent navigation pattern that yields the full URL list within one or two levels.

Step 2 — Bulk-convert to Markdown in the browser

Open the bulk-conversion dashboard in BulkMD, paste the URL list, and let it run. The defaults — Article mode, three-tab concurrency, 500ms between page starts — are tuned for documentation sites and will produce one clean Markdown file per page. A 400-page library typically completes in three to six minutes. The output is a ZIP of .md files, one per page, with the URL preserved as a citation block at the top of each.

We cover the architecture that makes this reliable in the Manifest V3 service-worker post; the relevant point for this workflow is that you can step away, fix lunch, and come back to a complete corpus.

Step 3 — Drop into a known project location

Unpack the ZIP into a known location your coding agent will recognize. The conventions that work across the major agents:

your-project/
├── src/
├── package.json
└── docs/
    └── .ai/
        ├── library-name/
        │   ├── getting-started.md
        │   ├── api-reference.md
        │   └── ...
        └── another-library/
            └── ...

The docs/.ai/ directory is deliberately under-the-radar — humans rarely browse it, but every coding agent we have tested indexes everything under docs/ by default. The per-library subdirectory keeps file names readable in retrieval results.

If you do not want to commit the Markdown corpus to source control, add docs/.ai/ to .gitignore and treat the conversion as a developer-machine step. We commit it for shared projects, gitignore it for personal projects — neither is wrong, but consistency within a team matters.

Step 4 — Trigger the index

Coding agents detect new files differently. Claude Code indexes on the next conversation; Cursor indexes within a few seconds via its background watcher. To force a full re-index when you want immediate effect, the conventional commands are /reindex in Claude Code and the Command Palette → "Codebase: Reindex" in Cursor.

For larger corpora the first indexing pass takes ten to thirty seconds; subsequent retrieval is near-instant. You will know the index is working when asking the agent a docs-specific question produces an answer that quotes a section header or file name from your corpus.

Step 5 — Refresh on a schedule

The single most common failure mode of a local docs corpus is staleness. Six months after you build it, the library has released two major versions, your Markdown is wrong, and your agent is confidently producing code that no longer compiles. The fix is a refresh schedule — anywhere from weekly for fast-moving libraries to quarterly for stable ones.

A two-line shell script that re-runs the bulk export against the stored URL list, replaces the contents of docs/.ai/library-name/, and commits the diff is enough. The diff itself is a useful artifact — you can review it in your usual code-review flow to see what changed in the upstream docs since last refresh.

How big is the corpus, really?

A common worry is that committing documentation to a project repo will balloon the repo size. The empirical answer, from a sample of ten libraries we use regularly:

LibraryPages exportedMarkdown sizeAvg per page
React (react.dev)1874.2 MB23 KB
Next.js4129.1 MB22 KB
TanStack Query761.6 MB21 KB
Tailwind CSS1953.8 MB19 KB
Prisma2846.7 MB24 KB
FastAPI1583.1 MB20 KB
Stripe API61214.8 MB24 KB
AWS S348811.2 MB23 KB
Postgres docs94722.4 MB24 KB
Anthropic API911.9 MB21 KB

The median is around 22 KB of clean Markdown per documentation page, which gives a rough rule of thumb: every hundred pages of docs adds roughly 2 MB to your repo. For most projects this is negligible alongside node_modules lockfiles and image assets, and it shows up exactly once in source control because Markdown deltas compress beautifully.

The Postgres docs are an outlier worth noting: at 947 pages and 22 MB, they push the boundary of what is reasonable to commit. For corpora that large, we recommend the gitignore approach plus a make docs target that any developer can run to rebuild the local copy.

What agents do differently when given a local corpus

After running this workflow on roughly forty libraries over the past year, the pattern of behavior change is consistent enough to be worth naming.

Coding agents stop hallucinating function signatures. When the docs are local, the agent's retriever surfaces the correct signature, and the agent quotes it rather than guessing. We have seen the hallucination rate on library-specific function calls drop from roughly fifteen percent to under two percent on a fixed evaluation set after introducing local docs.

Agents start citing specific pages by name. Where an answer used to say "you can use the useEffect hook," a docs-indexed answer says "according to react/hooks-effect.md, the useEffect hook…" with a deep-link to the section. The citation is itself useful — it tells you exactly which part of the docs the agent is reasoning from, which makes it easy to spot when the docs themselves are wrong or out of date.

Agents handle version-specific questions correctly. If the corpus you committed is for v5, the agent answers for v5; if you refresh to v6, the answers shift on the next conversation. Without a local corpus, the agent answers for whatever version its training data emphasized, which is almost never the version you care about.

Cross-library questions improve sharply. Asking "how do I wire TanStack Query to a Next.js Server Component" produces a far better answer when both libraries' docs are local than when either is missing. The agent's retriever pulls relevant chunks from both corpora and the model synthesizes them; without local docs, you get half an answer.

TL;DR

Coding agents are limited by what they can retrieve. The cheapest, fastest way to make Claude Code, Cursor, or Aider materially better at a library you work with seriously is to give them a local Markdown copy of that library's docs. Enumerate the URLs, bulk-convert in the browser, drop the result under docs/.ai/, let the agent index it, and refresh on a schedule.

If you want the conversion step to take three minutes instead of three hours, BulkMD is the free Chrome extension that runs the bulk conversion entirely in your browser — no API key, no server, no rate limit beyond what the docs site enforces.

Frequently asked questions

Won't a large docs corpus pollute my coding agent with irrelevant context?

Modern coding agents use retrieval rather than dumping all of `docs/.ai/` into every prompt. The agent embeds the corpus once, then retrieves only the chunks relevant to the current task. Adding more docs improves recall on relevant questions without diluting answers to other questions.

Why commit the Markdown instead of letting the agent fetch live?

Latency, rate limiting, and chunk selection — covered in detail above. The short version: a committed corpus lets the agent's own retriever pick the best chunk from the entire docs site, which produces better answers than a one-shot live fetch can match.

What if the library publishes its own llms.txt or AI-ready docs?

Use those instead — they will be cleaner than anything you can extract. As of 2026, llms.txt adoption among major libraries is still under thirty percent, so for most libraries you are still on the extract-it-yourself path. Check for `/llms.txt` and `/docs/llms-full.txt` before running the bulk export.

Does this work for Aider, Cline, and other coding agents besides Claude Code and Cursor?

Yes. The workflow is agent-agnostic. Any agent that indexes the project directory will pick up `docs/.ai/`; the differences across agents are in retrieval tuning and trigger syntax, not in what content they accept. We have validated the workflow on Claude Code, Cursor, Aider, Cline, and Continue.dev.

How do I keep the corpus from going stale?

Tie the refresh to a cadence you will actually keep. For fast-moving libraries (Next.js, Tailwind), weekly. For stable libraries (Postgres, Stripe), quarterly. A two-line shell script in `package.json` as `npm run refresh-docs` removes the friction; the diff is a useful artifact for tracking upstream changes.

About the author

M. H. Tawfik

Lead Developer & Owner

Working from Kushtia, Bangladesh.

Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.

Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.

ShareXinHN
TaggedClaudeChatGPTRAGMarkdownLLM context