You have a folder of pages you clipped over the last month, and now an AI agent needs to read them. The gap between "twenty Markdown files on disk" and "a web corpus an agent can ingest" is mostly metadata and structure, and it is small enough to close in one pass. This post specifies a portable bundle format — a folder of Markdown, a human-readable index, and a manifest.json carrying file, title, URL, and token counts — that drops directly into a RAG pipeline or an agent's retrieval layer. BulkMD produces this shape as an "agent bundle" ZIP, but the format is plain files, so nothing here locks you to one tool.

The post walks through the directory layout, the manifest schema, where to place chunk boundaries, and how all of this relates to MCP and agent retrieval. It also draws a line you should keep clear: an extension exports a bundle; it does not serve one. Knowing which side of that line you are on saves a lot of confusion about why your files are not "an MCP server."

What an agent actually needs from a corpus

An agent retrieving over your documents has to answer three operational questions before it can reason: which documents exist, how big each one is, and where each one came from. A loose folder of .md files answers none of them without a directory scan, a tokenizer pass, and a guess about provenance. The job of packaging is to precompute those answers and ship them alongside the content.

Concretely, a usable corpus needs:

Content in a format the model parses well. Markdown is the right default — it preserves headings, lists, tables, and code fences as structure the model recognizes, at roughly 60–80% fewer tokens than the original HTML. We compared the alternatives in Markdown vs JSON vs plain text for LLM context; for prose-heavy web pages, Markdown holds up well on both fidelity and cost.
An index a human can skim to confirm the corpus contains what they think it does.
A manifest a program can parse without heuristics: stable IDs, titles, source URLs, and token counts.

The token count is the part people skip, and it is the part that matters most for agents. A retrieval layer that knows each document is, say, 1,800 tokens can plan how many it can fit before it hits the budget. Without that number, it either reads everything and overflows, or reads one document at a time and burns round-trips. We go deep on the planning side in context window budgeting for RAG in 2026; the corpus is where that budget data should originate.

The bundle layout

Here is the directory structure. It is deliberately flat and boring — boring is portable.

web-corpus/
├── manifest.json          # machine-readable inventory
├── index.md               # human-readable table of contents
└── docs/
    ├── 0001-rate-limiting-strategies.md
    ├── 0002-postgres-connection-pooling.md
    ├── 0003-idempotency-keys-in-apis.md
    └── ...

Three rules make this layout work as a contract rather than a suggestion.

First, file names are stable IDs. The 0001- prefix gives a deterministic sort and a short handle the manifest and index both reference. Once a file has an ID, never renumber it; if a page is removed, retire the ID rather than reusing it, so any cached citation that points at 0007 still resolves.

Second, content lives only under docs/. The manifest and index sit at the root and point inward. A consumer can ingest docs/ directly, or read manifest.json first and pull files on demand. Both paths work because the content is isolated.

Third, every document carries its own front matter. Even though the manifest duplicates some of it, a single .md file should be self-describing when it travels alone — pasted into a chat, dropped into Obsidian, attached to an issue. We cover the front-matter conventions in Obsidian-friendly front matter for web clippings; the same title, source, and captured keys serve the agent corpus.

A document looks like this:

---
title: "Idempotency keys in HTTP APIs"
source: "https://example.com/blog/idempotency-keys"
captured: "2026-05-28"
tokens: 1840
---

# Idempotency keys in HTTP APIs

Answer-first summary of what idempotency keys solve...

## How the server stores keys

...

The manifest.json schema

The manifest is the load-bearing file. Keep it small, flat, and explicit. Here is a complete, minimal schema with one populated entry:

{
  "schema": "bulkmd.corpus/v1",
  "generated": "2026-06-02T14:30:00Z",
  "tokenizer": "o200k_base",
  "documentCount": 3,
  "totalTokens": 5210,
  "documents": [
    {
      "id": "0003",
      "file": "docs/0003-idempotency-keys-in-apis.md",
      "title": "Idempotency keys in HTTP APIs",
      "url": "https://example.com/blog/idempotency-keys",
      "captured": "2026-05-28",
      "tokens": 1840,
      "headings": [
        "How the server stores keys",
        "Choosing a key namespace",
        "Expiring stale keys"
      ]
    }
  ]
}

Each field earns its place:

schema and generated let a consumer detect format version and staleness without parsing the body.
tokenizer names which encoding produced the tokens numbers. This matters: cl100k_base powers GPT-3.5 and GPT-4, while o200k_base powers GPT-4o and the o-series. They are close for English prose but not identical, and Claude uses its own tokenizer entirely. State the basis so the consumer can apply a correction factor instead of trusting a number of unknown origin. We explain the counting itself in estimating LLM token cost in the browser.
file is the relative path under the bundle root — relative, never absolute, so the bundle relocates cleanly.
url is the provenance anchor. It is what makes a citation verifiable and a re-fetch possible when the page changes.
tokens is the per-document count that drives context budgeting.
headings is the optional but valuable list of H2/H3 boundaries, which doubles as a chunking map (next section).

Resist the urge to nest. A flat array of flat objects is trivially parseable in any language and easy to diff in version control. If you later need richer metadata — embeddings, scores, tags — add sibling files (embeddings.parquet) rather than bloating the manifest. The manifest's job is inventory, not storage.

Where to place chunk boundaries

Chunking is the step every RAG tutorial reinvents, usually with a fixed character window that slices mid-sentence. A packaged corpus can do better by marking boundaries once, at export time, on the structure the document already has: its headings.

The heuristic is to treat each H2 section as a candidate chunk, and split further on H3 only when a section exceeds a target size. Because the Markdown is already clean, the heading hierarchy is a reliable semantic skeleton — each ## introduces a topic the author chose to separate. Chunking on those boundaries keeps related sentences together and gives every chunk a natural title (its heading) you can prepend for retrieval context.

A practical splitter, around a dozen lines:

import re

def chunk_markdown(text: str, max_tokens: int = 512, count=lambda s: len(s) // 4):
    # Split on H2/H3 boundaries, keeping the heading with its body.
    parts = re.split(r"(?m)^(#{2,3}\s.*)$", text)
    sections, buf = [], parts[0]
    for i in range(1, len(parts), 2):
        heading, body = parts[i], parts[i + 1] if i + 1 < len(parts) else ""
        block = f"{heading}\n{body}".strip()
        if count(buf) + count(block) > max_tokens and buf:
            sections.append(buf.strip())
            buf = block
        else:
            buf = f"{buf}\n\n{block}"
    if buf.strip():
        sections.append(buf.strip())
    return sections

The count parameter defaults to the rough four-characters-per-token approximation, but you should pass a real tokenizer in production. The key design point is not the splitter itself — it is where it runs. Two options:

Pre-chunk at packaging time and ship a chunks/ directory plus chunk records in the manifest. Best when many consumers share one corpus and you want identical boundaries everywhere.
Ship boundary hints (the headings array) and let each consumer chunk on ingest. Best when consumers have different window sizes — an agent with a 200K context chunks coarser than one feeding a 512-token embedding model.

For a general-purpose bundle, option 2 is the better default: the headings array is small, and it lets the consumer chunk to its own budget without re-parsing the body to find boundaries. AI search engines reinforce this granularity — they cite roughly 200–500-token passages, not whole pages, so designing chunks in that range aligns your corpus with how retrieval and citation actually work. The same answer-first, fact-dense structure we recommend in GEO for developer docs makes individual chunks more citable in isolation.

How this relates to MCP and agent retrieval

This is the section to read slowly, because it is where the mental model usually breaks.

The Model Context Protocol (MCP) is a standard for connecting models to live tools and data sources — an MCP server exposes resources and tools that an agent calls at runtime, over a connection. A corpus bundle is the opposite: a static snapshot of content, sitting in a folder. They are complementary, not competing, and they meet at a clear boundary.

A browser extension exports a bundle. It is not an MCP server. BulkMD runs locally in your browser, converts pages to Markdown, and writes a ZIP. That ZIP is an artifact — inert files. To make it queryable by an agent, something downstream has to ingest it: a RAG pipeline that embeds and indexes the chunks, or an MCP server that reads manifest.json and serves the documents as resources. The extension produces the input to that layer; it does not become that layer. Being explicit about this prevents the common error of expecting a clipping tool to "be" your retrieval backend.

The bundle format is designed to make that downstream step short. An MCP server wrapping this corpus is mostly a thin reader:

# Sketch: an MCP server exposing a corpus bundle as resources.
import json, pathlib

root = pathlib.Path("web-corpus")
manifest = json.loads((root / "manifest.json").read_text())

def list_resources():
    # Each document becomes an addressable resource the agent can request.
    return [
        {"uri": f"corpus://{d['id']}", "name": d["title"],
         "tokens": d["tokens"], "source": d["url"]}
        for d in manifest["documents"]
    ]

def read_resource(uri: str) -> str:
    doc_id = uri.removeprefix("corpus://")
    entry = next(d for d in manifest["documents"] if d["id"] == doc_id)
    return (root / entry["file"]).read_text()

Because the manifest already carries titles, token counts, and source URLs, the server advertises everything the agent needs to choose resources intelligently before reading a single byte of content. The same bundle feeds a classic vector RAG just as cleanly — point your indexer at docs/, chunk on the heading hints, embed, and store. The end-to-end version of that path is in building a personal RAG pipeline, which consumes exactly this folder shape.

If you publish the corpus on a website rather than handing it to an agent directly, the parallel artifact is llms.txt — a root-level index of your site's Markdown for agents that crawl. The manifest is the offline-bundle analog of that file; how to write an llms.txt file covers the published variant and which agents fetch it.

A worked sizing example

Numbers make the budgeting payoff concrete. Suppose you clip a 12-page corpus of API documentation and your downstream agent has a 128K-token context window with a working budget of roughly 100K tokens after the system prompt and headroom.

Document	Tokens (o200k_base)	Cumulative	Fits in 100K budget?
0001 Rate limiting	1,640	1,640	yes
0002 Connection pooling	2,310	3,950	yes
0003 Idempotency keys	1,840	5,790	yes
0004 Webhook retries	3,120	8,910	yes
0005 Pagination	1,205	10,115	yes
...	...	...	...
0012 Error taxonomy	4,480	31,000	yes

With the per-document tokens field, the agent computes that the entire 12-page corpus is roughly 31K tokens — well under budget — and can load all of it directly, skipping retrieval entirely. Without the manifest, it would have to either tokenize on the fly or assume the worst and retrieve piecemeal. The packaging decision changes the runtime strategy from "retrieve and hope" to "I know exactly what this costs."

This is the money sentence to carry away: a corpus that ships per-document token counts in its manifest lets an agent decide, before reading anything, whether to load the whole bundle or retrieve a subset — and for the common case of a few dozen clipped pages totaling under 50K tokens, the right answer is almost always "load it all," which makes chunking and embedding entirely optional.

Keeping the corpus honest over time

A bundle is a snapshot, and web pages change. Two cheap habits keep a corpus trustworthy. Record captured per document so a consumer can judge staleness — a page captured eight months ago deserves a re-fetch before you cite it as current. And keep the url exact, including any anchor, so re-capture is a one-click operation rather than a hunt. When you do refresh, regenerate the manifest in the same pass; a manifest whose tokens and headings have drifted from the actual files is worse than no manifest, because consumers trust it.

For corpora you rebuild regularly, the export side matters too. BulkMD's bulk mode processes up to 10 tabs in parallel and retains around 500 results per batch, which is sized for the curated-list workflow — you open the pages you care about and export them together, rather than crawling a whole domain. To assemble that list, you can pull the links out of a single page section instead of copying URLs by hand. The architecture tradeoffs behind that choice are in server scrapers vs browser extensions.

TL;DR

A web corpus an AI agent can ingest is a folder of clean Markdown under docs/, a skimmable index.md, and a flat manifest.json recording id, file, title, url, and tokens per document, with optional heading boundaries for chunking. The token counts are what let an agent budget context before reading; the URLs are what make citations and re-fetching possible; the stable file IDs are what keep both valid over time. Remember the boundary: an extension exports this bundle, and a RAG pipeline or MCP server ingests it — they are different jobs done by different tools.

Your next step: take the pages you have already clipped, give each a numbered file name and front matter, and write a fifteen-line script that walks docs/ and emits manifest.json with a token count per file. If you do not have the Markdown yet, BulkMD's agent bundle export produces this exact folder shape — manifest, index, and clean Markdown — in one click, entirely in your browser.

Frequently asked questions

Is a BulkMD agent bundle an MCP server?

No. The bundle is a static ZIP of Markdown files plus a manifest.json file — an artifact, not a running service. To serve it to an agent at runtime you wrap it with an MCP server or feed it into a RAG indexer. The extension exports the input; the retrieval layer consumes it. They are deliberately separate jobs.

Why put token counts in the manifest instead of computing them at ingest?

So the consumer can budget context before reading any content. An agent that knows each document's size up front can decide whether the whole corpus fits in its window or whether it must retrieve a subset. Computing counts at ingest works too, but it forces a tokenizer pass on every consumer and makes the bundle's cost opaque until you open it.

Which tokenizer should the manifest report?

Name whichever one you used in a tokenizer field — o200k_base for GPT-4o and the o-series, cl100k_base for GPT-3.5 and GPT-4. They are approximately comparable for English prose but not identical, and Claude uses its own tokenizer. Stating the basis lets a consumer apply a correction factor rather than trusting an unlabeled number.

Should I pre-chunk the corpus or just mark boundaries?

For a shared, general-purpose bundle, ship boundary hints (a list of H2/H3 headings per document) and let each consumer chunk to its own window size. Pre-chunk into a separate directory only when many consumers must use identical boundaries — for example, when several services share one embedding index and drift would break it.

How is this different from an llms.txt file?

Same idea, different delivery. An llms.txt file is a published, root-level index of your site's Markdown that crawling agents (Perplexity, Claude, IDE agents) fetch over the web. A corpus manifest is the offline-bundle analog — it travels inside a ZIP you hand to a pipeline or MCP server. Note that Google Search does not read llms.txt; it is an agent-and-crawler convention, not a ranking signal.

About the author

M. H. Tawfik

Lead Developer & Owner

Working from Kushtia, Bangladesh.

Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.

Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.

ShareX in HN

TaggedLLM contextRAGMarkdownBulk exportTokens

Packaging a Web Corpus for AI Agents to Ingest