If you have followed the rest of this blog through extraction, serialization, token math, and embeddings, the natural next step is to actually wire all of those decisions into a working RAG pipeline. This post is the assembly: roughly 200 lines of Python that take BulkMD-converted Markdown on disk and turn it into a queryable knowledge base, with explicit choices justified at each step.
The pipeline targets the personal-scale RAG — a few hundred to a few thousand documents, queried interactively by one or two people, running on a laptop or a small VM. For that scale, the right architecture is almost embarrassingly simple, and the temptation to over-engineer with multi-tenant clouds and orchestration frameworks is the single most common failure mode. We will name the over-engineering traps explicitly along the way.
The five stages
A personal RAG has five distinct stages, each doing one job:
- Capture — convert source URLs to clean Markdown on disk.
- Chunk — split each Markdown document into retrieval-sized units.
- Embed — compute vector representations of each chunk.
- Retrieve — given a query, return the top-k most-similar chunks.
- Generate — feed the retrieved chunks plus the query to an LLM and return the answer.
The interesting decisions live in stages 2 and 3; stages 4 and 5 are largely off-the-shelf in 2026. We will cover each stage with the actual code, then assemble them at the end.
Capture: skip the scraper, use the browser
Stage 1 is where most RAG tutorials lose the plot. The standard playbook is to spin up Playwright in Docker, write a crawler, manage retries, deal with rate limits, and burn a weekend on infrastructure that does not improve your final answer quality at all. The pragmatic move is to capture with a browser extension and skip the scraper layer entirely.
We covered the server vs extension architecture tradeoff in detail; the short version is that for any personal RAG built on a curated source list under a thousand pages, the extension path is faster and produces cleaner output. Open the URLs in your browser, run BulkMD, drop the resulting .md files into a folder, done. The remaining pipeline assumes you have a folder of clean Markdown files; how they got there is independent.
The folder we use looks like:
rag-corpus/
├── docs/ # Markdown files, one per source page
│ ├── article-1.md
│ ├── article-2.md
│ └── ...
├── index.lance/ # LanceDB vector store (created on first run)
└── rag.py # the pipeline
This is intentionally flat. There is no category/ hierarchy because retrieval will surface what is relevant regardless of folder; there is no metadata sidecar because metadata lives in the Markdown frontmatter.
Chunk: lean on Markdown structure
Most RAG pipelines chunk with fixed-size windows — 800 tokens, sliding with 100-token overlap. This works, but it is strictly worse than heading-based chunking for Markdown corpora where the source documents already have a structural hierarchy. Heading-based chunking gives you natural semantic boundaries, predictable chunk sizes for well-structured content, and citation-friendly anchors when you want to point a user at the source paragraph.
The chunker we use is forty lines:
import re
from pathlib import Path
from dataclasses import dataclass
@dataclass
class Chunk:
text: str
source: str
heading: str
chunk_id: str
def chunk_markdown(path: Path) -> list[Chunk]:
raw = path.read_text()
source = path.stem
chunks: list[Chunk] = []
current_heading = ""
current_lines: list[str] = []
def flush():
if not current_lines:
return
text = "\n".join(current_lines).strip()
if len(text) < 200: # skip near-empty sections
return
chunk_id = f"{source}#{slugify(current_heading)}"
chunks.append(Chunk(text, source, current_heading, chunk_id))
for line in raw.splitlines():
if re.match(r"^##\s", line):
flush()
current_heading = line[3:].strip()
current_lines = [line]
else:
current_lines.append(line)
flush()
return chunks
def slugify(s: str) -> str:
return re.sub(r"[^a-z0-9]+", "-", s.lower()).strip("-")
A few decisions to call out. We split only on ## (H2), not ###, because H3 sections tend to be too small to embed usefully — a one-paragraph subsection inherits enough context from its H2 parent that splitting hurts retrieval. We skip chunks under 200 characters because they are almost always headers without content or footer sections. We slugify the heading into the chunk ID so that retrieval results carry a human-readable anchor.
For Markdown files that genuinely have no ## structure — old-school posts that are a wall of prose — we fall back to a fixed-window chunker with 800-token windows and 100-token overlap. In our 400-document benchmark this fallback fires on roughly 5% of documents.
Embed and store: LanceDB plus OpenAI
Stage 3 (embed) and stage 4 (retrieve) are tightly coupled because the vector store needs both the embeddings and the chunk text. LanceDB is the right default for personal scale: it stores vectors and metadata in a single file-backed table, queries in single-digit milliseconds on tens of thousands of vectors, and requires no server. Combined with OpenAI's text-embedding-3-small from the embedding benchmark, the indexing loop is about thirty lines:
import lancedb
import openai
import pyarrow as pa
EMBED_MODEL = "text-embedding-3-small"
EMBED_DIM = 1536
def embed_batch(texts: list[str]) -> list[list[float]]:
resp = openai.embeddings.create(model=EMBED_MODEL, input=texts)
return [d.embedding for d in resp.data]
def build_index(corpus_dir: Path, db_path: Path):
db = lancedb.connect(db_path)
schema = pa.schema([
("vector", pa.list_(pa.float32(), EMBED_DIM)),
("text", pa.string()),
("source", pa.string()),
("heading", pa.string()),
("chunk_id", pa.string()),
])
table = db.create_table("chunks", schema=schema, mode="overwrite")
all_chunks: list[Chunk] = []
for md in corpus_dir.glob("*.md"):
all_chunks.extend(chunk_markdown(md))
# Embed in batches of 100 to stay under rate limits
for i in range(0, len(all_chunks), 100):
batch = all_chunks[i:i+100]
vectors = embed_batch([c.text for c in batch])
rows = [
{"vector": v, "text": c.text, "source": c.source,
"heading": c.heading, "chunk_id": c.chunk_id}
for c, v in zip(batch, vectors)
]
table.add(rows)
The retrieve side is even shorter:
def retrieve(query: str, db_path: Path, k: int = 5) -> list[dict]:
db = lancedb.connect(db_path)
table = db.open_table("chunks")
query_vector = embed_batch([query])[0]
return table.search(query_vector).limit(k).to_list()
That is the entire data plane. No managed vector store, no server, no Docker. The full pipeline state is index.lance/ on disk and your Markdown corpus next to it.
Generate: stuff context into Claude
Stage 5 is the LLM call. Once retrieval returns the top-k chunks, we format them as a prompt and call Claude (or any other LLM) for the answer. The shape we use takes advantage of the prompt-caching pattern by putting the retrieved chunks first and the user query last:
from anthropic import Anthropic
client = Anthropic()
def answer(query: str, db_path: Path) -> str:
chunks = retrieve(query, db_path, k=5)
context = "\n\n---\n\n".join(
f"## Source: {c['source']}#{c['heading']}\n\n{c['text']}"
for c in chunks
)
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system="Answer strictly from the provided sources. Cite by source name.",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": context,
"cache_control": {"type": "ephemeral"}},
{"type": "text", "text": f"Question: {query}"},
],
}],
)
return response.content[0].text
The ## Source: header on each retrieved chunk is what gives the model a citation anchor — Claude reliably quotes the source-name slug back in its answer, letting you click through to the original document. The cache_control block on the context means a session of repeated queries against the same retrieval set re-uses the cached prefix for ~10% of the input-token cost.
How big is the saving versus a naive approach?
We compared this five-stage pipeline against three alternatives on the same 240-question evaluation set used in the embedding benchmark. The numbers measure answer accuracy (graded by a separate Claude pass against the ground-truth) and end-to-end cost per 100 queries.
| Approach | Answer accuracy | Cost per 100 queries | Setup time |
|---|---|---|---|
| Paste full corpus into context (no retrieval) | 82% | $4.20 | 0 min |
| Naive RAG: fixed-window chunks, no caching | 71% | $1.10 | 30 min |
| This pipeline: heading chunks, OpenAI small, cache | 84% | $0.18 | 45 min |
| This pipeline + Voyage-3-large embeddings | 86% | $0.62 | 45 min |
The headline result is that a properly-tuned personal RAG beats both the no-retrieval baseline (on cost, by 23×) and a naive RAG (on accuracy, by 13 points) at the cost of about 15 extra minutes of setup time. The version with Voyage embeddings is marginally better on accuracy and runs at about 3.5× the embedding cost — worth it for production workflows, marginal for personal scale.
When to graduate
The architecture above scales to roughly 50,000 chunks comfortably, which on typical Markdown corpora means around 5,000 source documents. Past that, three things start to bind: LanceDB queries slow into double-digit milliseconds, embedding-build time on a fresh index takes longer than is comfortable to wait, and you start wanting multi-user access. At that point the right move is to migrate to a managed vector store (Pinecone, Qdrant Cloud, Turbopuffer) and a small API layer.
Resist this migration before you genuinely need it. The personal-scale pipeline above has zero ongoing infrastructure cost, zero dependency on third-party uptime, and zero operational surface area. A managed vector store is operationally simple but it is one more thing that can fail and one more bill on your card. Wait until the local approach has visible failure modes before adding the managed dependency.
TL;DR
A personal RAG that consumes BulkMD output is straightforward to build in 2026: capture with the browser extension, chunk on ## headings with a 200-character minimum, embed with OpenAI text-embedding-3-small, store in LanceDB locally, retrieve top-5, generate with Claude using a cached context prefix. Total code: around 200 lines. Total cost for a 400-document corpus: under fifty cents to build, pennies per month to query. Total accuracy: 84% on a 240-question evaluation, above the no-retrieval baseline and well above naive fixed-window chunking.
If you want clean, deterministic Markdown to feed into this pipeline — the kind that chunks cleanly on ## boundaries because the source HTML had real headings — BulkMD is the Chrome extension that produces it. Drop the output into the rag-corpus/docs/ folder and the rest of this post is roughly a hundred lines of Python away.
Frequently asked questions
Why LanceDB instead of Chroma, FAISS, or sqlite-vss?
Do I need a reranker on top of vector search?
How do I handle updates to the corpus?
Can I run this entirely offline with local embeddings?
How do I evaluate whether my RAG is actually any good?
About the author
Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.
Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.