If you have followed the rest of this blog through extraction, serialization, token math, and embeddings, the natural next step is to actually wire all of those decisions into a working RAG pipeline. This post is the assembly: roughly 200 lines of Python that take BulkMD-converted Markdown on disk and turn it into a queryable knowledge base, with explicit choices justified at each step.

The pipeline targets the personal-scale RAG — a few hundred to a few thousand documents, queried interactively by one or two people, running on a laptop or a small VM. For that scale, the right architecture is almost embarrassingly simple, and the temptation to over-engineer with multi-tenant clouds and orchestration frameworks is the single most common failure mode. We will name the over-engineering traps explicitly along the way.

The five stages

A personal RAG has five distinct stages, each doing one job:

Capture — convert source URLs to clean Markdown on disk.
Chunk — split each Markdown document into retrieval-sized units.
Embed — compute vector representations of each chunk.
Retrieve — given a query, return the top-k most-similar chunks.
Generate — feed the retrieved chunks plus the query to an LLM and return the answer.

The interesting decisions live in stages 2 and 3; stages 4 and 5 are largely off-the-shelf in 2026. We will cover each stage with the actual code, then assemble them at the end.

Capture: skip the scraper, use the browser

Stage 1 is where most RAG tutorials lose the plot. The standard playbook is to spin up Playwright in Docker, write a crawler, manage retries, deal with rate limits, and burn a weekend on infrastructure that does not improve your final answer quality at all. The pragmatic move is to capture with a browser extension and skip the scraper layer entirely.

We covered the server vs extension architecture tradeoff in detail; the short version is that for any personal RAG built on a curated source list under a thousand pages, the extension path is faster and produces cleaner output. Open the URLs in your browser, run BulkMD, drop the resulting .md files into a folder, done. If your source is a single site, turning a blog archive or category page into a Markdown dataset gives you the whole corpus in one pass. The remaining pipeline assumes you have a folder of clean Markdown files; how they got there is independent.

The folder we use looks like:

rag-corpus/
├── docs/                    # Markdown files, one per source page
│   ├── article-1.md
│   ├── article-2.md
│   └── ...
├── index.lance/             # LanceDB vector store (created on first run)
└── rag.py                   # the pipeline

This is intentionally flat. There is no category/ hierarchy because retrieval will surface what is relevant regardless of folder; there is no metadata sidecar because metadata lives in the Markdown frontmatter.

Chunk: lean on Markdown structure

Most RAG pipelines chunk with fixed-size windows — 800 tokens, sliding with 100-token overlap. This works, but it is strictly worse than heading-based chunking for Markdown corpora where the source documents already have a structural hierarchy. Heading-based chunking gives you natural semantic boundaries, predictable chunk sizes for well-structured content, and citation-friendly anchors when you want to point a user at the source paragraph.

The chunker we use is forty lines:

import re
from pathlib import Path
from dataclasses import dataclass

@dataclass
class Chunk:
    text: str
    source: str
    heading: str
    chunk_id: str

def chunk_markdown(path: Path) -> list[Chunk]:
    raw = path.read_text()
    source = path.stem
    chunks: list[Chunk] = []
    current_heading = ""
    current_lines: list[str] = []

    def flush():
        if not current_lines:
            return
        text = "\n".join(current_lines).strip()
        if len(text) < 200:  # skip near-empty sections
            return
        chunk_id = f"{source}#{slugify(current_heading)}"
        chunks.append(Chunk(text, source, current_heading, chunk_id))

    for line in raw.splitlines():
        if re.match(r"^##\s", line):
            flush()
            current_heading = line[3:].strip()
            current_lines = [line]
        else:
            current_lines.append(line)
    flush()
    return chunks

def slugify(s: str) -> str:
    return re.sub(r"[^a-z0-9]+", "-", s.lower()).strip("-")

A few decisions to call out. We split only on ## (H2), not ###, because H3 sections tend to be too small to embed usefully — a one-paragraph subsection inherits enough context from its H2 parent that splitting hurts retrieval. We skip chunks under 200 characters because they are almost always headers without content or footer sections. We slugify the heading into the chunk ID so that retrieval results carry a human-readable anchor.

For Markdown files that genuinely have no ## structure — old-school posts that are a wall of prose — we fall back to a fixed-window chunker with 800-token windows and 100-token overlap. In our 400-document benchmark this fallback fires on roughly 5% of documents.

Embed and store: LanceDB plus OpenAI

Stage 3 (embed) and stage 4 (retrieve) are tightly coupled because the vector store needs both the embeddings and the chunk text. LanceDB is the right default for personal scale: it stores vectors and metadata in a single file-backed table, queries in single-digit milliseconds on tens of thousands of vectors, and requires no server. Combined with OpenAI's text-embedding-3-small from the embedding benchmark, the indexing loop is about thirty lines:

import lancedb
import openai
import pyarrow as pa

EMBED_MODEL = "text-embedding-3-small"
EMBED_DIM = 1536

def embed_batch(texts: list[str]) -> list[list[float]]:
    resp = openai.embeddings.create(model=EMBED_MODEL, input=texts)
    return [d.embedding for d in resp.data]

def build_index(corpus_dir: Path, db_path: Path):
    db = lancedb.connect(db_path)
    schema = pa.schema([
        ("vector", pa.list_(pa.float32(), EMBED_DIM)),
        ("text", pa.string()),
        ("source", pa.string()),
        ("heading", pa.string()),
        ("chunk_id", pa.string()),
    ])
    table = db.create_table("chunks", schema=schema, mode="overwrite")

    all_chunks: list[Chunk] = []
    for md in corpus_dir.glob("*.md"):
        all_chunks.extend(chunk_markdown(md))

    # Embed in batches of 100 to stay under rate limits
    for i in range(0, len(all_chunks), 100):
        batch = all_chunks[i:i+100]
        vectors = embed_batch([c.text for c in batch])
        rows = [
            {"vector": v, "text": c.text, "source": c.source,
             "heading": c.heading, "chunk_id": c.chunk_id}
            for c, v in zip(batch, vectors)
        ]
        table.add(rows)

The retrieve side is even shorter:

def retrieve(query: str, db_path: Path, k: int = 5) -> list[dict]:
    db = lancedb.connect(db_path)
    table = db.open_table("chunks")
    query_vector = embed_batch([query])[0]
    return table.search(query_vector).limit(k).to_list()

That is the entire data plane. No managed vector store, no server, no Docker. The full pipeline state is index.lance/ on disk and your Markdown corpus next to it.

Generate: stuff context into Claude

Stage 5 is the LLM call. Once retrieval returns the top-k chunks, we format them as a prompt and call Claude (or any other LLM) for the answer. The shape we use takes advantage of the prompt-caching pattern by putting the retrieved chunks first and the user query last:

from anthropic import Anthropic
client = Anthropic()

def answer(query: str, db_path: Path) -> str:
    chunks = retrieve(query, db_path, k=5)
    context = "\n\n---\n\n".join(
        f"## Source: {c['source']}#{c['heading']}\n\n{c['text']}"
        for c in chunks
    )
    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        system="Answer strictly from the provided sources. Cite by source name.",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": context,
                 "cache_control": {"type": "ephemeral"}},
                {"type": "text", "text": f"Question: {query}"},
            ],
        }],
    )
    return response.content[0].text

The ## Source: header on each retrieved chunk is what gives the model a citation anchor — Claude reliably quotes the source-name slug back in its answer, letting you click through to the original document. The cache_control block on the context means a session of repeated queries against the same retrieval set re-uses the cached prefix for ~10% of the input-token cost.

How big is the saving versus a naive approach?

We compared this five-stage pipeline against three alternatives on the same 240-question evaluation set used in the embedding benchmark. The numbers measure answer accuracy (graded by a separate Claude pass against the ground-truth) and end-to-end cost per 100 queries.

Approach	Answer accuracy	Cost per 100 queries	Setup time
Paste full corpus into context (no retrieval)	82%	$4.20	0 min
Naive RAG: fixed-window chunks, no caching	71%	$1.10	30 min
This pipeline: heading chunks, OpenAI small, cache	84%	$0.18	45 min
This pipeline + Voyage-3-large embeddings	86%	$0.62	45 min

The headline result is that a properly-tuned personal RAG beats both the no-retrieval baseline (on cost, by 23×) and a naive RAG (on accuracy, by 13 points) at the cost of about 15 extra minutes of setup time. The version with Voyage embeddings is marginally better on accuracy and runs at about 3.5× the embedding cost — worth it for production workflows, marginal for personal scale.

When to graduate

The architecture above scales to roughly 50,000 chunks comfortably, which on typical Markdown corpora means around 5,000 source documents. Past that, three things start to bind: LanceDB queries slow into double-digit milliseconds, embedding-build time on a fresh index takes longer than is comfortable to wait, and you start wanting multi-user access. At that point the right move is to migrate to a managed vector store (Pinecone, Qdrant Cloud, Turbopuffer) and a small API layer.

Resist this migration before you genuinely need it. The personal-scale pipeline above has zero ongoing infrastructure cost, zero dependency on third-party uptime, and zero operational surface area. A managed vector store is operationally simple but it is one more thing that can fail and one more bill on your card. Wait until the local approach has visible failure modes before adding the managed dependency.

TL;DR

A personal RAG that consumes BulkMD output is straightforward to build in 2026: capture with the browser extension, chunk on ## headings with a 200-character minimum, embed with OpenAI text-embedding-3-small, store in LanceDB locally, retrieve top-5, generate with Claude using a cached context prefix. Total code: around 200 lines. Total cost for a 400-document corpus: under fifty cents to build, pennies per month to query. Total accuracy: 84% on a 240-question evaluation, above the no-retrieval baseline and well above naive fixed-window chunking.

If you want clean, deterministic Markdown to feed into this pipeline — the kind that chunks cleanly on ## boundaries because the source HTML had real headings — BulkMD is the Chrome extension that produces it. Drop the output into the rag-corpus/docs/ folder and the rest of this post is roughly a hundred lines of Python away.

Frequently asked questions

Why LanceDB instead of Chroma, FAISS, or sqlite-vss?

LanceDB is file-backed (single directory artifact, easy to back up or move), schema-aware (you store text and metadata alongside vectors without a separate table), and persistent without a server. Chroma is similar but slower on writes; FAISS is fast but requires hand-rolling persistence; sqlite-vss is fine but lags on query performance past 10K vectors. For personal scale, LanceDB is the best-fit default.

Do I need a reranker on top of vector search?

For most personal-scale RAG, no — the top-5 from a good embedding model is usually right. Rerankers (Cohere's rerank-v3, Voyage's rerank-2) lift accuracy by 2–4 percentage points but add 100–300 ms per query and a per-query cost. We use them in production pipelines where the accuracy delta justifies the latency, but skip them at personal scale.

How do I handle updates to the corpus?

Two options. The simple version: rebuild the index from scratch when the corpus changes, which is fast (under a minute for a 400-doc corpus). The more sophisticated version: LanceDB supports merge-by-id updates, so you can re-embed only the chunks whose source files changed. For personal scale, the full-rebuild is fine; for daily-changing corpora, the incremental update pays off.

Can I run this entirely offline with local embeddings?

Yes. Replace `embed_batch` with a call to a local model — `gte-Qwen2-1.5B` or `BGE-large` via `sentence-transformers` are both reasonable choices. The accuracy delta versus OpenAI text-embedding-3-small is 2–4 percentage points lower; the privacy and cost tradeoff is yours to weigh. The rest of the pipeline (chunking, LanceDB, retrieval) is unchanged.

How do I evaluate whether my RAG is actually any good?

Hand-write 30–60 questions whose answers are in your corpus, with the ground-truth source chunk noted. Run them against your pipeline and measure two things: did the right chunk appear in the top 5 retrieved (retrieval recall@5), and did the final answer reference that chunk correctly (answer accuracy). Without an eval set you are guessing; with one, you can tell when a config change helps or hurts.

About the author

M. H. Tawfik

Lead Developer & Owner

Working from Kushtia, Bangladesh.

Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.

Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.

ShareX in HN

TaggedRAGMarkdownLLM contextCost optimization

Building a Personal RAG with BulkMD Markdown Output