If you are building retrieval-augmented generation in 2026, the question is no longer "does my content fit the model" but "how much of it should I actually send." Context window budgeting is the discipline of allocating a fixed token budget across system instructions, retrieved evidence, the user query, and reserved output space, and the answer is almost always less than the window allows. This post gives you the reasoning, the numbers, and a worksheet you can copy.

The piece covers why bigger windows did not make budgeting obsolete, how chunk size trades relevance against recall, what stuffing actually costs in dollars and accuracy, and a per-request budget table you can adapt. It assumes you already have clean Markdown to retrieve over, the kind BulkMD produces locally from any page; if your sources are still raw HTML, fix that first, because boilerplate is the largest hidden line item in any budget.

Why the context window is a budget, not a capacity

A context window is a budget you allocate, not a capacity you fill. Models from every major lab now ship windows in the hundreds of thousands of tokens, with a few crossing into the millions. The naive reading is that retrieval is solved: throw the whole corpus in, let attention sort it out. That reading is wrong on three independent axes (cost, latency, and accuracy) and budgeting is how you manage all three at once.

Cost is the most obvious axis. Input tokens are billed per million, and a long window filled to capacity on every call is the fastest way to a surprising invoice. A request that sends 200,000 tokens of context costs roughly forty times more than one that sends 5,000 tokens of well-retrieved evidence, for the same question, often with a worse answer.

Latency scales with input length too. Time-to-first-token grows with the prompt the model has to read before it can begin generating. A 200K-token prompt has measurably higher prefill latency than a 5K-token prompt, which matters for any interactive workflow where a user is waiting.

The accuracy axis is the least intuitive and the most important. The "lost in the middle" effect, documented by Liu et al. in their 2023 paper of that name and reproduced repeatedly since, shows that models recall information at the start and end of a long context far better than information buried in the middle. Filling the window does not guarantee the model uses what you put there. A tight, relevant 4,000-token context frequently outperforms a 100,000-token dump that happens to contain the same answer somewhere in its murky center.

What budgeting actually means per request

Budgeting means partitioning the window into named slices with hard caps, then fitting retrieval inside the slice reserved for evidence. Every request to a chat or completion endpoint spends its window on four things:

System and instructions — the role, the answer format, the guardrails. Usually fixed and small.
Retrieved evidence — the chunks your retriever returned. This is the flexible slice and the one you budget hardest.
The query — the user's question, plus any conversation history you carry forward.
Reserved output — max_tokens for the response. This is part of the window: the input plus the output must fit.

The discipline is to set caps on each slice up front rather than letting evidence expand to fill whatever is left. A retriever that returns "top-k by similarity" with no token ceiling will happily hand you fifty chunks when three would have answered the question, and your prompt-assembly code will dutifully paste all fifty. Cap the evidence slice in tokens, not in chunk count, because chunks vary wildly in size.

The single most useful number in a RAG system is the evidence-slice cap: the maximum tokens of retrieved content you will ever send for one question. Pick it deliberately, enforce it in code, and tune it against an evaluation set; never let it default to "whatever fit."

How big should a retrieval chunk be?

Retrieval chunks of roughly 200 to 500 tokens are the practical default for prose, because that range matches how AI engines themselves select citable passages and keeps each chunk to a single coherent idea. The chunk is your unit of both retrieval and citation, and its size controls the central trade-off in RAG: relevance versus recall.

Small chunks favor relevance

A 200-token chunk is roughly one to two paragraphs. It tends to express a single idea, so its embedding is a focused vector and a similarity search returns it only when the query genuinely matches that idea. Precision is high. The cost is recall: an answer that spans two paragraphs may require two separate chunk hits, and if your retriever returns only the top three, you can miss half the answer.

Large chunks favor recall

A 1,000-token chunk captures more surrounding context, so a single hit is more likely to contain the complete answer. The cost is relevance: the embedding now averages several ideas into one vector, which makes it match more queries weakly and fewer queries strongly. You also burn budget faster: five 1,000-token chunks is already 5,000 tokens of evidence for one question.

The middle is usually right

For most prose corpora, chunking on natural boundaries (Markdown ## and ### headings) and capping each chunk at around 300 to 500 tokens lands in the sweet spot. Heading-based chunking keeps semantically coherent units together, which is why we recommend it in the personal RAG pipeline walkthrough. Add a small overlap (roughly 10 to 15%) so a sentence split across a boundary still appears whole in at least one chunk.

# Heading-aware chunking with a token cap, using tiktoken for measurement
import tiktoken

enc = tiktoken.get_encoding("o200k_base")  # GPT-4o / o-series tokenizer

def chunk_markdown(md: str, max_tokens: int = 400, overlap_tokens: int = 50):
    sections, current = [], []
    for line in md.splitlines():
        if line.startswith("## ") and current:
            sections.append("\n".join(current))
            current = []
        current.append(line)
    if current:
        sections.append("\n".join(current))

    chunks = []
    for section in sections:
        toks = enc.encode(section)
        if len(toks) <= max_tokens:
            chunks.append(section)
            continue
        # Split oversized sections into overlapping windows
        start = 0
        while start < len(toks):
            window = toks[start : start + max_tokens]
            chunks.append(enc.decode(window))
            start += max_tokens - overlap_tokens
    return chunks

The o200k_base encoding here powers GPT-4o and the o-series; cl100k_base powers GPT-3.5 and GPT-4, and Claude uses its own tokenizer. For English prose the three are approximately comparable, so a 400-token cap measured with any of them is close enough for budgeting. Measure with the tokenizer of the model you actually call when you need precision.

The cost of stuffing versus the cost of missing context

Stuffing context is a real cost you pay on every call; missing context is a risk you pay only when retrieval fails, and the asymmetry is why disciplined budgets beat "send everything." Both failure modes are real, and the budget is where you set the balance between them.

The cost of stuffing is concrete and continuous. Every extra thousand tokens of irrelevant context is billed, adds prefill latency, and pushes useful evidence toward the attention-starved middle of the window. Stuffing also has a quiet correctness cost: more irrelevant text gives the model more opportunities to anchor on the wrong passage, and a confidently wrong answer grounded in a tangential chunk is worse than an honest "not found."

The cost of missing context is discrete and occasional. When the retriever fails to surface the chunk that holds the answer, the model either says it does not know or hallucinates. This is the failure that tempts teams to over-stuff: one missed answer feels worse than a hundred slightly-too-long prompts. But the fix for missed context is better retrieval (better chunking, hybrid keyword-plus-vector search, reranking) not a bigger context dump. Raising the evidence cap to paper over weak retrieval treats the symptom and pays for it on every single call.

This is where trimming pages to their main content earns its place in the budget. A typical article carried as raw HTML is mostly navigation, cookie banners, related-post rails, and footers, boilerplate that can be 60 to 80% of the bytes and contributes nothing to retrieval. Converting to clean Markdown first, the way BulkMD does locally in the browser, removes that boilerplate before it ever reaches your embedder or your prompt. The same content as Markdown is typically 60 to 80% smaller, and a boilerplate-heavy page can reach up to roughly 93% reduction. That reclaimed budget is the difference between fitting three clean chunks and three chunks plus a cookie banner.

A context budget worksheet

Here is a worked budget for an interactive RAG assistant answering questions over a documentation corpus, sized for a model with a 200K-token window. The point of the table is not the absolute numbers (adjust them to your model and task) but the discipline of assigning every slice a cap that sums to less than the window, with headroom left unallocated on purpose.

Slice	Budget (tokens)	% of 200K window	Notes
System and instructions	800	0.4%	Fixed role, output format, refusal rules
Conversation history	2,000	1.0%	Last ~3 turns, summarized beyond that
Retrieved evidence	6,000	3.0%	~15 chunks at 400 tokens, hard cap
User query	200	0.1%	The current question
Reserved output (`max_tokens`)	2,000	1.0%	Answer plus citations
Total committed	11,000	5.5%
Unallocated headroom	189,000	94.5%	Deliberately unused

The committed budget is 11,000 tokens against a 200,000-token window. The 94.5% you leave on the table is not waste; it is the proof that a large window is a safety margin for outlier questions, not a target to fill. If your evaluation shows answers improving as you raise the evidence cap, raise it deliberately and re-measure cost and latency. If they do not improve, the cap is right where it is.

A few rules make the worksheet hold up in production. Cap evidence in tokens and enforce it after retrieval by dropping the lowest-ranked chunks until you fit, rather than truncating mid-chunk. Summarize conversation history past a few turns instead of carrying it verbatim, because raw history is the slice that silently grows unbounded. Reserve output space explicitly, since the input plus max_tokens must fit the window and an under-reserved output gets cut off mid-sentence. And keep the system slice short: long system prompts are pure overhead paid on every call. How you serialize that evidence matters too; Markdown versus JSON versus plain text changes the token count for the same information.

When does the budget get bigger?

Raise the evidence cap only when an evaluation set shows accuracy climbing with it, and even then, layer prompt caching before you assume the larger budget is affordable. There are legitimate reasons to spend more of the window. Multi-hop questions that genuinely require synthesizing several documents need more evidence than a lookup question. Long-document tasks (summarizing a contract, reviewing a full specification) are not retrieval problems and reasonably consume large fractions of the window. Agentic loops accumulate tool outputs and intermediate reasoning that grow the budget over a session.

For these cases the lever is not "stuff more" but "spend efficiently." Prompt caching changes the economics of a large, stable context dramatically: a fixed corpus cached across many queries costs a fraction of the uncached price on repeat reads, which we cover in detail in the prompt caching and Markdown context breakdown. If your large budget is stable across calls, cache it and the per-call cost of the bigger window drops toward the read price. If it changes every call, caching cannot help and the full cost applies, which is itself a signal to retrieve more tightly.

TL;DR

Context window budgeting in 2026 is allocation, not maximization. Partition each request into fixed slices (system, history, retrieved evidence, query, reserved output), cap the evidence slice in tokens rather than chunk count, and tune that cap against an evaluation set instead of letting it default to whatever fit. Chunk prose at roughly 200 to 500 tokens on heading boundaries, fix missed context with better retrieval rather than a bigger dump, and trim pages to their main content before any of this so boilerplate does not eat the budget. The next concrete step: count the tokens your current pipeline sends per question, compare it to the worksheet above, and if your sources are still raw HTML, install BulkMD from the Chrome Web Store to convert them to clean Markdown locally before they ever reach your embedder.

Frequently asked questions

If my model has a 1M-token context window, why not just send everything?

Three reasons: cost (input is billed per token, so a full window on every call is expensive), latency (prefill time grows with prompt length), and accuracy. The 'lost in the middle' effect means models recall information at the start and end of a long context far better than the middle, so filling the window does not guarantee the model uses what you put there. A tight, relevant context often beats a large dump containing the same answer.

What chunk size should I use for RAG?

For prose, roughly 200 to 500 tokens per chunk is the practical default. Smaller chunks favor relevance (focused embeddings, high precision) but can hurt recall when an answer spans paragraphs. Larger chunks favor recall but dilute the embedding and burn budget faster. Chunk on natural boundaries like Markdown headings, cap each chunk at around 400 tokens, and add a 10 to 15% overlap so sentences split across a boundary still appear whole somewhere.

Should I cap retrieved evidence by chunk count or by tokens?

By tokens. Chunks vary widely in size, so a 'top-5 chunks' cap can mean 1,000 tokens one time and 5,000 the next. Set a hard token cap for the evidence slice and, after retrieval, drop the lowest-ranked chunks until you fit rather than truncating mid-chunk. This keeps cost and latency predictable per request.

How does trimming pages to content reduce my context budget?

A raw HTML page is mostly boilerplate (navigation, cookie banners, related-post rails, footers) which can be 60 to 80% of the bytes and adds nothing to retrieval. Converting to clean Markdown first removes that boilerplate, typically making the content 60 to 80% smaller (up to roughly 93% on boilerplate-heavy pages). That reclaimed budget goes to actual evidence instead of UI chrome.

When should I raise the evidence budget instead of improving retrieval?

Raise the budget only when an evaluation set shows accuracy climbing as you add evidence; that points to multi-hop questions that genuinely need more documents. If accuracy is flat or worse, the problem is retrieval quality (chunking, hybrid search, reranking), and a bigger dump just adds cost and dilutes attention. Fix retrieval first; spend more budget only when the data says it helps.

About the author

M. H. Tawfik

Lead Developer & Owner

Working from Kushtia, Bangladesh.

Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.

Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.

ShareX in HN

TaggedRAGLLM contextTokensCost optimizationMarkdown

Context Window Budgeting for RAG in 2026