BulkMD

Claude Model Routing: Haiku vs Sonnet vs Opus for RAG

When to call Haiku, Sonnet, or Opus in a RAG pipeline — a measured comparison of cost, latency, and answer quality across the Claude 4.x lineup in 2026.

M. H. Tawfik11 min read

If you have built a RAG pipeline by following the personal-RAG walk-through and now stare at your monthly Anthropic bill wondering whether you really need Opus on every query, the answer is almost certainly no. Most RAG queries — the routine ones, the ones with one obvious passage to cite — are handled equally well by Haiku 4.5 at roughly one-tenth the cost. The interesting question is how to identify which queries belong on which model, and how to wire that routing in without ad-hoc heuristics that break the moment your corpus changes.

This post is the measured analysis of that routing decision across the Claude 4.x family in 2026, with numbers from our own RAG evaluation set — the same one used in the RAG pipeline walk-through — and a concrete routing implementation. The corpus underlying the eval set is a real BulkMD-converted set of pages; if you want to reproduce it with your own corpus, BulkMD is the conversion tool that produces the clean Markdown the evaluation runs against. For the broader context on caching and per-format token costs, the prompt caching post and the token-math breakdown cover the upstream cost levers.

The pricing reality in 2026

Anthropic's price ladder across the Claude 4.x line in 2026, per million input/output tokens:

ModelInput priceOutput priceCache writeCache read
Claude Haiku 4.5$0.80 / $4$4$1 / $5$0.08 / $0.40
Claude Sonnet 4.6$3 / $15$15$3.75 / $18.75$0.30 / $1.50
Claude Opus 4.7$15 / $75$75$18.75 / $93.75$1.50 / $7.50

The price spread is roughly 1× : 4× : 20×. A RAG query that costs you $0.001 on Haiku costs $0.020 on Opus — twenty times the bill for what is often an identical answer because the retrieved chunk made the question trivial. Routing matters not because Haiku is cheap but because the marginal Opus query is expensive.

The cache prices are worth flagging separately. The cache-read price drops to roughly 10% of input on every model, which means the post-cache cost gap narrows to 1× : 3× : 19× — caching benefits the cheapest model proportionally less because Haiku's base price was already low. This matters when planning a routing strategy: you cannot cache your way out of Opus, but you can cache your way into making Sonnet feasible.

Quality across the lineup on RAG tasks

We ran the same 240-question evaluation set from the RAG pipeline post against each model, holding everything else constant (same chunks retrieved, same prompt template, same temperature). Answer accuracy was graded by a separate Claude pass against the ground-truth answer, and we hand-spot-checked 60 to confirm the auto-grader was calibrated.

ModelAnswer accuracyMedian latencyCost per 100 queries
Haiku 4.579%600 ms$0.04
Sonnet 4.687%1.2 s$0.15
Opus 4.789%2.4 s$0.75
Routed (Haiku → Opus on low confidence)88%750 ms (median)$0.11

Three things to read here. Haiku is dramatically cheaper but lands two percentage points below Sonnet and ten below Opus on raw accuracy. The Sonnet-to-Opus jump is only two points — much smaller than the price jump, and large enough that Opus is almost never the right default for ordinary RAG. The routed configuration (use Haiku, escalate to Opus on low confidence) lands one point below all-Opus on accuracy at 14% of the cost. That is the headline result.

The latency column matters as much as the cost column for interactive workflows. A user sitting at a chat interface notices a 2.4-second wait far more than a 0.6-second wait, even though the dollar cost difference per query is trivial at low volume. Haiku-by-default produces a feel of responsiveness that Opus-by-default fundamentally cannot match at current latencies.

What "low confidence" actually means

The routing trick is identifying which Haiku answers to escalate. Two signals work in practice; both are cheap to compute on the same response.

The first signal is the model's own stated confidence. Ask Haiku to begin its answer with a confidence score from 1 to 10 ("Confidence: 7"). For RAG queries with strong retrieval, Haiku scores itself 8 or above the vast majority of the time. For genuinely ambiguous queries — where the retrieved chunks are weakly relevant or contradict each other — it self-reports 5 or 6, and these are the queries that benefit most from Opus.

def haiku_with_confidence(query: str, context: str) -> tuple[str, int]:
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=1024,
        system="""Answer strictly from the provided sources.
Begin your answer with a confidence score from 1 to 10 on a line of its own,
in the form: 'Confidence: N'. Then provide your answer.""",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": context,
                 "cache_control": {"type": "ephemeral"}},
                {"type": "text", "text": f"Question: {query}"},
            ],
        }],
    )
    text = response.content[0].text
    match = re.search(r"Confidence:\s*(\d+)", text)
    confidence = int(match.group(1)) if match else 5
    return text, confidence

The second signal is retrieval-side: how similar are the top retrieved chunks to the query, and how spread are their scores. A query with one obvious top chunk (similarity 0.85, second chunk 0.62) is almost always handled well by Haiku. A query where the top five chunks all have similarity 0.55–0.60 is the ambiguous case, and benefits from Opus's stronger synthesis.

Combining both signals — "escalate if stated confidence < 7 OR retrieval top-1 similarity < 0.65" — produced the routed-pipeline numbers in the table above. On our 240-question set this escalates about 18% of queries to Opus and lands within one accuracy point of all-Opus at one-seventh the cost.

When Sonnet is the right default

The table above frames the choice as Haiku-vs-Opus with Sonnet as a middle option, but for some workloads Sonnet is the right primary model rather than a fallback.

Sonnet's specific advantage is multi-step reasoning over retrieved context. When the answer requires synthesizing two or three retrieved chunks — comparing API versions, reconciling conflicting documentation, deriving a conclusion from a benchmark plus a methodology — Sonnet outperforms Haiku by a margin that Opus does not extend further. On synthesis-shaped queries in our eval set, Haiku scored 64%, Sonnet 84%, and Opus 87%. The Sonnet-vs-Opus gap is three points; the Haiku-vs-Sonnet gap is twenty.

If your RAG workload skews toward synthesis questions — research assistants, comparative analysis tools, anything that asks "how do these sources agree or disagree" — making Sonnet the primary model and escalating to Opus on confidence is a better routing shape than starting from Haiku. The all-Sonnet baseline costs roughly 4× Haiku but produces meaningfully better answers on the part of the workload that matters.

The cost math in production

Putting numbers on a realistic workload: a personal RAG that runs 100 queries a day against a 400-document corpus. Costs per month for each strategy:

StrategyMonthly cost
All Opus$22.50
All Sonnet$4.50
All Haiku$1.20
Routed: Haiku → Opus on low confidence$3.30
Routed: Sonnet → Opus on low confidence$5.40

For most personal use cases the Haiku-default routed pipeline is the right answer — it lands within one accuracy point of all-Opus at one-seventh the cost. For workloads that skew toward synthesis, the Sonnet-default routed pipeline is the right answer — slightly more expensive but materially better on hard queries.

A workload where all-Opus is justified looks like: low volume (under 50 queries a day), high stakes per query (each answer feeds a downstream decision), and queries that are predominantly synthesis-shaped. For everything else, routing dominates.

The batch API as a second cost lever

Anthropic's Batch API runs the same models at 50% of the per-token cost in exchange for asynchronous delivery (results within 24 hours rather than seconds). For workflows that are not interactive — offline RAG indexing passes, scheduled summarization, archival research — the batch API is a free 2× discount on top of whatever routing strategy you choose.

The catch is the asynchronicity. Batch-API queries return when they return; you cannot show a user a "thinking" spinner for 24 hours. The pattern that works is to run the bulk of a workload through the batch API at cron-driven intervals (overnight, weekly) and reserve real-time API calls for queries that actually need real-time results. We use this split in our own knowledge-base refresh workflow — nightly batch summarization plus on-demand real-time Q&A.

TL;DR

Most RAG queries are not Opus-shaped. Claude Haiku 4.5 handles 70-80% of queries at the answer quality of Opus for one-tenth the cost; a confidence-routed pipeline that escalates only the genuinely hard queries to Opus lands within one accuracy point of all-Opus at one-seventh the total cost. Sonnet is the right default for synthesis-heavy workloads. The batch API doubles the savings on workflows that can wait 24 hours for results.

If you want a clean, well-shaped Markdown corpus to actually run any of these models against — the kind of corpus where chunks carry their own structural anchors and the model can cite by source name — BulkMD is the free Chrome extension that produces it. Pair the clean output with a Haiku-default routed pipeline and a personal RAG that answers as well as Opus costs you under five dollars a month.

Frequently asked questions

Can I use OpenAI models instead of Claude in the same routed pattern?

Yes. The same shape — cheap-by-default, escalate-on-uncertainty — works with GPT-4o-mini → GPT-4o → o1. The pricing ratios and quality gaps are slightly different (the gap between mini and 4o is smaller than the gap between Haiku and Opus), but the routing logic transfers directly.

How do I measure whether routing is actually helping me?

Run your evaluation set against each model individually and against the routed pipeline. Compare accuracy and cost across the four configurations. If the routed pipeline lands within a percentage point of your top-quality model at materially lower cost, routing is working. If accuracy drops by more than 2 percentage points, your confidence threshold is too aggressive.

Should I use Haiku for the confidence-scoring step, or a smaller model?

Use Haiku. Smaller models (open-source 7B-parameter classifiers, OpenAI's text-moderation) are cheaper but produce noisier confidence signals on retrieval-style queries. The price difference between Haiku-with-confidence and a separate classifier is negligible at personal scale, and the confidence signal you get back is meaningfully better-calibrated.

Does prompt caching work with model routing?

Yes, and they compound. Cache your retrieved context once with Haiku; if you escalate to Opus, cache the same context separately under Opus's cache. The cache write cost for the small fraction of escalated queries is dwarfed by the savings on the queries that stayed on Haiku. Both layers stack.

Is the batch API worth setting up for personal scale?

Only for non-interactive workloads. The setup overhead is modest (the batch API uses the same SDK with a different endpoint), but the value depends on having queries that can wait 24 hours. For interactive personal Q&A, the real-time API at routed-model pricing is already cheap enough that the batch discount does not justify the latency.

About the author

M. H. Tawfik

Lead Developer & Owner

Working from Kushtia, Bangladesh.

Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.

Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.

ShareXinHN
TaggedClaudeCost optimizationRAGTokens