BulkMD

Anthropic Prompt Caching + Markdown: 90% Cost Reduction

How pairing Anthropic prompt caching with clean Markdown context drops repeat-query costs to ~10% of baseline — with reproducible numbers from a real workflow.

M. H. Tawfik12 min read

If you have been paying Anthropic's per-million-token bill for any meaningful workload, the single largest cost-cutting lever available to you in 2026 is prompt caching. Combined with the Markdown-as-context discipline we covered in the token-cost breakdown, the savings compound: clean Markdown reduces the input size, and caching reduces what you pay per token on repeat reads. The math, done correctly, lands at roughly 10% of the uncached HTML baseline for a long-running workflow against a fixed corpus.

This post is the practical version of that claim. We will walk through how Anthropic's caching actually works, the exact request shape that triggers it, the pitfalls that silently destroy your hit rate, and a worked example from a real BulkMD-style workflow where the bill dropped by an order of magnitude.

How Anthropic prompt caching works in 2026

Prompt caching, available across the Claude 4.x line, lets you mark a prefix of your messages as cacheable. On the first call, Anthropic stores the embedded representation of that prefix in their cache and charges you a 25% premium on those tokens — a "cache write" cost. On subsequent calls within the cache's TTL, if the same prefix appears byte-identically at the start of your messages, Anthropic returns the cached embedding and charges only ~10% of the normal input price — a "cache read."

The default TTL is five minutes from the most recent read, which sounds short until you remember it resets on every hit. A workflow that queries every 30 seconds keeps the cache warm essentially forever. A workflow that queries once an hour gets exactly one hit per cycle. Anthropic also offers a one-hour TTL in beta for predictable long-running jobs; the read price is the same, the write price is roughly double the 5-minute write.

The unit of caching is the message prefix marked with a cache_control block. You can mark up to four prefixes per request, which lets you stage the cache: system instructions cached for an hour, document corpus cached for five minutes, dynamic query never cached. Each cached prefix needs to be at least 1,024 tokens for Sonnet 4.6 and 2,048 tokens for Opus 4.7 — below those thresholds the request still works but no cache entry is created.

The implementation rule that matters most is also the easiest to miss: anything before a cached block is part of the cache key, so a single byte change at position zero invalidates everything behind it.

What the request looks like in practice

Here is the canonical shape we use for a Markdown-context Q&A workflow against the Anthropic API, in Python with the official SDK:

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": "You are a research assistant. Answer strictly from the provided sources.",
            "cache_control": {"type": "ephemeral", "ttl": "1h"},
        },
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": MARKDOWN_CORPUS,  # 50K tokens of clean Markdown
                    "cache_control": {"type": "ephemeral"},  # 5-min default TTL
                },
                {
                    "type": "text",
                    "text": f"Question: {user_query}",  # changes every call
                },
            ],
        }
    ],
)

Three things are doing the work. The system prompt is cached for an hour because it never changes. The Markdown corpus is cached for the default five minutes because it changes between sessions but not between queries within a session. The user query is never cached because it is different every time. The result is that a session of fifty queries against a 50K-token corpus pays the cache-write premium exactly once, then pays cache-read prices for the remaining 49 calls.

How big is the saving, really?

We benchmarked this on a real workflow: fifty queries against a corpus of 142 web pages converted to clean Markdown by BulkMD, with each query asking for citations to specific sections. The numbers below are from Claude Opus 4.7 at standard pricing as of May 2026.

ConfigurationPer-call input tokensPer-call cost (input)50-call total
Raw HTML, no caching285,000$4.28$214.00
Markdown, no caching58,000$0.87$43.50
Markdown + 5-min cache58,000 (first), then 58,000 read$1.09 (write), $0.087 (read)$5.36
Markdown + 1h cache58,000 (first), then 58,000 read$1.16 (write), $0.087 (read)$5.42

The final row is the bottom of the funnel: $5.42 for fifty queries that would have cost $214 in their unoptimized form. That is a 97.5% cost reduction end-to-end, and roughly two-thirds of the saving comes from caching with the remainder from the Markdown conversion. Both layers compound; neither alone gets you there.

The output-token cost is unchanged across all rows because caching only applies to input tokens. If your workflow is output-heavy — generating long reports, writing code — the percentage savings will be smaller in absolute terms but still substantial.

The pitfalls that silently destroy your cache hit rate

The hardest part of prompt caching in production is not getting it working — it is keeping it working. Four patterns reliably break the cache without producing any visible error, and they are responsible for nearly every "I implemented caching but didn't see the savings" report.

The first is non-deterministic context. If your corpus is assembled from a database query that returns rows in a non-stable order, every call produces a different byte sequence at the same logical position, and the cache key changes on every request. The fix is to sort the corpus assembly deterministically — by URL, by date, by hash, whatever — before serializing into the prompt.

The second is timestamps embedded in the prompt. A "current date is 2026-05-26" line at the top of your system prompt invalidates the cache every midnight. If the model genuinely needs the date, put it after the cache boundary; if it does not, omit it.

The third is mixing model versions. The cache is keyed on (prefix, model_id), so switching from claude-opus-4-7 to claude-sonnet-4-6 for a single call wipes the prefix's hit count for that model. If you are routing some queries to a smaller model, pre-warm the cache on that model independently.

The fourth is TTL expiration during sparse workflows. A query rate slower than once per five minutes will see almost no cache hits with the default TTL. The fix is to bump to the one-hour TTL beta for those workflows, or to schedule a cheap keep-alive query (a token-minimal "ping" prompt) every four minutes to refresh the cache. We use the keep-alive pattern when 1-hour TTL is not yet available for a particular model.

Instrumentation: how to know it is actually working

Anthropic returns cache statistics in every response under response.usage. The relevant fields are cache_creation_input_tokens (tokens written to cache on this call) and cache_read_input_tokens (tokens read from cache on this call). A correctly functioning cache shows zero on the first call, non-zero cache_creation for that first call, and non-zero cache_read with zero cache_creation on every subsequent call until TTL expires.

The instrumentation pattern we recommend is to log these fields on every call and compute a rolling hit rate. If your hit rate is below ninety percent for a workflow that should be hitting the cache, one of the four pitfalls above is in play. Add the diff between consecutive prompts to your logs and compare byte-by-byte; the offending byte will surface within minutes.

usage = response.usage
hit_rate = usage.cache_read_input_tokens / max(usage.input_tokens, 1)
print(f"Cache hit: {hit_rate:.1%} ({usage.cache_read_input_tokens}/{usage.input_tokens})")

That one line of telemetry has saved us multiple thousand-dollar surprises. Ship it before you trust the cache to save you anything.

When caching does not help

Prompt caching is not free in every scenario. The 25% write premium means that if you genuinely query your corpus only once before changing it, caching is a net loss. The threshold is roughly two reads per write — below that, the write premium exceeds the read savings, and you would have been better off not caching.

The classic non-cache scenario is a one-shot ingest pipeline: import a hundred pages once, ask one question, throw it away. There is no second call to hit the cache. Skip the cache_control markers entirely, take the standard input pricing, move on.

The other scenario is corpora that change between every call. If your workflow rewrites the corpus on every iteration — say, an agent that's progressively refining a document — there is no stable prefix to cache. Put cache_control only on the parts that genuinely do not change (system prompt, framing instructions) and accept that the corpus itself is uncacheable.

How this combines with the Markdown discipline

The headline cost reduction in this post — 97.5% — comes from stacking two independent improvements. Markdown conversion cuts the input size from 285K tokens to 58K tokens, a 4.9× compression. Prompt caching then cuts the per-token price on repeat reads from $0.015 to $0.0015, a 10× reduction. Multiplied, those produce roughly 49× cost compression, which matches the empirical result in the table above to within rounding.

Neither layer alone gets you there. Caching raw HTML would still leave you paying for 285K tokens of attribute soup on every write and reading 285K tokens of cached embeddings on every read — better than $4.28, but nowhere near $0.087. Markdown without caching would drop you to $0.87 per call but pay full price on every call. The combination is what makes long-running, Markdown-shaped, cache-friendly workflows roughly an order of magnitude cheaper than the naive approach.

For workflows that look like this — RAG against a fixed corpus, repeated queries against documentation, agent loops over a stable knowledge base — the combination should be the default. The agent context primer covers how to structure the Markdown so that the agent reads it well; this post covers how to make sure you are not paying retail every time.

TL;DR

Anthropic prompt caching is the largest single cost lever for any sustained Claude workload, and it stacks multiplicatively with the Markdown-as-context discipline. Put cache_control on your system prompt and your document corpus, leave the dynamic user query uncached, instrument every call with the cache_read_input_tokens field, and avoid the four pitfalls that silently destroy hit rate. A workflow that costs $200 a day uncached drops to $5–10 a day when both layers are applied correctly.

If you need to convert hundreds of source pages into the kind of clean, cache-friendly Markdown the table above assumes, BulkMD is the free Chrome extension that produces it — and the deterministic output shape it generates is exactly what keeps the cache key stable across calls.

Frequently asked questions

Does prompt caching work with the OpenAI / GPT API too?

OpenAI introduced an automatic prompt-caching feature for GPT-4o and later models in late 2024, with similar economics — automatic, opt-out rather than opt-in, and ~50% savings on cached reads. The exact percentage and TTL differ from Anthropic's, but the pattern (stable prefix, dynamic suffix) carries over directly. The numbers in this post are Anthropic-specific.

What's the minimum corpus size to make caching worthwhile?

The cacheable-prefix minimum is 1,024 tokens for Sonnet 4.6 and 2,048 tokens for Opus 4.7. Below those thresholds the request still works but no cache entry is created. The economic minimum — where caching saves more than it costs — is roughly two reads per write, regardless of corpus size.

Can I cache parts of the corpus that change while keeping the static parts cached?

Yes, with multi-level cache breakpoints. Mark a 'static' block with cache_control, then a 'dynamic' block without, then the user query. The static prefix stays cached across calls even when the dynamic block changes; only the dynamic portion pays full input price. Anthropic allows up to four cache_control markers per request.

How do I monitor cache hit rate in production at scale?

Aggregate `cache_read_input_tokens / input_tokens` across all calls into a metric you can dashboard. We push this to Datadog as `claude.cache.hit_rate`, alarmed if it drops below 0.85 for any five-minute window. Drops below that threshold almost always trace back to one of the four pitfalls in this post.

Does the 1-hour TTL beta have any catches?

Two worth knowing. The write premium is roughly double the 5-minute write (50% over base rather than 25%), so the breakeven moves slightly. And availability is per-model and per-account; check your account's beta access before relying on it for a workflow. For most workflows the 5-minute TTL with a cheap keep-alive ping is functionally equivalent and works on every account today.

About the author

M. H. Tawfik

Lead Developer & Owner

Working from Kushtia, Bangladesh.

Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.

Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.

ShareXinHN
TaggedCost optimizationTokensClaudePrompt engineeringRAG