If you have ever stared at a 200K-token Anthropic bill and wondered why the same article costs different amounts to send depending on whether it has code or tables in it, you have hit one of the under-appreciated facts of LLM cost: different content types tokenize at very different densities. A 1,000-byte block of English prose is roughly 280 tokens. A 1,000-byte block of TypeScript is roughly 450 tokens. A 1,000-byte JSON object can be anywhere from 300 to 700 tokens depending on how it is formatted. Knowing these ratios before you build a workflow is the difference between predicting your bill and being surprised by it.

This post is the per-content-type token math, measured on the cl100k_base tokenizer that powered GPT-3.5 and GPT-4. Newer OpenAI models (GPT-4o and the o-series) moved to the larger o200k_base tokenizer, and Claude uses its own — but the relative Markdown-vs-HTML savings transfer across all of them within a few percentage points, because they are all byte-pair encoders trained on similar web text. The data comes from BulkMD's own corpus of converted pages, where we have tokenized millions of bytes of real web content and seen the per-format breakdown across every common shape. If you have not read the broader token-cost breakdown — which covers the macro story of why Markdown is the right shape — start there; this post is the granular companion.

How we measured the numbers

For each content type, we took ten representative samples — real, not synthetic — and ran them through OpenAI's tiktoken library against the cl100k_base encoding. The numbers below are medians of the ten samples per type, reported as characters per token (higher = denser = cheaper). The samples spanned a deliberate range: prose was English-language blog posts, code was TypeScript and Python, tables were API reference rows, lists were package.json scripts, JSON was arbitrary API responses.

The cl100k tokenizer is a byte-pair-encoding scheme trained on a web crawl. It has favorable splits for common English words, common code keywords, and pure ASCII whitespace patterns — and unfavorable splits for unusual punctuation, mixed-case identifiers, and embedded HTML attributes.

The headline table

Content type	Format	Chars / token	Tokens per 1KB
English prose	Markdown body text	3.6	280
English prose	Same text in `<p>` tags	3.1	325
Bullet list	Markdown `-` list	3.7	270
Bullet list	`<ul><li>` HTML	2.9	350
Numerical table	Markdown pipe table	3.2	315
Numerical table	HTML `<table>`	1.6	625
Numerical table	JSON array of objects	2.3	435
Numerical table	CSV	3.4	295
Code (TypeScript)	Fenced ```ts block	2.2	460
Code (TypeScript)	Same code in `<pre>` HTML	2.1	475
Code (Python)	Fenced ```py block	2.4	420
JSON	Pretty-printed (2-space indent)	2.5	400
JSON	Minified, no whitespace	3.0	335
YAML	Standard 2-space indent	3.0	335

A few patterns jump out. Markdown beats HTML on every comparable content type, by a margin of roughly 10–50% depending on what is being represented. The widest gap is on tables — a Markdown pipe table is roughly twice as dense as the same data in <table> HTML, because every cell in the HTML version pays for <td> and </td> tokens that the pipe character collapses to almost nothing.

Code is the worst-tokenizing content type at the byte level, and the format does not help much. Whether you wrap your TypeScript in a fenced Markdown block or in HTML <pre>, the cost is similar (2.2 vs 2.1 chars/token) because the dominant cost is the code's own punctuation and identifier soup, not the surrounding fence. The Markdown fence is still strictly better because it carries the language hint and renders cleaner everywhere, but the byte-level token savings here are small.

Why prose tokenizes so well

English prose is the densest content type in the table for a reason that traces back to how cl100k_base was trained. The tokenizer was fit on a web corpus that is overwhelmingly English text, so the byte-pair merges that the algorithm learned favor common English words and word fragments. The word "configuration" becomes one or two tokens; the word "configfile_settings_override" becomes five or six.

This is also why Markdown prose tokenizes slightly better than the same prose in HTML. The HTML cost is not in the words themselves — those tokenize identically — but in the surrounding tag soup. Every <p> and </p> is a few tokens; every <a href="..."> is several more. The prose-in-Markdown row in the table above already strips those, which is why 3.6 chars/token beats 3.1.

The headline finding is unintuitive: code is the most expensive thing you can send to an LLM per byte, even though it feels structured and "machine-readable."

Why tables are where Markdown wins hardest

The most dramatic format-driven saving in the table above is for numerical tables: 3.2 chars/token for Markdown pipe tables versus 1.6 chars/token for HTML — a 2× compression. The reason is that every <table>, <thead>, <tbody>, <tr>, <th>, <td> and their closing tags each cost real tokens, and they repeat on every row. A 20-row × 5-column HTML table has roughly 250 tag-tokens of pure overhead before any data appears. The Markdown equivalent uses pipes — a single character per cell boundary — and a header separator line of dashes.

This is the single biggest concrete reason to convert HTML to Markdown for any context window where you are paying per token. Articles with one or two HTML tables routinely shrink by 30–40% on conversion to Markdown without losing any information, because the tables themselves shrink so dramatically.

The CSV row in the table is also worth noting: at 3.4 chars/token, CSV is denser than Markdown tables for the same data. If you are bulk-feeding numerical data and do not need the model to read the columns alongside other content, raw CSV can be even cheaper than Markdown tables. The tradeoff is that the model handles markdown tables more accurately when answering questions that require column awareness; CSV gets reasoned over more like a stream of comma-separated tokens.

Why JSON has a wide range

The JSON rows in the table show a meaningful spread between pretty-printed and minified versions. Pretty-printed JSON pays for every space character and every line break — small individually, but they compound across a large object. Minifying the same JSON saves 15–20% of tokens on typical API responses without changing the information.

This matters because many RAG pipelines store and ship JSON. If your pipeline serializes records as pretty-printed JSON and sends them to the model, switching to minified JSON is one of the easiest token wins available — a single argument change on json.dumps saves a measurable percentage of your bill.

That said, JSON is not the right format for human-readable context. Models can parse it, but they reason over it less reliably than Markdown for documents-shaped content. JSON is right when the model needs structured fields by name; Markdown is right when the model needs prose with structure. Use both, in the right places.

Where the byte-level numbers mislead

The tokens-per-kilobyte view captures the format cost, but it misses two things that matter in practice.

The first is that real content is a mix of types. A typical technical blog post is mostly prose with one or two code blocks and maybe a table. The blended rate ends up around 3.2–3.4 chars/token — better than pure code, worse than pure prose. When estimating costs for a workflow, multiply your raw size by 0.31 (1/3.2) for a reasonable mid-range estimate; pure-prose workflows will run cheaper, code-heavy workflows more expensive.

The second is that the model's downstream cost depends on more than the tokenizer. Tables that the model has to reason over cell-by-cell consume "thinking" effort the tokenizer cannot measure. Code blocks invoke the model's syntax-aware reasoning paths. For dense-prose Q&A, the per-token math above predicts cost accurately; for code-generation or table-reasoning workflows, the model's compute cost (output tokens, reasoning steps) is a meaningful additional factor.

How to apply these numbers

For most readers of this post, the actionable takeaway is to stop guessing and start measuring. Two concrete steps tip the math in your favor in any workflow.

First, run your real workflow's context through tiktoken once and look at the distribution. If 60% of your tokens are tables in HTML form, converting those tables to Markdown is your highest-leverage cost reduction. If 30% are code, you cannot compress much further but you can at least language-tag the fences so the model handles them well. The benchmark above tells you which optimization to chase, but only your own corpus tells you which one matters.

Second, when authoring or generating content destined for LLM context, prefer the formats with the highest characters-per-token ratio. Markdown prose, Markdown lists, fenced code blocks, and Markdown tables are the densest formats for their respective content types — and they are also the formats that LLMs read most accurately, as we covered in the agent context primer. Density and readability move together, which is the rare case of an optimization that has no trade-off.

TL;DR

Different content types tokenize at very different densities on the cl100k_base tokenizer (and on its successors like o200k_base and Claude's tokenizer, within a few percentage points). Prose is the cheapest per byte; code is the most expensive. HTML is consistently more expensive than its Markdown equivalent, with the largest gaps on tables (2× compression) and lists (~25% compression). Knowing these ratios lets you predict your workflow's token bill before you run it and target the format conversions that produce the biggest savings.

If you need to convert HTML web content into the most token-efficient format for your LLM workflow, BulkMD is the free Chrome extension that produces clean, well-shaped Markdown with code-block language hints and proper table formatting — exactly the output shape that produces the cost reductions in the table above.

Frequently asked questions

Do these numbers apply to Claude's tokenizer too?

Within a few percentage points, yes. Claude uses its own tokenizer that is closely related to cl100k for Latin-script text — the same byte-pair merges work on common English words, code keywords, and Markdown syntax. The largest divergence we have measured is on HTML attribute syntax, where Claude's tokenizer is slightly less efficient than cl100k. For the Markdown-vs-HTML comparison the direction is unchanged.

What about non-English prose?

Tokenization density drops sharply for languages whose script wasn't well-represented in the tokenizer's training data. Japanese, Korean, and Arabic prose can run at 1.5–2 characters per token (vs 3.6 for English), and per-byte costs go up correspondingly. The Markdown-vs-HTML structural savings still apply, but the absolute token counts will be higher than the English-prose numbers above.

Should I minify JSON before sending it as context?

For RAG context that the model will reason over by field name, yes — minified JSON costs 15–20% fewer tokens than pretty-printed JSON for the same information. The model parses both equally well. For development debugging where humans read the JSON, keep it pretty-printed; for production LLM context, minify.

Does converting markdown back to plain text save more tokens?

Slightly, but not enough to be worth the loss. Stripping all Markdown structure saves maybe 5% of tokens versus Markdown prose, but the model loses the structural signals that help it cite specific sections accurately. The cited-accuracy gain from keeping headings and lists is much larger than the token savings from stripping them.

How do I count tokens in my own content?

Use OpenAI's `tiktoken` Python library: `import tiktoken; enc = tiktoken.get_encoding('cl100k_base'); print(len(enc.encode(text)))`. For Claude-specific counts, Anthropic ships `anthropic-tokenizer-py` in their SDK. Both are free, run locally, and let you measure your real corpus in seconds — far better than relying on rules of thumb.

About the author

M. H. Tawfik

Lead Developer & Owner

Working from Kushtia, Bangladesh.

Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.

Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.

ShareX in HN

TaggedTokensCost optimizationMarkdownPrompt engineering

Token Math by Content Type: Code, Tables, Lists in 2026