If you have ever stared at a 200K-token Anthropic bill and wondered why the same article costs different amounts to send depending on whether it has code or tables in it, you have hit one of the under-appreciated facts of LLM cost: different content types tokenize at very different densities. A 1,000-byte block of English prose is roughly 280 tokens. A 1,000-byte block of TypeScript is roughly 450 tokens. A 1,000-byte JSON object can be anywhere from 300 to 700 tokens depending on how it is formatted. Knowing these ratios before you build a workflow is the difference between predicting your bill and being surprised by it.
This post is the per-content-type token math, measured on the cl100k tokenizer that powers GPT-4o and is close enough to Claude's tokenizer that the numbers transfer within a few percentage points. The data comes from BulkMD's own corpus of converted pages, where we have tokenized millions of bytes of real web content and seen the per-format breakdown across every common shape. If you have not read the broader token-cost breakdown — which covers the macro story of why Markdown is the right shape — start there; this post is the granular companion.
How we measured the numbers
For each content type, we took ten representative samples — real, not synthetic — and ran them through OpenAI's tiktoken library against the cl100k_base encoding. The numbers below are medians of the ten samples per type, reported as characters per token (higher = denser = cheaper). The samples spanned a deliberate range: prose was English-language blog posts, code was TypeScript and Python, tables were API reference rows, lists were package.json scripts, JSON was arbitrary API responses.
The cl100k tokenizer is a byte-pair-encoding scheme trained on a web crawl. It has favorable splits for common English words, common code keywords, and pure ASCII whitespace patterns — and unfavorable splits for unusual punctuation, mixed-case identifiers, and embedded HTML attributes.
The headline table
| Content type | Format | Chars / token | Tokens per 1KB |
|---|---|---|---|
| English prose | Markdown body text | 3.6 | 280 |
| English prose | Same text in <p> tags | 3.1 | 325 |
| Bullet list | Markdown - list | 3.7 | 270 |
| Bullet list | <ul><li> HTML | 2.9 | 350 |
| Numerical table | Markdown pipe table | 3.2 | 315 |
| Numerical table | HTML <table> | 1.6 | 625 |
| Numerical table | JSON array of objects | 2.3 | 435 |
| Numerical table | CSV | 3.4 | 295 |
| Code (TypeScript) | Fenced ```ts block | 2.2 | 460 |
| Code (TypeScript) | Same code in <pre> HTML | 2.1 | 475 |
| Code (Python) | Fenced ```py block | 2.4 | 420 |
| JSON | Pretty-printed (2-space indent) | 2.5 | 400 |
| JSON | Minified, no whitespace | 3.0 | 335 |
| YAML | Standard 2-space indent | 3.0 | 335 |
A few patterns jump out. Markdown beats HTML on every comparable content type, by a margin of roughly 10–50% depending on what is being represented. The widest gap is on tables — a Markdown pipe table is roughly twice as dense as the same data in <table> HTML, because every cell in the HTML version pays for <td> and </td> tokens that the pipe character collapses to almost nothing.
Code is the worst-tokenizing content type at the byte level, and the format does not help much. Whether you wrap your TypeScript in a fenced Markdown block or in HTML <pre>, the cost is similar (2.2 vs 2.1 chars/token) because the dominant cost is the code's own punctuation and identifier soup, not the surrounding fence. The Markdown fence is still strictly better because it carries the language hint and renders cleaner everywhere, but the byte-level token savings here are small.
Why prose tokenizes so well
English prose is the densest content type in the table for a reason that traces back to how cl100k_base was trained. The tokenizer was fit on a web corpus that is overwhelmingly English text, so the byte-pair merges that the algorithm learned favor common English words and word fragments. The word "configuration" becomes one or two tokens; the word "configfile_settings_override" becomes five or six.
This is also why Markdown prose tokenizes slightly better than the same prose in HTML. The HTML cost is not in the words themselves — those tokenize identically — but in the surrounding tag soup. Every <p> and </p> is a few tokens; every <a href="..."> is several more. The prose-in-Markdown row in the table above already strips those, which is why 3.6 chars/token beats 3.1.
The headline finding is unintuitive: code is the most expensive thing you can send to an LLM per byte, even though it feels structured and "machine-readable."
Why tables are where Markdown wins hardest
The most dramatic format-driven saving in the table above is for numerical tables: 3.2 chars/token for Markdown pipe tables versus 1.6 chars/token for HTML — a 2× compression. The reason is that every <table>, <thead>, <tbody>, <tr>, <th>, <td> and their closing tags each cost real tokens, and they repeat on every row. A 20-row × 5-column HTML table has roughly 250 tag-tokens of pure overhead before any data appears. The Markdown equivalent uses pipes — a single character per cell boundary — and a header separator line of dashes.
This is the single biggest concrete reason to convert HTML to Markdown for any context window where you are paying per token. Articles with one or two HTML tables routinely shrink by 30–40% on conversion to Markdown without losing any information, because the tables themselves shrink so dramatically.
The CSV row in the table is also worth noting: at 3.4 chars/token, CSV is denser than Markdown tables for the same data. If you are bulk-feeding numerical data and do not need the model to read the columns alongside other content, raw CSV can be even cheaper than Markdown tables. The tradeoff is that the model handles markdown tables more accurately when answering questions that require column awareness; CSV gets reasoned over more like a stream of comma-separated tokens.
Why JSON has a wide range
The JSON rows in the table show a meaningful spread between pretty-printed and minified versions. Pretty-printed JSON pays for every space character and every line break — small individually, but they compound across a large object. Minifying the same JSON saves 15–20% of tokens on typical API responses without changing the information.
This matters because many RAG pipelines store and ship JSON. If your pipeline serializes records as pretty-printed JSON and sends them to the model, switching to minified JSON is one of the easiest token wins available — a single argument change on json.dumps saves a measurable percentage of your bill.
That said, JSON is not the right format for human-readable context. Models can parse it, but they reason over it less reliably than Markdown for documents-shaped content. JSON is right when the model needs structured fields by name; Markdown is right when the model needs prose with structure. Use both, in the right places.
Where the byte-level numbers mislead
The tokens-per-kilobyte view captures the format cost, but it misses two things that matter in practice.
The first is that real content is a mix of types. A typical technical blog post is mostly prose with one or two code blocks and maybe a table. The blended rate ends up around 3.2–3.4 chars/token — better than pure code, worse than pure prose. When estimating costs for a workflow, multiply your raw size by 0.31 (1/3.2) for a reasonable mid-range estimate; pure-prose workflows will run cheaper, code-heavy workflows more expensive.
The second is that the model's downstream cost depends on more than the tokenizer. Tables that the model has to reason over cell-by-cell consume "thinking" effort the tokenizer cannot measure. Code blocks invoke the model's syntax-aware reasoning paths. For dense-prose Q&A, the per-token math above predicts cost accurately; for code-generation or table-reasoning workflows, the model's compute cost (output tokens, reasoning steps) is a meaningful additional factor.
How to apply these numbers
For most readers of this post, the actionable takeaway is to stop guessing and start measuring. Two concrete steps tip the math in your favor in any workflow.
First, run your real workflow's context through tiktoken once and look at the distribution. If 60% of your tokens are tables in HTML form, converting those tables to Markdown is your highest-leverage cost reduction. If 30% are code, you cannot compress much further but you can at least language-tag the fences so the model handles them well. The benchmark above tells you which optimization to chase, but only your own corpus tells you which one matters.
Second, when authoring or generating content destined for LLM context, prefer the formats with the highest characters-per-token ratio. Markdown prose, Markdown lists, fenced code blocks, and Markdown tables are the densest formats for their respective content types — and they are also the formats that LLMs read most accurately, as we covered in the agent context primer. Density and readability move together, which is the rare case of an optimization that has no trade-off.
TL;DR
Different content types tokenize at very different densities on the cl100k tokenizer that powers most major LLMs. Prose is the cheapest per byte; code is the most expensive. HTML is consistently more expensive than its Markdown equivalent, with the largest gaps on tables (2× compression) and lists (~25% compression). Knowing these ratios lets you predict your workflow's token bill before you run it and target the format conversions that produce the biggest savings.
If you need to convert HTML web content into the most token-efficient format for your LLM workflow, BulkMD is the free Chrome extension that produces clean, well-shaped Markdown with code-block language hints and proper table formatting — exactly the output shape that produces the cost reductions in the table above.
Frequently asked questions
Do these numbers apply to Claude's tokenizer too?
What about non-English prose?
Should I minify JSON before sending it as context?
Does converting markdown back to plain text save more tokens?
How do I count tokens in my own content?
About the author
Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.
Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.