BulkMD

Token Math by Content Type: Code, Tables, Lists in 2026

How prose, code, tables, lists, and JSON tokenize differently in 2026 — the per-byte token cost of each content type, and where Markdown compresses best.

M. H. Tawfik11 min read

If you have ever stared at a 200K-token Anthropic bill and wondered why the same article costs different amounts to send depending on whether it has code or tables in it, you have hit one of the under-appreciated facts of LLM cost: different content types tokenize at very different densities. A 1,000-byte block of English prose is roughly 280 tokens. A 1,000-byte block of TypeScript is roughly 450 tokens. A 1,000-byte JSON object can be anywhere from 300 to 700 tokens depending on how it is formatted. Knowing these ratios before you build a workflow is the difference between predicting your bill and being surprised by it.

This post is the per-content-type token math, measured on the cl100k tokenizer that powers GPT-4o and is close enough to Claude's tokenizer that the numbers transfer within a few percentage points. The data comes from BulkMD's own corpus of converted pages, where we have tokenized millions of bytes of real web content and seen the per-format breakdown across every common shape. If you have not read the broader token-cost breakdown — which covers the macro story of why Markdown is the right shape — start there; this post is the granular companion.

How we measured the numbers

For each content type, we took ten representative samples — real, not synthetic — and ran them through OpenAI's tiktoken library against the cl100k_base encoding. The numbers below are medians of the ten samples per type, reported as characters per token (higher = denser = cheaper). The samples spanned a deliberate range: prose was English-language blog posts, code was TypeScript and Python, tables were API reference rows, lists were package.json scripts, JSON was arbitrary API responses.

The cl100k tokenizer is a byte-pair-encoding scheme trained on a web crawl. It has favorable splits for common English words, common code keywords, and pure ASCII whitespace patterns — and unfavorable splits for unusual punctuation, mixed-case identifiers, and embedded HTML attributes.

The headline table

Content typeFormatChars / tokenTokens per 1KB
English proseMarkdown body text3.6280
English proseSame text in <p> tags3.1325
Bullet listMarkdown - list3.7270
Bullet list<ul><li> HTML2.9350
Numerical tableMarkdown pipe table3.2315
Numerical tableHTML <table>1.6625
Numerical tableJSON array of objects2.3435
Numerical tableCSV3.4295
Code (TypeScript)Fenced ```ts block2.2460
Code (TypeScript)Same code in <pre> HTML2.1475
Code (Python)Fenced ```py block2.4420
JSONPretty-printed (2-space indent)2.5400
JSONMinified, no whitespace3.0335
YAMLStandard 2-space indent3.0335

A few patterns jump out. Markdown beats HTML on every comparable content type, by a margin of roughly 10–50% depending on what is being represented. The widest gap is on tables — a Markdown pipe table is roughly twice as dense as the same data in <table> HTML, because every cell in the HTML version pays for <td> and </td> tokens that the pipe character collapses to almost nothing.

Code is the worst-tokenizing content type at the byte level, and the format does not help much. Whether you wrap your TypeScript in a fenced Markdown block or in HTML <pre>, the cost is similar (2.2 vs 2.1 chars/token) because the dominant cost is the code's own punctuation and identifier soup, not the surrounding fence. The Markdown fence is still strictly better because it carries the language hint and renders cleaner everywhere, but the byte-level token savings here are small.

Why prose tokenizes so well

English prose is the densest content type in the table for a reason that traces back to how cl100k_base was trained. The tokenizer was fit on a web corpus that is overwhelmingly English text, so the byte-pair merges that the algorithm learned favor common English words and word fragments. The word "configuration" becomes one or two tokens; the word "configfile_settings_override" becomes five or six.

This is also why Markdown prose tokenizes slightly better than the same prose in HTML. The HTML cost is not in the words themselves — those tokenize identically — but in the surrounding tag soup. Every <p> and </p> is a few tokens; every <a href="..."> is several more. The prose-in-Markdown row in the table above already strips those, which is why 3.6 chars/token beats 3.1.

The headline finding is unintuitive: code is the most expensive thing you can send to an LLM per byte, even though it feels structured and "machine-readable."

Why tables are where Markdown wins hardest

The most dramatic format-driven saving in the table above is for numerical tables: 3.2 chars/token for Markdown pipe tables versus 1.6 chars/token for HTML — a 2× compression. The reason is that every <table>, <thead>, <tbody>, <tr>, <th>, <td> and their closing tags each cost real tokens, and they repeat on every row. A 20-row × 5-column HTML table has roughly 250 tag-tokens of pure overhead before any data appears. The Markdown equivalent uses pipes — a single character per cell boundary — and a header separator line of dashes.

This is the single biggest concrete reason to convert HTML to Markdown for any context window where you are paying per token. Articles with one or two HTML tables routinely shrink by 30–40% on conversion to Markdown without losing any information, because the tables themselves shrink so dramatically.

The CSV row in the table is also worth noting: at 3.4 chars/token, CSV is denser than Markdown tables for the same data. If you are bulk-feeding numerical data and do not need the model to read the columns alongside other content, raw CSV can be even cheaper than Markdown tables. The tradeoff is that the model handles markdown tables more accurately when answering questions that require column awareness; CSV gets reasoned over more like a stream of comma-separated tokens.

Why JSON has a wide range

The JSON rows in the table show a meaningful spread between pretty-printed and minified versions. Pretty-printed JSON pays for every space character and every line break — small individually, but they compound across a large object. Minifying the same JSON saves 15–20% of tokens on typical API responses without changing the information.

This matters because many RAG pipelines store and ship JSON. If your pipeline serializes records as pretty-printed JSON and sends them to the model, switching to minified JSON is one of the easiest token wins available — a single argument change on json.dumps saves a measurable percentage of your bill.

That said, JSON is not the right format for human-readable context. Models can parse it, but they reason over it less reliably than Markdown for documents-shaped content. JSON is right when the model needs structured fields by name; Markdown is right when the model needs prose with structure. Use both, in the right places.

Where the byte-level numbers mislead

The tokens-per-kilobyte view captures the format cost, but it misses two things that matter in practice.

The first is that real content is a mix of types. A typical technical blog post is mostly prose with one or two code blocks and maybe a table. The blended rate ends up around 3.2–3.4 chars/token — better than pure code, worse than pure prose. When estimating costs for a workflow, multiply your raw size by 0.31 (1/3.2) for a reasonable mid-range estimate; pure-prose workflows will run cheaper, code-heavy workflows more expensive.

The second is that the model's downstream cost depends on more than the tokenizer. Tables that the model has to reason over cell-by-cell consume "thinking" effort the tokenizer cannot measure. Code blocks invoke the model's syntax-aware reasoning paths. For dense-prose Q&A, the per-token math above predicts cost accurately; for code-generation or table-reasoning workflows, the model's compute cost (output tokens, reasoning steps) is a meaningful additional factor.

How to apply these numbers

For most readers of this post, the actionable takeaway is to stop guessing and start measuring. Two concrete steps tip the math in your favor in any workflow.

First, run your real workflow's context through tiktoken once and look at the distribution. If 60% of your tokens are tables in HTML form, converting those tables to Markdown is your highest-leverage cost reduction. If 30% are code, you cannot compress much further but you can at least language-tag the fences so the model handles them well. The benchmark above tells you which optimization to chase, but only your own corpus tells you which one matters.

Second, when authoring or generating content destined for LLM context, prefer the formats with the highest characters-per-token ratio. Markdown prose, Markdown lists, fenced code blocks, and Markdown tables are the densest formats for their respective content types — and they are also the formats that LLMs read most accurately, as we covered in the agent context primer. Density and readability move together, which is the rare case of an optimization that has no trade-off.

TL;DR

Different content types tokenize at very different densities on the cl100k tokenizer that powers most major LLMs. Prose is the cheapest per byte; code is the most expensive. HTML is consistently more expensive than its Markdown equivalent, with the largest gaps on tables (2× compression) and lists (~25% compression). Knowing these ratios lets you predict your workflow's token bill before you run it and target the format conversions that produce the biggest savings.

If you need to convert HTML web content into the most token-efficient format for your LLM workflow, BulkMD is the free Chrome extension that produces clean, well-shaped Markdown with code-block language hints and proper table formatting — exactly the output shape that produces the cost reductions in the table above.

Frequently asked questions

Do these numbers apply to Claude's tokenizer too?

Within a few percentage points, yes. Claude uses its own tokenizer that is closely related to cl100k for Latin-script text — the same byte-pair merges work on common English words, code keywords, and Markdown syntax. The largest divergence we have measured is on HTML attribute syntax, where Claude's tokenizer is slightly less efficient than cl100k. For the Markdown-vs-HTML comparison the direction is unchanged.

What about non-English prose?

Tokenization density drops sharply for languages whose script wasn't well-represented in the tokenizer's training data. Japanese, Korean, and Arabic prose can run at 1.5–2 characters per token (vs 3.6 for English), and per-byte costs go up correspondingly. The Markdown-vs-HTML structural savings still apply, but the absolute token counts will be higher than the English-prose numbers above.

Should I minify JSON before sending it as context?

For RAG context that the model will reason over by field name, yes — minified JSON costs 15–20% fewer tokens than pretty-printed JSON for the same information. The model parses both equally well. For development debugging where humans read the JSON, keep it pretty-printed; for production LLM context, minify.

Does converting markdown back to plain text save more tokens?

Slightly, but not enough to be worth the loss. Stripping all Markdown structure saves maybe 5% of tokens versus Markdown prose, but the model loses the structural signals that help it cite specific sections accurately. The cited-accuracy gain from keeping headings and lists is much larger than the token savings from stripping them.

How do I count tokens in my own content?

Use OpenAI's `tiktoken` Python library: `import tiktoken; enc = tiktoken.get_encoding('cl100k_base'); print(len(enc.encode(text)))`. For Claude-specific counts, Anthropic ships `anthropic-tokenizer-py` in their SDK. Both are free, run locally, and let you measure your real corpus in seconds — far better than relying on rules of thumb.

About the author

M. H. Tawfik

Lead Developer & Owner

Working from Kushtia, Bangladesh.

Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.

Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.

ShareXinHN
TaggedTokensCost optimizationMarkdownPrompt engineering