You have a folder of web pages, API responses, and notes, and you are about to paste them into Claude or ChatGPT or a RAG index. The unglamorous question that decides your token bill and your answer quality is this: what format should the context be in? The honest answer to Markdown vs JSON vs plain text for LLM context is that there is no single winner. There is a right format per content type, and getting it wrong costs you tokens, reasoning accuracy, or both. This post gives you the decision rule and the reasoning behind it.

We will compare the three formats on two axes that actually matter, token density (how much you pay to send a given piece of information) and reasoning reliability (how accurately the model uses what you sent), then map each format to the content type it serves best. Throughout, the lens is practical: which format you should pick before you paste, and how BulkMD lets you produce the right one for documents in one click.

Why format is a real decision, not a style choice

Format changes both what you pay and what you get back. The same information serialized three ways produces three different token counts and three different reasoning outcomes, because a language model does not see your data. It sees a token stream, and the shape of that stream steers how the model attends to it.

Token density is the first lever. A tokenizer like OpenAI's cl100k_base (which powered GPT-3.5 and GPT-4; GPT-4o and the o-series use the larger o200k_base, and Claude has its own, all roughly comparable for English prose) assigns favorable splits to common English words and Markdown punctuation, and unfavorable splits to repeated structural syntax. JSON's quotes, braces, and repeated key names fall on token boundaries that the encoder cannot merge away. Plain text strips structure but keeps the prose, so it costs nearly what Markdown does while carrying less signal.

Reasoning reliability is the second lever, and it cuts the other way for structured data. When a task is "return the status field for each record," JSON's explicit key names give the model an unambiguous address for every value. Markdown's looser structure forces the model to infer which value is which. The format that is cheaper to send is not always the format the model reasons over best, which is exactly why this is a per-case decision.

How the three formats tokenize

The table below shows illustrative per-format token economics for the same content, on the order of what you see when you run OpenAI's tiktoken against cl100k_base. The "article" rows assume a representative English article with one table; the "records" rows assume a small dataset of five fields each. Treat the figures as roughly indicative ranges, not precise measurements: your exact numbers depend on the specific text, but the ranking between formats is stable. Chars per token is a density measure (higher is denser); tokens per kilobyte is a cost proxy (lower is cheaper). For a deeper breakdown by content type, see our content-type token math guide.

Content	Format	Chars / token (approx)	Tokens / 1KB (approx)
Article body	Markdown	~3.6	~280
Article body	Plain text	~3.7	~270
Article body	HTML	~3.1	~325
Tabular data	Markdown pipe table	~3.2	~315
Tabular data	JSON array (pretty)	~2.3	~435
Tabular data	JSON array (minified)	~3.0	~335
Tabular data	CSV	~3.4	~295
Key-value record	Markdown list	~3.5	~290
Key-value record	JSON (pretty)	~2.5	~400
Key-value record	JSON (minified)	~3.0	~335

Three patterns matter. First, plain text is marginally denser than Markdown (roughly 3.7 versus 3.6 chars per token) because it drops the #, -, and | characters, but the difference is on the order of a few percent, and you pay for it by losing every structural signal. Second, pretty-printed JSON is the most expensive way to ship structured data: every indent space and newline is a token, and the repeated key names compound across rows. Minifying recovers most of that, costing roughly 15 to 20 percent fewer tokens for identical fields. Third, for purely tabular numeric data, a Markdown pipe table or CSV beats JSON on density by a wide margin.

The takeaway from the table alone: never ship pretty-printed JSON to a model that does not need to read it, and never reach for plain text expecting a meaningful token saving. The saving is noise, and the cost is structure.

When Markdown is the right format

Markdown is the right format for documents, meaning anything a human would read as prose with structure. Articles, documentation, READMEs, knowledge-base entries, transcripts, and notes all belong in Markdown when they become LLM context.

The case rests on three properties. Markdown is the densest format for natural-language prose, so you pay the fewest tokens. Its headings give a retriever clean chunk boundaries, so a single concept does not get split across two embedded chunks. And models cite Markdown structure more reliably than the alternatives: a ## Heading is an addressable landmark the model can point back to when it answers, which is why structured Markdown context tends to produce better section-level citations than the same text pasted raw. We cover the citation mechanics in depth in the agent context primer.

The practical failure mode this avoids is the "wall of text" paste. When you copy an article out of a browser, you usually get either HTML soup (expensive tokens, navigation boilerplate) or plain text (no headings, no table structure, no link targets). Converting to clean Markdown first strips the boilerplate and preserves the structure, which is the entire job BulkMD does locally in the browser, emitting Markdown with proper heading levels, language-tagged code fences, and pipe tables. The token saving versus the original HTML is typically 60 to 80 percent, and a boilerplate-heavy page can reach up to roughly 93 percent.

When JSON is the right format

JSON is the right format when the model must address data by exact field name: field extraction, tool inputs and outputs, and any record where the schema is the point. If your prompt is "for each product, return sku, price, and in_stock as a JSON array," the input should be JSON too, because the key names are the contract.

The reason is reliability, not density. JSON costs more tokens than Markdown for the same data, but it removes ambiguity. When a model reads "status": "shipped", the field name and value are bound together explicitly; the model does not have to guess that the third column of a table means status. For agentic tool calls, where a function expects a typed object and the model must produce one, JSON is the only sane choice, because the downstream consumer is a real JSON parser, not a human.

Two rules make JSON-as-context affordable. Minify it: drop the indentation and newlines and you save roughly 15 to 20 percent of tokens with no information loss, since the model parses minified and pretty JSON identically. And do not use it for documents: a paragraph of prose stuffed into a JSON string field tokenizes worse than the same paragraph as Markdown and reads worse to the model, because the prose now carries escaping overhead and the model has to traverse JSON structure to reach text that had no business being structured.

The schema-fidelity rule

A short heuristic settles most cases. If losing a single field name would break the task, use JSON. If the task is "understand this and answer questions," use Markdown. The dividing line is whether the model needs to reason over named values or over meaning. Extraction, validation, and tool I/O sit on the named-values side; summarization, question answering, and analysis sit on the meaning side.

When plain text is the right format

Plain text is rarely the optimal format, but it has two honest uses: content that has no structure to preserve, and pipelines where downstream tooling cannot tolerate any markup characters.

A single short paragraph, a search query, a log line, or a user utterance has no headings, tables, or fields. There is nothing for Markdown or JSON to add, so plain text is the correct, minimal choice. The mistake is using plain text for content that does have structure. Pasting a documentation page as plain text saves only a few percent of tokens versus Markdown while discarding every heading and table boundary the model would have used to navigate and cite. That trade is almost always bad: you give up disproportionate reasoning quality for a token saving that rounds to nothing.

The second legitimate use is defensive. Some retrieval and embedding pipelines pre-process input in ways that choke on stray markup, or you may be feeding a system whose chunker treats # characters as noise. In those narrow cases, normalizing to plain text is a pragmatic compatibility decision, but treat it as a constraint you are working around, not a format you chose for its merits.

The decision rule in one line

For LLM context in 2026, use Markdown for documents, JSON for field extraction and tool I/O, and plain text only for content with no structure to preserve, because Markdown is the densest and most citable format for prose (typically 60 to 80 percent fewer tokens than the source HTML, up to roughly 93 percent on boilerplate-heavy pages), JSON is the only format that binds values to named keys reliably, and plain text saves a negligible few percent over Markdown while discarding every structural signal a model uses to navigate and cite.

That single rule resolves the vast majority of real decisions. The edge cases, such as purely numeric tables where CSV can edge out both, or mixed documents with embedded data where you nest a JSON or table block inside Markdown, are refinements on top of it, not exceptions to it.

A worked example: the same data, three ways

Consider a product spec you want a model to answer questions about. Here is the same record in all three formats, with approximate token costs on cl100k_base. The counts are close because the record is small and field-shaped; treat them as illustrative.

# Plain text (approx 38 tokens) — no structure, no addressability
Acme Widget Pro. Price 49 dollars. In stock yes. Ships in 2 days.
Weight 1.2 kg. Warranty 24 months. SKU AWP-100.

// Minified JSON (approx 41 tokens) — every field addressable by name
{"name":"Acme Widget Pro","price_usd":49,"in_stock":true,"ships_days":2,"weight_kg":1.2,"warranty_months":24,"sku":"AWP-100"}

<!-- Markdown (approx 44 tokens) — human-readable, citable, values not key-addressable -->
## Acme Widget Pro
- Price: $49
- In stock: yes
- Ships in: 2 days
- Weight: 1.2 kg
- Warranty: 24 months
- SKU: AWP-100

The decision here is not driven by tokens, since the three counts are nearly equal. It is driven by the task. Ask "is the Acme Widget Pro in stock and what does it weigh?" and all three answer well. Ask "return every product as a JSON object with sku and in_stock" across 200 such records, and JSON input wins, because the key names are the contract and minified JSON keeps the cost in check. Ask the model to summarize a 2,000-word review of the widget, and Markdown wins decisively, because that is a document, not a record. Pick the format by asking what the model has to do, not which looks tidiest.

How this maps to BulkMD's output

BulkMD's job is the document case, the most common one. It converts web pages to clean Markdown locally in the browser (no account, no telemetry; the only network call is the optional, opt-in AI summarize and clean feature that uses your own API key), preserving heading hierarchy, language-tagged code fences, and pipe tables. That is precisely the format you want when the content is an article, a doc page, or a transcript headed for an LLM or a notes tool like Obsidian or Notion.

When your data is genuinely record-shaped, such as an API response, a config object, or a dataset you will query by field, keep it as JSON and minify it before sending. BulkMD does not turn documents into JSON, and it should not: forcing prose into fields makes it cost more and read worse. Use the right tool for each shape. For sizing how much of any format fits in a given model, pair this with the context-window budgeting guide, which covers how to allocate a fixed token budget across retrieved chunks.

TL;DR

Format is a per-content-type decision. Use Markdown for documents: it is the densest format for prose and the most reliably citable, cutting a typical 60 to 80 percent of tokens versus source HTML. Use JSON for field extraction and tool I/O, where named keys are the contract, and minify it to recover roughly 15 to 20 percent of tokens. Use plain text only for content with no structure worth keeping; its few-percent token edge over Markdown is not worth the lost signal. The next step: stop pasting raw HTML or wall-of-text into your prompts, and convert documents to clean Markdown first. BulkMD is the free Chrome extension that does it locally in one click.

Frequently asked questions

Is JSON or Markdown better for sending data to an LLM?

It depends on the data. Use JSON when the model must address values by exact field name, such as extraction tasks and tool calls, because the key names are an unambiguous contract. Use Markdown for documents and prose, where it is denser per token and produces more reliable section-level citations. They solve different problems.

Does plain text save tokens compared to Markdown?

Barely. Stripping Markdown syntax saves only a few percent of tokens on typical prose, because the prose itself is unchanged and only the structural characters are removed. You pay for that small saving by losing the headings and table boundaries the model uses to navigate and cite, so it is rarely a good trade.

Should I minify JSON before putting it in a prompt?

Yes, for context the model only needs to parse. Minified JSON costs roughly 15 to 20 percent fewer tokens than pretty-printed JSON for the same fields, and models parse both identically. Keep JSON pretty-printed only when a human will read it during debugging.

What format is best for a mixed document that contains tables?

Markdown with embedded pipe tables. A Markdown pipe table is markedly denser than the same data in HTML and noticeably denser than a JSON array of objects, while staying readable inline with the surrounding prose. Reserve JSON for cases where the model must query those rows by field name.

Do these density estimates apply to Claude as well as GPT models?

Within a few percentage points, yes. The figures here reflect cl100k_base; GPT-4o and the o-series use o200k_base, and Claude uses its own tokenizer. All are byte-pair encoders trained on similar web text, so the relative ranking, Markdown dense, plain text similar, pretty JSON expensive, holds across all three for English content.

About the author

M. H. Tawfik

Lead Developer & Owner

Working from Kushtia, Bangladesh.

Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.

Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.

ShareX in HN

TaggedLLM contextMarkdownTokensPrompt engineeringRAG

Markdown vs JSON vs Text for LLM Context