When you paste a raw web page into ChatGPT or Claude, you are paying — in tokens, in latency, and in answer quality — for navigation bars, cookie banners, ad slots, and three nested <div> wrappers around every paragraph. The same article expressed as clean Markdown is typically 60–80% smaller and produces noticeably sharper answers.
This guide walks through why Markdown is the right format for LLM context and the fastest way to produce it from any page in your browser. If you care primarily about the dollars-and-cents math behind the savings, the token-cost breakdown post goes deeper on the numbers.
Why Markdown is the right shape for LLM context
LLMs were trained on a web that looks a lot like Markdown — README.md files, Stack Overflow answers, GitHub issues, technical blog posts. Markdown gives the model exactly three things it needs:
- Structure without noise. A
#heading, a-list, a>quote. No<div class="prose prose-invert sm:prose-lg">wrappers swallowing 200 tokens before the model sees a single word. - Stable semantics. Bold means emphasis. A link is
[text](url). There is no ambiguity about whether<span style="font-weight: 700">is a heading or just a bold word. - Predictable tokenization. Markdown tokenizes densely. Five lines of cleanly nested list items take roughly the same token budget as one paragraph of HTML soup.
One of the cheapest improvements you can make to RAG pipelines and one-shot prompts is to stop feeding the model the source HTML.
What a "clean" Markdown export should keep — and drop
The hard part of HTML-to-Markdown is not the conversion. Every tool can turn <h1> into #. The hard part is deciding what to throw away.
A good extractor will drop:
- Site chrome (header, footer, nav, sidebar, sticky cookie banner)
- Ad iframes and tracking pixels
- Social share buttons and "related posts" widgets
aria-hiddendecorations and screen-reader-only labels that became visual junk- Empty links, image-less
<img>tags, anchor-only headings
And it must keep:
- The actual article body, in reading order
- Code blocks with their language hint (so the model knows it's TypeScript, not Python)
- Image captions paired with their
<figure>/<figcaption>source - Blockquotes, definition lists, tables — formats LLMs reason over well
- The page title, source URL, and date as a citation block at the top
Mozilla's Readability library — the same engine that powers Firefox Reader View — handles step one well. A solid Markdown converter (we use Turndown plus the GFM plugin and a stack of markdownlint-compliant rules) handles step two.
Three ways to do the conversion
There are three families of tools, in rough order of friction:
Server-side conversion APIs
You POST a URL, the service fetches the page, runs Readability + a converter, and returns Markdown. Reliable for archival batch jobs, but:
- Every request leaves your machine, so logged-in / paywalled pages stay out of reach.
- Auth-walled sources (Notion, internal wikis, Slack threads) are entirely off-limits.
- You pay per call and inherit the vendor's privacy posture.
Good for: server-to-server pipelines where you control the URL list and the content is fully public.
Pandoc on the command line
pandoc input.html -o output.md is the long-standing answer. It works, but you still need to get the HTML to disk somehow — usually curl, which means you'll re-implement the Readability pass yourself or accept the boilerplate.
Good for: one-off scripted conversions of static HTML files.
A local browser extension
The page is already rendered in your tab, with your cookies, your subscriptions, your authenticated state. A Manifest V3 extension can run Readability + Turndown inside the page and put the Markdown directly on your clipboard — no upload, no API key, no server round-trip.
This is the approach BulkMD takes. The conversion happens inside the page's content script, so the only thing that leaves the browser is the Markdown you paste into ChatGPT.
A fast workflow for collecting LLM context
The repeatable flow we recommend for research, prompt engineering, and RAG ingestion:
- Open every source tab. Search results, documentation, GitHub issues, internal pages. Whatever the model needs to reason from.
- Convert in bulk. Paste the list of URLs into BulkMD's bulk dashboard — it walks the queue tab-by-tab, runs Readability + Turndown, and produces one
.mdper page (or a single concatenated file). The bulk export post walks through the queue engineering in detail. - Skim and curate. Markdown is human-readable. A two-minute pass deletes the obvious dross before it reaches the model.
- Bundle the prompt. Prefix each section with
## Source: <url>so the model can cite specifically. LLMs are markedly better at citing structured Markdown than they are at quoting raw HTML.
A reading list of fifteen articles that started at 1.4 MB of raw HTML usually lands around 280–350 KB of clean Markdown — comfortably inside a 200 K context window with room left for the actual question.
How big is the token saving, really?
We benchmarked twenty pages spanning long-form blog posts, technical docs, news articles, and product landing pages. The median:
| Source format | Median size | Median tokens (cl100k) |
|---|---|---|
| Raw page HTML | 142 KB | ~38,000 |
| Readability-only HTML | 71 KB | ~19,000 |
| Clean Markdown (BulkMD) | 18 KB | ~7,800 |
That is a ~79% reduction vs. raw HTML and around ~59% vs. Readability output alone. The Markdown step matters because Turndown collapses heavy attribute soup (class=, data-*, inline styles) that survives Readability.
Common gotchas
A few patterns that bite people the first time they industrialize this:
- Lazy-loaded images. Many sites set
srcto a placeholder and put the real URL indata-srcorsrcset. A naive converter emits broken image links. Resolve before serializing. - SPA-rendered content. Pages that hydrate client-side need the converter to wait until DOM is stable. A tab-script that runs at
document_idle(and waits one tick after) avoids almost all of these. - Code block languages. If your converter doesn't read
class="language-ts"and emit```ts, you lose syntax-aware tokenization gains. Detect the language; don't drop it. - Heading demotion. Two
<h1>s on a page is common. Demote subsequent<h1>s to<h2>so your Markdown still has a single top-level heading — markdownlint MD025.
TL;DR
If you spend any meaningful time pasting web content into ChatGPT, Claude, or your own RAG pipeline, the highest-leverage change you can make this week is to stop pasting HTML. Convert to clean Markdown in the browser, keep a citation block at the top of each page, and feed the model exactly what it was trained to read.
That is the entire premise behind BulkMD — install it free from the Chrome Web Store and run your next prompt on Markdown instead.
Frequently asked questions
Does converting to Markdown lose information that the LLM needs?
Why not just summarize the page server-side and feed the summary?
Does this work with paywalled or logged-in pages?
How is this different from copy-pasting the page into ChatGPT?
What tokenizer is the 60–80% number measured against?
About the author
Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.
Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.