BulkMD

How to Convert Any Web Page to Markdown for ChatGPT & Claude (2026 Guide)

A practical guide to turning live web pages into clean Markdown for LLMs — why it cuts tokens 60–80%, how Readability strips boilerplate, and the fastest browser-side workflow.

M. H. Tawfik8 min read

When you paste a raw web page into ChatGPT or Claude, you are paying — in tokens, in latency, and in answer quality — for navigation bars, cookie banners, ad slots, and three nested <div> wrappers around every paragraph. The same article expressed as clean Markdown is typically 60–80% smaller and produces noticeably sharper answers.

This guide walks through why Markdown is the right format for LLM context and the fastest way to produce it from any page in your browser. If you care primarily about the dollars-and-cents math behind the savings, the token-cost breakdown post goes deeper on the numbers.

Why Markdown is the right shape for LLM context

LLMs were trained on a web that looks a lot like Markdown — README.md files, Stack Overflow answers, GitHub issues, technical blog posts. Markdown gives the model exactly three things it needs:

  1. Structure without noise. A # heading, a - list, a > quote. No <div class="prose prose-invert sm:prose-lg"> wrappers swallowing 200 tokens before the model sees a single word.
  2. Stable semantics. Bold means emphasis. A link is [text](url). There is no ambiguity about whether <span style="font-weight: 700"> is a heading or just a bold word.
  3. Predictable tokenization. Markdown tokenizes densely. Five lines of cleanly nested list items take roughly the same token budget as one paragraph of HTML soup.

One of the cheapest improvements you can make to RAG pipelines and one-shot prompts is to stop feeding the model the source HTML.

What a "clean" Markdown export should keep — and drop

The hard part of HTML-to-Markdown is not the conversion. Every tool can turn <h1> into #. The hard part is deciding what to throw away.

A good extractor will drop:

  • Site chrome (header, footer, nav, sidebar, sticky cookie banner)
  • Ad iframes and tracking pixels
  • Social share buttons and "related posts" widgets
  • aria-hidden decorations and screen-reader-only labels that became visual junk
  • Empty links, image-less <img> tags, anchor-only headings

And it must keep:

  • The actual article body, in reading order
  • Code blocks with their language hint (so the model knows it's TypeScript, not Python)
  • Image captions paired with their <figure>/<figcaption> source
  • Blockquotes, definition lists, tables — formats LLMs reason over well
  • The page title, source URL, and date as a citation block at the top

Mozilla's Readability library — the same engine that powers Firefox Reader View — handles step one well. A solid Markdown converter (we use Turndown plus the GFM plugin and a stack of markdownlint-compliant rules) handles step two.

Three ways to do the conversion

There are three families of tools, in rough order of friction:

Server-side conversion APIs

You POST a URL, the service fetches the page, runs Readability + a converter, and returns Markdown. Reliable for archival batch jobs, but:

  • Every request leaves your machine, so logged-in / paywalled pages stay out of reach.
  • Auth-walled sources (Notion, internal wikis, Slack threads) are entirely off-limits.
  • You pay per call and inherit the vendor's privacy posture.

Good for: server-to-server pipelines where you control the URL list and the content is fully public.

Pandoc on the command line

pandoc input.html -o output.md is the long-standing answer. It works, but you still need to get the HTML to disk somehow — usually curl, which means you'll re-implement the Readability pass yourself or accept the boilerplate.

Good for: one-off scripted conversions of static HTML files.

A local browser extension

The page is already rendered in your tab, with your cookies, your subscriptions, your authenticated state. A Manifest V3 extension can run Readability + Turndown inside the page and put the Markdown directly on your clipboard — no upload, no API key, no server round-trip.

This is the approach BulkMD takes. The conversion happens inside the page's content script, so the only thing that leaves the browser is the Markdown you paste into ChatGPT.

A fast workflow for collecting LLM context

The repeatable flow we recommend for research, prompt engineering, and RAG ingestion:

  1. Open every source tab. Search results, documentation, GitHub issues, internal pages. Whatever the model needs to reason from.
  2. Convert in bulk. Paste the list of URLs into BulkMD's bulk dashboard — it walks the queue tab-by-tab, runs Readability + Turndown, and produces one .md per page (or a single concatenated file). The bulk export post walks through the queue engineering in detail.
  3. Skim and curate. Markdown is human-readable. A two-minute pass deletes the obvious dross before it reaches the model.
  4. Bundle the prompt. Prefix each section with ## Source: <url> so the model can cite specifically. LLMs are markedly better at citing structured Markdown than they are at quoting raw HTML.

A reading list of fifteen articles that started at 1.4 MB of raw HTML usually lands around 280–350 KB of clean Markdown — comfortably inside a 200 K context window with room left for the actual question.

How big is the token saving, really?

We benchmarked twenty pages spanning long-form blog posts, technical docs, news articles, and product landing pages. The median:

Source formatMedian sizeMedian tokens (cl100k)
Raw page HTML142 KB~38,000
Readability-only HTML71 KB~19,000
Clean Markdown (BulkMD)18 KB~7,800

That is a ~79% reduction vs. raw HTML and around ~59% vs. Readability output alone. The Markdown step matters because Turndown collapses heavy attribute soup (class=, data-*, inline styles) that survives Readability.

Common gotchas

A few patterns that bite people the first time they industrialize this:

  • Lazy-loaded images. Many sites set src to a placeholder and put the real URL in data-src or srcset. A naive converter emits broken image links. Resolve before serializing.
  • SPA-rendered content. Pages that hydrate client-side need the converter to wait until DOM is stable. A tab-script that runs at document_idle (and waits one tick after) avoids almost all of these.
  • Code block languages. If your converter doesn't read class="language-ts" and emit ```ts, you lose syntax-aware tokenization gains. Detect the language; don't drop it.
  • Heading demotion. Two <h1>s on a page is common. Demote subsequent <h1>s to <h2> so your Markdown still has a single top-level heading — markdownlint MD025.

TL;DR

If you spend any meaningful time pasting web content into ChatGPT, Claude, or your own RAG pipeline, the highest-leverage change you can make this week is to stop pasting HTML. Convert to clean Markdown in the browser, keep a citation block at the top of each page, and feed the model exactly what it was trained to read.

That is the entire premise behind BulkMD — install it free from the Chrome Web Store and run your next prompt on Markdown instead.

Frequently asked questions

Does converting to Markdown lose information that the LLM needs?

For prose, almost never. Markdown preserves headings, lists, code blocks, links, blockquotes, and tables — the structural signals models actually attend to. The things you lose (CSS classes, data-* attributes, inline styles) are noise the model was already ignoring. The exception is pages where layout encodes meaning (heavy financial tables, dashboards) — for those, keep HTML or use a screenshot with a vision model.

Why not just summarize the page server-side and feed the summary?

Summaries lose the verbatim text the model needs for citations, and they introduce a second hop where errors compound. Clean Markdown is small enough that you don't need a summary step — you fit the whole article inside the context window and let the answering model decide what's relevant. Summarize only when you genuinely cannot fit the corpus.

Does this work with paywalled or logged-in pages?

Only if the conversion runs in the browser. Server-side conversion APIs fetch the page anonymously and hit the paywall. A local extension like BulkMD runs against the already-rendered DOM in your authenticated tab, so anything you can read, it can convert.

How is this different from copy-pasting the page into ChatGPT?

Copy-paste captures visible text but no structure — every heading becomes a normal paragraph, code blocks lose their language hint, and tables are flattened to whitespace. Markdown preserves the structure that lets the model cite specific sections back to you. It's also about 30–40% fewer tokens than a raw copy-paste for the same article.

What tokenizer is the 60–80% number measured against?

cl100k_base, the GPT-4 / GPT-4o tokenizer family. The percentage holds within a few points for Claude's tokenizer because both have similar byte-pair vocabularies for Latin-script text. The savings are slightly larger for Claude because its tokenizer is marginally less efficient on HTML attribute syntax.

About the author

M. H. Tawfik

Lead Developer & Owner

Working from Kushtia, Bangladesh.

Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.

Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.

ShareXinHN
TaggedLLM contextMarkdownChatGPTClaudeReadability