Choosing an HTML content extractor in 2026 is one of those decisions that feels small until you ship the wrong one and discover at scale that 12% of your pages are silently losing their main content. The three extractors that dominate practical pipelines — Mozilla Readability, Trafilatura, and jsdom-readability — each shine on a different corpus and break in a different way, and the GitHub-star ordering does not predict which will work for your use case.
This post is the benchmark we wish we had when we built BulkMD and had to pick between them. We measured all three on the same 50 pages, scored their output against a hand-curated ground truth, and tracked their runtime, memory footprint, and edge-case failure modes. If you want the higher-level case for why you want clean Markdown at all, the LLM context primer is the prequel; this post is for engineers who already know they want clean output and need to pick the tool that produces it.
What each extractor actually is
The three projects look superficially similar — give them HTML, get back the main content — but they took very different paths to get there. Understanding the design lineage explains every benchmark result that follows.
Mozilla Readability is the JavaScript library that powers Firefox's Reader View. It is a single 1,800-line file that ports the original Arc90 Readability algorithm forward through fifteen years of web evolution. It runs in the browser, in Node via jsdom, in Deno, and in any environment that can parse a DOM. Its job description is narrow: take a Document, return the most likely "article" subtree as a cleaned-up node, plus metadata. It does not output Markdown — you compose it with a serializer like Turndown for that.
Trafilatura is a Python library originating from the natural-language-processing community. It treats extraction as a text-mining problem: heuristics, fallbacks, and a small neural classifier that ranks candidate nodes by their likelihood of being the article body. It outputs cleaned plain text, XML/TEI, or Markdown directly, and ships with comment-extraction, language-detection, and date-extraction modules out of the box. It runs as a Python process and is the slowest of the three but the most thorough on edge cases.
jsdom-readability is a thin wrapper that pairs Mozilla Readability with jsdom, the Node-native DOM implementation. It is what most server-side Node extraction pipelines actually use when they say "Readability." Its performance characteristics are entirely shaped by jsdom: cold-start dominates, but once jsdom is warm, parsing is fast.
How we benchmarked them
We assembled a corpus of 50 pages spread evenly across five categories: long-form blog posts, technical documentation, news articles, product/landing pages, and forum threads. For each page we manually annotated the "true" article body — the text a reader would actually want — and stored it as plain text. We then ran each extractor in its default configuration, serialized the output to Markdown (or plain text for Trafilatura's default), and computed the F1 score of extracted tokens against the ground truth.
Runtime measurements ran on a 2024-era M3 Pro laptop, single-threaded, with each page warmed once before the timed run. Trafilatura ran in Python 3.12; Readability and jsdom-readability ran in Node 22. We disabled all network access during extraction so that no extractor was advantaged or penalized by fetching auxiliary resources.
The corpus, the ground-truth annotations, and the run scripts are not yet published, but the methodology is straightforward enough that anyone can reproduce it. The numbers below are the medians across all 50 pages unless noted otherwise.
How big is the difference, really?
| Metric | Mozilla Readability | jsdom-readability | Trafilatura |
|---|---|---|---|
| Median F1 score (vs ground truth) | 0.94 | 0.94 | 0.91 |
| Pages within 5% of ground truth | 88% | 88% | 78% |
| Pages losing >20% of body | 4% | 4% | 6% |
| Median runtime per page | 18 ms (browser) | 14 ms (Node, warm) | 85 ms |
| Cold-start runtime | 22 ms | 310 ms (jsdom init) | 1,100 ms (model load) |
| Bundle / install size | ~90 KB (with Turndown) | ~7 MB (jsdom) | ~120 MB (Python + deps) |
| Multilingual handling | English-tuned | English-tuned | All major languages |
| Comments extraction | No | No | Yes |
| Date extraction | Partial (via metadata) | Partial | Yes (htmldate) |
The headline result is that Readability and Trafilatura are closer than the discourse around them suggests. Readability wins outright on extraction fidelity by three percentage points and dominates on runtime. Trafilatura wins on coverage breadth — multilingual pages, comments, dates — and on one specific corpus category (forum threads) where its heuristics handle nested quoted content better than Readability's article-shaped assumptions.
The 4–6% of pages where each extractor loses significant body content are not random. They cluster on specific patterns we cover below.
Where each extractor fails
Every extractor has a failure mode, and the failure mode is more diagnostic than the average-case score. If your corpus is heavy on the pattern that breaks one tool, the averages will lie to you.
Where Readability stumbles
Readability's algorithm is article-shaped. It looks for a single <article> or <main>-like subtree, scores candidate nodes, and returns the highest-scoring one. This works beautifully on long-form prose. It fails on pages that have no single dominant article — comparison tables, dashboard-style technical docs, product landing pages with multiple equal-weight sections. On a Stripe-style documentation page with three side-by-side code samples, Readability sometimes returns only the column that scored highest, dropping the other two entirely.
It also struggles with infinite-scroll content. If a page only renders its first chunk on initial HTML and lazy-loads the rest, Readability has nothing to score on the late content. This is not a bug — the DOM genuinely doesn't have the content yet — but it means a Readability-based pipeline must either wait for hydration or accept partial extraction.
The third pattern is paywalled previews. Many news sites render the first two paragraphs to anonymous fetchers, then gate the rest. Readability extracts the preview faithfully and returns it without any signal that you got 200 words out of a 1,500-word article. A wrapper that compares extracted length to declared wordCount in application/ld+json is the cheapest defense.
Where Trafilatura stumbles
Trafilatura's strength is breadth, and its weakness is decisiveness. On clean article pages where Readability returns exactly the right subtree, Trafilatura sometimes includes one or two adjacent navigation items it could not confidently rule out. The output is still readable, but the F1 score takes a small hit because the extra tokens are not in the ground truth.
The other Trafilatura failure mode is performance on rich, deeply-nested DOMs. On a page with 8,000 nodes, Trafilatura's per-node scoring dominates the runtime, and 85ms can grow to 400ms on the worst cases. For batch pipelines this is fine; for interactive use it is not.
Where jsdom-readability stumbles
jsdom-readability inherits every failure mode of Readability, plus one of its own: jsdom is not a real browser. It does not run JavaScript by default, does not honor CSS, does not lazy-load images, and does not hydrate client-side frameworks. On a server-side pipeline that fetches raw HTML, this is irrelevant. On any page that depends on client rendering to populate its content, jsdom-readability sees the empty shell and returns the loading skeleton as if it were the article.
The fix is to render the page first (Playwright, Puppeteer) and pass the post-hydration HTML to jsdom-readability. This works but doubles the cost — both in money and latency — and largely erases the speed advantage that made you pick jsdom-readability in the first place.
Which one belongs in a browser extension?
For a Manifest V3 Chrome extension like BulkMD, the choice is effectively forced. Trafilatura is Python; it cannot run in a browser. jsdom-readability is Node-flavored and ships a 7 MB DOM implementation that would balloon the extension's bundle and duplicate work the browser is already doing. That leaves Mozilla Readability proper, which was designed for exactly this case: it expects a real Document, runs in the content script's frame, and produces a clean article node in roughly 20 ms on the page the user just opened.
This is why BulkMD pairs Readability with Turndown plus the GFM plugin. The two libraries together are about 90 KB minified, run entirely in the content script, and produce clean Markdown without ever leaving the user's tab. We cover the resulting browser-side workflow in more depth in the bulk export walk-through.
The same logic applies in reverse for server-side batch pipelines. If you are running an overnight job that ingests 100,000 pages into a RAG index, the 90-KB-vs-120-MB question is irrelevant; what matters is multilingual coverage and date extraction, and Trafilatura wins. For tools sitting between those two extremes — a CLI that runs on a developer's machine, a Node service that handles requests per second — jsdom-readability is the pragmatic middle ground, provided you have a plan for client-rendered pages.
A decision tree for picking one
If you are choosing today, the decision tree is short.
You ship a browser extension or any code that runs in a real Document — pick Mozilla Readability and pair it with Turndown for serialization. The runtime advantage and the bundle-size advantage are decisive, and you inherit the page's authenticated state for free.
You run server-side extraction in Python, especially across multiple languages — pick Trafilatura. Its breadth is genuinely without equal in the Python ecosystem, the slower runtime is acceptable in batch contexts, and the comment / date / language modules will save you from reimplementing them yourself.
You run server-side extraction in Node and you control the input HTML (so client-rendering is not in play) — pick jsdom-readability. The cold-start cost is amortized across the run, the algorithm is the same proven Readability, and you stay inside one runtime.
You face mixed conditions — heavy client-side rendering, multilingual pages, an interactive surface — combine a headless browser for rendering with Mozilla Readability for extraction. The combination is more moving parts but produces the most consistent results across diverse corpora.
TL;DR
There is no single "best" HTML extractor — there is a best extractor for your corpus and your runtime. Mozilla Readability wins on browser-side fidelity and bundle size. Trafilatura wins on multilingual coverage and rich metadata. jsdom-readability is the pragmatic Node-server choice when you control the HTML.
If you are doing extraction inside a browser tab and want a battle-tested pipeline that just works, BulkMD ships Readability plus a markdownlint-compliant Turndown pipeline as a free Chrome extension — install it and you get the recommended browser-side stack with zero setup.
Frequently asked questions
Is there a single extractor that handles every corpus well?
What about newer LLM-based extractors? Don't they outperform heuristic libraries?
Can I run Mozilla Readability without jsdom on the server?
Does Trafilatura support Markdown output directly?
What's the right way to test extraction quality on my own corpus?
About the author
Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.
Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.