If you have ever built a workflow that converts web pages into clean Markdown for LLM context, you have hit the choice that this post is about: do you run a server-side scraping API (Firecrawl, Jina Reader, Browserless, your own headless-Chrome cluster), or do you run a browser extension on your own machine? Both work. Both produce Markdown. Both have committed evangelists. The right answer depends on four metrics that no marketing page surfaces clearly, and on a corpus that has at least one quality — authentication, freshness, or volume — that pushes the trade in one direction.

This post is the comparison we wish existed when we built BulkMD and had to articulate what we were and were not optimizing for. The benchmark data comes from running both approaches across the same 200-page mixed corpus (public docs, paywalled news, logged-in admin pages, ordinary blogs) and measuring what actually came out. For the extractor and serializer choices that sit underneath either approach, see the Readability vs Trafilatura and Turndown vs Pandoc comparisons.

What each approach actually is

A server-side scraping API is a backend service you POST URLs to. The service runs headless Chrome (or its own HTML parser), fetches the URL, runs Readability-like content extraction, and returns Markdown or cleaned text. Firecrawl, Jina Reader, Browserless, and rolling your own Playwright cluster all fall in this bucket. The architecture is uniform: a worker pool, a job queue, and a JSON response per URL.

A browser extension runs inside your already-rendered tab. It reads the DOM your browser has already built — with your cookies, your subscriptions, your authentication state — and extracts content from that. BulkMD is an example; the official Obsidian Web Clipper is another. The architecture is uniform here too: a content script, a popup or dashboard for orchestration, and chrome.downloads for output.

The shapes look superficially similar — URL in, Markdown out — but they sit on opposite sides of a fundamental privilege boundary. The server has no access to your authenticated state; the extension has no horizontal scale. Every meaningful tradeoff below is a downstream consequence of that boundary.

The four metrics that actually matter

We ran each approach against the same 200 URLs and measured four things: end-to-end latency per page, percentage of pages with usable content extracted, marginal cost per thousand pages, and rate-limit incidents per hundred pages.

Metric	Browser extension (BulkMD)	Server-side API (Firecrawl-class)
Median latency per page	0.8 s	1.4 s
Successful extractions (200-page mix)	91%	64%
Cost per 1,000 pages	$0 (local)	$4–$12 (typical pricing)
Rate-limit incidents per 100 pages	0–1	6–18
Setup time	Install extension (1 min)	API key + queue logic (15–60 min)
Concurrent capacity	1–3 tabs typical	Hundreds to thousands
Auth-walled page support	Yes (already logged in)	No (or fragile)
Unattended overnight run	Limited	Native

Two of these numbers want explanation. The 91% vs 64% success rate is dominated by auth-walled and dynamically rendered pages — server-side scrapers fetch the public, anonymous version of those pages, which is often a login wall or a skeleton. The extension renders them as the user already sees them, which includes the content. The 36% failure margin for the server side is almost entirely concentrated on those page categories.

The 6–18 rate-limit incidents per 100 pages on the server side reflect the reality that a server scraper hitting site X with twenty parallel workers looks, to site X, like a coordinated attack from a single IP. Many sites have aggressive WAFs that 403 such traffic. The extension distributes naturally because each user is a separate browser, separate IP, and separate session.

When server-side wins decisively

For corpora that share three properties — public-facing, large, and not time-sensitive — server-side wins outright. A nightly job that ingests ten thousand pages of competitor documentation into your RAG index is the canonical server-side use case. You write the URL list once, the worker pool grinds through overnight, the cost stays under fifty dollars even at typical per-page pricing, and you wake up to a fresh corpus. If the URL list itself is the only crawler you want to maintain, see how to build an LLM corpus without writing a crawler.

The other server-side strength is when the workflow needs to run without you in the loop. A scheduled refresh of a hundred docs sites every Monday morning cannot depend on someone manually opening their browser. Server-side jobs run on cron and recover from failure without human attention; an extension-based workflow needs a human-in-the-loop or at minimum a machine with the extension's browser running.

A third server-side win is when you genuinely need horizontal scale. An extension caps at the user's machine — typically three to six concurrent tabs before resource contention dominates. A server cluster scales to whatever you are willing to pay for. If your job is genuinely thousands of pages per hour rather than hundreds, the extension architecture cannot keep up.

When browser extensions win decisively

For corpora that share any one of three properties — auth-walled, fresh/personalized, or low-volume — extensions win outright. Anything behind a paywall (the New York Times, the Financial Times, most academic journals), anything behind a login (your team's Notion, your company's wiki, internal documentation), or anything that returns different content based on cookies (recommendation feeds, A/B-tested marketing pages) is invisible to a server-side scraper but trivial for an extension to capture.

The other extension strength is small-batch interactive work. If you are reading research papers and want to drop the next twenty into your knowledge base, the round-trip-time of "set up an API key, write a small script, paste URLs, wait for the batch" is wildly longer than "click the extension, paste URLs, click run." For batches under a hundred pages — which is most personal workflows — the extension wins on wall-clock time even if its per-page latency is comparable.

A third extension strength is privacy posture. Server-side scrapers see every URL you submit, which means your reading list is sitting in some vendor's logs. Extensions that run conversion locally — like BulkMD — see only your own browser; nothing leaves the machine. For research workflows where what you are reading is itself sensitive (legal, M&A, investigative journalism), this is dispositive.

The crossover at 50–100 pages

The interesting space is the middle: workflows of 50 to a few hundred pages, mostly public, somewhat time-sensitive. Here the two approaches are within a factor of two on every metric, and the choice depends on which side you can absorb the friction on more easily.

If you already have an API key for a server-side scraper, fifty pages takes about a minute of script-running and three to seven minutes of waiting. If you do not, the same fifty pages takes forty-five minutes to one hour of setup before the first page resolves, plus the waiting. The extension equivalent is "install once, then any time, paste URLs and click run." For occasional batches, the lifetime amortized friction of the extension is lower; for daily batches, the server-side amortizes more favorably once the integration exists.

The other thing the middle case rewards is the ability to switch. We use both approaches at BulkMD — extension for the bulk of our own daily workflows, server-side for scheduled refreshes of competitive intelligence corpora. There is no architectural reason to commit to one for everything, and the marginal cost of running both is low once each is set up.

A decision tree

Strip away the analysis and the choice reduces to a short tree.

You need the content from pages that require login, paywall, or session state: extension. There is no server-side path that works, because the server cannot see those pages.

You need to ingest more pages than a human can stay logged in for, unattended, overnight, recurringly: server-side. The extension architecture cannot do unattended scale.

Your corpus is under a hundred pages, public, and interactive (you would read most of these yourself): extension. The setup tax on a server-side API is not worth amortizing.

Your corpus is over a thousand pages, public, and you need it refreshed on a schedule: server-side. The extension cannot keep up at that volume.

Your corpus is mixed: run both. There is no rule against using a server-side scraper for the public 80% of your URLs and an extension for the auth-walled 20%. We do this regularly; the two outputs land in the same docs/.ai/ folder and the downstream consumer does not care which produced which.

The cost worth naming explicitly

The four-to-twelve-dollars-per-thousand-pages server-side cost in the table above is real but is often not the largest cost in the comparison. Two costs that dominate in practice are integration time (the first time you set up Firecrawl or roll your own Playwright cluster, plan for a working day, not an hour) and rate-limit recovery (when a 403 storm hits a high-traffic source, you spend more on retries and backoff than on the original requests). Neither shows up on a vendor pricing page.

For the extension, the dominant cost is the user's wall-clock attention. You cannot start a 500-page extension job and walk away for eight hours; you need the browser open. For batches under a few hundred pages this is invisible; past that, the user-time cost becomes the binding constraint and pushes you toward server-side.

TL;DR

Server-side scrapers and browser extensions are not substitutes — they sit on opposite sides of an authentication boundary and have inverted strengths. Server-side wins on scale, unattended scheduling, and concurrent capacity. Extensions win on auth-walled content, small-batch interactivity, and privacy. For mixed corpora the right answer is to run both, with each handling the URLs it can actually reach. Neither is universally better; the question is always "which constraints bind your specific workflow?"

If most of your corpus is the kind of content a browser extension reaches more reliably — auth-walled docs, paywalled research, logged-in admin pages — BulkMD is the free Chrome extension that runs conversion entirely in your tab, with no API key and no server round-trip.

Frequently asked questions

Can a server-side scraper handle auth-walled pages by storing my cookies?

Some can, fragilely. Browserless and Playwright clusters support cookie injection, but you have to extract cookies from your browser and ship them to the server, which is a meaningful privacy and security trade. Cookies expire frequently, so the workflow needs constant maintenance. For occasional auth-walled access this is workable; as a default architecture it produces more breakage than it saves time.

Does Firecrawl (or similar) work well for documentation sites specifically?

Yes, on public documentation sites that have a clean sitemap and stable URL structure. Most modern docs frameworks (Docusaurus, Mintlify, Nextra) produce server-rendered HTML that server scrapers extract cleanly. The friction starts when docs sites use heavy client-side rendering or require auth (internal wikis), where the extension path is more reliable.

What about running headless Chrome on my own machine?

It's a middle path: you get extension-like auth access (you can re-use your browser profile) and server-like unattended scheduling (you can cron the script). The downside is that it consumes your local resources while running and you become responsible for the Playwright/Puppeteer lifecycle. We use this pattern for our own competitor-monitoring jobs where we need both auth and schedule.

Are there legal concerns with either approach?

Both approaches can violate a site's terms of service if you ignore robots.txt or hit pages you do not have authorization to access. The legal exposure is roughly proportional to the volume and recurrence of your traffic; one-off browser captures are very rarely actionable, scheduled scraping of a thousand pages a day is a different conversation. Read the source's terms; respect robots.txt; obtain authorization for anything behind a login that is not yours.

How do I decide if a particular site allows scraping?

Read `/robots.txt` and the site's terms of service. If robots.txt allows access and the terms do not prohibit automated retrieval, you are typically in bounds for reasonable-volume non-commercial use. For commercial use, especially for competitive-intelligence purposes, talk to a lawyer rather than rely on a blog. The technical question (can I get the content?) and the legal question (am I allowed to?) are independent.

About the author

M. H. Tawfik

Lead Developer & Owner

Working from Kushtia, Bangladesh.

Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.

Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.

ShareX in HN

TaggedWeb scrapingChrome extensionBulk exportMarkdown

Server Scrapers vs Browser Extensions: 2026 Tradeoffs