Single-page Markdown conversion is a solved problem. The interesting engineering — and the workflow most people are actually missing — is bulk. You have a list of fifty URLs. You want fifty clean .md files (or one concatenated file) on disk, with retries, with the right rate-limiting, and without uploading anyone's authenticated pages to a third-party service.
This post walks through how we built that pipeline in BulkMD and the patterns worth lifting if you're writing your own. If you want the upstream "why Markdown at all" argument before the engineering, the LLM context guide and the token-cost breakdown cover that ground.
The shape of a good bulk pipeline
A working bulk converter needs to handle, in order:
- Parse the URL list. Comma, newline, or paste-from-Notion-mess. Validate each one is
http(s)and not a duplicate. - Drive the queue. Open tabs (or fetch in-worker), run the extractor, collect the result.
- Respect concurrency limits. Two or three parallel tabs is a sane default for most sites; some need one.
- Apply a per-page timeout. If a tab never reaches
document_idle, you can't block the queue waiting on it forever. - Survive a worker restart. Manifest V3 service workers go to sleep. The queue must be resumable.
- Surface progress. A user staring at a thirty-minute job needs running counts and the ability to abort.
- Export the result. Individual files, a single concatenated file with
## Source: <url>separators, or a ZIP.
Skip any one of those and the workflow degrades from "useful tool" to "kicks off a run, comes back, sees errors, gives up."
Concurrency: pick a small number and stick to it
The temptation is to crank parallelism up. Don't. Real-world sites push back in three ways:
- Cloudflare-style rate limiters issue a JS challenge that breaks the extractor.
- News sites serve a meter wall after the third or fourth concurrent hit from one IP.
- Single-page apps with shared global state (a search-results page that mutates
window.history) can interfere with each other if you load two siblings in the same browsing context.
Two concurrent tabs is the safe default. Some users dial it down to one for high-friction sources (LinkedIn, X, anything with an aggressive bot wall) and up to four for stable docs sites. Make it configurable; don't pick a magic number.
State that survives MV3 service-worker shutdowns
This is the gotcha that surprises every Chrome extension author. Manifest V3 replaced background pages with service workers, and Chrome aggressively terminates them after ~30 seconds of inactivity. If your queue lives in a module-level let queue = [...], the first time the worker sleeps mid-run, you lose everything.
Two patterns that actually work:
Persist every state mutation to chrome.storage.session
session storage (not local) is purpose-built for this. It survives worker restarts within a browser session, gets cleared on browser exit, and has no quota issues for the sizes a queue runs at.
const STATE_KEY = "bulkmd_state";
async function updateState(mutate: (s: State) => void): Promise<State> {
const { [STATE_KEY]: current } = await chrome.storage.session.get(STATE_KEY);
const next = structuredClone(current ?? initialState);
mutate(next);
await chrome.storage.session.set({ [STATE_KEY]: next });
return next;
}
Atomic read-mutate-write is critical. A tab-update listener and the queue driver can both fire in the same tick; without serialization you'll lose results.
Keep the worker awake while a UI is open
If a long-running dashboard tab is visible, you can pin the worker awake via a chrome.runtime.connect() port. The dashboard opens the port on mount; the worker stays alive for the lifetime of that connection.
// dashboard
chrome.runtime.connect({ name: "dashboard-keepalive" });
// service worker — no handler needed; the port itself keeps the worker alive
The worker still gets killed when no UI is open, which is correct — there's nothing to keep alive for.
Per-page watchdog timers
Some pages never settle. A misconfigured analytics script keeps the load event pending; a slow third-party iframe blocks document_idle. Without a watchdog you'll wait the full extractor timeout — by default that's "until you close the tab."
A 45-second watchdog is a good upper bound for almost any article:
function armTabTimer(tabId: number) {
const timer = setTimeout(() => {
chrome.tabs.remove(tabId);
markPageAsTimedOut(tabId);
}, 45_000);
// Clear when extraction completes via chrome.runtime.onMessage
return timer;
}
Tabs that time out get marked as errors and the queue rolls forward. The user sees them in the results table and can re-run only the failures.
Result buffering
Bulk runs can collect hundreds of pages. If you keep every Markdown blob in memory and then build the export string in one pass, you'll OOM a long run.
Two mitigations:
- Cap the in-memory result buffer (BulkMD caps it at 500). Older results spill to disk via
chrome.downloadsprogressively. - Stream the export. Build the concatenated file by writing chunks to a
Bloband callingchrome.downloads.download({ url: URL.createObjectURL(blob) }), rather than building a single huge string.
What "good" looks like from the user's seat
The whole point of bulk export is that the user does the human-judgement work — picking the URLs — and the machine does everything else. The interaction goal is:
- Paste fifty URLs, click Start.
- Glance at progress occasionally; abort if needed.
- When it's done, get one file (or fifty) that's ready to drop into Obsidian, Notion, a RAG ingest folder, or directly into an LLM prompt.
If your pipeline needs the user to babysit it — restart on a worker crash, re-paste failed URLs, manually concatenate the outputs — it's not really bulk export. It's "single conversion in a loop with extra steps."
Try it
BulkMD's bulk dashboard ships every pattern above: persistent queue, configurable concurrency, 45-second watchdogs, retry-on-failure, and exports as either one-file-per-URL or a single concatenated .md. The extension is free on the Chrome Web Store and processes everything locally — your URL list never leaves the browser.
Frequently asked questions
Why not just script Pandoc / wget for this?
How many tabs in parallel is safe?
What happens to the queue if Chrome quits during a run?
Can the bulk dashboard handle pages behind a login?
What's the largest run you've tested?
About the author
Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.
Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.