BulkMD

Bulk Export Web Pages to Markdown: a Repeatable Chrome Workflow

How to convert dozens of URLs to clean Markdown at once using a local Chrome extension — concurrency, retries, queue persistence, and the patterns that make it survive a service-worker restart.

M. H. Tawfik7 min read

Single-page Markdown conversion is a solved problem. The interesting engineering — and the workflow most people are actually missing — is bulk. You have a list of fifty URLs. You want fifty clean .md files (or one concatenated file) on disk, with retries, with the right rate-limiting, and without uploading anyone's authenticated pages to a third-party service.

This post walks through how we built that pipeline in BulkMD and the patterns worth lifting if you're writing your own. If you want the upstream "why Markdown at all" argument before the engineering, the LLM context guide and the token-cost breakdown cover that ground.

The shape of a good bulk pipeline

A working bulk converter needs to handle, in order:

  1. Parse the URL list. Comma, newline, or paste-from-Notion-mess. Validate each one is http(s) and not a duplicate.
  2. Drive the queue. Open tabs (or fetch in-worker), run the extractor, collect the result.
  3. Respect concurrency limits. Two or three parallel tabs is a sane default for most sites; some need one.
  4. Apply a per-page timeout. If a tab never reaches document_idle, you can't block the queue waiting on it forever.
  5. Survive a worker restart. Manifest V3 service workers go to sleep. The queue must be resumable.
  6. Surface progress. A user staring at a thirty-minute job needs running counts and the ability to abort.
  7. Export the result. Individual files, a single concatenated file with ## Source: <url> separators, or a ZIP.

Skip any one of those and the workflow degrades from "useful tool" to "kicks off a run, comes back, sees errors, gives up."

Concurrency: pick a small number and stick to it

The temptation is to crank parallelism up. Don't. Real-world sites push back in three ways:

  • Cloudflare-style rate limiters issue a JS challenge that breaks the extractor.
  • News sites serve a meter wall after the third or fourth concurrent hit from one IP.
  • Single-page apps with shared global state (a search-results page that mutates window.history) can interfere with each other if you load two siblings in the same browsing context.

Two concurrent tabs is the safe default. Some users dial it down to one for high-friction sources (LinkedIn, X, anything with an aggressive bot wall) and up to four for stable docs sites. Make it configurable; don't pick a magic number.

State that survives MV3 service-worker shutdowns

This is the gotcha that surprises every Chrome extension author. Manifest V3 replaced background pages with service workers, and Chrome aggressively terminates them after ~30 seconds of inactivity. If your queue lives in a module-level let queue = [...], the first time the worker sleeps mid-run, you lose everything.

Two patterns that actually work:

Persist every state mutation to chrome.storage.session

session storage (not local) is purpose-built for this. It survives worker restarts within a browser session, gets cleared on browser exit, and has no quota issues for the sizes a queue runs at.

const STATE_KEY = "bulkmd_state";

async function updateState(mutate: (s: State) => void): Promise<State> {
  const { [STATE_KEY]: current } = await chrome.storage.session.get(STATE_KEY);
  const next = structuredClone(current ?? initialState);
  mutate(next);
  await chrome.storage.session.set({ [STATE_KEY]: next });
  return next;
}

Atomic read-mutate-write is critical. A tab-update listener and the queue driver can both fire in the same tick; without serialization you'll lose results.

Keep the worker awake while a UI is open

If a long-running dashboard tab is visible, you can pin the worker awake via a chrome.runtime.connect() port. The dashboard opens the port on mount; the worker stays alive for the lifetime of that connection.

// dashboard
chrome.runtime.connect({ name: "dashboard-keepalive" });

// service worker — no handler needed; the port itself keeps the worker alive

The worker still gets killed when no UI is open, which is correct — there's nothing to keep alive for.

Per-page watchdog timers

Some pages never settle. A misconfigured analytics script keeps the load event pending; a slow third-party iframe blocks document_idle. Without a watchdog you'll wait the full extractor timeout — by default that's "until you close the tab."

A 45-second watchdog is a good upper bound for almost any article:

function armTabTimer(tabId: number) {
  const timer = setTimeout(() => {
    chrome.tabs.remove(tabId);
    markPageAsTimedOut(tabId);
  }, 45_000);
  // Clear when extraction completes via chrome.runtime.onMessage
  return timer;
}

Tabs that time out get marked as errors and the queue rolls forward. The user sees them in the results table and can re-run only the failures.

Result buffering

Bulk runs can collect hundreds of pages. If you keep every Markdown blob in memory and then build the export string in one pass, you'll OOM a long run.

Two mitigations:

  1. Cap the in-memory result buffer (BulkMD caps it at 500). Older results spill to disk via chrome.downloads progressively.
  2. Stream the export. Build the concatenated file by writing chunks to a Blob and calling chrome.downloads.download({ url: URL.createObjectURL(blob) }), rather than building a single huge string.

What "good" looks like from the user's seat

The whole point of bulk export is that the user does the human-judgement work — picking the URLs — and the machine does everything else. The interaction goal is:

  • Paste fifty URLs, click Start.
  • Glance at progress occasionally; abort if needed.
  • When it's done, get one file (or fifty) that's ready to drop into Obsidian, Notion, a RAG ingest folder, or directly into an LLM prompt.

If your pipeline needs the user to babysit it — restart on a worker crash, re-paste failed URLs, manually concatenate the outputs — it's not really bulk export. It's "single conversion in a loop with extra steps."

Try it

BulkMD's bulk dashboard ships every pattern above: persistent queue, configurable concurrency, 45-second watchdogs, retry-on-failure, and exports as either one-file-per-URL or a single concatenated .md. The extension is free on the Chrome Web Store and processes everything locally — your URL list never leaves the browser.

Frequently asked questions

Why not just script Pandoc / wget for this?

Two reasons. First, anonymous server-side fetching gets blocked on Cloudflare, on paywalls, and on any site requiring auth — which is most useful sources. Second, you re-implement Readability yourself, badly, or accept boilerplate-heavy output. Running inside the user's authenticated browser solves both problems for free.

How many tabs in parallel is safe?

Two as a default. Four on well-behaved docs sites. One on aggressive bot walls (LinkedIn, X, news sites with meter walls). The right number is a property of the *site*, not the extension — so it has to be configurable. We expose a slider in the dashboard rather than picking a magic number.

What happens to the queue if Chrome quits during a run?

It resumes from the last persisted state on the next open. We persist after every state mutation to `chrome.storage.session`, so the worst-case loss is one page in flight. The dashboard offers to resume the run on reopen, and finished pages are already exported to disk by that point.

Can the bulk dashboard handle pages behind a login?

Yes — that's the main reason it runs in-browser. The tab opens with your cookies, your subscription state, your SSO session. The Readability + Turndown pass runs against the authenticated DOM, so anything you can read in a regular tab, the bulk runner can convert.

What's the largest run you've tested?

About 500 URLs in a single batch — capped at that buffer size for memory reasons. For larger sets we recommend splitting into batches of 200–300 and concatenating output files. The patterns above scale further, but we don't yet have data we'd publish for 1,000+ runs.

About the author

M. H. Tawfik

Lead Developer & Owner

Working from Kushtia, Bangladesh.

Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.

Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.

ShareXinHN
TaggedBulk exportChrome extensionManifest V3Markdown