BulkMD

Extract Links From a Web Page, Then Bulk-Convert Them

How to extract every link on a web page — or just one section — auto-detect, review, remove the noise, and bulk-convert them to clean Markdown for LLM context.

M. H. Tawfik12 min read

You found the page that lists everything you need — a documentation index, a release-notes archive, a category page of articles — and you want all of it as clean Markdown for an LLM. The slow way is to open each link, convert it, and repeat forty times. The fast-but-wrong way is to grab every link on the page and discover your corpus is half navigation. This post is about the middle path: how to extract links from a web page precisely, scoped to the section you actually care about, then bulk-convert that exact set into Markdown an agent can read.

We will cover why a whole-page link dump is the wrong default, how section-scoped detection works (point, auto-detect, review, remove), the URL-normalization rules that decide which links survive, and how the surviving set flows into a parallel bulk converter. The tool we use to make this a few clicks rather than a script is BulkMD, but the reasoning applies to any link-harvesting workflow.

Run "get all links on a page" against a real documentation site and count what comes back. A typical docs page carries a global navigation bar, a version switcher, a left sidebar listing every other page in the docs, a right-hand "on this page" anchor list, a footer with company and legal links, and a "related articles" rail — and only then the handful of links inside the actual content. The whole-page set is dominated by chrome that repeats on every page in the site.

That matters because the next step is conversion, and conversion costs tokens and time. If you feed a bulk converter 400 URLs when 40 are real, you open 360 unnecessary tabs, convert 360 pages of duplicated navigation, and then have to dedupe a corpus that is mostly the same sidebar forty times over. The noise also degrades retrieval later: a RAG index full of near-identical navigation chunks ranks them against your queries, crowding out the content you indexed it for. We unpack that ingestion problem in packaging a web corpus for AI agents; the cheapest fix is upstream, by never collecting the noise in the first place.

The goal, then, is not "every link" — it is "every link in this part of the page." That is a selection problem, and the right place to solve it is visually, on the rendered page, where you can see exactly which region holds the links you want.

Section-scoped detection turns link harvesting into four steps: point at a region, let the tool auto-detect the links inside it, review the list, and remove anything you don't want before converting.

In BulkMD this runs as an in-page overlay rather than a popup, because a browser popup closes the moment you click the page — and pointing at a section is clicking the page. Clicking Detect links injects a heads-up display that asks a single question: scan the whole page, or pick a specific section? Choosing pick a section enters a picker where hovering highlights the element under your cursor and shows a live count of the links inside it. Pressing the up arrow widens the selection to the parent element — useful when the link you want sits one container deeper than the one you'd hover — and a click locks the choice. Escape backs out.

Once a region is chosen, the overlay lists every link it found, with a running "N of M selected" count, a checkbox per row, a select-all/none control, and a per-row remove button. This review step is the point of the whole exercise: auto-detection gets you 95% of the way, and the last 5% — dropping the one "edit this page on GitHub" link, or the three external references that snuck into a sidebar — is a few clicks instead of a regex. When the list is right, one button hands the selected URLs to the bulk converter.

Auto-detection is only useful if it is predictable, so it helps to know exactly which links a section yields. Every anchor in the chosen region is run through the same normalization, and the rules are deliberately strict about what counts as a navigable page.

Link patternKept?Why
https://site/docs/guideYesAbsolute http(s) URL — a real page
/docs/guide (relative)YesResolved against the page origin to an absolute URL
https://other-site.com/xOnly if same-site filter is offExternal by default to keep a corpus on-topic
#section or empty hrefNoIn-page anchor or no target — not a page
javascript:, mailto:, tel:NoNot navigable content
Duplicate of one already foundNoDeduplicated by normalized URL
The page you're currently onNoSelf-link, nothing new to convert

The same-site filter is the highest-leverage of these, and it is on by default. With it on, pointing at a documentation sidebar gives you the documentation pages and nothing else — no Twitter link, no status-page link, no "powered by" partner URL. Toggle it off only when you genuinely want to follow external references, for example collecting every source a research page cites.

Here is the core of the normalization, lightly trimmed from the actual implementation. It resolves relative URLs, drops non-navigable schemes, deduplicates, and respects the same-site toggle:

function collectLinks(root, sameSiteOnly) {
  const origin = location.origin;
  const current = location.href.split("#")[0];
  const seen = new Set();
  const out = [];
  for (const a of root.querySelectorAll("a[href]")) {
    const raw = a.getAttribute("href");
    if (!raw || /^\s*(javascript:|mailto:|tel:|#)/i.test(raw)) continue;
    let abs;
    try {
      abs = new URL(raw, location.href).href; // resolve relative -> absolute
    } catch (_) {
      continue;
    }
    if (!/^https?:\/\//i.test(abs)) continue;
    const norm = abs.split("#")[0];           // strip the fragment
    if (norm === current) continue;           // skip the current page
    if (sameSiteOnly && new URL(norm).origin !== origin) continue;
    if (seen.has(norm)) continue;             // dedupe
    seen.add(norm);
    out.push(norm);
  }
  return out;
}

Two details earn their place. Stripping the #fragment before deduplication means /guide#install and /guide#usage collapse to a single /guide — you want the page once, not once per anchor. And resolving every href against location.href means a sidebar full of relative links like ./reference becomes a list of absolute, convertible URLs without any manual prefixing.

What section-scoping actually buys you

The win from section-scoping is a clean corpus instead of a noisy one — the difference between feeding an agent forty documentation pages and feeding it the same sidebar forty times.

Concretely: collect the in-content links from a docs section, and you get one URL per real page. Convert those to Markdown and a typical documentation page lands around 600–900 tokens of clean content — so a forty-page section is roughly 25,000–35,000 tokens of corpus, sized to drop into a long-context window or a RAG index. Run the whole-page link set instead and you would convert hundreds of pages, most of them duplicated navigation, then spend the savings deduplicating. The token math behind those per-page figures — and why Markdown lands 60–80% below the source HTML — is worked through in cut LLM token costs with clean Markdown.

Section-scoped link detection is the difference between handing an agent a clean forty-page documentation set and handing it four hundred links of navigation, footers, and "related posts" — the same conversion either way, but one corpus is signal and the other is mostly chrome.

The other quiet benefit is that detection runs on the rendered page. A site that builds its link list with JavaScript — an infinite-scroll archive, a framework-driven docs sidebar — has those links in the live DOM by the time you point at them, even though a plain server-side fetch of the URL would see an empty shell. That is the same rendering advantage we make the full case for in server scrapers versus browser extensions: harvesting links where the page is already rendered avoids a class of failures a crawler hits head-on.

A reviewed list of URLs is only half the job; the point is the Markdown. The collected links flow straight into the bulk converter, which is built to process a list politely and in parallel.

The bulk engine opens up to 10 tabs at a time — a hard cap, for the host site's sake as much as your machine's — waits a configurable delay between pages, converts each rendered page to Markdown locally, and retains around 500 results per batch. Every conversion runs the same Readability-plus-Turndown pipeline the single-page action uses, so a forty-page section comes out as forty clean Markdown documents, not forty walls of HTML. From there you can download them as a ZIP, or as an agent bundle — one file per page plus an index.md and a manifest.json that lists every page, its title, source URL, and token count — which is the shape an AI agent or RAG loader ingests without extra glue. That packaging step, and why agents prefer it, is the subject of building a personal RAG pipeline.

The whole path, end to end, is: point at the section, let detection collect the links, uncheck the few you don't want, send them to bulk, and download the Markdown bundle. No script, no regex, no opening forty tabs by hand — and because every step runs locally in the browser, no page content leaves your machine during conversion.

TL;DR

To extract links from a web page for an LLM corpus, don't grab every link — scope to the section that holds them. Point at the region, let detection auto-collect the links inside it (relative URLs resolved, fragments stripped, duplicates and external links dropped by default), review the list and remove the stragglers, then hand the clean set to a bulk converter. The result is one Markdown document per real page instead of the same navigation repeated hundreds of times, sized at roughly 600–900 tokens per docs page and ready for a long-context window or a RAG index. The actionable next step is to stop collecting whole-page link dumps and start scoping detection to the content region — then convert the reviewed set in bulk.

To do it without writing a crawler, install BulkMD free from the Chrome Web Store, open the page that lists what you need, and click Detect links.

Frequently asked questions

How do I extract only the links inside one part of a page?

Use a section-scoped picker instead of a whole-page scan. In BulkMD, click Detect links, choose 'pick a specific section', hover the region you want (press the up arrow to widen to the parent container), and click to lock it. The tool collects only the links inside that element, which you can then review and convert.

Will detection include navigation, footer, and external links?

Scoping to a content section already excludes the global nav and footer, since those live outside the region you pick. External links are dropped by default via a same-site filter you can toggle off, and in-page anchors, mailto/tel links, duplicates, and the current page are always skipped during normalization.

Does it work on sites that load their links with JavaScript?

Yes. Detection reads the rendered DOM in your browser, so links added by JavaScript — an infinite-scroll archive, a framework-driven sidebar — are present by the time you point at them. A plain server-side fetch of the same URL would often see an empty shell and miss them.

How many links can I convert at once?

The collected set feeds a bulk converter that opens up to 10 tabs in parallel, waits a configurable delay between pages to stay polite, and retains around 500 results per batch. Each page is converted to clean Markdown locally and can be downloaded as a ZIP or as an agent bundle with an index and manifest.

Can I remove specific links before converting?

Yes — that's the review step. After detection, every link appears in a list with a per-row checkbox and remove button plus a select-all/none control and a live count. Uncheck or remove anything you don't want, and only the remaining selected URLs are sent to the converter.

About the author

M. H. Tawfik

Lead Developer & Owner

Working from Kushtia, Bangladesh.

Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.

Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.

ShareXinHN
TaggedBulk exportWeb scrapingMarkdownChrome extensionRAG