You found the page that lists everything you need — a documentation index, a release-notes archive, a category page of articles — and you want all of it as clean Markdown for an LLM. The slow way is to open each link, convert it, and repeat forty times. The fast-but-wrong way is to grab every link on the page and discover your corpus is half navigation. This post is about the middle path: how to extract links from a web page precisely, scoped to the section you actually care about, then bulk-convert that exact set into Markdown an agent can read.
We will cover why a whole-page link dump is the wrong default, how section-scoped detection works (point, auto-detect, review, remove), the URL-normalization rules that decide which links survive, and how the surviving set flows into a parallel bulk converter. The tool we use to make this a few clicks rather than a script is BulkMD, but the reasoning applies to any link-harvesting workflow.
Why "all links on the page" is the wrong default
Run "get all links on a page" against a real documentation site and count what comes back. A typical docs page carries a global navigation bar, a version switcher, a left sidebar listing every other page in the docs, a right-hand "on this page" anchor list, a footer with company and legal links, and a "related articles" rail — and only then the handful of links inside the actual content. The whole-page set is dominated by chrome that repeats on every page in the site.
That matters because the next step is conversion, and conversion costs tokens and time. If you feed a bulk converter 400 URLs when 40 are real, you open 360 unnecessary tabs, convert 360 pages of duplicated navigation, and then have to dedupe a corpus that is mostly the same sidebar forty times over. The noise also degrades retrieval later: a RAG index full of near-identical navigation chunks ranks them against your queries, crowding out the content you indexed it for. We unpack that ingestion problem in packaging a web corpus for AI agents; the cheapest fix is upstream, by never collecting the noise in the first place.
The goal, then, is not "every link" — it is "every link in this part of the page." That is a selection problem, and the right place to solve it is visually, on the rendered page, where you can see exactly which region holds the links you want.
How section-scoped link detection works
Section-scoped detection turns link harvesting into four steps: point at a region, let the tool auto-detect the links inside it, review the list, and remove anything you don't want before converting.
In BulkMD this runs as an in-page overlay rather than a popup, because a browser popup closes the moment you click the page — and pointing at a section is clicking the page. Clicking Detect links injects a heads-up display that asks a single question: scan the whole page, or pick a specific section? Choosing pick a section enters a picker where hovering highlights the element under your cursor and shows a live count of the links inside it. Pressing the up arrow widens the selection to the parent element — useful when the link you want sits one container deeper than the one you'd hover — and a click locks the choice. Escape backs out.
Once a region is chosen, the overlay lists every link it found, with a running "N of M selected" count, a checkbox per row, a select-all/none control, and a per-row remove button. This review step is the point of the whole exercise: auto-detection gets you 95% of the way, and the last 5% — dropping the one "edit this page on GitHub" link, or the three external references that snuck into a sidebar — is a few clicks instead of a regex. When the list is right, one button hands the selected URLs to the bulk converter.
The rules that decide which links survive
Auto-detection is only useful if it is predictable, so it helps to know exactly which links a section yields. Every anchor in the chosen region is run through the same normalization, and the rules are deliberately strict about what counts as a navigable page.
| Link pattern | Kept? | Why |
|---|---|---|
https://site/docs/guide | Yes | Absolute http(s) URL — a real page |
/docs/guide (relative) | Yes | Resolved against the page origin to an absolute URL |
https://other-site.com/x | Only if same-site filter is off | External by default to keep a corpus on-topic |
#section or empty href | No | In-page anchor or no target — not a page |
javascript:, mailto:, tel: | No | Not navigable content |
| Duplicate of one already found | No | Deduplicated by normalized URL |
| The page you're currently on | No | Self-link, nothing new to convert |
The same-site filter is the highest-leverage of these, and it is on by default. With it on, pointing at a documentation sidebar gives you the documentation pages and nothing else — no Twitter link, no status-page link, no "powered by" partner URL. Toggle it off only when you genuinely want to follow external references, for example collecting every source a research page cites.
Here is the core of the normalization, lightly trimmed from the actual implementation. It resolves relative URLs, drops non-navigable schemes, deduplicates, and respects the same-site toggle:
function collectLinks(root, sameSiteOnly) {
const origin = location.origin;
const current = location.href.split("#")[0];
const seen = new Set();
const out = [];
for (const a of root.querySelectorAll("a[href]")) {
const raw = a.getAttribute("href");
if (!raw || /^\s*(javascript:|mailto:|tel:|#)/i.test(raw)) continue;
let abs;
try {
abs = new URL(raw, location.href).href; // resolve relative -> absolute
} catch (_) {
continue;
}
if (!/^https?:\/\//i.test(abs)) continue;
const norm = abs.split("#")[0]; // strip the fragment
if (norm === current) continue; // skip the current page
if (sameSiteOnly && new URL(norm).origin !== origin) continue;
if (seen.has(norm)) continue; // dedupe
seen.add(norm);
out.push(norm);
}
return out;
}
Two details earn their place. Stripping the #fragment before deduplication means /guide#install and /guide#usage collapse to a single /guide — you want the page once, not once per anchor. And resolving every href against location.href means a sidebar full of relative links like ./reference becomes a list of absolute, convertible URLs without any manual prefixing.
What section-scoping actually buys you
The win from section-scoping is a clean corpus instead of a noisy one — the difference between feeding an agent forty documentation pages and feeding it the same sidebar forty times.
Concretely: collect the in-content links from a docs section, and you get one URL per real page. Convert those to Markdown and a typical documentation page lands around 600–900 tokens of clean content — so a forty-page section is roughly 25,000–35,000 tokens of corpus, sized to drop into a long-context window or a RAG index. Run the whole-page link set instead and you would convert hundreds of pages, most of them duplicated navigation, then spend the savings deduplicating. The token math behind those per-page figures — and why Markdown lands 60–80% below the source HTML — is worked through in cut LLM token costs with clean Markdown.
Section-scoped link detection is the difference between handing an agent a clean forty-page documentation set and handing it four hundred links of navigation, footers, and "related posts" — the same conversion either way, but one corpus is signal and the other is mostly chrome.
The other quiet benefit is that detection runs on the rendered page. A site that builds its link list with JavaScript — an infinite-scroll archive, a framework-driven docs sidebar — has those links in the live DOM by the time you point at them, even though a plain server-side fetch of the URL would see an empty shell. That is the same rendering advantage we make the full case for in server scrapers versus browser extensions: harvesting links where the page is already rendered avoids a class of failures a crawler hits head-on.
From a link list to a Markdown corpus
A reviewed list of URLs is only half the job; the point is the Markdown. The collected links flow straight into the bulk converter, which is built to process a list politely and in parallel.
The bulk engine opens up to 10 tabs at a time — a hard cap, for the host site's sake as much as your machine's — waits a configurable delay between pages, converts each rendered page to Markdown locally, and retains around 500 results per batch. Every conversion runs the same Readability-plus-Turndown pipeline the single-page action uses, so a forty-page section comes out as forty clean Markdown documents, not forty walls of HTML. From there you can download them as a ZIP, or as an agent bundle — one file per page plus an index.md and a manifest.json that lists every page, its title, source URL, and token count — which is the shape an AI agent or RAG loader ingests without extra glue. That packaging step, and why agents prefer it, is the subject of building a personal RAG pipeline.
The whole path, end to end, is: point at the section, let detection collect the links, uncheck the few you don't want, send them to bulk, and download the Markdown bundle. No script, no regex, no opening forty tabs by hand — and because every step runs locally in the browser, no page content leaves your machine during conversion.
TL;DR
To extract links from a web page for an LLM corpus, don't grab every link — scope to the section that holds them. Point at the region, let detection auto-collect the links inside it (relative URLs resolved, fragments stripped, duplicates and external links dropped by default), review the list and remove the stragglers, then hand the clean set to a bulk converter. The result is one Markdown document per real page instead of the same navigation repeated hundreds of times, sized at roughly 600–900 tokens per docs page and ready for a long-context window or a RAG index. The actionable next step is to stop collecting whole-page link dumps and start scoping detection to the content region — then convert the reviewed set in bulk.
To do it without writing a crawler, install BulkMD free from the Chrome Web Store, open the page that lists what you need, and click Detect links.
Frequently asked questions
How do I extract only the links inside one part of a page?
Will detection include navigation, footer, and external links?
Does it work on sites that load their links with JavaScript?
How many links can I convert at once?
Can I remove specific links before converting?
About the author
Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.
Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.