If you have ever wanted to ask Claude "what did this 40-minute conference talk actually conclude?" and balked at the idea of watching it twice to find out, the fix is to convert YouTube transcripts to LLM context and let the model read what you do not have time to. A transcript is one of the densest, cheapest forms of context you can hand an LLM: pure spoken text, no navigation bars, no ad slots, no nested <div> wrappers. The hard part is not getting the words; it is getting them clean enough that the model spends its attention on the argument instead of on [00:14:22] timestamps and forty repetitions of "you know."

This post walks the full workflow: why transcripts are unusually good context, how to open the transcript on any video that exposes one, how to strip timestamps and filler down to readable Markdown, and how to use the result for summaries and Q&A. We will also be honest about the limit that trips everyone up, which is that transcript availability is a property of the video, not of any tool.

Why a transcript is unusually good LLM context

A transcript is spoken text with almost none of the boilerplate that makes web pages expensive. When you convert a web article to Markdown, most of the work is deciding what to throw away, the same problem covered in the web-page-to-Markdown primer. A transcript arrives with that work already done: there is no header, no sidebar, no cookie banner, no "related videos" rail mixed into the body. What you get is the words a person said, in order.

That makes transcripts dense in the way that matters for context windows. A speaker at a normal cadence produces roughly 130-160 words per minute. So a 20-minute talk is on the order of 2,800-3,200 words once you remove filler, which is roughly 3,700-4,300 tokens against the cl100k_base vocabulary that powers GPT-3.5 and GPT-4. (o200k_base, which powers GPT-4o and the o-series, tokenizes English prose comparably, and Claude's own tokenizer lands in the same neighborhood for English.) That fits, whole and uncompressed, inside any modern context window with room to spare for your question and the answer.

The catch is that the raw transcript YouTube hands you is not clean. It is a stream of caption cues, each prefixed with a timestamp, frequently broken mid-sentence, and, for auto-generated captions, studded with the verbal tics that humans filter out unconsciously but a tokenizer counts in full. If you want a deeper sense of how different content types tokenize, BulkMD and the wider Markdown-for-context approach treat spoken transcripts as just another source feeding the same pipeline.

How to open the transcript on a YouTube video

The most reliable source is YouTube's own transcript panel, because it works regardless of which extension or script you use.

Open the video and click the "..." (More actions) menu under the player, or the Show transcript button in the description area.
The transcript panel opens on the right. Each line is a caption cue with a timestamp.
There is a toggle to hide timestamps inside that panel. Use it; it removes most of the noise before you even copy.
Select the transcript text and copy it.

This works on any video where the uploader (or YouTube's automatic speech recognition) has produced captions. It is your fallback for everything.

Where BulkMD fits

When YouTube exposes a transcript through its standard panel, BulkMD does a best-effort transcript copy: it reads the transcript YouTube has already rendered and puts it on your clipboard as text, the same way it copies a converted article. This is a convenience over the manual select-and-copy above. It is not magic, and it is bound by the same constraint. If the video has no captions, there is nothing to read.

An honest limit. Transcript availability is a property of the video. Auto-captions cover a large share of English uploads, but creators can disable captions, and some languages or older uploads have none. No browser tool can produce a transcript that YouTube has not generated. When that happens, you need a separate transcription step (a speech-to-text service), which is outside the local-only workflow.

What "clean" means: timestamps, speaker noise, and filler

A clean transcript for LLM context has three things removed and one thing added.

Remove timestamps. A cue like [00:14:22] or 14:22 at the start of every line is pure overhead. It tokenizes to several tokens per cue, and across a long video that is hundreds of wasted tokens carrying no semantic value the model needs to answer a content question. (Keep timestamps only if you specifically want the model to cite "around the 14-minute mark", a rare case.)

Remove caption fragmentation. Auto-captions break lines every few words to fit the on-screen caption box, so a single sentence is shattered across five cues. Re-flowing those fragments into sentences and paragraphs is the single biggest readability win, because it restores the structural boundaries a retriever and a model both rely on.

Remove filler. Auto-transcribed speech is full of "um", "uh", "you know", "like", "I mean", and false starts. These are not just ugly; they dilute the signal. A model attending to a paragraph that is heavy with filler spends part of that paragraph's attention budget on noise instead of on the claim.

Add a citation block. Prefix the cleaned text with the video title, channel, URL, and date. This is the same ## Source: pattern that lifts citation accuracy for any LLM context, and it lets the model attribute a claim to "the talk by X" rather than "the document."

Here is the difference on a short stretch of an auto-caption stream, before and after:

[00:02:14] so the the thing about
[00:02:16] retrieval is that you you basically
[00:02:18] um you want to chunk the document
[00:02:20] before you embed it you know
[00:02:22] otherwise the the embeddings are
[00:02:24] kind of meaningless

Cleaned:

The thing about retrieval is that you basically want to chunk the
document before you embed it; otherwise the embeddings are meaningless.

Six fragmented, timestamp-prefixed cues collapse to one sentence. The meaning is identical; the token count drops by more than half.

A repeatable cleanup pass

For occasional use, a single tidy regex pass gets a raw YouTube transcript most of the way to clean. The script below strips leading timestamps, removes the most common English filler tokens, collapses whitespace, and re-flows the fragments into a paragraph. It is intentionally conservative: it does not try to re-punctuate, which is a job better left to the LLM itself.

// Conservative transcript cleanup. Paste raw transcript text into `raw`.
function cleanTranscript(raw) {
  const FILLER = /\b(um|uh|erm|you know|i mean|kind of|sort of|like)\b/gi;

  return raw
    .split("\n")
    .map((line) =>
      line
        // strip leading [hh:mm:ss], [mm:ss], or bare mm:ss timestamps
        .replace(/^\s*\[?\d{1,2}:\d{2}(:\d{2})?\]?\s*/, "")
        .trim()
    )
    .filter(Boolean)
    .join(" ")
    // drop filler words, then tidy the gaps they leave behind
    .replace(FILLER, "")
    .replace(/\s+([,.;:!?])/g, "$1")
    .replace(/\s{2,}/g, " ")
    // remove stutters: "the the" -> "the", "you you" -> "you"
    .replace(/\b(\w+)\s+\1\b/gi, "$1")
    .trim();
}

// Then wrap it with a citation block for the model:
function toMarkdown(meta, body) {
  return [
    `## Source: ${meta.url}`,
    `- Title: ${meta.title}`,
    `- Channel: ${meta.channel}`,
    `- Captured: ${new Date().toISOString().slice(0, 10)}`,
    ``,
    body,
  ].join("\n");
}

A word of caution on the filler list: removing "like" and "kind of" is safe in casual speech but can occasionally delete meaning ("a tree-like structure"). Keep the list short, and review the output. This is a cleanup aid, not an unsupervised pipeline. For a one-off video you can skip the script entirely and let the model do the filtering, which the next section covers.

How big is the cleanup saving, really

To make the numbers concrete, here is a single 22-minute conference talk measured at three stages: the raw transcript copied straight from YouTube's panel with timestamps on, the same text with timestamps stripped, and the fully cleaned Markdown after filler removal and re-flow. Token counts are cl100k_base, the GPT-4 family tokenizer; treat them as a worked example from one video, not a population average.

Stage	Words	Tokens (cl100k)	Notes
Raw transcript, timestamps on	~3,650	~6,900	One timestamp per caption cue
Timestamps stripped, still fragmented	~3,500	~4,800	Removing cue prefixes is the biggest single drop
Cleaned Markdown (filler removed, re-flowed)	~3,050	~4,150	Plus a ~25-token citation block

The cleaned version is roughly 40% fewer tokens than the raw copy, and the largest single saving comes from removing the per-cue timestamps rather than from filler removal. The filler pass matters more for answer quality than for token count: a model reading clean prose produces tighter summaries than one wading through "um" and stutters, even when the token delta is modest.

Using the transcript for summaries and Q&A

Once you have clean Markdown, the prompt patterns are the same ones that work for any Markdown context fed to an AI agent. Three are worth calling out for transcripts specifically.

Structured summary, not a blob. Ask for a summary with explicit structure, such as "Give me the three main claims, the evidence offered for each, and one objection the speaker did not address." A transcript is linear and repetitive by nature; a structured prompt forces the model to deduplicate the speaker's restatements into distinct points.

Quote with attribution. Because the transcript is verbatim, the model can quote exact phrasing. Ask it to "support each point with a short direct quote from the transcript." This is where the citation block earns its keep: the model attributes quotes to the named source rather than inventing a generic "the speaker said."

Let the model finish the cleanup. If you skipped the regex pass, you can fold cleanup into the prompt: "This is a raw auto-generated transcript. Ignore timestamps and filler words, and treat repeated phrases as a single statement." Modern models handle this well, and for a single video it is faster than running a script. The script pays off when you are processing many transcripts and want to remove noise before it consumes context-window budget across a batch.

A practical sequence I use for a research session: collect transcripts from a handful of talks on one topic, clean each, concatenate them under separate ## Source: headings into one Markdown file, and ask the model to compare and contrast the speakers' positions. Five 20-minute talks come to roughly 20,000-22,000 tokens cleaned, which is a single comfortable prompt with no chunking required.

The money sentence

A clean 20-minute YouTube transcript is about 3,000 words and roughly 4,000-4,500 tokens of pure spoken content, which means you can hand an LLM the complete substance of a talk for well under a tenth of a typical model's context window, and the only real prerequisite is that the video has captions at all.

TL;DR

Transcripts are some of the best LLM context available because the boilerplate problem is already solved: what is left is removing timestamps, caption fragmentation, and filler, then wrapping the result in a citation block. Open YouTube's transcript panel (or let BulkMD copy it when one exists), run a conservative cleanup pass or let the model do it, and you have a dense, citable source for summaries and Q&A. Just remember the honest limit: if the video has no captions, no browser tool can conjure them, and you will need a separate transcription step. If you already collect web articles and documentation for context and want video transcripts in the same Markdown pipeline, install BulkMD free from the Chrome Web Store and add your next talk to the stack.

Frequently asked questions

Can BulkMD transcribe a YouTube video that has no captions?

No. BulkMD does a best-effort copy of the transcript YouTube already exposes through its transcript panel. If the uploader disabled captions and YouTube generated none, there is no transcript to read, and no browser extension can create one. For those videos you need a separate speech-to-text step, which is outside the local-only workflow.

How many tokens does a typical YouTube transcript use?

Speech runs roughly 130-160 words per minute, so a clean 20-minute transcript is about 3,000 words, or roughly 4,000-4,500 cl100k tokens after you strip timestamps and filler. A raw copy with per-cue timestamps can be 40% larger for the same content.

Should I keep the timestamps in the transcript?

Usually no. They add several tokens per caption cue and carry no meaning for a content question. Keep them only when you specifically want the model to cite a moment in the video, like "around the 14-minute mark." For summaries and Q&A, strip them.

Is it better to clean the transcript first or let the model do it?

For a single video, let the model handle it: tell it to ignore timestamps and filler. Cleaning first with a script pays off when you process many transcripts at once, because removing noise before the prompt saves context-window budget across the whole batch.

Will removing filler words change the meaning?

Rarely, if you keep the filler list short. The risk is words like "like" and "kind of" that are sometimes meaningful ("a tree-like structure"). Use a conservative list and review the output. Treat the cleanup pass as an aid, not an unsupervised pipeline.

About the author

M. H. Tawfik

Lead Developer & Owner

Working from Kushtia, Bangladesh.

Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.

Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.

ShareX in HN

TaggedLLM contextMarkdownChatGPTClaudeTokens

YouTube Transcripts to LLM Context: A Clean Markdown Flow