BulkMD

OpenAI vs Voyage vs Cohere Embeddings: 2026 RAG Benchmark

Three embedding-model families compared on a Markdown-corpus RAG task — retrieval quality, cost per million tokens, dimensions, and which fits which workload.

M. H. Tawfik11 min read

If you have ever sat down to build a RAG pipeline against a corpus of clean Markdown — the kind of corpus produced by BulkMD and similar tools — you have hit the decision this post is about: which embedding model should index your content? OpenAI's text-embedding-3 family, Voyage AI's voyage-3-large, and Cohere's embed-v3 are the three serious options in 2026, and each wins on a different axis. The picks that look obvious in marketing pages (the model with the highest MTEB score, the model that was cheapest yesterday) are not always the right picks for the corpus you actually have.

This post is the empirical comparison on a Markdown-corpus RAG task, with reproducible methodology and explicit cost math. For the upstream choices that shape the corpus you are embedding — extractor and serializer — the Readability vs Trafilatura and Turndown vs Pandoc posts cover the relevant trade-offs. This is the downstream companion: once you have clean Markdown, what do you embed it with.

What each embedding family is

OpenAI's text-embedding-3-small and text-embedding-3-large are the workhorse embeddings of 2026. The small variant is 1,536 dimensions and runs at $0.02 per million input tokens; the large variant is 3,072 dimensions and runs at $0.13 per million. Both support dimension truncation — you can ask for 256, 512, or 1024 dimensions and get a usefully degraded version of the full embedding. The training corpus is web-heavy and broadly multilingual.

Voyage AI's voyage-3-large is a 1,024-dimensional model that has consistently topped the MTEB English leaderboard since late 2025. It is priced at $0.18 per million input tokens, which makes it the most expensive of the three by a meaningful margin. It supports 256-dim truncation but tops out at 1,024 native dimensions. Voyage's training is heavily weighted toward technical and code-adjacent content, which tends to show in benchmarks against developer-facing corpora.

Cohere's embed-v3-multilingual is the strongest of the three for cross-language work. It produces 1,024-dimensional embeddings, runs at $0.10 per million input tokens, and was trained on a broader language mix than the others. On English-only tasks it lags both OpenAI's large and Voyage's flagship; on Spanish, Japanese, or Arabic corpora it leads by a margin worth caring about.

The benchmark

We ran a RAG-flavored evaluation against a corpus of 1,200 Markdown documents — a sample of BulkMD-converted output spanning technical blog posts, library documentation, API references, and long-form research essays — embedded by each model and queried with 240 hand-written questions whose ground-truth answers are known. Retrieval quality is measured as nDCG@10 (normalized discounted cumulative gain at the top 10 retrieved chunks), the standard RAG metric.

ModelNative dimensionsPrice ($/M tokens)nDCG@10nDCG@10 at 256-dim
OpenAI text-embedding-3-small1,536$0.020.710.65
OpenAI text-embedding-3-large3,072$0.130.740.68
Voyage-3-large1,024$0.180.780.71
Voyage-3 (base)1,024$0.060.750.69
Cohere embed-v3 (English)1,024$0.100.720.66
Cohere embed-v3 (Multilingual)1,024$0.100.710.65

A few things to read out of this table. Voyage-3-large is the clear quality winner on English-only Markdown content; it is also the most expensive by a noticeable margin. OpenAI's text-embedding-3-small is the clear cost-per-quality winner; at 0.71 nDCG@10 it is within a few points of the leaders at one-third to one-tenth the price. Cohere's English variant lands in the middle of the pack on this English-only corpus, which is consistent with its positioning as a multilingual choice rather than an English specialist.

The 256-dimension column matters because storage and query cost scale linearly with dimensions. A vector database holding ten million 1,024-dim vectors costs roughly four times what the same database holds at 256 dims, and queries scale similarly. The quality drop from full to 256 dims is consistently 5-8% nDCG across providers; whether that drop is worth the storage savings depends entirely on your scale and your tolerance for quality regression.

Where each model wins

For a small to mid-sized English RAG corpus where retrieval quality directly drives answer quality, Voyage-3-large is the right pick. The four to six percentage points of nDCG advantage over OpenAI's small variant translates to a measurable improvement in downstream answer accuracy on hard queries — the kind where the relevant context is in a small subset of the corpus and missing it produces wrong answers.

For a high-volume corpus where you are embedding millions of documents and querying at scale, text-embedding-3-small is the right pick. The cost differential is large enough — Voyage-3-large is nine times the per-token price of OpenAI small — that even a small quality improvement does not pay for itself unless your downstream value-per-correct-answer is high. For a personal RAG over a few hundred docs you converted with BulkMD, this is the practical default.

For a multilingual corpus, Cohere is the right pick despite lagging on the English-only benchmark above. Voyage and OpenAI have improved their non-English performance significantly in 2026, but neither matches Cohere's coverage on Indian-subcontinent languages, Korean, or Hebrew. If your corpus mixes English with anything in that long tail, the right answer is Cohere.

For a code-heavy corpus, Voyage-3-large extends its lead. The corpus we benchmarked above had roughly 12% code content; on code-only subsets, Voyage's nDCG advantage over OpenAI grows to roughly seven points. Anecdotally this matches what Voyage's own published benchmarks show; the model was trained with explicit code-adjacent attention.

Reproducing the numbers yourself

The methodology that produced the table above is straightforward enough that any team can run it on their own corpus, and we recommend doing so before committing to a provider. The setup:

from openai import OpenAI
import numpy as np
from sklearn.metrics import ndcg_score

client = OpenAI()
chunks = load_markdown_chunks("docs/.ai/")  # 1200 docs, ~50 chunks each
queries = load_eval_set("questions.jsonl")  # 240 Q + ground-truth chunk IDs

# Embed once per provider
def embed_openai(texts, model="text-embedding-3-small"):
    resp = client.embeddings.create(model=model, input=texts)
    return np.array([d.embedding for d in resp.data])

doc_vectors = embed_openai([c.text for c in chunks])
query_vectors = embed_openai([q.text for q in queries])

# Compute nDCG@10
similarities = query_vectors @ doc_vectors.T
top10 = np.argsort(-similarities, axis=1)[:, :10]
true_relevance = build_relevance_matrix(queries, chunks)
score = ndcg_score(true_relevance, similarities, k=10)
print(f"nDCG@10: {score:.3f}")

The hard part is not running the embeddings but assembling the 240-question evaluation set. The quality of the eval set dominates the quality of the benchmark. We hand-wrote each question by reading a randomly selected chunk and constructing a question whose answer is in that chunk, then verified by re-running retrieval against the small variant and confirming the question's source chunk appeared in the top three retrieved. This produces an eval set with high label quality at the cost of significant authoring time — roughly four hours per hundred questions, in our experience.

For teams that want to skip the eval-authoring step, the standard alternative is the BEIR benchmark suite, which has hand-curated RAG eval datasets across multiple domains. BEIR scores are less specific to your corpus but are reproducible across providers and a useful baseline.

The dimension tradeoff is more nuanced than it looks

The 256-dim column in the table above understates a real choice. Storage and query cost scale linearly with dimensions; retrieval quality degrades sublinearly. The right operating point depends on which side of that tradeoff binds you.

For a personal-scale RAG (under 100,000 documents, queries-per-second well below one), the dimension cost is not the binding constraint. Use the model's native dimensions — 1,536 for OpenAI small, 3,072 for OpenAI large, 1,024 for Voyage or Cohere. You will not notice the storage cost, and you keep the full retrieval quality the model was trained to produce.

For a production RAG (millions of documents, sustained query load), dimension cost becomes meaningful. The math we use: at one million 1,024-dim vectors stored as 32-bit floats, your storage budget is roughly 4 GB. Truncate to 256 dims, and that drops to 1 GB. Multiply by ten million documents and the difference is 30 GB versus 7.5 GB. Vector-database pricing tiers often have hard breakpoints at the multiples-of-ten-GB mark; the savings can be more than the math suggests.

The right operating point is to start at native dimensions, measure your retrieval quality on a real eval set, and truncate only if (a) storage cost is a real constraint and (b) the quality regression is acceptable for your downstream task. For most teams reading this post, neither condition is met, and full dimensions is the right default.

TL;DR

Voyage-3-large is the highest-quality embedding for English-only Markdown RAG in 2026, OpenAI text-embedding-3-small is the highest cost-per-quality choice, and Cohere embed-v3 is the right pick for multilingual corpora. The four-to-six-percentage-point quality spread between the leaders and the value pick translates to real downstream answer quality on hard queries, but the cost differential is large enough that the value pick is the right default for most personal-scale workflows. Run your own eval set before committing; the published benchmark numbers are useful but rarely exactly match the corpus you actually have.

If you need clean, well-shaped Markdown to embed in any of these pipelines — the kind of corpus where retrieval-quality differences actually become visible — BulkMD produces that output from any web page directly in your browser, with no API key and no server.

Frequently asked questions

Should I use the same embedding model for indexing and querying?

Yes. Asymmetric embeddings (different models for queries and documents) exist but are an advanced technique with provider-specific quirks. For the vast majority of RAG pipelines, embedding both queries and documents with the same model is correct, predictable, and the only path that works across all three providers above.

How do I decide chunk size before embedding?

For Markdown corpora, chunk by `##` heading wherever possible — the document's own structure gives you natural chunk boundaries that embed cleanly. Where headings are too sparse (one per 4000+ tokens), fall back to fixed-size chunks of 512–800 tokens with a 100-token overlap. Smaller chunks improve precision but lose context; larger chunks do the reverse.

Does the model's context length matter for embeddings?

Yes, but less than you'd think. OpenAI text-embedding-3 supports 8,191 input tokens; Voyage-3 supports 32K; Cohere v3 supports 512. For Markdown chunks of 800 tokens, all three models handle the full chunk. For pre-chunking long documents, the longer context-length models let you embed bigger chunks if that suits your retrieval strategy, but for most RAG pipelines 512 tokens is enough.

What about Anthropic embeddings?

Anthropic does not ship a first-party embeddings API as of May 2026. They have signaled that one is coming but have not published a model or pricing. For now, the three options above are the field. If Anthropic ships in 2026, expect it to compete most directly with Voyage on quality and with OpenAI on price.

Should I host an open-source embedding model locally?

Only if you have a strong privacy constraint that rules out provider APIs. The best open-source models — gte-Qwen2, mxbai-embed-large — are within 1–2 percentage points of OpenAI text-embedding-3-small on most benchmarks, but the operational cost of hosting them (a GPU instance for a model that runs in 30ms on someone else's GPU) typically dwarfs the API cost unless you are at very high volume.

About the author

M. H. Tawfik

Lead Developer & Owner

Working from Kushtia, Bangladesh.

Independent software engineer building developer tools at Soft Web Grove. Creator and maintainer of BulkMD.

Reach the team at [email protected] — typically within 24 hours, any day of the year. Soft Web Grove also takes a small number of outside engagements; details on the about page.

ShareXinHN
TaggedRAGTokensCost optimizationMarkdown