The Context Window Race: 1 Million, 10 Million, and What Actually Works

In 2023, 4,096 tokens were the standard. In 2024, 128K became the new minimum for serious models. In 2026, several models announce 1 million tokens and two — Gemini 3.1 Pro and Llama 4 Scout — reach 10 million. Is this a marketing race, or is there practical utility in contexts of that magnitude?

The answer is: it depends on where the information sits in the context.

The Context Window Map — May 2026

Gemini 3.1 Pro: 10 million tokens (Closed)
Llama 4 Scout: 10 million tokens (Open)
GPT-5.5: 1 million tokens (Closed)
Claude Opus 4.7: 1 million tokens (Closed)
DeepSeek V4 Pro: 1 million tokens (Open)
Qwen 3.5-397B: 1 million tokens (Open)
Mistral Medium 3.5: 256K tokens (Open)
Gemma 4-31B: 256K tokens (Open)

Gemini 3.1 Pro and Llama 4 Scout lead by a factor of 10x. For most models, 1 million tokens is the new frontier standard.

The "Lost in the Middle" Problem

The announced context window number is not the number the model uses reliably. Research from 2026 shows a consistent pattern: models have high accuracy for information at the beginning and end of context, and significant degradation for information in the middle.

The magnitude of the degradation is concrete: for very long contexts, retrieval accuracy for information in the middle drops 10-25% compared to the beginning or end. In short contexts (up to 128K), the effect is manageable. In 1M+ contexts, the "middle" is enormous — and potentially includes most of the relevant information.

The effective capacity of a model that announces 200K tokens tends to be 130K to 140K reliably. For 1-million-token models, the effective capacity for tasks requiring precise retrieval of information distributed across the context may be in the 400-600K range.

That does not make long context useless — but it changes how it should be used.

When Long Context Works Well

Long document analysis with questions about the beginning or end. Processing an 800-page annual financial report and asking questions about the executive summary (beginning) or footnotes (end) works well. Asking questions about specific clauses scattered throughout the document is riskier.

Generation with reference to a corpus. When the model needs to generate text while maintaining consistency with a style or set of facts provided in the context, the exact position of the information matters less — the model uses context as a diffuse reference, not a precise database.

Codebase ingestion. Providing an entire repository in context and asking questions about structure, dependencies, or overall flow works better than retrieving specific lines from files in the middle of the context. For architecture review and high-level analysis, it works.

When Long Context Fails

Precise retrieval of distributed information. If you need the model to find every mention of a specific clause spread across 500 pages of a contract, the long-context model will miss some of them — especially the ones in the middle. For this case, RAG (Retrieval Augmented Generation) with a search index is still more reliable.

Reasoning across multiple long sources. Comparing two long documents where relevant information is distributed across both requires the model to maintain attention at multiple distant points in the context simultaneously. The effective working memory of models does not scale linearly with context window size.

Full production codebases. Anthropic noted that workflows relying on "put everything in context" hit a practical limit: most enterprise production codebases contain more code than 1-2 million tokens can accommodate. And even within the limit, degradation in the middle compromises analyses that depend on files at the center of the context.

The Cost of Long Context

There is an economic detail that long-context announcements frequently omit: surcharges for context above certain thresholds.

Anthropic and Google apply surcharges when requests exceed 200K tokens. The surcharge applies to the total request, not just the tokens above the threshold. For a call with 500K tokens of context that would normally cost $2.50, the actual cost can be 2-3x higher depending on current pricing policies.

For occasional use, this is not a concern. For production pipelines making hundreds of calls per hour with long contexts, cost can be the difference between product viability and non-viability.

The Alternative: RAG Is Still Relevant

Given the actual behavior of models with long contexts, RAG (Retrieval Augmented Generation) continues to be relevant in 2026 — not because long context does not work, but because for specific cases it works better.

RAG indexes documents externally, retrieves the most relevant passages, and provides only those passages in context. The model receives 2-10K tokens of highly relevant context instead of 500K tokens where the relevant information is diluted. For precise, deterministic information retrieval, the combination of search index + short context outperforms long context + attention-based search.

The optimal 2026 approach is not long context OR RAG — it is knowing which to use for which task. Long context for holistic analysis and reference-based generation. RAG for precise retrieval of distributed information.

The Number That Actually Matters

Instead of focusing on the announced maximum window, the relevant question is: what is the model's effective window for the specific task you need to execute?

Models with 128K of reliable effective context can outperform models with 1M announced context but 300K effective, depending on the use case. The context window benchmarks that measure retrieval from the middle — not just from the beginning and end — are the ones that reveal actual capability.

The context window race continues. But in 2026, the metric that matters is needle-in-a-haystack performance at the center of the context — not the number in the press release headline.