Code, craft, and hard-won lessons.
Building with code & caffeine.

1M Token Context Windows Are Hiding Your Architecture Problems

There’s a drug being handed out for free in the AI tooling space, and it’s called the million-token context window. Claude, Gemini, and the latest frontier models all support it. Vendors advertise it like headroom is the only metric that matters. Teams are using it to avoid doing the hard thing — designing systems that know what information they actually need.

The result is production AI systems that are slower, more expensive, and less accurate than they should be. And the teams building them don’t know it yet because the demo worked.

The Pitch Is Seductive

The implicit promise of a 1M token context window is: stop worrying about retrieval, just throw everything in. Your entire codebase. Every conversation. All the docs. The model will figure it out.

And sometimes it does. In demos. In synthetic evals. In happy-path testing where the answer is somewhere near the top of your dump.

In production, it falls apart.

What Actually Happens at Scale

Context rot is real and measurable. Research across multiple model families consistently shows quality degradation as input size grows. The most famous failure mode is “lost in the middle” — models attend strongly to the beginning and end of a long prompt and become unreliable at recalling content buried in the middle. You can build a benchmark that puts the right answer at position 600k and watch your model fail it.

But even before you hit that specific failure, you’ve already accepted a cost multiplier that compounds hard in agentic systems.

A 1M token context in a single-shot query is expensive. In an agent loop that re-sends context with every tool call — 10, 20, 50 times per session — it becomes economically indefensible at any meaningful scale. You’re paying to process information the model doesn’t need, over and over, and getting degraded output in return.

The Architecture You’re Avoiding

When you lean on a giant context window, you’re usually avoiding one of these:

A proper retrieval layer. RAG gets dismissed as “old school” now that context windows are huge, but focused retrieval consistently outperforms context stuffing for factual recall tasks. Selecting the right 10k tokens beats throwing in 500k every time the information has any density at all. The retrieve-then-solve pattern has been shown to more than double accuracy on tasks where naive context stuffing struggles. That’s not a marginal win. That’s the difference between a feature that ships and one that quietly gets rolled back after users complain.

A memory architecture. Long conversations don’t need every prior message verbatim. They need a compressed, structured representation of state — decisions made, facts established, open questions. Building that forces you to think about what matters. Context stuffing lets you pretend nothing needs to be thrown away. It does.

A tool design problem. Agents that need to read 200k tokens of documentation on every call are compensating for not having the right tools. If your agent retrieves a full API spec every time it needs to make an HTTP call, the fix isn’t a bigger context window — it’s an API lookup tool that returns exactly the endpoint signature it needs.

Why This Matters More Than It Used To

18 months ago, context window size was a genuine constraint. 8k, 32k — you had to be surgical about what went in. That constraint forced good habits. Teams built retrieval pipelines. They thought about information architecture. The constraint was load-bearing.

Now the constraint is gone, and so are the habits it enforced.

The teams shipping the most reliable production AI systems today are not the ones with the biggest context windows. They’re the ones treating context as a scarce resource even when it isn’t — explicitly filtering, ranking, and pruning before anything hits the model. They’re building systems where the answer to “what does the model need to know?” is a deliberate design decision, not a default dump.

How to Think About It

Context windows are a capability, not a strategy. Use them when you genuinely don’t know what subset of information is relevant and the task is one-shot. Use them when latency and cost aren’t constraints. Use them when you’re prototyping and want to defer the retrieval problem.

Don’t use them as a substitute for knowing what your system needs. Don’t use them in agentic loops where cost compounds. Don’t use them and then wonder why your agent hallucinates details that were technically in the context somewhere around token 400k.

The practical test: if you could name every piece of information your system reliably needs, you should retrieve it. If you genuinely can’t narrow it down, dig into why. Nine times out of ten, you can narrow it down — you just haven’t tried.

The Real Problem

Giant context windows are doing the same thing to AI system design that “just add another microservice” did to backend architecture in 2018. They’re a solution that looks scalable until suddenly it doesn’t, and by then you’ve built your entire system around the assumption that it would.

The teams that will have the most maintainable, cost-efficient, and accurate AI systems in two years are the ones building the right retrieval and memory primitives now. Not because context windows won’t get bigger — they will — but because systems that know what they need will always outperform systems that just grab everything and hope.

A million tokens isn’t infinite. It just feels that way until you’re in production.