Context Engineering: The Discipline Nobody Is Taking Seriously
There’s a war being fought in AI circles right now between the RAG camp and the long-context camp. The RAG people say retrieval is fast, cheap, and targeted. The long-context people say just throw everything at the model and stop overthinking it. Both camps are wrong in the same way: they’re treating a strategy problem like a tool selection problem.
The real question isn’t “RAG or long context?” It’s: do you have a coherent strategy for what information you’re giving the model, when, in what form, and at what cost?
Most teams don’t. That’s the actual problem.
This discipline has a name now — context engineering — and it’s about to become the skill that separates teams shipping reliable LLM applications from teams perpetually debugging hallucinations and wondering why their agent keeps confidently doing the wrong thing.
What Context Engineering Actually Means
Context engineering is not a new framework. It’s not a library you install. It’s the set of decisions you make about how to populate the prompt — and crucially, how to make those decisions systematically rather than through vibes and trial and error.
Every LLM call you make has a context window. That window is finite, has a cost per token, and has a structure. What goes in that window — and what doesn’t — determines whether your application works or produces expensive, confident garbage.
The decisions are:
- What information does this call actually need? Not what might be useful. What is necessary.
- Where does that information come from? Static? Retrieved? Computed? Remembered?
- When is the information fresh enough? Can you cache it? Does it need to be fetched per-call?
- What is the cost profile? Are you paying for a million tokens of context on every request, or are you paying for one precise retrieval?
- Where in the context does it land? Because this matters more than you think.
If you can’t answer all five of these for every major LLM call in your application, you’re not doing context engineering — you’re just hoping.
The Cost Reality Is Brutal
Let’s talk numbers, because the gap is genuinely alarming.
A RAG-powered query — embedding the question, retrieving a relevant chunk, inserting it into a short prompt — costs roughly $0.00008 per query. A long-context call stuffing a 200k-token document into the context window costs around $0.10 per query. That’s a 1,250x cost difference.
At 10,000 queries a day, that’s $0.80 vs $1,000. Per day.
The long-context advocates will tell you the precision is worth it. Sometimes it is. But often they haven’t done the math, or they built their prototype at low traffic and never modelled what happens at scale. Then the AWS bill shows up and suddenly everyone is “reconsidering the architecture.”
Speed is just as stark. A RAG retrieval takes roughly 1 second end-to-end. A 200k-token context call averages around 45 seconds. You can’t hide 45 seconds of latency. You can’t build a responsive application on it. Streaming helps the perception of latency, not the actual time to first useful token.
This doesn’t mean long context is wrong. It means it’s a tool for specific scenarios — not a default strategy.
What Stanford Found That Nobody Wants to Talk About
The “Lost in the Middle” paper from Stanford is required reading for anyone building LLM applications. The finding: model performance degrades significantly when the relevant information is positioned in the middle of a long context window.
The models are good at using information near the beginning and the end of the context. Information buried in the middle? They start to miss it. The performance drop is over 30% in some configurations.
Think about what this means in practice. You’ve got a 128k-token context. You throw in a dense codebase, some documentation, a system prompt, conversation history, and the user’s question. The answer lives somewhere in the middle of that mass. The model might just… not find it. Or find it less reliably than if you’d retrieved it directly and put it at the top of a 4k-token prompt.
Bigger context windows are a technical achievement. They are not a substitute for thinking about where you put things within them.
The Four Primitives
Context engineering uses four primitives, and the skill is knowing when to reach for each one.
Retrieval (RAG). Fetch semantically relevant information at query time. Good for: large knowledge bases, freshness requirements, cost-sensitive applications. Bad for: when the query doesn’t map cleanly to a retrievable chunk (reasoning questions, multi-hop tasks).
Long context. Pass a large, structured document or codebase directly. Good for: complex reasoning over a single coherent artifact, code review, document analysis. Bad for: anything you’ll do at scale, anything requiring sub-second response times, anything where the budget matters.
Structured memory. Persistent, queryable state maintained across turns or sessions. Not just “I’ll stuff the conversation history in.” Actual summaries, entity extractions, preference records that grow intelligently rather than ballooning. Good for: agents with persistent state, personalisation, long-running tasks. Bad for: stateless APIs where you’d pay for memory you never use.
Tool calls. Real-time data retrieval and action execution, scoped precisely to what’s needed. Good for: live data (stock prices, weather, current code state), side effects. Bad for: anything that should be pre-computed or cached, anything where you can’t tolerate latency.
The mistake most teams make: they pick one of these and use it everywhere. RAG teams over-retrieve and under-reason. Long-context teams over-stuff and under-filter. Agent teams throw every tool at every call regardless of whether it’s needed.
A mature context strategy uses all four, in deliberate combination, based on the characteristics of each call.
What a Real Context Strategy Looks Like
Here’s a concrete example. You’re building a developer assistant that answers questions about a company’s internal codebase.
Naive approach: Load the entire codebase into context on every query. 500k tokens per call. Slow. Expensive. Fails silently when the answer is in the middle of a file.
RAG-only approach: Embed every file, retrieve the top-5 chunks. Fast. Cheap. But if the question requires understanding how three different modules interact, the retrieved chunks are disconnected fragments and the model reasons badly.
Context-engineered approach:
- Use a short system prompt with the repo structure (static, cached).
- Retrieve the 3-5 most relevant files or functions via semantic search.
- For the specific file the user is asking about, inject the full file directly.
- Use a persistent memory layer to remember what the user has already asked and what files have been relevant in this session.
- Tool calls for any live state: current git status, recent commits, running tests.
Same question. Much better answer. Orders of magnitude cheaper. Predictable latency.
This isn’t complicated. But it requires thinking about it rather than defaulting to whichever primitive is most convenient.
The MCP Angle
Model Context Protocol is relevant here — not because it solves context engineering, but because it makes the retrieval and tool-call primitives composable. MCP gives you a standardised interface for pulling in external data sources and capabilities without hard-coding each integration.
But here’s the trap: MCP can make over-retrieval easier. If you’ve got 20 MCP servers registered and your agent calls all of them on every turn to “be thorough,” you’ve just automated a context-stuffing disaster. The protocol doesn’t enforce discipline. You still need to decide which tools to call and when.
MCP is infrastructure. Context engineering is strategy. You need both.
Why This Is Hard
The reason context engineering hasn’t been taken seriously is that it’s invisible when it’s working. Good context decisions feel like “the model understood the question.” Bad context decisions feel like “the model is dumb” — even when the model is fine and the prompt is just poorly constructed.
Engineers debug model outputs, not context strategies, because the model is the observable thing. The context is invisible until you start measuring.
Once you start measuring — cost per query, latency distributions, retrieval precision, answer quality by context length — the picture changes fast. You start seeing exactly which calls are over-contextualised and which are under-contextualised. You start seeing the correlation between context structure and output quality. You start making deliberate tradeoffs instead of accidental ones.
The teams that will win with LLM applications aren’t the ones with the best models. They’re the ones who instrument their context pipelines, measure what actually works, and iterate on information architecture with the same rigour they apply to everything else.
The Practical Starting Point
You don’t need to redesign everything. Start here:
-
Audit your three most expensive LLM calls. For each one: what’s in the context? How big is it? Does all of it actually need to be there?
-
Measure retrieval precision. If you’re using RAG, what percentage of retrieved chunks are actually relevant to the answer? If it’s below 60%, your embedding strategy or chunking strategy is broken.
-
Profile latency by context length. Plot your p50 and p95 response times against input token count. You’ll immediately see where your bottlenecks are.
-
Add position discipline. The most important information goes at the top or bottom of the context. System instructions, the user’s question, the most relevant retrieved chunk — these should never be buried in the middle of a massive context block.
-
Build a memory layer before you need one. Don’t wait until your conversation history is 50k tokens before thinking about summarisation. Design the summarisation strategy into the architecture from the start.
None of this requires new models, new frameworks, or new infrastructure. It requires treating context as a first-class engineering concern — not an afterthought you’ll “optimise later” that never gets optimised.
“Later” is here. Your competitors are either figuring this out now or they’re about to pay for not figuring it out. Either way, context engineering is no longer optional.