The Model Isn't the Problem. Your Evals Are.

Your AI feature worked in the demo. It worked in staging. Then it shipped to production and within a week, you’re drowning in support tickets because it randomly confidently gives wrong answers. You swap the model. Same outcome. You tune the prompt. Better, but still broken in ways you can’t predict.

The model isn’t the problem.

You don’t have evals.

What Evals Actually Are (And Aren’t)

An eval is not a unit test. It’s not assert output === "Paris". LLMs are probabilistic — asking the same question twice can yield different answers, and both can be correct. Traditional deterministic testing breaks down immediately.

An eval is a systematic measurement of quality over a distribution of inputs.

It answers: “Across 200 realistic queries, how often does this system behave the way we intend?” Not once, not in one cherry-picked demo, but consistently, at scale, across the messy surface area of real use.

There are three flavors:

1. Deterministic evals — for outputs with a ground truth. Classification, entity extraction, structured data parsing. If you’re extracting names from text, you can check precision and recall against a labeled dataset. This is the easy case. Most teams only do this one.

2. Heuristic evals — for outputs with soft constraints. “The response must not exceed 200 tokens.” “The answer must not mention competitors.” “The code must be syntactically valid.” These are fast, cheap, and catch a surprising proportion of regressions. Write more of them than you think you need.

3. LLM-as-judge evals — for outputs with no clear ground truth. “Is this response helpful? Accurate? Appropriately toned?” You use a second LLM to score the output of your first LLM. Yes, it sounds circular — but calibrated judge prompts against human-labeled reference data achieve 80-90% agreement with human raters. That’s good enough to catch regressions, even if it’s not perfect.

Most teams build none of these. They ship, observe, and react. That’s not engineering — that’s managed chaos.

Why Teams Skip Evals (And What It Actually Costs)

The usual excuses:

“We don’t have labeled data.” “It takes too long to build.” “The model is changing anyway so evals will be stale.”

These are rationalizations, not reasons.

You don’t need a perfect labeled dataset to start. You need 50 representative inputs, a few expected behaviors, and a scoring function that captures what you care about. A weekend of work. The cost of skipping it is weeks of production firefighting, blind model upgrades, and prompt changes that fix one thing and silently break three others.

The “model is changing” argument is backwards. Evals are precisely how you know whether a model upgrade is safe. Without them, every upgrade is a prayer.

The real reason teams skip evals is that they’re treated as a nice-to-have for after launch. This is the same mistake teams made with unit tests in 2005. It took a decade for TDD to mainstream because the up-front cost felt high. The industry learned. AI is repeating the cycle at speed.

A Practical Eval Stack That Doesn’t Suck

You don’t need a $200k MLOps platform. Here’s what actually works:

Step 1: Build a golden dataset. Start with 50–100 real production inputs (or synthetic ones that mirror real distribution). Label the expected outputs, behaviors, or quality dimensions. This is the hardest part — do it anyway. The act of labeling forces you to define what “good” means, which is valuable even before the first eval runs.

Step 2: Write your scorers. For each quality dimension:

Correctness → exact match or fuzzy match against ground truth
Format compliance → regex, schema validation, token length check
Quality → LLM judge with a calibrated rubric prompt

Keep scorers in version control alongside your prompts and application code. They’re part of the system.

Step 3: Run evals on every prompt change. Not once a week. Not before a release. On every meaningful change to your prompts, retrieval logic, or model config. This is where CI comes in — eval runs should be as automatic as linting.

Step 4: Track metrics over time. You need a dashboard showing score trends across versions. A regression from 94% to 87% on “answer relevance” should stop a deploy the same way a failing test stops a build. Tools like LangSmith, Braintrust, and PromptFoo handle this — or you can run evals to a Postgres table and build a 10-line dashboard. Either way, you need the number.

Step 5: Expand the dataset continuously. Every production failure is a new eval case. When users complain, add their query to the golden set. Your eval suite should grow as your system learns what production looks like.

The Uncomfortable Truth About LLM Quality

Here’s what most teams don’t want to hear: there is no “fixed” prompt.

Prompts decay. The same prompt that scored 92% in January scores 81% in March because the underlying model was silently updated, your users shifted their input patterns, or adjacent features changed the retrieval context. LLMs in production are not static artifacts — they’re living systems subject to distributional shift.

The only teams that catch this drift early are the ones with evals running continuously. Everyone else finds out from Slack.

This is also why “just upgrade to the new model” advice is dangerous without evals. Newer models are often better in aggregate benchmarks and worse on your specific task. o3 is brilliant at mathematical reasoning and noticeably worse than GPT-4o at certain instruction-following tasks in constrained domains. Claude 3.7’s extended thinking is extraordinary for complex coding problems and frustratingly verbose for simple Q&A. Model upgrades need to be validated against your eval suite, not assumed safe because the benchmark number went up.

Evals Are a Forcing Function for Clarity

There’s a second-order benefit nobody talks about: writing evals forces you to understand what your system is supposed to do.

Vague product requirements (“make it helpful and accurate”) cannot be eval’d. To write a scorer, you must define what “helpful” means in context. Is it answering in under 150 tokens? Citing sources? Staying in scope? The act of turning fuzzy requirements into measurable criteria is clarifying in a way that no amount of prompt iteration achieves.

Teams that invest in evals early tend to write better prompts — not because evals teach you prompting, but because they force you to think precisely about the goal. Precision in the spec produces precision in the output.

Conclusion

The benchmark arms race between labs is impressive. o3, R1, Claude 3.7, Gemini 2.0 — they’re all getting better at a pace that would have seemed absurd three years ago. But benchmark performance at the lab doesn’t transfer directly to production quality on your task. The gap between “the model can do this” and “your system reliably does this” is not filled by a better model. It’s filled by evals.

You wouldn’t ship backend code without tests and call it engineering. Shipping AI features without evals isn’t engineering either — it’s optimistic deployment and reactive support. The industry will learn this the hard way, same as it always does.

You don’t have to wait for the hard way.

P.S. If you’re starting from zero: build 50 test cases this week. Write one deterministic scorer and one heuristic scorer. Run them on your current system and record the baseline. That’s it. The habit matters more than the tooling. The tooling can catch up.