Code, craft, and hard-won lessons.
Building with code & caffeine.

Your Model Is Quietly Changing Underneath You

At some point in the last six months, something changed in your AI-powered feature. Outputs that used to be consistent started varying. A prompt that reliably produced structured JSON started occasionally producing prose. A classification task that was working fine started getting edge cases wrong. You checked your code — nothing changed. You checked your infrastructure — healthy. You filed it under “LLM behaviour” and moved on.

What actually happened: your model provider updated the model you were calling, silently, under an alias you didn’t know was mutable.

This is not a hypothetical. It’s the most underappreciated reliability problem in production AI systems today, and almost no one is handling it correctly.

The Alias Problem

When you call gpt-4o or claude-3-5-sonnet-20241022 or gemini-1.5-pro, you’re often not calling a fixed artifact. You’re calling an alias that resolves to whatever version the provider has decided is current. Providers update these aliases — for bug fixes, safety improvements, capability enhancements — without guaranteeing identical behaviour on existing prompts.

Some providers offer dated version identifiers that are more stable. Some offer explicit deprecation windows with advance notice. Some don’t. The contract you have with the model is weaker than the contract you have with almost any other infrastructure dependency in your stack.

You version your database schema. You version your APIs. You pin your Docker base images. You lock your npm dependencies. Then you call an unversioned model endpoint and wonder why behaviour changes.

Why It’s Hard to Detect

The insidious part is that model behaviour changes don’t look like outages. Your error rates don’t spike. Your latency stays flat. Your monitoring shows green. The change shows up in output quality — slightly different phrasing, subtly different formatting, a classifier that tips differently on borderline inputs — and quality is hard to monitor at runtime.

This is the observability gap that distinguishes AI systems from traditional software. A bug in your authentication service either causes a login to fail or it doesn’t. A model change might cause 3% of your summarisations to be slightly worse, or cause your structured output parser to break on 0.1% of responses, or cause your tone classifier to shift 8 percentage points on a specific category. None of these look like incidents. They look like noise.

By the time you notice the pattern, the model has been running changed for weeks. The data you’d need to diagnose it — pre-change outputs — is gone. You’re left trying to reproduce behaviour with a model that no longer behaves the same way it did when the problem started.

The Evaluation Staleness Problem

The standard answer here is “run evals.” But evals have a lifecycle problem that teams consistently underestimate.

Your eval suite is written against observed model behaviour. It encodes what good responses look like given the model you were using when you wrote the evals. When the model updates, one of three things happens: the model improves and your evals still pass because the outputs are good (fine), the model regresses and your evals catch it (great, that’s the point), or the model changes in a way that’s neutral to task performance but different from what your evals expect (your evals fail spuriously, you investigate, find nothing wrong, and either fix the evals or stop trusting them).

The third case is the one that destroys eval cultures. Teams that see eval failures that don’t correspond to real quality problems stop treating eval failures as urgent. The signal degrades. Eventually evals become a checkbox rather than a signal.

The root cause: evals that test for specific output patterns rather than task outcomes. “The response contains exactly three bullet points” will fail when the model changes formatting conventions. “The response correctly answers the user’s question” — evaluated by another model or by a human rubric — is robust to formatting changes. The second type of eval is harder to write and maintain. Most teams write the first type.

What Pinning Actually Buys You

The obvious fix is to use the most specific version identifier your provider offers. Dated snapshots, immutable model hashes, explicit version strings — whatever gets you closest to a fixed artifact.

Pinning buys you reproducibility and change control. You opt into model updates deliberately, with a chance to run your eval suite against the new version before cutting over. You can do a proper A/B comparison. You can roll back if something breaks. You treat a model update the way you’d treat any other dependency upgrade: with intention and verification.

What pinning doesn’t buy you is stability forever. Providers deprecate old versions. The fine-tuning data a pinned model was trained on included a knowledge cutoff. Your pinned model will eventually be unavailable, and the new model you upgrade to will not be a drop-in replacement.

This means model upgrades are not optional maintenance tasks you can defer indefinitely. They’re a recurring engineering cost that needs to be planned for. Teams that haven’t built the infrastructure for safe model migration will pay for that eventually.

The Prompt Drift Nobody Talks About

There’s a second form of drift that doesn’t involve the provider changing anything: your own prompts drifting over time in ways that interact badly with model updates.

Production prompt engineering is not a one-time activity. Prompts accumulate patches. Someone adds a clarification to handle an edge case. Someone else adjusts the output format. Someone adds a few-shot example for a type of request that was failing. The prompt grows. The examples become inconsistent. The instructions contradict each other in subtle ways that the model was papering over.

When the model updates, the cracks in your prompt surface. The new model is slightly more literal about contradictory instructions. The few-shot examples that used to guide format now fight with updated instruction following. The output schema that used to work reliably starts breaking on 2% of requests.

You’re now debugging a problem that’s the intersection of an undocumented prompt evolution and an undocumented model change. Good luck.

The fix is prompt versioning — treating prompts as code, with change history, review processes, and regression testing. This is not how most teams manage prompts. Most teams manage prompts as configuration, edited directly in a database or a CMS with no history beyond “last_updated_at”. That’s fine for a blog post. It’s not fine for production AI infrastructure.

What Good Model Version Management Looks Like

A few practices that actually hold up:

Explicit version identifiers, always. Use the most specific model identifier your provider offers. Never call gpt-4 when you can call gpt-4-0125-preview. Document which version you’re on. Review version changes deliberately.

Shadow evaluation on model updates. Before cutting over to a new model version, run your evaluation suite against both the old and new versions in parallel. Compare outputs statistically, not just by pass/fail. Look for distribution shifts in output length, format adherence, and task accuracy.

Prompt version control. Prompts live in git. They get reviewed like code. Changes to prompts get evaluated before deployment. The history of what a prompt looked like when a given incident occurred is preserved.

Behavioral monitoring, not just operational monitoring. Track output characteristics at runtime: response length distributions, format adherence rates, structured output parse success rates, downstream feature usage patterns that proxy for quality. Set up anomaly detection on these signals. You won’t catch everything, but you’ll catch the large shifts before they compound.

Incident archaeology support. When something goes wrong, you need to be able to reconstruct the exact model version and prompt that was in use at the time of the incident. If your logging doesn’t capture that, you can’t do RCA. Log the model identifier and prompt version with every request.

The Deeper Problem

All of this is harder than it should be because the tooling is immature. The observability ecosystem for AI systems is years behind the observability ecosystem for traditional services. Distributed tracing, semantic conventions, alert templates — these exist for web services. For AI systems, most teams are building their own monitoring from scratch.

This is the infrastructure gap of the current AI production moment. Everyone has figured out how to get a model to produce useful outputs in development. Far fewer have figured out how to know, in production, that the model is still producing useful outputs three months later when the model has updated and the prompts have drifted and the data distribution has shifted.

The teams that solve this first will have a real operational advantage. Not because they’ll have better models — everyone has access to the same frontier models — but because they’ll be able to move faster and break things less when the environment changes, which it will, constantly, because you don’t control the model and the model is not static.

That’s the infrastructure reality of building on top of third-party AI. The foundation shifts. Build for it.