Code, craft, and hard-won lessons.
Building with code & caffeine.

Stop Treating AI Agents Like Magic: They're Distributed Systems Problems

Gartner is projecting that 40% of agentic AI projects will be abandoned by 2027. McKinsey data shows only 23% of organizations experimenting with agents are actually scaling them. The industry is starting to frame this as an “AI maturity” problem or a “change management” problem.

It’s neither. It’s an engineering problem — and it’s one that distributed systems engineers solved decades ago.

AI agents aren’t a new category of software. They’re just distributed systems with an LLM in the hot path. The moment you treat them that way, most of the failure modes become obvious and the fixes become straightforward.


The Failure Pattern Nobody Talks About

Here’s how most agent projects die: they work in demos, break in production, and nobody can explain why.

The demo is always impressive. The agent correctly books a meeting, writes a PR, fills out a form, sends a confirmation email. Stakeholders nod. The engineering team ships it to production with some confidence.

Then it starts silently failing. The agent books the wrong timezone. It retries a payment that already went through. It gets stuck waiting on an API that returned a 202 and never called back. It loses context halfway through a 12-step workflow. The error logs are full of “unexpected model output” and “max_retries exceeded.”

None of these are model failures. They’re infrastructure failures. And every distributed systems engineer reading this has seen every single one of them before — in RPC systems, in message queues, in microservice call chains.

The industry spent 20 years learning how to handle partial failures, idempotency, timeouts, retries, and state management in distributed systems. Then AI agents showed up and we threw all of it out.


Agents Are Just Unreliable RPC Chains

Strip an AI agent down to its essence: a coordinator that calls tools (APIs, databases, subagents) in a sequence, accumulates state, and makes decisions based on intermediate results.

That’s a distributed workflow. The LLM is just one more fallible node.

The failure modes are identical to any distributed system:

Non-idempotent operations. Your agent retries after a timeout. The first call actually succeeded — you just didn’t get the response. Now you’ve charged the card twice, sent the email twice, created the duplicate record. This isn’t an AI problem. It’s the same problem you have with any unreliable network call. The fix is also the same: idempotency keys on every mutating operation.

Unbounded retries. The LLM returns a malformed response. The agent retries. Then retries again. Then again. No backoff. No circuit breaker. No dead-letter queue. A single bad request can spin forever, burning tokens and blocking downstream steps. Apply exponential backoff with jitter. Set maximum retry budgets. This is 1990s distributed systems wisdom — it applies here too.

No durable state. The agent is three steps into a ten-step workflow. Your process crashes. The state is in memory. You restart from scratch, re-running the first three steps, which may not be safe to re-run. The fix: checkpoint state to durable storage between steps. Temporal, Step Functions, Inngest — any workflow orchestration system does this by default. In-memory agents do not.

Missing timeouts. The agent calls a third-party API. The API hangs. The agent waits. Forever. Your deadline-sensitive job is now blocked on an unresponsive external service. Every I/O call needs a timeout. Every. Single. One. This is table stakes in distributed systems. Agents treat it as optional.

Undifferentiated errors. The agent fails. Why? “Model error.” Which kind? A timeout? A rate limit? A malformed response? A tool returning unexpected data? An upstream dependency being down? Each failure mode demands a different response: retry, fall back, escalate, abort. Logging “error” and stopping is not error handling.


The Orchestration Layer Is Not Optional

The most important architectural decision in an agent system isn’t which model to use. It’s what orchestrates the workflow.

Most teams pick the model first, build the tools second, and figure out orchestration later — if at all. This is exactly backwards.

Workflow orchestration systems like Temporal, Apache Airflow, or AWS Step Functions exist precisely to make long-running, multi-step, failure-prone workflows reliable. They give you durable execution, automatic retries with backoff, compensating transactions (sagas), visible audit trails, and the ability to pause and resume mid-workflow.

These aren’t nice-to-haves for agents. They’re prerequisites.

Consider what “durable execution” means in practice: if your process crashes mid-workflow, the orchestrator replays the workflow from the last checkpoint, skipping already-completed steps. Your agent picks up exactly where it left off. This is magic — the kind of magic that’s actually engineered and not just hoped for.

Temporal 2.0 (shipped late 2025) makes this even more compelling by letting you write workflows as ordinary code — no YAML state machine definitions, no external graph definitions. The workflow is the code. That fits naturally with how teams think about agent logic.


Treat Agents Like Junior Engineers, Not Autonomous Systems

There’s a cultural shift happening in how the industry talks about agents. “Autonomous” is quietly being replaced by “supervised.” McKinsey’s language has shifted to “co-pilots.” Gartner is talking about “human-in-the-loop” as a reliability mechanism, not just an ethical one.

This is correct — but not for the reasons usually given.

The standard argument for supervision is “AI makes mistakes.” True, but that’s also true of junior engineers. You don’t make a junior engineer autonomous because they might make mistakes. You keep them supervised until you’ve built confidence in their judgment on a specific class of problem, and you put guardrails around the operations that have irreversible consequences.

The same model applies to agents. Define the operation classes. Identify the reversible vs. irreversible actions. Build gates at the irreversibility boundaries. Run the reversible parts autonomously and freely. Require human confirmation (or at minimum human visibility) before the agent writes to production databases, sends external communications, or makes financial transactions.

This isn’t supervision as a philosophical stance. It’s risk-scoped automation. You’re maximizing autonomous speed where mistakes are cheap to fix, while limiting blast radius where they aren’t.


What Actually Works

The agent projects that are scaling have a few things in common:

Narrow scope. They solve one class of problem well rather than being general-purpose. An agent that handles incoming support escalations is deployable. An agent that handles “all of customer success” is not.

Workflow orchestration with durable state. Not “we save to Redis sometimes.” Durable, automatic, recoverable state that doesn’t require the on-call engineer to understand the agent’s internal logic to restart a failed run.

Tool contracts enforced, not hoped. Every tool the agent can call has an explicit contract: inputs, outputs, error codes, idempotency behavior. When the tool returns something outside the contract, the agent fails fast and loudly — not silently with a retry loop.

Observability as a first-class concern. You can’t debug what you can’t observe. Every agent invocation should emit structured traces: which tools were called, in what order, with what inputs and outputs, how long each step took, where it failed. LangSmith, Langfuse, and similar platforms exist specifically for this. Use them from day one, not after the first production incident.

Bounded token budgets. Unbounded context accumulation is a hidden cost bomb and a reliability risk. Long contexts degrade model performance on earlier content. Set explicit context window budgets. Summarize aggressively. Prune what isn’t needed for the current step.


The Models Are Fine

Here’s the uncomfortable truth: the models are not the bottleneck.

GPT-5, Claude 3.7, Gemini 2.0, Qwen 3.5 — all of them are capable of driving complex multi-step workflows when the orchestration around them is solid. The benchmark-chasing and model-swapping that consumes so much engineering time is mostly irrelevant to whether an agent system actually works in production.

What breaks agent systems isn’t model quality. It’s the same thing that breaks every distributed system: partial failures, missing timeouts, non-idempotent operations, invisible state, and inadequate observability.

The engineers who are making agents work aren’t the ones who picked the best model. They’re the ones who applied 20 years of distributed systems engineering discipline to a new kind of node in their call graph.

That’s the unlock. Treat your LLM like an unreliable remote service — because it is one. Design around its failure modes the same way you’d design around any unreliable dependency. Build the orchestration layer before you build the tools. Observe everything.

The magic comes from the engineering, not the model.