Multi-Agent Systems Are Just Microservices With Hallucinations

Gartner is reporting a 1,445% surge in multi-agent system inquiries from Q1 2024 to Q2 2025. Teams are splitting their monolithic LLM calls into fleets of specialised agents — a planner, a researcher, a coder, a reviewer, a validator — each responsible for a slice of the problem.

Sound familiar? It should. We did this exact thing with services ten years ago. We called it microservices. We were very excited. Then we spent three years debugging distributed deadlocks and wondering why order-service was timing out because inventory-service was waiting on pricing-service which was waiting on order-service.

Multi-agent systems are microservices with hallucinations. The failure modes are identical. The solutions already exist. And almost nobody building agent pipelines today has looked at what the distributed systems field learned the hard way.

The Math That Should Scare You

Here’s the core problem, stated plainly.

If a single agent completes a task successfully 95% of the time, that sounds good. But chain five of those agents together in a sequential pipeline, and your end-to-end success rate is 0.95⁵ = 77%. Chain ten agents and you’re at 60%. This is the compound error trap, and it’s brutal.

Research analysing 1,642 multi-agent traces across seven frameworks found failure rates between 41% and 87% depending on task complexity. Even generous benchmarks show that single-agent systems outperform multi-agent pipelines at 99.5% vs 97% success rates — and that 2.5% gap compounds into a massive reliability tax at scale.

The microservices world called this the “cascading failure” problem. A single unhealthy downstream service could take down an entire request chain. The solution wasn’t to give up on microservices — it was to engineer for failure as a default assumption, not an edge case.

Agent engineers haven’t accepted this yet. They’re still treating failures as bugs to fix rather than invariants to design around.

The “Bag of Agents” Anti-Pattern

There’s a specific failure mode that’s becoming the defining mistake of 2026 multi-agent architecture. I’ll call it the Bag of Agents: you have a task, so you create an agent for each subtask, wire them together loosely, and call it done.

The microservices equivalent was “nanoservices” — the tendency to split services so granularly that the coordination overhead dwarfs the actual work being done. A service to validate an email address. A service to format a date string. Each hop adds latency, adds failure surface, adds a network call to debug when things go wrong.

The Bag of Agents version: an orchestrator agent that calls a research agent that calls a summarisation agent that calls a fact-checking agent that calls a formatting agent. Five LLM calls, five potential hallucinations, five points where a misunderstood instruction cascades into a confidently wrong output six steps later.

The fix isn’t fewer agents per se — it’s right-sized agents. The boundary between agents should be drawn at points of genuine cognitive separation: different tools, different data sources, different risk profiles. Not at arbitrary task decomposition lines.

What Observability Actually Means for Agents

In distributed systems, observability isn’t a nice-to-have — it’s the only thing standing between “it works” and “we have no idea why it stopped working.” The three pillars — logs, metrics, traces — exist because you cannot reason about a distributed system from first principles at runtime. You need data.

Multi-agent systems generate distributed execution traces across concurrent, often non-deterministic agents. Traditional debugging doesn’t work. You can’t console.log your way through a pipeline where six agents ran in parallel, one returned an ambiguous intermediate result, and a downstream agent made a wrong inference based on it two minutes ago.

What you actually need:

Structured traces per agent call. Not just “agent A succeeded.” What was its input? What tools did it invoke? What did it reason about before acting? Token budget consumed? Time taken? This is distributed tracing applied to LLM calls — and the frameworks are starting to catch up, but most production deployments aren’t using them.

Inter-agent contract validation. In microservices, you have API contracts — the downstream service guarantees it will return a specific schema. If you’re passing unstructured natural language between agents and hoping the next agent parses it correctly, you don’t have a multi-agent system, you have a game of telephone. Define structured outputs. Validate them. Fail loudly at the boundary rather than silently downstream.

Coordination failure detection. One of the most insidious failure modes is agents waiting indefinitely — on a tool call that never returns, on another agent that’s stuck in a loop, on a resource that’s exhausted. You need timeouts, and you need to know when they fire.

None of this is novel. The SRE playbook solved this for services. Agent engineers need to read it.

Circuit Breakers, Idempotency, and Retry Logic

Let’s steal three specific concepts from distributed systems and apply them directly to agent architectures.

Circuit breakers. If a tool your agent calls fails three times in a row, stop calling it. Return a structured failure response. Let the orchestrator decide whether to retry, reroute, or escalate. Right now, most agent pipelines will happily retry a failing tool call in a tight loop until they exhaust the token budget or hit a timeout. That’s not resilience — it’s a meltdown.

Idempotency. If your agent pipeline can be interrupted and resumed, every agent action that has side effects needs to be idempotent. Writing to a database, sending an email, calling an external API — if the pipeline crashes and replays, you cannot have your agent doing these things twice. This is table stakes in distributed systems. It’s almost completely ignored in agent design.

Retry with exponential backoff and jitter. When an agent fails a subtask, the instinct is to immediately retry. Don’t. If the failure was caused by rate limiting, a downstream outage, or a temporarily corrupted context, retrying immediately just hammers the problem. Back off. Add jitter. This is a 1990s distributed systems lesson that somehow needs to be learned again.

Centralised vs Decentralised Orchestration

Microservices had the service mesh debate: should services discover and call each other directly, or should traffic flow through a centralised proxy that handles routing, retries, and observability?

Multi-agent systems have the same debate, just with more vibes and less rigor.

Centralised orchestration — one orchestrator agent that routes tasks to specialist agents — is easier to debug, easier to reason about, and has a single point of control. The downside is a single point of failure and potential bottlenecks if the orchestrator is doing real work.

Decentralised orchestration — agents that spawn sub-agents, pass tasks peer-to-peer — is more resilient but creates an observability nightmare. When things go wrong, you’re tracing execution across a dynamic graph of agents that may or may not have logged what they did.

The right answer depends on your use case, but the decision shouldn’t be made implicitly. Most teams stumble into one model or the other based on which framework they picked. That’s the microservices mistake: letting your infrastructure choice make your architecture decision for you.

The Trust Boundary Problem

Here’s something the microservices world had to reckon with that agent engineers are just starting to face: you cannot trust inputs from other services, even internal ones.

In the early microservices days, teams assumed internal traffic was safe. Services trusted each other implicitly. Then came the wave of supply chain attacks, misconfigured network policies, and the realisation that service-to-service calls needed authentication, validation, and rate limiting just like external API calls.

Agents have the same problem, with an extra wrinkle: the input isn’t just structured data, it’s natural language instructions that another agent generated. Prompt injection — where malicious content in one agent’s output manipulates the behaviour of a downstream agent — is real, underexplored, and getting worse as pipelines get longer.

If you’re running an agent pipeline that processes external data at any point, and you’re passing that data unfiltered into another agent’s context, you have a trust boundary problem. The attacker doesn’t need to compromise your system — they just need to get their instructions into the pipeline.

What Good Actually Looks Like

This isn’t an argument against multi-agent systems. The pattern is powerful and increasingly necessary for tasks too complex for a single context window or requiring genuine parallelism across specialised domains.

But good multi-agent architecture looks like this:

Right-sized agents with clear cognitive responsibilities and minimal coordination surface
Structured interfaces between agents — typed outputs, validated schemas, not raw text blobs
Observability built in from day one — traces, structured logs, inter-agent latency metrics
Explicit failure handling — circuit breakers, timeouts, fallback strategies that don’t just retry blindly
Trust boundaries respected — external data sanitised before it enters the pipeline
Idempotent actions — especially for any agent that touches external systems

This is not a new framework. It’s distributed systems engineering applied to LLMs. The textbooks already cover this. The patterns are established. Teams just aren’t looking in the right places.

The Bitter Lesson, Again

The multi-agent hype cycle is at its peak. Gartner predicts over 40% of agentic AI projects will be scrapped by 2027 — not because the models aren’t capable enough, but because the orchestration falls apart in production.

That’s the same arc as early microservices. The technology was sound. The engineering discipline wasn’t there yet. Teams who treated it as a deployment topology rather than a distributed systems design problem had a rough time.

The teams who win with multi-agent systems in 2026 won’t be the ones with the most agents. They’ll be the ones who treated agent reliability as a first-class engineering constraint and borrowed their playbook from a field that solved these problems a decade ago.

Your agents don’t need more intelligence. They need better failure modes.