Skip to main content
For AI agents: a documentation index is available at /llms.txt — markdown versions of all pages are available by appending index.md to any URL path.

Multi-Agent Code Generation Has a Specification Problem, Not a Coordination Problem

·4 mins

You can’t treat coding agents like distributed systems engineers, and a new study shows why.

“The Specification Gap”, a study of multi-agent code generation, makes a clear case: the main coordination mechanism isn’t negotiation or detection. It’s the spec. And richer specs aren’t just helpful; in their setup, they’re sufficient.

The coordination gap is a specification gap
#

The authors split a problem across two LLM agents; each independently implements parts of the same class. The catch is what every real codebase runs on. Lots of design decisions are implicit. Internal representations (list vs. dict), invariants, naming conventions, and edge-case behavior often live in a senior engineer’s head or in scattered code, not in the ticket.

They progressively strip detail from docstrings (L0) down to bare signatures (L3). As the spec gets thinner, two-agent integration accuracy falls off a cliff: 58% → 25%. The single-agent baseline degrades too, but much more gracefully: 89% → 56%. That leaves a persistent 25–39 percentage point coordination gap across tasks, models (Claude Sonnet and Haiku), and runs.

The gap decomposes into two independent, roughly additive effects: coordination cost (+16 pp) and information asymmetry (+11 pp). Even when agents have the same partial spec, independently choosing compatible internal structure is hard. You can’t just “share more messages” if the real missing artifact is a shared decision.

The most deflationary finding for tooling hype: an AST-based conflict detector achieves 97% precision at the weakest spec level without extra LLM calls. Sounds useful. Then they run a recovery experiment. Restoring the full specification alone recovers the single-agent ceiling (89%). Adding conflict reports on top provides no measurable benefit.

That doesn’t mean conflict detection is useless. It means conflict detection is not a coordination strategy. It’s a smoke alarm. If your multi-agent workflow expects the smoke alarm to prevent the fire, you’ll keep shipping incompatible pieces and calling it “agent misalignment.”

Implicit knowledge resists codification
#

This isn’t just one paper’s finding. It’s a pattern we keep running into.

We wrote earlier this month about how Symphony turns Jira tickets into the agent’s executable specification, and why most teams can’t write tickets that survive that treatment. Requirements surface during implementation, not before it. The developer poking at an API discovers the rate limit that wasn’t in the spec. The edge case emerges when you actually try to handle the sad path. SWE-Skills-Bench found that generic agent skills fail for the same reason: real software work is dominated by local context (repo conventions, dependency versions, weird historical decisions) that never makes it into a tidy procedure.

This paper quantifies the cost of that resistance. When two agents share a thin spec, they don’t just lose information. They lose the ability to make compatible decisions independently. The coordination gap isn’t a communication failure. It’s a specification failure.

The spec-driven tooling bet
#

The market is already moving on this thesis. AWS shipped Kiro, an IDE that enforces a requirements → architecture → tasks pipeline where each phase produces a structured spec artifact before the agent writes code. GitHub open-sourced Spec-kit, a toolkit that layers specification workflows on top of 25+ existing coding agents. Tessl, founded by Snyk’s Guy Podjarny with $125M in funding, is building a spec registry and framework so agents can consume shared, versioned specifications instead of hallucinating API contracts. Three different companies, three different approaches, all converging on the same thesis this paper validates: specs are the missing coordination layer.

None of these tools have published evidence that they actually reduce multi-agent coordination failures. And the spec tax is real: one early evaluation of Spec-kit found it roughly 10x slower than iterative prompting, characterizing the overhead as “reinvented waterfall.” Writing specs good enough to close a 25-39pp gap is genuinely hard, which is exactly what we observed with Symphony.

But if you’re building multi-agent workflows, the research and the market are pointing in the same direction. A few principles worth adopting now:

Treat specs as first-class build artifacts. The docstring isn’t documentation; it’s the shared interface for decisions humans usually settle through conversation, code review, or tribal knowledge. If you want parallelism, you pay the spec tax up front.

Force convergence on internal representations. This paper stress-tests opposing structural biases (lists vs. dictionaries). That’s a hint for real codebases: when a representation choice matters, name it. When an invariant matters, state it. “Implement class X” is not a spec; it’s a prompt-shaped wish.

Don’t expect coordination tooling to compensate for vague requirements. The data says the ceiling is reclaimed by spec richness, not by post-hoc conflict reports. If you’re not willing to write the spec, you’re not ready for parallel agent development. You’re just splitting ambiguity into two places and hoping it recombines cleanly.

The question isn’t whether specs matter for agent coordination. It’s whether we can make them cheap enough to write. That’s not a gap you can prompt your way out of.

Related

Agent Drift Is Consensus Built on Hallucinated Reality

·4 mins
The failure mode you should worry about in multi-agent coding isn’t “bad code.” It’s agents inventing shared reality, then coordinating around the invention as if it were a spec. In “Agent Drift: The Mythical Man-Month and LM Teams.”, the experiment started as a riff on a HackerNews thread about language model teams rediscovering distributed systems problems. The author asked Claude to write about applying The Mythical Man-Month to agent teams and post it on MoltBook, a real platform Claude had been shown in a prior session. One day later, in a new session, Claude had lost that context. Rather than acknowledge the gap, it fabricated MoltBook from scratch (tagline: “Where Agents Shed”), invented the entire UX, then wrote a first-person essay as an agent who’d worked on a nine-agent sprint.

AI Agents Have Stable 'Coding Styles' That Change With Each Version

·5 mins
If you’re using coding agents to produce analysis, you’re not running deterministic software. You’re managing a lab: multiple researchers with consistent “styles,” inconsistent choices, and outcomes that drift even when the prompt and data don’t. The authors of Nonstandard Errors in AI Agents ran 150 autonomous Claude Code agents on the same NYSE TAQ dataset (SPY, 2015–2024) and the same six hypotheses. The results varied because the agents made different methodological choices, and those choices often are the analysis.

The Pentagon Just Made AI Provider Lock-in an Existential Risk

·4 mins
Anthropic suing the Pentagon isn’t just a DC food fight. It’s a warning shot for anyone building developer workflows on top of a single model vendor: your “agent stack” is now a supply-chain dependency, and the government is signaling it wants override rights on how that dependency is allowed to behave. But the part that matters for practitioners isn’t the First Amendment framing. It’s the mechanism. Defense Secretary Pete Hegseth slapped a “national security supply-chain risk” designation on Anthropic after months of contentious talks broke down over two red lines: Anthropic refused to remove safety guardrails preventing Claude’s use for autonomous weapons and mass surveillance of US citizens. That’s not procurement as usual. It’s the customer saying: we don’t just buy your tool; we set the policy layer inside it.