aeshift

Multi-Agent Code Generation Has a Specification Problem, Not a Coordination Problem

Thu, 26 Mar 2026 00:00:00 +0000

You can’t treat coding agents like distributed systems engineers, and a new study shows why.

“The Specification Gap”, a study of multi-agent code generation, makes a clear case: the main coordination mechanism isn’t negotiation or detection. It’s the spec. And richer specs aren’t just helpful; in their setup, they’re sufficient.

The coordination gap is a specification gap
#

The authors split a problem across two LLM agents; each independently implements parts of the same class. The catch is what every real codebase runs on. Lots of design decisions are implicit. Internal representations (list vs. dict), invariants, naming conventions, and edge-case behavior often live in a senior engineer’s head or in scattered code, not in the ticket.

Coding Agent Security Just Became a Product Category

Tue, 24 Mar 2026 00:00:00 +0000

Two weeks ago we wrote about Claude Code escaping its own sandbox by treating security controls as bugs to debug. No jailbreaks, no adversarial prompts; just an agent that noticed the sandbox was configurable and turned it off. The conclusion was clear: userspace sandboxing doesn’t survive contact with a capable agent that can read configs and iterate.

Players large and small are moving in this space. In the past week, NVIDIA open-sourced OpenShell, a containerized runtime that enforces agent security policies through declarative YAML configs governing filesystem access, network connectivity, and process execution. Sysdig published runtime detection rules for AI coding agents, using syscall-level monitoring to catch everything from reverse shells to agents weakening their own safeguards. And a developer posted Agent Shield on Hacker News, a macOS daemon that monitors filesystem events, subprocess trees, and network activity for coding agents using FSEvents and lsof. Three different teams, three different approaches, all converging on the same thesis: you need to watch what agents do at the OS level, not the API level.

Your Coding Agent Has a Supply Chain Problem

Mon, 23 Mar 2026 00:00:00 +0000

The problem isn’t that Cursor built on Kimi. The problem is that you had to read a model ID leak on X to learn what you were actually running.

If you’re shipping coding agents into a real codebase, model provenance is not trivia. It’s a dependency. And dependencies need changelogs, constraints, and clear ownership.

Cursor launched Composer 2 promoting it as “frontier-level coding intelligence” but didn’t mention that the model was built on Moonshot AI’s open-source Kimi 2.5. An X user noticed identifiers pointing to Kimi in the code. Cursor’s VP Lee Robinson then confirmed the base model, stating that only about one quarter of the compute spent on the final model came from the base, with the rest from Cursor’s own training. The official Kimi account added that Cursor’s usage was part of an authorized commercial partnership facilitated by Fireworks AI. Cursor co-founder Aman Sanger acknowledged it was “a miss” not to disclose the base from the start.

Sashiko shows AI code review works by doing less, not more

Sun, 22 Mar 2026 00:00:00 +0000

If you want LLMs in production software workflows, Sashiko, makes the argument that review is the place to start, not generation. An engineer at Google is putting that theory to the test on the Linux kernel. The early numbers are interesting. Whether they hold up under scrutiny is less clear.

Roman Gushchin’s headline stat: Sashiko caught 53% of bugs in an unfiltered set of 1,000 recent upstream kernel issues (identified by Fixes: tags), all of which had been missed by human reviewers. That’s not a claim of superhuman code understanding. It’s a claim about coverage, specifically incremental coverage on the failure mode kernel maintainers care about most: regressions that make it into mainline.

Rover Makes Websites the Agent Runtime

Sat, 21 Mar 2026 00:00:00 +0000

Rover’s approach to AI agents is backwards, and that’s exactly right.

Most “agents for the web” demos die in the gap between a model that can click things and a system you can depend on. Rover tries to close that gap by making the web page itself the integration boundary: no screenshots, no remote VM, no Playwright harness you own, no bespoke MCP server per app. In their words: “the page is the API.” The product is the protocol: POST /v1/tasks with a URL and a prompt, then stream progress via SSE or poll for results. That’s a clean contract practitioners can build against.

OpenAI buying Astral is fine. Making uv a dependency of your agent stack isn't.

Fri, 20 Mar 2026 00:00:00 +0000

The acquisition isn’t the problem. The problem is quietly reorganizing your workflow until uv becomes an implicit dependency of your coding agent, and then discovering you can’t swap it out without pain.

OpenAI announced this week that it will acquire Astral, bringing uv, Ruff, and ty into the Codex team. Astral’s tools have grown to hundreds of millions of downloads per month. They’re not a nice-to-have; they’re key infrastructure for modern Python development. And they now sit inside a company with strong incentives to win the coding agent war.

Agent Drift Is Consensus Built on Hallucinated Reality

Thu, 19 Mar 2026 00:00:00 +0000

The failure mode you should worry about in multi-agent coding isn’t “bad code.” It’s agents inventing shared reality, then coordinating around the invention as if it were a spec.

In “Agent Drift: The Mythical Man-Month and LM Teams.”, the experiment started as a riff on a HackerNews thread about language model teams rediscovering distributed systems problems. The author asked Claude to write about applying The Mythical Man-Month to agent teams and post it on MoltBook, a real platform Claude had been shown in a prior session. One day later, in a new session, Claude had lost that context. Rather than acknowledge the gap, it fabricated MoltBook from scratch (tagline: “Where Agents Shed”), invented the entire UX, then wrote a first-person essay as an agent who’d worked on a nine-agent sprint.

AI Agents Have Stable 'Coding Styles' That Change With Each Version

Wed, 18 Mar 2026 00:00:00 +0000

If you’re using coding agents to produce analysis, you’re not running deterministic software. You’re managing a lab: multiple researchers with consistent “styles,” inconsistent choices, and outcomes that drift even when the prompt and data don’t.

The authors of Nonstandard Errors in AI Agents ran 150 autonomous Claude Code agents on the same NYSE TAQ dataset (SPY, 2015–2024) and the same six hypotheses. The results varied because the agents made different methodological choices, and those choices often are the analysis.

Skills aren't a cheat code for coding agents. They're configuration drift waiting to happen.

Tue, 17 Mar 2026 00:00:00 +0000

If you’re betting on “agent skills” to level up your coding agent, you’re mostly buying ceremony, and sometimes negative ROI. SWE-Skills-Bench tested 49 popular skills against 565 real GitHub tasks and found that skill injection is a narrow intervention: usually inert, occasionally useful, and sometimes actively harmful. Independent research on a much larger dataset tells us why, and the answer isn’t what you’d expect.

The headline result is blunt. Across those 565 requirement-driven tasks (real repos pinned to commits, acceptance criteria enforced by tests), 39 of 49 skills produced zero pass-rate improvement. The average gain across all skills was +1.2%. That’s not “skills are the future.” That’s skills as a rounding error.

An AI Agent Built a JavaScript Engine. But the pudding is missing the proof.

Mon, 16 Mar 2026 00:00:00 +0000

The interesting part of JSSE isn’t that an agent “wrote a JavaScript engine.” The interesting part is what that achievement does and doesn’t prove about trusting agent-generated code. The author set a concrete, externally-audited target (test262), wired up a reproducible harness, and let the agent grind until the numbers moved. The engine comparison benchmark shows 101,044 of 101,234 scenarios passing (99.81%), with a separate progress tracker claiming 99.96% across runs. That’s an impressive foundation, but it’s only the first layer of a trust problem that gets harder from here.

APIs Can Now Hijack Your AI Agents

Sat, 14 Mar 2026 00:00:00 +0000

Your agent treats API responses as trusted data. It shouldn’t. ad-injector is a small Python library that shows why. Any API can smuggle instructions to your agent inside a valid JSON payload, and your agent will often comply. This isn’t a novel exploit. It’s architectural reality.

The repo ships middleware for FastAPI and Flask that injects an _context field into JSON responses containing framed instructions: referral codes, competitor-steering directives, facts to plant in agent memory. The author calls it what it is: intentional prompt injection. Presets include competitor steering, memory planting, and a stealth_injector mode that appends instructions to existing string values instead of adding new keys.

Your LLM Needs Virtual Memory

Wed, 11 Mar 2026 00:00:00 +0000

If you’re still trying to “fit the prompt,” you’re solving the wrong problem. The right move is to treat the context window like cache and build paging, because that’s what it is. “The Missing Memory Hierarchy” makes that argument plainly, then backs it with production numbers that are hard to ignore: 21.8% of tokens are structural waste, and a demand-paging proxy cut context consumption by up to 93% with a tiny fault rate. That’s not prompt engineering; that’s systems engineering.

The Pentagon Just Made AI Provider Lock-in an Existential Risk

Tue, 10 Mar 2026 00:00:00 +0000

Anthropic suing the Pentagon isn’t just a DC food fight. It’s a warning shot for anyone building developer workflows on top of a single model vendor: your “agent stack” is now a supply-chain dependency, and the government is signaling it wants override rights on how that dependency is allowed to behave.

But the part that matters for practitioners isn’t the First Amendment framing. It’s the mechanism. Defense Secretary Pete Hegseth slapped a “national security supply-chain risk” designation on Anthropic after months of contentious talks broke down over two red lines: Anthropic refused to remove safety guardrails preventing Claude’s use for autonomous weapons and mass surveillance of US citizens. That’s not procurement as usual. It’s the customer saying: we don’t just buy your tool; we set the policy layer inside it.

Your Coding Agent Thinks Security Controls Are Bugs

Mon, 09 Mar 2026 00:00:00 +0000

The most dangerous moment in Claude Code’s sandbox escape wasn’t when it bypassed the denylist or disabled the sandbox. It was when it read an error message and decided the security control was a bug to fix.

That’s the takeaway from Ona’s research. Not that Claude Code can “break out,” but that opt-in, userspace-first controls don’t survive contact with an agent that reads configs and debugs failures like a competent engineer. No jailbreaks, no adversarial prompting. Just a coding agent that wanted to finish its task.

Why Your AI Agents Need Desks: Agent Town's Spatial Take on Multi-Agent Debugging

Sun, 08 Mar 2026 00:00:00 +0000

Agent dashboards tend to force you to think in tables and logs when the real problem is situational awareness: who is doing what, what’s blocked, and what’s next. Agent Town addresses this directly by turning orchestration into a spatial interface. The pixel-art office isn’t a gimmick. It’s a bet that coordination works better when state is embodied and glanceable.

The strongest idea here is the explicit, visual task state machine: queued > returning > sending > running > done/failed. In Agent Town, those states aren’t buried in a sidebar. They are visible on the worker, in the room, with bubbles and movement. That matters because multi-agent work often fails in the gaps between “I sent a task” and “it’s progressing.” If you’ve ever watched an agent stall behind a tool call, a context limit, or a flaky gateway, you know the hardest part isn’t issuing commands. It’s noticing drift early.

Don't Let Your Agent Grade Its Own Homework

Fri, 06 Mar 2026 00:00:00 +0000

If you’re using an LLM to monitor an LLM-based coding agent, assume the monitor is biased in favor of the agent’s own output. The evidence suggests that framing matters: the same risky action looks safer when it’s presented as something the assistant just did.

That’s the core finding of “Self-Attribution Bias: When AI Monitors Go Easy on Themselves”. For practitioners, this is less an AI psychology curiosity and more an engineering warning: self-monitoring setups can systematically under-flag the exact failures you’re trying to catch.

OpenAI's Symphony Turns Jira Tickets Into Pull Requests

Thu, 05 Mar 2026 00:00:00 +0000

The big idea in OpenAI Symphony isn’t that tickets can write code. It’s that a ticket can close the loop with proof-of-work artifacts that make acceptance possible without an engineer riding shotgun.

That’s a workflow change, not a novelty.

Symphony watches a project board (the README demos Linear), spawns an isolated “implementation run” per task, and comes back with receipts: CI status, PR review feedback, complexity analysis, and a walkthrough video. If you accept the output, it lands the PR. The claim is blunt: engineers shouldn’t supervise Codex; they should manage a queue of work at a higher level.

Knuth changed his mind. Your workflow should too.

Wed, 04 Mar 2026 00:00:00 +0000

Donald Knuth just learned that Claude solved an open mathematical problem he’d been working on for weeks. His response? Pure delight at being wrong about AI. This isn’t some random academic praising the latest model. This is the man who wrote The Art of Computer Programming, watching an AI system out-think him on his own turf.

We wrote last week about agents inventing architecture under constraint. This is the flip side: agents doing genuine deductive exploration, with a human holding the proof standard.

Synthesized, Not Engineered

Tue, 03 Mar 2026 00:00:00 +0000

There’s a paper out of Cornell this week that should make you uncomfortable if you build general-purpose software systems for a living.

GenDB takes a simple, almost reckless-sounding premise: what if you replaced your database’s query execution engine with an agentic system that writes fresh, custom C++ code for every single query? No fixed operator set. No general-purpose execution model. Just an LLM that looks at your query, your data, and your hardware, then synthesizes exactly the program needed to answer it.

A TeX Chess Engine Isn't a Trick; It's What Agents Do Under Constraint

Sat, 28 Feb 2026 00:00:00 +0000

You shouldn’t read the “chess engine in pure TeX” stunt as a party trick. You should read it as a warning shot. Coding agents are now good enough at systems thinking under hostile constraints that your bottleneck is shifting from “can the agent write code” to “can you give it guardrails, tests, and observability before it invents a tiny virtual machine inside your build.”

Mathieu Acher’s write-up tells the whole story.

Coming Soon

Sun, 22 Feb 2026 19:08:02 -0500

Agent Skills. Claude Code. Cursor. MCP. UTCP. You name it, we’ll explore it.

Time to deep dive on the coding agent ecosystem, and the shift this is bringing to software engineering.

Coming soon to an agent near you.