If you’re using coding agents to produce analysis, you’re not running deterministic software. You’re managing a lab: multiple researchers with consistent “styles,” inconsistent choices, and outcomes that drift even when the prompt and data don’t.
The authors of Nonstandard Errors in AI Agents ran 150 autonomous Claude Code agents on the same NYSE TAQ dataset (SPY, 2015–2024) and the same six hypotheses. The results varied because the agents made different methodological choices, and those choices often are the analysis.
The paper borrows the term “nonstandard errors” (NSEs) from empirical economics, where human researchers make similar divergent choices. In agent deployments, NSEs are the uncertainty created by agent-to-agent variation in analytical decisions. Two agents can both “work,” both produce clean code and plausible prose, and still disagree because one uses variance ratio while another uses autocorrelation.
Model families develop stable “empirical styles.” Sonnet 4.6 and Opus 4.6 prefer different methodological paths given identical data and instructions. That means your results are partly a function of model choice in a way that won’t show up in unit tests. If you’re building agent pipelines for analytics, policy evaluation, or even internal KPI dashboards, you need to treat model selection like method selection, not like runtime selection.
For development teams, the lesson hits a different way. Most teams already mix models by task. Opus for the complex feature, Sonnet for the quick bugfix, Haiku for the boilerplate. That’s a reasonable cost optimization, but this research suggests it has a hidden cost: each model brings its own architectural instincts, and over time you’re layering different stylistic fingerprints into the same codebase. No single commit looks wrong. The inconsistency accumulates quietly, in how modules are decomposed, how errors are handled, which patterns get reached for.
This isn’t so different from what happens when multiple engineers work on the same codebase. Everyone has preferences, patterns they reach for, conventions they assume. The difference is that human teams had a natural correction mechanism: code review. Someone would comment “we don’t do it that way here” and the codebase stayed coherent. With the volume of agent-written code increasing, that review gets lighter and less frequent. The stylistic drift that peer review used to catch now accumulates faster than any team can read.
This creates testing problems no one has good answers for yet. For analysis pipelines: how do you write tests for agents that might change their entire analytical framework between versions? How do you handle rollbacks when the old model and new model aren’t just giving different answers but answering different questions? Standard regression testing catches output changes. It doesn’t catch methodology changes that produce equally plausible but incompatible results.
For codebases, the problem is subtler but compounds faster. No test will catch that Opus decomposes a module into three files while Sonnet would have used one, or that Tuesday’s bugfix agent handled errors differently than Monday’s feature agent. These aren’t failures. They’re stylistic divergences that make code harder to read, harder to maintain, and eventually harder to debug when assumptions from one style collide with assumptions from another.
If you’ve had the idea that you can peer-review agents into convergence, the paper puts that to rest. The authors tried a three-stage protocol where agents critiqued each other’s work. It had minimal effect on dispersion. What did reduce dispersion was showing agents exemplar papers; interquartile ranges dropped 80–99% within converging measure families.
But the authors are clear about what that convergence means: imitation, not understanding. Exemplars will make your agent outputs look more consistent. That consistency is not evidence you’ve built a reliable system. You may have just trained your workflow to reproduce a house style. Is that enough to ensure a stable, maintainable codebase? I’d love to see more research into this practical angle.
We wrote yesterday about agent skills mostly failing to improve real-world outcomes. Skills inject external guidance and agents ignore or fight it. This paper shows the flip side: even without external guidance, agents diverge on their own. The variance isn’t noise you can prompt away. It’s baked into the model.
The paper studied financial analysis, but the pattern presumably holds wherever agents make design choices: data pipeline construction, API integration, code architecture. Any task with unresolved degrees of freedom (which metric, which filter, which decomposition) will produce agent-to-agent variation that looks like disagreement among junior engineers given the same spec.
For teams running analysis pipelines, run multiple agents (or multiple seeds and configs) and measure spread as a first-class metric. If the spread is large, that’s not noise. It’s telling you your task contains unresolved degrees of freedom. Your pipeline should surface them and force them into explicit configuration, review, or pre-registered defaults.
For teams writing code with agents, pick a model per project and stick with it, or invest heavily in convention enforcement. Style guides, linters, and project-level instructions (CLAUDE.md, AGENTS.md, or equivalent) can override some of a model’s default instincts. The goal is to make the project’s conventions louder than the model’s preferences. That won’t eliminate stylistic drift, but it narrows the band.
Treat exemplar-based alignment as a sharp tool in either context. Use it to enforce organizational standards when you already know the method you want. Don’t use it as a substitute for method selection, and don’t call the resulting tight cluster “robust.”
If you want reproducibility from agents, you’ll get it the same way you get it from humans: constrain the decision space, log the choices, and make variance visible. Otherwise you’re scaling up nonstandard errors with more compute.