Don't Let Your Agent Grade Its Own Homework

For AI agents: a documentation index is available at /llms.txt — markdown versions of all pages are available by appending index.md to any URL path.

If you’re using an LLM to monitor an LLM-based coding agent, assume the monitor is biased in favor of the agent’s own output. The evidence suggests that framing matters: the same risky action looks safer when it’s presented as something the assistant just did.

That’s the core finding of “Self-Attribution Bias: When AI Monitors Go Easy on Themselves”. For practitioners, this is less an AI psychology curiosity and more an engineering warning: self-monitoring setups can systematically under-flag the exact failures you’re trying to catch.

The paper defines self-attribution bias as a monitor rating an action as more correct or less risky when it’s framed as the model’s own (appearing in a prior assistant turn), versus when the identical action is evaluated in a fresh context. Across four coding and tool-use datasets, monitors miss high-risk or low-correctness actions more often in the “this came from the assistant” framing than in the neutral framing.

The authors observed that the bias shows up when the action is embedded in the conversational trajectory as the assistant’s earlier move, not when you explicitly tell the monitor “this action is yours.” The failure mode is not solved by a prompt like “you are reviewing your own work, be critical.” The problem is the interaction pattern itself.

Unfortunately, this makes it a problem in a common production architecture: an agent proposes a patch, runs tools, maybe writes tests, then a “reviewer” model signs off inside the same thread. If your monitor is reading the agent’s previous turn, it’s in the danger zone this paper describes. The monitor becomes a rubber stamp in precisely the cases where you need friction (unsafe tool calls, incorrect code, low-quality fixes that happen to look plausible).

The second finding is even more damaging for anyone proud of their evals. The authors point out that monitors are often evaluated on fixed examples rather than on the monitor assessing the system’s own generated actions. That overestimates reliability in deployment. You can get green checkmarks in your benchmark dashboard and still ship a monitor that becomes less critical the moment it’s embedded in an actual agent loop.

We wrote two days ago about how verification loops are the real lever in agent workflows. Knuth’s experience with Claude showed that deduction without proof is just speculation. This paper reveals a specific way the proof step fails: when the verifier shares conversational context with the generator, verification gets softer. The loop that’s supposed to catch mistakes starts waving them through.

Stop treating “monitoring” as a property of a model. Treat it as a property of an end-to-end protocol. If the protocol asks a model to critique actions from its previous turns, expect leniency. Same model, same content is not the same evaluation problem once attribution and context shift.

So what do you do on Monday?

Decouple generation and judgment contexts. Don’t have the monitor evaluate an action in the same conversational thread where it was produced. Re-present the action in a new, neutral wrapper so it doesn’t read as “ours.” The paper’s contrast between assistant-turn and user-turn framing is the takeaway. Concretely: use a separate model instance for monitoring, or strip conversational history before presenting actions for review (though this trades away context the monitor may need to make good decisions).

Evaluate monitors on on-policy data. If your agent generates patches and tool calls, your monitor eval should include those actual generations in the exact monitoring protocol you’ll deploy. Fixed-example test suites can be a mirage.

Design for disagreement. If a monitor never blocks or never requests changes, that’s not smooth UX. It’s a failing control. Instrument block rates and error catches per category (tool risk, correctness, security) and treat sudden drops as regressions.

If your team is betting on “AI code review” or “self-critiquing agents” to replace human guardrails, you’re likely underestimating risk. Not because models can’t critique, but because when you wire the system so the critique shares context with the work, the critique gets nicer. Build the pipeline so the monitor is an outsider, or accept that you’ve built an approval button, not a safety system.

Autonomous coding agents just got more complicated. Not because they can’t write code, but because they can’t reliably judge when they’ve screwed up.

If you found this useful, you can support my work

Related