Skip to main content
For AI agents: a documentation index is available at /llms.txt — markdown versions of all pages are available by appending index.md to any URL path.

Sashiko shows AI code review works by doing less, not more

·4 mins

If you want LLMs in production software workflows, Sashiko, makes the argument that review is the place to start, not generation. An engineer at Google is putting that theory to the test on the Linux kernel. The early numbers are interesting. Whether they hold up under scrutiny is less clear.

Roman Gushchin’s headline stat: Sashiko caught 53% of bugs in an unfiltered set of 1,000 recent upstream kernel issues (identified by Fixes: tags), all of which had been missed by human reviewers. That’s not a claim of superhuman code understanding. It’s a claim about coverage, specifically incremental coverage on the failure mode kernel maintainers care about most: regressions that make it into mainline.

Kernel review is already a game of incomplete attention. Humans are great at “does this smell right?” and “does this violate subsystem norms?” but terrible at exhaustive mental execution across edge cases, weird call paths, and implicit contracts. An LLM-based reviewer doesn’t need to be perfect. It needs to look in different places than the humans did. 53% on bugs that slipped through sounds like additive quality, not replacement quality.

The argument for reviewer-side AI over author-side is intuitive. When an LLM writes code, you’re trusting it to understand context, requirements, and edge cases. When it reviews code, you’re trusting it to spot patterns that correlate with bugs. That’s a narrower ask. But narrower doesn’t automatically mean better in practice; it depends on the cost of the tool’s mistakes.

That’s where the case gets weaker. Sashiko’s false-positive rate is, by the authors’ own admission, “harder to measure.” Based on limited manual reviews, Guschchin claims it’s within 20 percent, with the majority falling into a “gray zone.” That gray zone is the expensive part. A clear false positive is quick to dismiss. A clear bug is quick to act on. An ambiguous flag forces a maintainer to dig in, reason about context, and make the judgment call the tool was supposed to help with.

The 53% recall stat was measured against 1,000 known-buggy issues. In production, Sashiko reviews every patch, and most patches are fine. The source doesn’t say how many flags a typical review cycle generates across all patches, buggy and clean. That denominator determines whether this feels like useful signal or a new category of noise. Open source projects have been pushing back on AI-generated PRs precisely because they shift cognitive burden onto already-overloaded maintainers. A review tool that generates ambiguous flags risks the same dynamic, but at a different point in the workflow.

Whether Sashiko clears that bar depends partly on integration. It ingests patches from the mailing list and reports back to maintainers, plugging into the kernel’s actual workflow (email-first, social hierarchy, distributed ownership) without asking contributors to change how they submit code. That’s the right instinct. But “advisory signal” only works if the signal-to-noise ratio is high enough that maintainers don’t learn to ignore it. The source’s own hedging (“harder to measure,” “limited manual reviews,” “gray zone”) suggests that question is still open.

Two practical caveats. Sashiko “sends data and code to whatever LLM provider it has been configured for.” Fine for LKML, where patches are public, but a hard constraint for corporate codebases. And Google is footing the LLM bill for now, a reminder that “AI review for everything” isn’t free. If you want this pattern in your org, budget for it like CI: always on, spiky usage, and politically painful to turn off once people rely on it.

We wrote earlier this week about AI agents developing stable “coding styles” that change with each version. That research showed agents introduce consistent biases into codebases, and the traditional correction mechanism (code review) is getting lighter as agent-written code increases. Sashiko is testing whether AI can backfill the review capacity that AI-generated volume is eroding. The theory is sound. The evidence is early, self-reported, and incomplete on the question that matters most: what does the maintainer experience actually look like when the flags start arriving?

Related

The Pentagon Just Made AI Provider Lock-in an Existential Risk

·4 mins
Anthropic suing the Pentagon isn’t just a DC food fight. It’s a warning shot for anyone building developer workflows on top of a single model vendor: your “agent stack” is now a supply-chain dependency, and the government is signaling it wants override rights on how that dependency is allowed to behave. But the part that matters for practitioners isn’t the First Amendment framing. It’s the mechanism. Defense Secretary Pete Hegseth slapped a “national security supply-chain risk” designation on Anthropic after months of contentious talks broke down over two red lines: Anthropic refused to remove safety guardrails preventing Claude’s use for autonomous weapons and mass surveillance of US citizens. That’s not procurement as usual. It’s the customer saying: we don’t just buy your tool; we set the policy layer inside it.

Don't Let Your Agent Grade Its Own Homework

·4 mins
If you’re using an LLM to monitor an LLM-based coding agent, assume the monitor is biased in favor of the agent’s own output. The evidence suggests that framing matters: the same risky action looks safer when it’s presented as something the assistant just did. That’s the core finding of “Self-Attribution Bias: When AI Monitors Go Easy on Themselves”. For practitioners, this is less an AI psychology curiosity and more an engineering warning: self-monitoring setups can systematically under-flag the exact failures you’re trying to catch.

Rover Makes Websites the Agent Runtime

·3 mins
Rover’s approach to AI agents is backwards, and that’s exactly right. Most “agents for the web” demos die in the gap between a model that can click things and a system you can depend on. Rover tries to close that gap by making the web page itself the integration boundary: no screenshots, no remote VM, no Playwright harness you own, no bespoke MCP server per app. In their words: “the page is the API.” The product is the protocol: POST /v1/tasks with a URL and a prompt, then stream progress via SSE or poll for results. That’s a clean contract practitioners can build against.