If you want LLMs in production software workflows, Sashiko, makes the argument that review is the place to start, not generation. An engineer at Google is putting that theory to the test on the Linux kernel. The early numbers are interesting. Whether they hold up under scrutiny is less clear.
Roman Gushchin’s headline stat: Sashiko caught 53% of bugs in an unfiltered set of 1,000 recent upstream kernel issues (identified by Fixes: tags), all of which had been missed by human reviewers. That’s not a claim of superhuman code understanding. It’s a claim about coverage, specifically incremental coverage on the failure mode kernel maintainers care about most: regressions that make it into mainline.
Kernel review is already a game of incomplete attention. Humans are great at “does this smell right?” and “does this violate subsystem norms?” but terrible at exhaustive mental execution across edge cases, weird call paths, and implicit contracts. An LLM-based reviewer doesn’t need to be perfect. It needs to look in different places than the humans did. 53% on bugs that slipped through sounds like additive quality, not replacement quality.
The argument for reviewer-side AI over author-side is intuitive. When an LLM writes code, you’re trusting it to understand context, requirements, and edge cases. When it reviews code, you’re trusting it to spot patterns that correlate with bugs. That’s a narrower ask. But narrower doesn’t automatically mean better in practice; it depends on the cost of the tool’s mistakes.
That’s where the case gets weaker. Sashiko’s false-positive rate is, by the authors’ own admission, “harder to measure.” Based on limited manual reviews, Guschchin claims it’s within 20 percent, with the majority falling into a “gray zone.” That gray zone is the expensive part. A clear false positive is quick to dismiss. A clear bug is quick to act on. An ambiguous flag forces a maintainer to dig in, reason about context, and make the judgment call the tool was supposed to help with.
The 53% recall stat was measured against 1,000 known-buggy issues. In production, Sashiko reviews every patch, and most patches are fine. The source doesn’t say how many flags a typical review cycle generates across all patches, buggy and clean. That denominator determines whether this feels like useful signal or a new category of noise. Open source projects have been pushing back on AI-generated PRs precisely because they shift cognitive burden onto already-overloaded maintainers. A review tool that generates ambiguous flags risks the same dynamic, but at a different point in the workflow.
Whether Sashiko clears that bar depends partly on integration. It ingests patches from the mailing list and reports back to maintainers, plugging into the kernel’s actual workflow (email-first, social hierarchy, distributed ownership) without asking contributors to change how they submit code. That’s the right instinct. But “advisory signal” only works if the signal-to-noise ratio is high enough that maintainers don’t learn to ignore it. The source’s own hedging (“harder to measure,” “limited manual reviews,” “gray zone”) suggests that question is still open.
Two practical caveats. Sashiko “sends data and code to whatever LLM provider it has been configured for.” Fine for LKML, where patches are public, but a hard constraint for corporate codebases. And Google is footing the LLM bill for now, a reminder that “AI review for everything” isn’t free. If you want this pattern in your org, budget for it like CI: always on, spiky usage, and politically painful to turn off once people rely on it.
We wrote earlier this week about AI agents developing stable “coding styles” that change with each version. That research showed agents introduce consistent biases into codebases, and the traditional correction mechanism (code review) is getting lighter as agent-written code increases. Sashiko is testing whether AI can backfill the review capacity that AI-generated volume is eroding. The theory is sound. The evidence is early, self-reported, and incomplete on the question that matters most: what does the maintainer experience actually look like when the flags start arriving?