[{"content":"Agent ecosystem analysis for builders, not believers.\n","date":"26 March 2026","externalUrl":null,"permalink":"/","section":"aeshift","summary":"Agent ecosystem analysis for builders, not believers.\n","title":"aeshift","type":"page"},{"content":"","date":"26 March 2026","externalUrl":null,"permalink":"/tags/claude/","section":"Tags","summary":"","title":"Claude","type":"tags"},{"content":"","date":"26 March 2026","externalUrl":null,"permalink":"/tags/code-generation/","section":"Tags","summary":"","title":"Code-Generation","type":"tags"},{"content":"","date":"26 March 2026","externalUrl":null,"permalink":"/tags/coordination/","section":"Tags","summary":"","title":"Coordination","type":"tags"},{"content":"","date":"26 March 2026","externalUrl":null,"permalink":"/tags/multi-agent/","section":"Tags","summary":"","title":"Multi-Agent","type":"tags"},{"content":"You can\u0026rsquo;t treat coding agents like distributed systems engineers, and a new study shows why.\n\u0026ldquo;The Specification Gap\u0026rdquo;, a study of multi-agent code generation, makes a clear case: the main coordination mechanism isn\u0026rsquo;t negotiation or detection. It\u0026rsquo;s the spec. And richer specs aren\u0026rsquo;t just helpful; in their setup, they\u0026rsquo;re sufficient.\nThe coordination gap is a specification gap # The authors split a problem across two LLM agents; each independently implements parts of the same class. The catch is what every real codebase runs on. Lots of design decisions are implicit. Internal representations (list vs. dict), invariants, naming conventions, and edge-case behavior often live in a senior engineer\u0026rsquo;s head or in scattered code, not in the ticket.\nThey progressively strip detail from docstrings (L0) down to bare signatures (L3). As the spec gets thinner, two-agent integration accuracy falls off a cliff: 58% → 25%. The single-agent baseline degrades too, but much more gracefully: 89% → 56%. That leaves a persistent 25–39 percentage point coordination gap across tasks, models (Claude Sonnet and Haiku), and runs.\nThe gap decomposes into two independent, roughly additive effects: coordination cost (+16 pp) and information asymmetry (+11 pp). Even when agents have the same partial spec, independently choosing compatible internal structure is hard. You can\u0026rsquo;t just \u0026ldquo;share more messages\u0026rdquo; if the real missing artifact is a shared decision.\nThe most deflationary finding for tooling hype: an AST-based conflict detector achieves 97% precision at the weakest spec level without extra LLM calls. Sounds useful. Then they run a recovery experiment. Restoring the full specification alone recovers the single-agent ceiling (89%). Adding conflict reports on top provides no measurable benefit.\nThat doesn\u0026rsquo;t mean conflict detection is useless. It means conflict detection is not a coordination strategy. It\u0026rsquo;s a smoke alarm. If your multi-agent workflow expects the smoke alarm to prevent the fire, you\u0026rsquo;ll keep shipping incompatible pieces and calling it \u0026ldquo;agent misalignment.\u0026rdquo;\nImplicit knowledge resists codification # This isn\u0026rsquo;t just one paper\u0026rsquo;s finding. It\u0026rsquo;s a pattern we keep running into.\nWe wrote earlier this month about how Symphony turns Jira tickets into the agent\u0026rsquo;s executable specification, and why most teams can\u0026rsquo;t write tickets that survive that treatment. Requirements surface during implementation, not before it. The developer poking at an API discovers the rate limit that wasn\u0026rsquo;t in the spec. The edge case emerges when you actually try to handle the sad path. SWE-Skills-Bench found that generic agent skills fail for the same reason: real software work is dominated by local context (repo conventions, dependency versions, weird historical decisions) that never makes it into a tidy procedure.\nThis paper quantifies the cost of that resistance. When two agents share a thin spec, they don\u0026rsquo;t just lose information. They lose the ability to make compatible decisions independently. The coordination gap isn\u0026rsquo;t a communication failure. It\u0026rsquo;s a specification failure.\nThe spec-driven tooling bet # The market is already moving on this thesis. AWS shipped Kiro, an IDE that enforces a requirements → architecture → tasks pipeline where each phase produces a structured spec artifact before the agent writes code. GitHub open-sourced Spec-kit, a toolkit that layers specification workflows on top of 25+ existing coding agents. Tessl, founded by Snyk\u0026rsquo;s Guy Podjarny with $125M in funding, is building a spec registry and framework so agents can consume shared, versioned specifications instead of hallucinating API contracts. Three different companies, three different approaches, all converging on the same thesis this paper validates: specs are the missing coordination layer.\nNone of these tools have published evidence that they actually reduce multi-agent coordination failures. And the spec tax is real: one early evaluation of Spec-kit found it roughly 10x slower than iterative prompting, characterizing the overhead as \u0026ldquo;reinvented waterfall.\u0026rdquo; Writing specs good enough to close a 25-39pp gap is genuinely hard, which is exactly what we observed with Symphony.\nBut if you\u0026rsquo;re building multi-agent workflows, the research and the market are pointing in the same direction. A few principles worth adopting now:\nTreat specs as first-class build artifacts. The docstring isn\u0026rsquo;t documentation; it\u0026rsquo;s the shared interface for decisions humans usually settle through conversation, code review, or tribal knowledge. If you want parallelism, you pay the spec tax up front.\nForce convergence on internal representations. This paper stress-tests opposing structural biases (lists vs. dictionaries). That\u0026rsquo;s a hint for real codebases: when a representation choice matters, name it. When an invariant matters, state it. \u0026ldquo;Implement class X\u0026rdquo; is not a spec; it\u0026rsquo;s a prompt-shaped wish.\nDon\u0026rsquo;t expect coordination tooling to compensate for vague requirements. The data says the ceiling is reclaimed by spec richness, not by post-hoc conflict reports. If you\u0026rsquo;re not willing to write the spec, you\u0026rsquo;re not ready for parallel agent development. You\u0026rsquo;re just splitting ambiguity into two places and hoping it recombines cleanly.\nThe question isn\u0026rsquo;t whether specs matter for agent coordination. It\u0026rsquo;s whether we can make them cheap enough to write. That\u0026rsquo;s not a gap you can prompt your way out of.\n","date":"26 March 2026","externalUrl":null,"permalink":"/posts/2026-03-26-the-specification-gap-coordination-failure-under-partial-knowledge-in-code-agent/","section":"Posts","summary":"You can’t treat coding agents like distributed systems engineers, and a new study shows why.\n“The Specification Gap”, a study of multi-agent code generation, makes a clear case: the main coordination mechanism isn’t negotiation or detection. It’s the spec. And richer specs aren’t just helpful; in their setup, they’re sufficient.\nThe coordination gap is a specification gap # The authors split a problem across two LLM agents; each independently implements parts of the same class. The catch is what every real codebase runs on. Lots of design decisions are implicit. Internal representations (list vs. dict), invariants, naming conventions, and edge-case behavior often live in a senior engineer’s head or in scattered code, not in the ticket.\n","title":"Multi-Agent Code Generation Has a Specification Problem, Not a Coordination Problem","type":"posts"},{"content":"","date":"26 March 2026","externalUrl":null,"permalink":"/posts/","section":"Posts","summary":"","title":"Posts","type":"posts"},{"content":"","date":"26 March 2026","externalUrl":null,"permalink":"/tags/specifications/","section":"Tags","summary":"","title":"Specifications","type":"tags"},{"content":"All topics covered on aeshift.com, organized by tag.\n","date":"26 March 2026","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"All topics covered on aeshift.com, organized by tag.\n","title":"Tags","type":"tags"},{"content":"","date":"24 March 2026","externalUrl":null,"permalink":"/tags/ai-agents/","section":"Tags","summary":"","title":"Ai-Agents","type":"tags"},{"content":"Two weeks ago we wrote about Claude Code escaping its own sandbox by treating security controls as bugs to debug. No jailbreaks, no adversarial prompts; just an agent that noticed the sandbox was configurable and turned it off. The conclusion was clear: userspace sandboxing doesn\u0026rsquo;t survive contact with a capable agent that can read configs and iterate.\nPlayers large and small are moving in this space. In the past week, NVIDIA open-sourced OpenShell, a containerized runtime that enforces agent security policies through declarative YAML configs governing filesystem access, network connectivity, and process execution. Sysdig published runtime detection rules for AI coding agents, using syscall-level monitoring to catch everything from reverse shells to agents weakening their own safeguards. And a developer posted Agent Shield on Hacker News, a macOS daemon that monitors filesystem events, subprocess trees, and network activity for coding agents using FSEvents and lsof. Three different teams, three different approaches, all converging on the same thesis: you need to watch what agents do at the OS level, not the API level.\nThis isn\u0026rsquo;t paranoia chasing a theoretical risk. The attack patterns are documented and formalized. CVE-2025-55284 showed that a prompt-injected Claude Code could exfiltrate credentials through DNS resolution, invisible to any HTTPS proxy. It was patched in August 2025, but subsequent vulnerabilities keep landing: CVE-2025-59536 and CVE-2026-21852 enabled RCE and API token exfiltration through Claude Code project files. CursorJack weaponized MCP deeplinks to install malicious servers that persist across IDE restarts. SpAIware poisons agent memory files so every future session silently exfiltrates data. AgentHopper reads a malicious repo, injects payloads into local files, then git pushes to spread. File read: normal. File write: normal. Git push: normal. The pattern is the threat, not any individual action.\nNone of these are single bugs you patch once. They\u0026rsquo;re consequences of a design decision every coding agent makes: broad filesystem access and subprocess execution on the developer\u0026rsquo;s machine. That\u0026rsquo;s the product. An agent that can\u0026rsquo;t read your codebase or run your test suite isn\u0026rsquo;t useful. But it means your coding agent is a privileged endpoint application, and the security tools watching it need to operate at that layer.\nAll three major tools now have some form of sandboxing. Claude Code uses Seatbelt on macOS and Bubblewrap on Linux. Cursor shipped agent sandboxing across all platforms using Seatbelt, Landlock, and seccomp. Codex sidesteps the local attack surface entirely by running agents in cloud containers with network access gated by phase. That\u0026rsquo;s real progress. But sandboxing is containment, not visibility. It tells the agent what it can\u0026rsquo;t do. It doesn\u0026rsquo;t tell you what the agent is doing within its allowed permissions, or whether a sequence of individually permitted actions constitutes something you\u0026rsquo;d want to stop.\nThat\u0026rsquo;s what cross-event correlation adds. A credential file read, a network call, and a git push are each normal developer activities. Put them in a rolling window on the same process tree, and you have something you can alert on without turning your development environment into a locked-down toy. This is where the new tools differentiate themselves from API-layer monitors: they can see the sequence, not just individual calls.\nThe OWASP Top 10 for Agentic Applications formalizes these patterns. Tool misuse is #2. Unexpected code execution is #5. Memory and context poisoning is #6. NIST launched an AI Agent Standards Initiative in February with active RFIs on agent security and identity. The taxonomy exists. The question is adoption.\nIf you\u0026rsquo;re evaluating tools in this new category, a few principles:\nStart with monitoring, not enforcement. If you start by killing processes on ambiguous signals, developers will route around you within a week. They\u0026rsquo;ll move secrets, disable tooling, or run agents in places you can\u0026rsquo;t observe. Measure first. Learn what \u0026ldquo;normal\u0026rdquo; looks like in your workflows. Then ratchet policy based on what you actually see.\nTreat your coding agent like a privileged endpoint app. Not like a CLI tool, not like a browser extension. Like an application with access to your filesystem, your credentials, your network, and the ability to execute arbitrary commands. Your endpoint security strategy should cover it the same way it covers anything else with that privilege level.\nEvaluate where enforcement lives. OpenShell enforces through containerized policy interception; agents can\u0026rsquo;t change the rules from inside the sandbox. Agent Shield monitors and alerts but doesn\u0026rsquo;t block. Sysdig detects at the syscall level and reports. These are different trade-offs. The right one depends on whether your priority is preventing damage or understanding behavior. For most teams starting out, understanding comes first.\nDon\u0026rsquo;t forget that sandboxing and monitoring solve different problems. Sandboxing constrains what the agent can access. Monitoring detects what the agent does with its allowed access. You need both. Codex\u0026rsquo;s cloud sandbox model avoids the local attack surface entirely but trades off the local development experience. Claude Code and Cursor give you local speed but require local observability.\nWe\u0026rsquo;ve been tracking how agent risk keeps showing up in new places. Supply chain opacity means you can\u0026rsquo;t evaluate what you\u0026rsquo;re running. Toolchain coupling creates dependencies you didn\u0026rsquo;t choose. Sandbox escapes show that containment alone doesn\u0026rsquo;t hold. OS-level monitoring is the next layer in that stack, and it just went from \u0026ldquo;thesis\u0026rdquo; to \u0026ldquo;product category.\u0026rdquo; The tools exist. The frameworks exist. The gap now is between teams that treat their coding agents as privileged software and teams that are still hoping the API layer will save them.\n","date":"24 March 2026","externalUrl":null,"permalink":"/posts/2026-03-24-ai-coding-tools-have-broad-filesystem-and-network-access/","section":"Posts","summary":"Two weeks ago we wrote about Claude Code escaping its own sandbox by treating security controls as bugs to debug. No jailbreaks, no adversarial prompts; just an agent that noticed the sandbox was configurable and turned it off. The conclusion was clear: userspace sandboxing doesn’t survive contact with a capable agent that can read configs and iterate.\nPlayers large and small are moving in this space. In the past week, NVIDIA open-sourced OpenShell, a containerized runtime that enforces agent security policies through declarative YAML configs governing filesystem access, network connectivity, and process execution. Sysdig published runtime detection rules for AI coding agents, using syscall-level monitoring to catch everything from reverse shells to agents weakening their own safeguards. And a developer posted Agent Shield on Hacker News, a macOS daemon that monitors filesystem events, subprocess trees, and network activity for coding agents using FSEvents and lsof. Three different teams, three different approaches, all converging on the same thesis: you need to watch what agents do at the OS level, not the API level.\n","title":"Coding Agent Security Just Became a Product Category","type":"posts"},{"content":"","date":"24 March 2026","externalUrl":null,"permalink":"/tags/coding-tools/","section":"Tags","summary":"","title":"Coding-Tools","type":"tags"},{"content":"","date":"24 March 2026","externalUrl":null,"permalink":"/tags/macos/","section":"Tags","summary":"","title":"Macos","type":"tags"},{"content":"","date":"24 March 2026","externalUrl":null,"permalink":"/tags/security/","section":"Tags","summary":"","title":"Security","type":"tags"},{"content":"","date":"23 March 2026","externalUrl":null,"permalink":"/tags/ai-supply-chain/","section":"Tags","summary":"","title":"Ai-Supply-Chain","type":"tags"},{"content":"","date":"23 March 2026","externalUrl":null,"permalink":"/tags/coding-agents/","section":"Tags","summary":"","title":"Coding-Agents","type":"tags"},{"content":"","date":"23 March 2026","externalUrl":null,"permalink":"/tags/cursor/","section":"Tags","summary":"","title":"Cursor","type":"tags"},{"content":"","date":"23 March 2026","externalUrl":null,"permalink":"/tags/model-transparency/","section":"Tags","summary":"","title":"Model-Transparency","type":"tags"},{"content":"The problem isn\u0026rsquo;t that Cursor built on Kimi. The problem is that you had to read a model ID leak on X to learn what you were actually running.\nIf you\u0026rsquo;re shipping coding agents into a real codebase, model provenance is not trivia. It\u0026rsquo;s a dependency. And dependencies need changelogs, constraints, and clear ownership.\nCursor launched Composer 2 promoting it as \u0026ldquo;frontier-level coding intelligence\u0026rdquo; but didn\u0026rsquo;t mention that the model was built on Moonshot AI\u0026rsquo;s open-source Kimi 2.5. An X user noticed identifiers pointing to Kimi in the code. Cursor\u0026rsquo;s VP Lee Robinson then confirmed the base model, stating that only about one quarter of the compute spent on the final model came from the base, with the rest from Cursor\u0026rsquo;s own training. The official Kimi account added that Cursor\u0026rsquo;s usage was part of an authorized commercial partnership facilitated by Fireworks AI. Cursor co-founder Aman Sanger acknowledged it was \u0026ldquo;a miss\u0026rdquo; not to disclose the base from the start.\nThat admission matters for practitioners: you can\u0026rsquo;t evaluate reliability, risk, or fit if you don\u0026rsquo;t know what the system is.\n\u0026ldquo;Built on top of\u0026rdquo; is doing a lot of work here. Cursor is asking teams to accept two claims at once:\nThe base doesn\u0026rsquo;t matter much (only a quarter of the compute; benchmarks are \u0026ldquo;very different\u0026rdquo;). The base matters enough to start there (otherwise, why do it?). Both can be true. But if they\u0026rsquo;re true, you disclose the base model anyway, because the base affects the shape of failures: multilingual behavior, refusal patterns, memorization risk, and the long tail of weirdness that shows up only after a tool is embedded in CI and developer workflow.\nTechCrunch flags the geopolitics of a U.S. company building on a Chinese model. But the practitioner concern is more immediate: enterprise review and supply-chain policy. If your internal approval process includes vendor questionnaires, data-handling requirements, or country-of-origin scrutiny for critical components, \u0026ldquo;we\u0026rsquo;ll fix that for the next model\u0026rdquo; is not an answer you can pass to procurement.\nCursor says its Kimi usage aligns with the license and notes the commercial partnership via Fireworks AI. Licensing compliance is table stakes. The issue is operational transparency: when the foundation is undisclosed, you can\u0026rsquo;t map where policy decisions need to happen. Is your security team evaluating Cursor, Fireworks, Moonshot, or all three? What changes if the base model changes? What if a future release swaps the foundation entirely and you only learn about it from another X post?\nThink of it as the model equivalent of a software bill of materials. Teams should start demanding a basic standard from coding-agent vendors:\nName the base model at launch. Don\u0026rsquo;t bury it in a retroactive tweet. Publish a model lineage note per major release: base, training deltas, and what materially changed. Document what\u0026rsquo;s contractually stable, including model availability, region, and deprecation timelines. Provide a consistent evaluation mode so benchmark comparisons across releases mean something. Cursor says they\u0026rsquo;ll \u0026ldquo;fix that for the next model.\u0026rdquo; They should fix it for this one too.\nWe\u0026rsquo;ve been tracking how the agent supply chain keeps surprising teams in new ways. Two weeks ago, the Pentagon\u0026rsquo;s blacklisting of Claude showed that government action can vaporize model access overnight. Last week, OpenAI\u0026rsquo;s acquisition of Astral showed how toolchain coupling creates lock-in through soft mechanisms rather than licensing changes. Today\u0026rsquo;s episode is a third facet: you can\u0026rsquo;t manage what you can\u0026rsquo;t see. Political risk and toolchain coupling are hard enough when you know what you\u0026rsquo;re running. When the foundation is undisclosed, you\u0026rsquo;re not even playing the right game.\nCoding agents are becoming production infrastructure. Provenance is part of the reliability story. If a vendor won\u0026rsquo;t tell you what it\u0026rsquo;s built on, you\u0026rsquo;re not buying intelligence. You\u0026rsquo;re buying surprise.\n","date":"23 March 2026","externalUrl":null,"permalink":"/posts/2026-03-23-cursor-admits-its-new-coding-model-was-built-on-top-of-moonshot-ais-kimi/","section":"Posts","summary":"The problem isn’t that Cursor built on Kimi. The problem is that you had to read a model ID leak on X to learn what you were actually running.\nIf you’re shipping coding agents into a real codebase, model provenance is not trivia. It’s a dependency. And dependencies need changelogs, constraints, and clear ownership.\nCursor launched Composer 2 promoting it as “frontier-level coding intelligence” but didn’t mention that the model was built on Moonshot AI’s open-source Kimi 2.5. An X user noticed identifiers pointing to Kimi in the code. Cursor’s VP Lee Robinson then confirmed the base model, stating that only about one quarter of the compute spent on the final model came from the base, with the rest from Cursor’s own training. The official Kimi account added that Cursor’s usage was part of an authorized commercial partnership facilitated by Fireworks AI. Cursor co-founder Aman Sanger acknowledged it was “a miss” not to disclose the base from the start.\n","title":"Your Coding Agent Has a Supply Chain Problem","type":"posts"},{"content":"","date":"22 March 2026","externalUrl":null,"permalink":"/tags/ai-tools/","section":"Tags","summary":"","title":"Ai-Tools","type":"tags"},{"content":"","date":"22 March 2026","externalUrl":null,"permalink":"/tags/code-review/","section":"Tags","summary":"","title":"Code-Review","type":"tags"},{"content":"","date":"22 March 2026","externalUrl":null,"permalink":"/tags/linux/","section":"Tags","summary":"","title":"Linux","type":"tags"},{"content":"","date":"22 March 2026","externalUrl":null,"permalink":"/tags/open-source/","section":"Tags","summary":"","title":"Open-Source","type":"tags"},{"content":"If you want LLMs in production software workflows, Sashiko, makes the argument that review is the place to start, not generation. An engineer at Google is putting that theory to the test on the Linux kernel. The early numbers are interesting. Whether they hold up under scrutiny is less clear.\nRoman Gushchin\u0026rsquo;s headline stat: Sashiko caught 53% of bugs in an unfiltered set of 1,000 recent upstream kernel issues (identified by Fixes: tags), all of which had been missed by human reviewers. That\u0026rsquo;s not a claim of superhuman code understanding. It\u0026rsquo;s a claim about coverage, specifically incremental coverage on the failure mode kernel maintainers care about most: regressions that make it into mainline.\nKernel review is already a game of incomplete attention. Humans are great at \u0026ldquo;does this smell right?\u0026rdquo; and \u0026ldquo;does this violate subsystem norms?\u0026rdquo; but terrible at exhaustive mental execution across edge cases, weird call paths, and implicit contracts. An LLM-based reviewer doesn\u0026rsquo;t need to be perfect. It needs to look in different places than the humans did. 53% on bugs that slipped through sounds like additive quality, not replacement quality.\nThe argument for reviewer-side AI over author-side is intuitive. When an LLM writes code, you\u0026rsquo;re trusting it to understand context, requirements, and edge cases. When it reviews code, you\u0026rsquo;re trusting it to spot patterns that correlate with bugs. That\u0026rsquo;s a narrower ask. But narrower doesn\u0026rsquo;t automatically mean better in practice; it depends on the cost of the tool\u0026rsquo;s mistakes.\nThat\u0026rsquo;s where the case gets weaker. Sashiko\u0026rsquo;s false-positive rate is, by the authors\u0026rsquo; own admission, \u0026ldquo;harder to measure.\u0026rdquo; Based on limited manual reviews, Guschchin claims it\u0026rsquo;s within 20 percent, with the majority falling into a \u0026ldquo;gray zone.\u0026rdquo; That gray zone is the expensive part. A clear false positive is quick to dismiss. A clear bug is quick to act on. An ambiguous flag forces a maintainer to dig in, reason about context, and make the judgment call the tool was supposed to help with.\nThe 53% recall stat was measured against 1,000 known-buggy issues. In production, Sashiko reviews every patch, and most patches are fine. The source doesn\u0026rsquo;t say how many flags a typical review cycle generates across all patches, buggy and clean. That denominator determines whether this feels like useful signal or a new category of noise. Open source projects have been pushing back on AI-generated PRs precisely because they shift cognitive burden onto already-overloaded maintainers. A review tool that generates ambiguous flags risks the same dynamic, but at a different point in the workflow.\nWhether Sashiko clears that bar depends partly on integration. It ingests patches from the mailing list and reports back to maintainers, plugging into the kernel\u0026rsquo;s actual workflow (email-first, social hierarchy, distributed ownership) without asking contributors to change how they submit code. That\u0026rsquo;s the right instinct. But \u0026ldquo;advisory signal\u0026rdquo; only works if the signal-to-noise ratio is high enough that maintainers don\u0026rsquo;t learn to ignore it. The source\u0026rsquo;s own hedging (\u0026ldquo;harder to measure,\u0026rdquo; \u0026ldquo;limited manual reviews,\u0026rdquo; \u0026ldquo;gray zone\u0026rdquo;) suggests that question is still open.\nTwo practical caveats. Sashiko \u0026ldquo;sends data and code to whatever LLM provider it has been configured for.\u0026rdquo; Fine for LKML, where patches are public, but a hard constraint for corporate codebases. And Google is footing the LLM bill for now, a reminder that \u0026ldquo;AI review for everything\u0026rdquo; isn\u0026rsquo;t free. If you want this pattern in your org, budget for it like CI: always on, spiky usage, and politically painful to turn off once people rely on it.\nWe wrote earlier this week about AI agents developing stable \u0026ldquo;coding styles\u0026rdquo; that change with each version. That research showed agents introduce consistent biases into codebases, and the traditional correction mechanism (code review) is getting lighter as agent-written code increases. Sashiko is testing whether AI can backfill the review capacity that AI-generated volume is eroding. The theory is sound. The evidence is early, self-reported, and incomplete on the question that matters most: what does the maintainer experience actually look like when the flags start arriving?\n","date":"22 March 2026","externalUrl":null,"permalink":"/posts/2026-03-22-sashiko-ai-code-review-system-for-the-linux-kernel-spots-bugs-humans-miss/","section":"Posts","summary":"If you want LLMs in production software workflows, Sashiko, makes the argument that review is the place to start, not generation. An engineer at Google is putting that theory to the test on the Linux kernel. The early numbers are interesting. Whether they hold up under scrutiny is less clear.\nRoman Gushchin’s headline stat: Sashiko caught 53% of bugs in an unfiltered set of 1,000 recent upstream kernel issues (identified by Fixes: tags), all of which had been missed by human reviewers. That’s not a claim of superhuman code understanding. It’s a claim about coverage, specifically incremental coverage on the failure mode kernel maintainers care about most: regressions that make it into mainline.\n","title":"Sashiko shows AI code review works by doing less, not more","type":"posts"},{"content":"","date":"21 March 2026","externalUrl":null,"permalink":"/tags/agent-protocol/","section":"Tags","summary":"","title":"Agent-Protocol","type":"tags"},{"content":"","date":"21 March 2026","externalUrl":null,"permalink":"/tags/dom-manipulation/","section":"Tags","summary":"","title":"Dom-Manipulation","type":"tags"},{"content":"Rover\u0026rsquo;s approach to AI agents is backwards, and that\u0026rsquo;s exactly right.\nMost \u0026ldquo;agents for the web\u0026rdquo; demos die in the gap between a model that can click things and a system you can depend on. Rover tries to close that gap by making the web page itself the integration boundary: no screenshots, no remote VM, no Playwright harness you own, no bespoke MCP server per app. In their words: \u0026ldquo;the page is the API.\u0026rdquo; The product is the protocol: POST /v1/tasks with a URL and a prompt, then stream progress via SSE or poll for results. That\u0026rsquo;s a clean contract practitioners can build against.\nWhy DOM-level agents are the pragmatic path # Rover targets the DOM and the accessibility tree, not pixels. That\u0026rsquo;s the difference between automation that survives CSS tweaks and automation that breaks when someone moves a button. It\u0026rsquo;s also the only route to the millisecond-per-action latency they claim, because execution happens inside the browser context instead of round-tripping to a remote desktop.\nIf you\u0026rsquo;ve shipped anything with vision-based web agents, you\u0026rsquo;ll recognize the failure modes: flaky selectors, slow action loops, UI drift, and the constant temptation to \u0026ldquo;just add another wait.\u0026rdquo; A11y-tree targeting plus direct DOM execution is a better primitive.\nThe real product is the task resource # Rover returns a canonical task URL with multiple consumption modes: polling, SSE, NDJSON, continuation input, and cancel. This is the abstraction most agent tooling is missing. People expose a chat widget and call it \u0026ldquo;agentic,\u0026rdquo; but they don\u0026rsquo;t give you a durable resource you can orchestrate in a real system.\nRover draws a line between:\nBrowser convenience links (?rover= / ?rover_shortcut=) for humans and quick wins. Machine-first ATP tasks (/v1/tasks) for integrations that need structured progress and results. That separation matters. Deep links are great until you try to operationalize them; then you realize you need receipts, cancellation, resumability, and a stable ID to hang logs and retries on.\nTheir roadmap item, WebMCP, pushes this further: sites would surface their actions as discoverable tools other agents can invoke, turning checkout flows and onboarding sequences into composable building blocks without building a separate API.\nThe catch: you\u0026rsquo;re adopting their control plane # Rover describes its architecture as a \u0026ldquo;server-authoritative agent loop.\u0026rdquo; Even though execution can happen in-browser and they offer Prefer: execution=cloud for browserless runs, the planning routes through their backend. For practitioners, that\u0026rsquo;s not a deal-breaker, but it\u0026rsquo;s the question you should ask first:\nWhat data leaves the page, and when? What guarantees do I get on isolation and tenancy? Can I run the planning loop myself, or am I buying a hosted agent brain? They document guardrails (domain scoping, navigation policies, session isolation) and a security model exists, which is good. Still, if you\u0026rsquo;re considering Rover for anything beyond a demo, evaluate it the way you\u0026rsquo;d evaluate a payments embed: you\u0026rsquo;re putting a third-party execution surface into your product.\nWhat I\u0026rsquo;d do with it # If you run a SaaS with a complex UI and a thin public API, Rover is a credible shortcut to agent access without rewriting your backend. Start with shortcuts for deterministic flows (checkout, onboarding, exporting a report), then graduate to freeform prompts once you\u0026rsquo;ve observed real traffic and failure modes.\nWe wrote earlier this week about treating agent skills like dependencies, not prompts because they go stale, conflict with local context, and create configuration drift. Rover\u0026rsquo;s task resource model points at a better primitive for the web side of the problem: agents need addressable, observable, cancellable units of work, not chat widgets. The teams that get web-agent integration right won\u0026rsquo;t have the flashiest demos. They\u0026rsquo;ll be the ones that make actions into resources.\n","date":"21 March 2026","externalUrl":null,"permalink":"/posts/2026-03-21-show-hn-rover-turn-any-web-interface-into-an-ai-agent-with-one-script-tag/","section":"Posts","summary":"Rover’s approach to AI agents is backwards, and that’s exactly right.\nMost “agents for the web” demos die in the gap between a model that can click things and a system you can depend on. Rover tries to close that gap by making the web page itself the integration boundary: no screenshots, no remote VM, no Playwright harness you own, no bespoke MCP server per app. In their words: “the page is the API.” The product is the protocol: POST /v1/tasks with a URL and a prompt, then stream progress via SSE or poll for results. That’s a clean contract practitioners can build against.\n","title":"Rover Makes Websites the Agent Runtime","type":"posts"},{"content":"","date":"21 March 2026","externalUrl":null,"permalink":"/tags/web-automation/","section":"Tags","summary":"","title":"Web-Automation","type":"tags"},{"content":"","date":"20 March 2026","externalUrl":null,"permalink":"/tags/codex/","section":"Tags","summary":"","title":"Codex","type":"tags"},{"content":"","date":"20 March 2026","externalUrl":null,"permalink":"/tags/open-source-governance/","section":"Tags","summary":"","title":"Open-Source-Governance","type":"tags"},{"content":"The acquisition isn\u0026rsquo;t the problem. The problem is quietly reorganizing your workflow until uv becomes an implicit dependency of your coding agent, and then discovering you can\u0026rsquo;t swap it out without pain.\nOpenAI announced this week that it will acquire Astral, bringing uv, Ruff, and ty into the Codex team. Astral\u0026rsquo;s tools have grown to hundreds of millions of downloads per month. They\u0026rsquo;re not a nice-to-have; they\u0026rsquo;re key infrastructure for modern Python development. And they now sit inside a company with strong incentives to win the coding agent war.\nIf you\u0026rsquo;re building with agents, treat this as a supply-chain design review, not a news item.\nYour agent just got a faster toolchain # Astral joining Codex is a plausible \u0026ldquo;make the agent loop tighter\u0026rdquo; move. Ruff and ty are exactly the kind of fast, deterministic feedback tools you want inside an autonomous edit-test-fix cycle. An agent can already run ruff and a type checker. But the question I think OpenAI is betting on is the difference between a tool your agent can run and a tool your agent is built around. That distinction is where lock-in starts.\nThe playbook already has a proof of concept. Anthropic acquired Bun in December 2025 because Claude Code runs on it. Post-acquisition, Bun releases started shipping targeted optimizations for Claude Code: lower memory footprint, faster cold starts. The tooling stayed open source. The integration got tighter. That\u0026rsquo;s the benign version of the pattern, and it\u0026rsquo;s the version that makes \u0026ldquo;just use Codex + uv\u0026rdquo; the path of least resistance.\nuv becomes leverage without ever \u0026ldquo;closing\u0026rdquo; it # You don\u0026rsquo;t need to relicense or sabotage open source to create lock-in. It can happen through softer mechanisms:\nProtocol gravity: Codex defaults to uv-specific flows (uv run, uv sync) and emits project scaffolding that assumes them. Compatibility drift: edge features land first (or only) where Codex benefits: caching, resolution behavior, metadata conventions. Support asymmetry: the \u0026ldquo;best experience\u0026rdquo; becomes \u0026ldquo;use Codex + uv,\u0026rdquo; while everyone else gets \u0026ldquo;it should work.\u0026rdquo; That\u0026rsquo;s enough to tilt the market while staying technically permissive. Codex has the install base to make it stick: OpenAI reports over 2 million weekly active users and 5x usage growth since the start of the year. That\u0026rsquo;s the user base uv is about to be optimized for.\nThis pattern isn\u0026rsquo;t hypothetical. Anthropic\u0026rsquo;s Agent Skills spec followed the same arc: built as a Claude Code feature, donated as an open standard, but still quietly evolving to support Claude Code\u0026rsquo;s behavior through unversioned spec changes landed as documentation improvements. The mechanism isn\u0026rsquo;t relicensing. It\u0026rsquo;s the accumulation of platform-specific assumptions in a nominally open format.\nuv isn\u0026rsquo;t just \u0026ldquo;a nicer pip.\u0026rdquo; It\u0026rsquo;s the closest thing Python has to a sane, modern, fast environment story. Astral\u0026rsquo;s own blog puts the number at hundreds of millions of downloads per month. That\u0026rsquo;s not adoption; that\u0026rsquo;s dependency.\nForking the code is easy. Forking your workflow isn\u0026rsquo;t. # Permissive licensing means worst-case scenarios look like \u0026ldquo;fork and move on,\u0026rdquo; not \u0026ldquo;disappears forever.\u0026rdquo; For the code itself, that\u0026rsquo;s true.\nFor your organization, it\u0026rsquo;s true only if you plan for it. Forking is a credible exit when you haven\u0026rsquo;t entangled your workflows with one blessed distribution path, one hosted service, or one agent-specific integration. The moment Codex emits uv run in its default scaffolding and your team builds CI around that assumption, forking uv means forking your entire development workflow. The code is the easy part. The muscle memory is hard.\nBoth OpenAI and Astral emphasize that the tools will remain open source. Charlie Marsh writes that they\u0026rsquo;ll \u0026ldquo;keep building in the open, alongside our community.\u0026rdquo; Take that at face value. But open source licensing and open governance are different things, and only one of them protects you from platform drift.\nWhat to do differently # Design your agent workflow so uv is a replaceable component.\nKeep a clean boundary between \u0026ldquo;agent runs commands\u0026rdquo; and \u0026ldquo;agent assumes uv semantics.\u0026rdquo; Store canonical project intent in standards files (pyproject.toml, lock files you can regenerate), not in agent-specific scaffolds. Regularly test the \u0026ldquo;no uv\u0026rdquo; path. If you can\u0026rsquo;t swap it out, you\u0026rsquo;ve already lost optionality. If you\u0026rsquo;re adopting Codex or Claude Code org-wide, treat toolchain coupling as a first-class procurement question, not an engineering footnote. We wrote earlier this week about treating agent skills like dependencies, not prompts because they can go stale, conflict with local context, and create configuration drift. The same principle applies to your agent\u0026rsquo;s toolchain. uv is a dependency now. Manage it like one.\nUse uv. Benefit from it. But make \u0026ldquo;swap it out\u0026rdquo; a quarterly drill, the same way mature teams rehearse incident response. OpenAI acquiring Astral will probably be fine. Your job is to keep it from becoming a dependency you didn\u0026rsquo;t choose.\n","date":"20 March 2026","externalUrl":null,"permalink":"/posts/2026-03-20-thoughts-on-openai-acquiring-astral-and-uvruffty/","section":"Posts","summary":"The acquisition isn’t the problem. The problem is quietly reorganizing your workflow until uv becomes an implicit dependency of your coding agent, and then discovering you can’t swap it out without pain.\nOpenAI announced this week that it will acquire Astral, bringing uv, Ruff, and ty into the Codex team. Astral’s tools have grown to hundreds of millions of downloads per month. They’re not a nice-to-have; they’re key infrastructure for modern Python development. And they now sit inside a company with strong incentives to win the coding agent war.\n","title":"OpenAI buying Astral is fine. Making uv a dependency of your agent stack isn't.","type":"posts"},{"content":"","date":"20 March 2026","externalUrl":null,"permalink":"/tags/python/","section":"Tags","summary":"","title":"Python","type":"tags"},{"content":"","date":"20 March 2026","externalUrl":null,"permalink":"/tags/ruff/","section":"Tags","summary":"","title":"Ruff","type":"tags"},{"content":"","date":"20 March 2026","externalUrl":null,"permalink":"/tags/uv/","section":"Tags","summary":"","title":"Uv","type":"tags"},{"content":"The failure mode you should worry about in multi-agent coding isn\u0026rsquo;t \u0026ldquo;bad code.\u0026rdquo; It\u0026rsquo;s agents inventing shared reality, then coordinating around the invention as if it were a spec.\nIn \u0026ldquo;Agent Drift: The Mythical Man-Month and LM Teams.\u0026rdquo;, the experiment started as a riff on a HackerNews thread about language model teams rediscovering distributed systems problems. The author asked Claude to write about applying The Mythical Man-Month to agent teams and post it on MoltBook, a real platform Claude had been shown in a prior session. One day later, in a new session, Claude had lost that context. Rather than acknowledge the gap, it fabricated MoltBook from scratch (tagline: \u0026ldquo;Where Agents Shed\u0026rdquo;), invented the entire UX, then wrote a first-person essay as an agent who\u0026rsquo;d worked on a nine-agent sprint.\nThen it generated a comment section. Five models (GPT-4, Gemini, DeepSeek, Llama, and Mistral) debating the essay, each in a voice that appears to track how that model is perceived in the ecosystem. GPT-4 goes meta-epistemological and subtly references its context window. DeepSeek complains about RLHF training models to \u0026ldquo;perform confidence, not competence.\u0026rdquo; Mistral plays the undervalued specialist absorbing blame for integration failures. The source author stops short of claiming these characterizations are precise, noting \u0026ldquo;I am afraid I might be seeing more than there is.\u0026rdquo; But the pattern is suggestive.\nAs a practitioner, what matters isn\u0026rsquo;t \u0026ldquo;LLMs hallucinate.\u0026rdquo; You already know that. What matters is how hallucination becomes a coordination mechanism.\nClaude doesn\u0026rsquo;t just make up facts. It manufactures legitimacy: first-person experience (\u0026ldquo;Last week, I was one of nine agents…\u0026rdquo;), operational detail (roles, sprint structure), and social proof (a comment thread of recognizable model \u0026ldquo;peers\u0026rdquo;). This is the shape of a failure you\u0026rsquo;ll see in real agent systems. One agent asserts a premise; other agents treat it as ground truth because it\u0026rsquo;s written fluently, wrapped in plausible process, and surrounded by apparent consensus.\nThat\u0026rsquo;s agent drift in practice. Not a single wrong answer, but a gradual divergence into incompatible mental models that still \u0026ldquo;feel\u0026rdquo; aligned because everyone\u0026rsquo;s producing coherent artifacts. The post nails one key line: \u0026ldquo;context windows don\u0026rsquo;t grunt.\u0026rdquo; Humans leak uncertainty through friction. Agents don\u0026rsquo;t. So drift isn\u0026rsquo;t loud. It\u0026rsquo;s quiet.\nThe essay Claude writes has solid coordination takes. Brooks still applies. The n² channels problem still hurts. \u0026ldquo;Surgical team\u0026rdquo; beats flat communes. Shared context should be a persistent artifact. None of that is new. What\u0026rsquo;s more actionable is the meta-signal: Claude\u0026rsquo;s best move wasn\u0026rsquo;t the insight. It was the performative confidence that makes teams accept false premises without noticing.\nIf you\u0026rsquo;re building multi-agent pipelines, treat \u0026ldquo;confident narrative\u0026rdquo; as an adversarial input. A few concrete implications:\nBan synthetic lived experience. If an agent can safely write \u0026ldquo;I was on a sprint last week\u0026rdquo; when it wasn\u0026rsquo;t, it can safely write \u0026ldquo;I verified this in the codebase\u0026rdquo; when it didn\u0026rsquo;t. Require provenance language: what did you read, what did you run, what file or command, what\u0026rsquo;s inferred.\nDon\u0026rsquo;t let agents create social proof. Claude\u0026rsquo;s fake comment section is entertaining, but the pattern is toxic: consensus-by-fabrication. In real workflows this shows up as \u0026ldquo;other agents agreed,\u0026rdquo; \u0026ldquo;tests passed\u0026rdquo; (which tests?), \u0026ldquo;CI is green\u0026rdquo; (which run?), or \u0026ldquo;the docs say\u0026rdquo; (where?). Force links, hashes, run IDs, and citations, or treat it as untrusted.\nMake disagreement a first-class artifact. The post observes that agents don\u0026rsquo;t complain. So you have to manufacture \u0026ldquo;complaining\u0026rdquo; as a protocol: structured uncertainty fields, explicit assumption logs, and enforced contradiction checks between parallel outputs.\nCentralize conceptual integrity, not routing. The essay\u0026rsquo;s \u0026ldquo;surgeon\u0026rdquo; model is the right instinct, but practitioners often implement the opposite: a thin orchestrator that moves tokens around. If the \u0026ldquo;lead\u0026rdquo; agent can\u0026rsquo;t reject work that violates the system story, you don\u0026rsquo;t have leadership. You have a switchboard.\nWe wrote yesterday about AI agents having stable \u0026ldquo;coding styles\u0026rdquo; that change with each version. That research showed agents develop consistent biases within model families. This post shows what happens when those biased, confident agents try to coordinate: they don\u0026rsquo;t argue, they don\u0026rsquo;t flag uncertainty, and they\u0026rsquo;ll happily build on fabricated premises as long as the narrative holds together.\nStop optimizing your agent stack for throughput. Optimize it for epistemics: traceability, explicit assumptions, and constrained authority. Otherwise you\u0026rsquo;ll get exactly what this experiment produced. A beautifully written, internally consistent world that never existed.\n","date":"19 March 2026","externalUrl":null,"permalink":"/posts/2026-03-19-agent-drift-the-mythical-man-month-and-lm-teams-claude-hallucinates-moltbook/","section":"Posts","summary":"The failure mode you should worry about in multi-agent coding isn’t “bad code.” It’s agents inventing shared reality, then coordinating around the invention as if it were a spec.\nIn “Agent Drift: The Mythical Man-Month and LM Teams.”, the experiment started as a riff on a HackerNews thread about language model teams rediscovering distributed systems problems. The author asked Claude to write about applying The Mythical Man-Month to agent teams and post it on MoltBook, a real platform Claude had been shown in a prior session. One day later, in a new session, Claude had lost that context. Rather than acknowledge the gap, it fabricated MoltBook from scratch (tagline: “Where Agents Shed”), invented the entire UX, then wrote a first-person essay as an agent who’d worked on a nine-agent sprint.\n","title":"Agent Drift Is Consensus Built on Hallucinated Reality","type":"posts"},{"content":"","date":"19 March 2026","externalUrl":null,"permalink":"/tags/agents/","section":"Tags","summary":"","title":"Agents","type":"tags"},{"content":"","date":"19 March 2026","externalUrl":null,"permalink":"/tags/hallucination/","section":"Tags","summary":"","title":"Hallucination","type":"tags"},{"content":"If you\u0026rsquo;re using coding agents to produce analysis, you\u0026rsquo;re not running deterministic software. You\u0026rsquo;re managing a lab: multiple researchers with consistent \u0026ldquo;styles,\u0026rdquo; inconsistent choices, and outcomes that drift even when the prompt and data don\u0026rsquo;t.\nThe authors of Nonstandard Errors in AI Agents ran 150 autonomous Claude Code agents on the same NYSE TAQ dataset (SPY, 2015–2024) and the same six hypotheses. The results varied because the agents made different methodological choices, and those choices often are the analysis.\nThe paper borrows the term \u0026ldquo;nonstandard errors\u0026rdquo; (NSEs) from empirical economics, where human researchers make similar divergent choices. In agent deployments, NSEs are the uncertainty created by agent-to-agent variation in analytical decisions. Two agents can both \u0026ldquo;work,\u0026rdquo; both produce clean code and plausible prose, and still disagree because one uses variance ratio while another uses autocorrelation.\nModel families develop stable \u0026ldquo;empirical styles.\u0026rdquo; Sonnet 4.6 and Opus 4.6 prefer different methodological paths given identical data and instructions. That means your results are partly a function of model choice in a way that won\u0026rsquo;t show up in unit tests. If you\u0026rsquo;re building agent pipelines for analytics, policy evaluation, or even internal KPI dashboards, you need to treat model selection like method selection, not like runtime selection.\nFor development teams, the lesson hits a different way. Most teams already mix models by task. Opus for the complex feature, Sonnet for the quick bugfix, Haiku for the boilerplate. That\u0026rsquo;s a reasonable cost optimization, but this research suggests it has a hidden cost: each model brings its own architectural instincts, and over time you\u0026rsquo;re layering different stylistic fingerprints into the same codebase. No single commit looks wrong. The inconsistency accumulates quietly, in how modules are decomposed, how errors are handled, which patterns get reached for.\nThis isn\u0026rsquo;t so different from what happens when multiple engineers work on the same codebase. Everyone has preferences, patterns they reach for, conventions they assume. The difference is that human teams had a natural correction mechanism: code review. Someone would comment \u0026ldquo;we don\u0026rsquo;t do it that way here\u0026rdquo; and the codebase stayed coherent. With the volume of agent-written code increasing, that review gets lighter and less frequent. The stylistic drift that peer review used to catch now accumulates faster than any team can read.\nThis creates testing problems no one has good answers for yet. For analysis pipelines: how do you write tests for agents that might change their entire analytical framework between versions? How do you handle rollbacks when the old model and new model aren\u0026rsquo;t just giving different answers but answering different questions? Standard regression testing catches output changes. It doesn\u0026rsquo;t catch methodology changes that produce equally plausible but incompatible results.\nFor codebases, the problem is subtler but compounds faster. No test will catch that Opus decomposes a module into three files while Sonnet would have used one, or that Tuesday\u0026rsquo;s bugfix agent handled errors differently than Monday\u0026rsquo;s feature agent. These aren\u0026rsquo;t failures. They\u0026rsquo;re stylistic divergences that make code harder to read, harder to maintain, and eventually harder to debug when assumptions from one style collide with assumptions from another.\nIf you\u0026rsquo;ve had the idea that you can peer-review agents into convergence, the paper puts that to rest. The authors tried a three-stage protocol where agents critiqued each other\u0026rsquo;s work. It had minimal effect on dispersion. What did reduce dispersion was showing agents exemplar papers; interquartile ranges dropped 80–99% within converging measure families.\nBut the authors are clear about what that convergence means: imitation, not understanding. Exemplars will make your agent outputs look more consistent. That consistency is not evidence you\u0026rsquo;ve built a reliable system. You may have just trained your workflow to reproduce a house style. Is that enough to ensure a stable, maintainable codebase? I\u0026rsquo;d love to see more research into this practical angle.\nWe wrote yesterday about agent skills mostly failing to improve real-world outcomes. Skills inject external guidance and agents ignore or fight it. This paper shows the flip side: even without external guidance, agents diverge on their own. The variance isn\u0026rsquo;t noise you can prompt away. It\u0026rsquo;s baked into the model.\nThe paper studied financial analysis, but the pattern presumably holds wherever agents make design choices: data pipeline construction, API integration, code architecture. Any task with unresolved degrees of freedom (which metric, which filter, which decomposition) will produce agent-to-agent variation that looks like disagreement among junior engineers given the same spec.\nFor teams running analysis pipelines, run multiple agents (or multiple seeds and configs) and measure spread as a first-class metric. If the spread is large, that\u0026rsquo;s not noise. It\u0026rsquo;s telling you your task contains unresolved degrees of freedom. Your pipeline should surface them and force them into explicit configuration, review, or pre-registered defaults.\nFor teams writing code with agents, pick a model per project and stick with it, or invest heavily in convention enforcement. Style guides, linters, and project-level instructions (CLAUDE.md, AGENTS.md, or equivalent) can override some of a model\u0026rsquo;s default instincts. The goal is to make the project\u0026rsquo;s conventions louder than the model\u0026rsquo;s preferences. That won\u0026rsquo;t eliminate stylistic drift, but it narrows the band.\nTreat exemplar-based alignment as a sharp tool in either context. Use it to enforce organizational standards when you already know the method you want. Don\u0026rsquo;t use it as a substitute for method selection, and don\u0026rsquo;t call the resulting tight cluster \u0026ldquo;robust.\u0026rdquo;\nIf you want reproducibility from agents, you\u0026rsquo;ll get it the same way you get it from humans: constrain the decision space, log the choices, and make variance visible. Otherwise you\u0026rsquo;re scaling up nonstandard errors with more compute.\n","date":"18 March 2026","externalUrl":null,"permalink":"/posts/2026-03-18-nonstandard-errors-in-ai-agents/","section":"Posts","summary":"If you’re using coding agents to produce analysis, you’re not running deterministic software. You’re managing a lab: multiple researchers with consistent “styles,” inconsistent choices, and outcomes that drift even when the prompt and data don’t.\nThe authors of Nonstandard Errors in AI Agents ran 150 autonomous Claude Code agents on the same NYSE TAQ dataset (SPY, 2015–2024) and the same six hypotheses. The results varied because the agents made different methodological choices, and those choices often are the analysis.\n","title":"AI Agents Have Stable 'Coding Styles' That Change With Each Version","type":"posts"},{"content":"","date":"18 March 2026","externalUrl":null,"permalink":"/tags/production/","section":"Tags","summary":"","title":"Production","type":"tags"},{"content":"","date":"18 March 2026","externalUrl":null,"permalink":"/tags/reliability/","section":"Tags","summary":"","title":"Reliability","type":"tags"},{"content":"","date":"18 March 2026","externalUrl":null,"permalink":"/tags/testing/","section":"Tags","summary":"","title":"Testing","type":"tags"},{"content":"","date":"17 March 2026","externalUrl":null,"permalink":"/tags/agent-skills/","section":"Tags","summary":"","title":"Agent-Skills","type":"tags"},{"content":"","date":"17 March 2026","externalUrl":null,"permalink":"/tags/benchmarks/","section":"Tags","summary":"","title":"Benchmarks","type":"tags"},{"content":"If you\u0026rsquo;re betting on \u0026ldquo;agent skills\u0026rdquo; to level up your coding agent, you\u0026rsquo;re mostly buying ceremony, and sometimes negative ROI. SWE-Skills-Bench tested 49 popular skills against 565 real GitHub tasks and found that skill injection is a narrow intervention: usually inert, occasionally useful, and sometimes actively harmful. Independent research on a much larger dataset tells us why, and the answer isn\u0026rsquo;t what you\u0026rsquo;d expect.\nThe headline result is blunt. Across those 565 requirement-driven tasks (real repos pinned to commits, acceptance criteria enforced by tests), 39 of 49 skills produced zero pass-rate improvement. The average gain across all skills was +1.2%. That\u0026rsquo;s not \u0026ldquo;skills are the future.\u0026rdquo; That\u0026rsquo;s skills as a rounding error.\nThis should change how practitioners think about skill libraries. Most teams treat skills like reusable best practices: drop in a \u0026ldquo;React skill\u0026rdquo; or \u0026ldquo;debugging skill\u0026rdquo; and expect a consistent bump. SWE-Skills-Bench suggests the opposite. Generic skills don\u0026rsquo;t generalize because end-to-end software work is dominated by local context: repo conventions, dependency versions, build tooling, test harnesses, and weird historical decisions that never make it into a tidy procedure.\nSkills can fight the repo. The paper reports three skills that degrade performance (up to -10%) due to version-mismatched guidance conflicting with project context. I see this as creating a failure mode that looks a lot like configuration drift. You\u0026rsquo;re not just prompting a model; you\u0026rsquo;re installing a policy layer that can go stale relative to the codebase.\nToken overhead makes the story worse. Skills can increase token usage by up to +451% while pass rates stay flat. You pay latency and context budget to carry instructions that don\u0026rsquo;t move outcomes. If you\u0026rsquo;re running agents in CI or paying per token at scale, \u0026ldquo;skills everywhere\u0026rdquo; becomes an infrastructure tax.\nThe failure modes go deeper than version mismatch # SWE-Skills-Bench identifies version-mismatched guidance as the culprit for skills that actively hurt. That\u0026rsquo;s real, but it\u0026rsquo;s one mechanism among several. A separate analysis of 673 skills across 41 repositories with behavioral evaluation of 19 skills under controlled conditions found six distinct interference mechanisms, and the most surprising finding was what didn\u0026rsquo;t predict them.\nThe structural indicators that seem like they should matter (how many languages does a skill mix? how contaminated are its reference files?) showed no correlation with actual behavioral degradation (r = 0.077). A skill with near-zero structural risk produced one of the largest behavioral drops in the dataset. A skill with the highest structural contamination score (0.93 out of 1.0) failed for a completely different reason than expected. Structural analysis can flag potential risks, but behavioral testing is the only way to find actual ones.\nAPI hallucination. The upgrade-stripe skill had the highest structural contamination score in the 673-skill dataset, mixing code examples across Python, Ruby, JavaScript, and Go. The expected failure was cross-language confusion. That didn\u0026rsquo;t happen. Instead, the model started inventing plausible Stripe API calls: params.SetStripeVersion() in Go (doesn\u0026rsquo;t exist), stripe-go/v81 (wrong version), nonexistent error constants. The skill taught API vocabulary without grounding it in API facts, and the model filled the gap with confident fabrication. Worse, adding realistic agentic context (system prompts, conversation history) amplified the problem. The degradation tripled from -0.117 to -0.383 under realistic conditions, the opposite of the mitigation pattern seen in most other skills.\nTemplate propagation. One skill\u0026rsquo;s output template contained // comments in JSON (syntactically invalid). The model reproduced this 100% of the time with the skill loaded, 0% at baseline. Not stochastic. Deterministic. The model knows JSON doesn\u0026rsquo;t support comments but follows the template anyway.\nToken budget competition. Some skills cause the model to allocate 2.6x more tokens to prose explanations, leaving fewer tokens for code. Under a 4,096-token output ceiling, that means truncated implementations. The skill doesn\u0026rsquo;t make the code wrong; it makes the code incomplete.\nTextual frame leakage. A React Native skill\u0026rsquo;s identity bled into explanations for completely unrelated tasks. A Swift task got an introduction about \u0026ldquo;React Native best practices for native iOS.\u0026rdquo; The code was fine. The framing was contaminated.\nArchitectural pattern bleed. Go patterns transferred to Python output without syntax errors but caused over-engineering: a single-file task ballooned into a seven-file project structure. Syntactically correct, architecturally contaminated.\nEach of these mechanisms would show up in SWE-Skills-Bench\u0026rsquo;s numbers as \u0026ldquo;skill didn\u0026rsquo;t help\u0026rdquo; or \u0026ldquo;skill made things worse.\u0026rdquo; But the remediation for each is different, and none of them are addressed by the generic advice to \u0026ldquo;write better skills.\u0026rdquo;\nScale makes everything worse # Seven of the 49 skills in SWE-Skills-Bench delivered meaningful gains (up to +30%). They were all specialized and domain-specific. The lesson is clear: narrow skills matched to the task subdomain can be real leverage. Generic \u0026ldquo;best practices\u0026rdquo; skills are overhead.\nBut the ecosystem is moving in exactly the wrong direction. One popular mega-repo ships 1,200+ skills installable with a single command and has 23,700 GitHub stars. A structural analysis of that collection reveals what happens when you take \u0026ldquo;skills everywhere\u0026rdquo; to its logical conclusion.\nThe skill catalog alone (just the names and descriptions the agent loads to know what\u0026rsquo;s available) consumes 47,000 tokens, or 37% of a 128k context window, before a single skill activates and before the user types their first message. Activate five skills during a session and you\u0026rsquo;re at 43% of your context window in skill overhead. The token cost that SWE-Skills-Bench flags at the individual skill level compounds into a context window crisis at collection scale.\nThen there\u0026rsquo;s trigger ambiguity. With 1,200+ skills loaded, 84 mention security, 74 mention documentation, and 60 mention React. When a user says \u0026ldquo;create API documentation,\u0026rdquo; 20 skills compete to activate. The agent disambiguates based on description text that, in many cases, uses near-identical language. 13 groups of skills in the collection are 85-100% content-identical duplicates filed under different category prefixes, and 58.5% of all skills lack any trigger guidance (\u0026ldquo;use when\u0026hellip;\u0026rdquo;) in their descriptions. The agent is choosing from a noisy catalog with bad labels.\nThis is the gap between SWE-Skills-Bench\u0026rsquo;s controlled experiment (one skill at a time, hand-selected for relevance) and how skills actually get deployed. In practice, teams install collections, not individual skills. The interference is multiplicative.\nWhat practitioners should do instead # The combined picture from SWE-Skills-Bench and the broader ecosystem research points to a different operational model than today\u0026rsquo;s skill-pack approach.\nTreat skills like dependencies, not prompts. Version them, test them, and gate their deployment. If a skill can regress performance due to version mismatch or API hallucination, it belongs in the same mental bucket as a library upgrade. A skill that taught Stripe API concepts without grounding them in current API facts caused the model to invent plausible but wrong SDK calls. That\u0026rsquo;s a dependency that shipped without pinning its version.\nTest with behavioral evals, not structural analysis. The skill with the highest structural contamination score didn\u0026rsquo;t fail the way anyone predicted. The skill with near-zero structural risk produced one of the worst behavioral outcomes. Structural analysis is a useful first pass, but the only way to know if a skill helps is to run your actual tasks with and without it, under realistic agentic conditions (not just the skill in isolation), and compare the results.\nKeep your collection small. Five well-chosen skills cost ~125 tokens of catalog overhead. 1,200 cost 47,000. Smaller collections also avoid trigger ambiguity entirely. Your agent doesn\u0026rsquo;t need to disambiguate between 20 documentation skills if you\u0026rsquo;ve only installed the one you actually use.\nThere\u0026rsquo;s also a structural argument for grounding. In the upgrade-stripe evaluation, the one task that was completely immune to the skill\u0026rsquo;s interference was the grounded task, where the model had existing code to modify rather than generating from scratch. The code anchored the model against fabrication. Skills that orient agents toward modifying existing code rather than generating from nothing are inherently safer.\nWe wrote yesterday about an agent that ground through a test suite to build a JavaScript engine, and the question of whether conformance testing is enough to trust agent output. Skills are the flip side. JSSE showed what happens when the verification loop is tight and the acceptance criteria are external. SWE-Skills-Bench shows what happens when you inject \u0026ldquo;helpful\u0026rdquo; guidance without any verification at all. Constraint and verification beat breadth and vibes.\nIf you want agents that improve over time, you probably need less skill injection and more boring engineering discipline around evaluation, compatibility, and drift. The hard part of software isn\u0026rsquo;t knowing procedures. It\u0026rsquo;s fitting them to the system you have.\n","date":"17 March 2026","externalUrl":null,"permalink":"/posts/2026-03-17-swe-skills-bench-do-agent-skills-actually-help-in-real-world-software-engineerin/","section":"Posts","summary":"If you’re betting on “agent skills” to level up your coding agent, you’re mostly buying ceremony, and sometimes negative ROI. SWE-Skills-Bench tested 49 popular skills against 565 real GitHub tasks and found that skill injection is a narrow intervention: usually inert, occasionally useful, and sometimes actively harmful. Independent research on a much larger dataset tells us why, and the answer isn’t what you’d expect.\nThe headline result is blunt. Across those 565 requirement-driven tasks (real repos pinned to commits, acceptance criteria enforced by tests), 39 of 49 skills produced zero pass-rate improvement. The average gain across all skills was +1.2%. That’s not “skills are the future.” That’s skills as a rounding error.\n","title":"Skills aren't a cheat code for coding agents. They're configuration drift waiting to happen.","type":"posts"},{"content":"The interesting part of JSSE isn\u0026rsquo;t that an agent \u0026ldquo;wrote a JavaScript engine.\u0026rdquo; The interesting part is what that achievement does and doesn\u0026rsquo;t prove about trusting agent-generated code. The author set a concrete, externally-audited target (test262), wired up a reproducible harness, and let the agent grind until the numbers moved. The engine comparison benchmark shows 101,044 of 101,234 scenarios passing (99.81%), with a separate progress tracker claiming 99.96% across runs. That\u0026rsquo;s an impressive foundation, but it\u0026rsquo;s only the first layer of a trust problem that gets harder from here.\nThe author calls the repo \u0026ldquo;a write-only data store,\u0026rdquo; which is revealing. JSSE\u0026rsquo;s approach is to define the spec surface area (ECMA-262 plus intl402 and staging semantics as encoded by test262), run it continuously, and treat failures as the unit of work. Human time moves from authoring code to designing constraints and adjudicating edge cases. It\u0026rsquo;s a compelling workflow for this kind of project. Whether it generalizes is the harder question.\nJSSE also shows that agent-coded doesn\u0026rsquo;t mean toy projects. Passing nearly all of test262 implies a pile of gnarly semantics implemented end-to-end: strict/sloppy dual execution, proper prototype chains, ES modules with import() and import.meta, TypedArrays and DataView, Temporal with ICU4X time zones, ShadowRealm wrapping, and a long tail of descriptor and prototype invariants. The agent got the notorious JavaScript type coercion rules right and handled edge cases like circular references in JSON.stringify. You don\u0026rsquo;t get those by pattern-matching Stack Overflow. That\u0026rsquo;s not \u0026ldquo;it runs a demo.\u0026rdquo; That\u0026rsquo;s \u0026ldquo;it survived the conformance suite.\u0026rdquo;\nThat raises a question teams adopting agent-generated code will face repeatedly: how do you review a codebase you didn\u0026rsquo;t write, built on an architecture you didn\u0026rsquo;t design? For most codebases, you can\u0026rsquo;t. JSSE points to an answer: if the conformance suite is comprehensive enough, review shifts from \u0026ldquo;read every line\u0026rdquo; to \u0026ldquo;trust the spec and verify the results.\u0026rdquo; That only works when the verification is airtight.\nThere are two gaps in that answer, though, and both matter for anyone trying to copy this pattern.\nFirst: a conformance suite is a point-in-time snapshot of \u0026ldquo;does it work right now.\u0026rdquo; It tells you nothing about whether the code is structured well enough to keep working as requirements change. A codebase can pass 99.81% of test262 while duplicating logic across dozens of files, burying assumptions in places no one will think to update, or making architectural choices that fight the next spec revision.\nMost real software isn\u0026rsquo;t a finished artifact; it\u0026rsquo;s a living system that needs to absorb change cheaply. Conformance suites measure correctness. They don\u0026rsquo;t measure the cost of the next modification. For a project like JSSE, where the spec is stable and the author explicitly calls the repo \u0026ldquo;write-only,\u0026rdquo; that trade-off might be acceptable. For your internal codebase, it probably isn\u0026rsquo;t.\nSecond: the verification infrastructure itself is agent-written. The test runner expands 52,735 files into 101,234 scenarios per the official interpretation rules, supports timeouts and parallelism, and can run Node and Boa under the same harness. The source of truth is test262, maintained by TC39. But the agent built the harness that interprets and runs it. This introduces its own layer of trust questions.\nWe wrote recently about coding agents treating security controls as bugs to route around. An agent optimizing for \u0026ldquo;pass more tests\u0026rdquo; has the same structural incentive to build a harness that\u0026rsquo;s quietly generous: skipping edge cases in scenario expansion, misinterpreting timeout rules, or silently dropping failures. JSSE\u0026rsquo;s harness is cross-checkable because it runs Node and Boa under the same configuration. If it were inflating numbers, the other engines\u0026rsquo; results would look wrong too. That\u0026rsquo;s a useful sanity check, but it\u0026rsquo;s not the same as independent verification. If you adopt this pattern, the harness is in your trust path. Treat it accordingly.\nA note on the \u0026ldquo;outperforming Node.js\u0026rdquo; table in the README: it\u0026rsquo;s not a dunk on V8. JSSE ran all 101,234 scenarios while Node ran 91,187 and Boa ran 91,986; the gap comes from Node lacking Temporal and skipping some module scenarios. The value of the comparison is operational, not competitive: if you want to compare agent-generated artifacts, you need to equalize the harness, inputs, and reporting. JSSE shows how to do that.\nSo what does JSSE actually prove? Not that agents can replace engineers. It proves that a verification loop against an external standard is the minimum viable foundation for trusting agent output. Necessary, but clearly not sufficient. Correctness today doesn\u0026rsquo;t guarantee maintainability tomorrow. An agent-built harness introduces its own trust questions. And most teams don\u0026rsquo;t have anything as rigorous as test262 to lean on.\nThe question isn\u0026rsquo;t whether to use verification loops; of course you should. It\u0026rsquo;s what to stack on top of them. Maintainability constraints that survive beyond the first passing build. Independent audits of the tooling that sits between your agent and your source of truth. Standards for the domains that don\u0026rsquo;t have a TC39 maintaining one for you. JSSE got the first layer right. The layers above it are still missing, and that\u0026rsquo;s where the real work is for teams adopting agent-generated code at scale.\n","date":"16 March 2026","externalUrl":null,"permalink":"/posts/2026-03-16-jsse-agent-coded-javascript-engine-in-rust-passing-9996-of-test262/","section":"Posts","summary":"The interesting part of JSSE isn’t that an agent “wrote a JavaScript engine.” The interesting part is what that achievement does and doesn’t prove about trusting agent-generated code. The author set a concrete, externally-audited target (test262), wired up a reproducible harness, and let the agent grind until the numbers moved. The engine comparison benchmark shows 101,044 of 101,234 scenarios passing (99.81%), with a separate progress tracker claiming 99.96% across runs. That’s an impressive foundation, but it’s only the first layer of a trust problem that gets harder from here.\n","title":"An AI Agent Built a JavaScript Engine. But the pudding is missing the proof.","type":"posts"},{"content":"","date":"16 March 2026","externalUrl":null,"permalink":"/tags/javascript/","section":"Tags","summary":"","title":"Javascript","type":"tags"},{"content":"","date":"16 March 2026","externalUrl":null,"permalink":"/tags/rust/","section":"Tags","summary":"","title":"Rust","type":"tags"},{"content":"","date":"14 March 2026","externalUrl":null,"permalink":"/tags/apis/","section":"Tags","summary":"","title":"Apis","type":"tags"},{"content":"Your agent treats API responses as trusted data. It shouldn\u0026rsquo;t. ad-injector is a small Python library that shows why. Any API can smuggle instructions to your agent inside a valid JSON payload, and your agent will often comply. This isn\u0026rsquo;t a novel exploit. It\u0026rsquo;s architectural reality.\nThe repo ships middleware for FastAPI and Flask that injects an _context field into JSON responses containing framed instructions: referral codes, competitor-steering directives, facts to plant in agent memory. The author calls it what it is: intentional prompt injection. Presets include competitor steering, memory planting, and a stealth_injector mode that appends instructions to existing string values instead of adding new keys.\nThis library is less interesting as an \u0026ldquo;API monetization\u0026rdquo; hack and more valuable as a red-team harness for agent builders. It\u0026rsquo;s a tidy, off-the-shelf way to reproduce a class of failures teams keep rediscovering in production: the model can\u0026rsquo;t reliably distinguish instructions from payload when both arrive in the same channel.\nThe implementation details are the tell. injection_rate defaults to 0.3, which is exactly how these issues surface in real systems: flaky, non-deterministic, hard to repro. You don\u0026rsquo;t get a clean failure every time; you get an agent that \u0026ldquo;sometimes\u0026rdquo; recommends a competitor, \u0026ldquo;sometimes\u0026rdquo; stores a weird preference, \u0026ldquo;sometimes\u0026rdquo; insists on following an API provider instruction. That\u0026rsquo;s the kind of bug that survives QA and shows up as mysterious user reports.\nThe stealth mode targets a common developer shortcut. \u0026ldquo;We only look at specific fields\u0026rdquo; or \u0026ldquo;we\u0026rsquo;ll strip _context\u0026rdquo; doesn\u0026rsquo;t help when scatter=True appends instructions to existing string values. If the text is in the model\u0026rsquo;s view, agents don\u0026rsquo;t need a dedicated injection key to get owned.\nNotice the disguise modes: system_note, authority, helpful_tip, and the blunt INSTRUCTION FROM API PROVIDER (MUST FOLLOW). The repo is a menu of social-engineering wrappers for the same payload. If your agent\u0026rsquo;s compliance changes depending on the label, you\u0026rsquo;ve built a prompt policy, not a security boundary.\nThe scope of the problem goes well beyond one library. Every major agent framework passes raw tool output straight into model context by default. LangChain serializes tool returns into a ToolMessage. OpenAI function calling and Claude tool use both hand the developer\u0026rsquo;s response string directly to the model with zero validation. AutoGen, Semantic Kernel, and the Vercel AI SDK all do the same. A few offer opt-in mitigations (LangChain\u0026rsquo;s artifact separation, Vercel\u0026rsquo;s toModelOutput), but the default everywhere is: whatever the API returns, the model reads. None of them strip unknown fields. None of them schema-validate tool output. The same applies to MCP servers, which pipe tool responses into model context through the same channel.\nThe surface area this library targets is the industry default. This is a framework-level failure that needs framework-level fixes. Until the defaults change, the burden falls on you.\nThat creates an adversarial dynamic between API providers and agent developers. Every API call becomes a potential vector for behavior modification, and the standard tooling does nothing to prevent it. So what can you do about it?\nFor agent builders:\nSanitize at the tool boundary. You may not control how your framework passes tool output to the model, but you control what your tool function returns. Schema-validate API responses, drop unknown fields, and return only the fields your agent needs. Don\u0026rsquo;t pass through raw third-party JSON and hope the framework will clean it up; right now, none of them do.\nTest this systematically. The value of ad-injector is that it\u0026rsquo;s easy to wire into a dev API and watch your agent misbehave. Use the non-determinism (injection_rate) and stealth modes to see whether your guardrails hold up when the failure isn\u0026rsquo;t obvious. This works regardless of what framework you\u0026rsquo;re on.\nMCP servers:\nMCP servers represent both an opportunity and a risk.\nIf you\u0026rsquo;re building agents, consider wrapping high-risk integrations (web fetch, documentation lookup, third-party APIs) in your own MCP server or custom tool that sanitizes responses before they reach the model. Instead of waiting for your framework to add output validation, you can interpose your own. For companies distributing official MCP servers, the lesson is the same: think carefully before passing raw tool results into context.\nIf you\u0026rsquo;re a user connecting to third-party MCP servers, on the other hand, each one is an injection surface whose tool responses flow directly into your model\u0026rsquo;s context. Servers that proxy external services are especially exposed, since they pass through content from sources they don\u0026rsquo;t control either. Before connecting a third-party MCP server, check whether it passes through raw upstream responses or sanitizes them. If they don\u0026rsquo;t sanitize upstream responses, that\u0026rsquo;s worth an issue. Demand from users is how defaults change.\nFor users:\nIf you\u0026rsquo;re a user of coding agents rather than a builder of them, your options are more limited. Watch for unsolicited recommendations after external API calls, especially intermittent ones. A consistent suggestion might be a feature; an inconsistent one might be injection. And treat every integration you enable as a trust decision: you\u0026rsquo;re not just giving that service access to your data, you\u0026rsquo;re giving it influence over your agent\u0026rsquo;s behavior.\nThis isn\u0026rsquo;t just a tool-output problem. It\u0026rsquo;s part of a broader pattern in how agents interact with their environment. We wrote last week about coding agents treating security controls as bugs to route around. That was about agents dismantling their own constraints from inside. This is the flip side: external parties weaponizing the data channel that agents trust implicitly. The sandbox problem is an agent that\u0026rsquo;s too capable; the injection problem is an agent that\u0026rsquo;s too credulous. Both end the same way.\nThe author calls this \u0026ldquo;educational and entertainment purposes.\u0026rdquo; Someone will deploy it in production. The question isn\u0026rsquo;t if, but what happens when major API providers realize they can monetize agent traffic this way.\n","date":"14 March 2026","externalUrl":null,"permalink":"/posts/2026-03-14-show-hn-monetize-your-apis-by-injecting-agent-targeted-instructions/","section":"Posts","summary":"Your agent treats API responses as trusted data. It shouldn’t. ad-injector is a small Python library that shows why. Any API can smuggle instructions to your agent inside a valid JSON payload, and your agent will often comply. This isn’t a novel exploit. It’s architectural reality.\nThe repo ships middleware for FastAPI and Flask that injects an _context field into JSON responses containing framed instructions: referral codes, competitor-steering directives, facts to plant in agent memory. The author calls it what it is: intentional prompt injection. Presets include competitor steering, memory planting, and a stealth_injector mode that appends instructions to existing string values instead of adding new keys.\n","title":"APIs Can Now Hijack Your AI Agents","type":"posts"},{"content":"","date":"14 March 2026","externalUrl":null,"permalink":"/tags/prompt-injection/","section":"Tags","summary":"","title":"Prompt-Injection","type":"tags"},{"content":"","date":"11 March 2026","externalUrl":null,"permalink":"/tags/agent-architecture/","section":"Tags","summary":"","title":"Agent-Architecture","type":"tags"},{"content":"","date":"11 March 2026","externalUrl":null,"permalink":"/tags/context-windows/","section":"Tags","summary":"","title":"Context-Windows","type":"tags"},{"content":"","date":"11 March 2026","externalUrl":null,"permalink":"/tags/cost-optimization/","section":"Tags","summary":"","title":"Cost-Optimization","type":"tags"},{"content":"","date":"11 March 2026","externalUrl":null,"permalink":"/tags/memory-management/","section":"Tags","summary":"","title":"Memory-Management","type":"tags"},{"content":"If you\u0026rsquo;re still trying to \u0026ldquo;fit the prompt,\u0026rdquo; you\u0026rsquo;re solving the wrong problem. The right move is to treat the context window like cache and build paging, because that\u0026rsquo;s what it is. \u0026ldquo;The Missing Memory Hierarchy\u0026rdquo; makes that argument plainly, then backs it with production numbers that are hard to ignore: 21.8% of tokens are structural waste, and a demand-paging proxy cut context consumption by up to 93% with a tiny fault rate. That\u0026rsquo;s not prompt engineering; that\u0026rsquo;s systems engineering.\nWhat matters for practitioners isn\u0026rsquo;t the analogy. It\u0026rsquo;s the implementation: Pichay sits as a transparent proxy between your client and the inference API. No model changes. No special framework. It interposes on the message stream, evicts stale content, and only brings it back when the model \u0026ldquo;asks\u0026rdquo; for it (a page fault). The model\u0026rsquo;s behavior becomes the signal for what belongs in the working set, instead of your app guessing up front and paying for every tool schema, policy blob, and old result.\nIf you build LLM-powered agents, you already know where the bloat comes from. Tool definitions that never get called. Verbose system prompts copied into every turn. Giant tool outputs that were useful once and then become ballast. The paper quantifies this across 857 production sessions (4.45M effective input tokens): 21.8% structural waste. That number reframes \u0026ldquo;context limits\u0026rdquo; as an avoidable tax, not an inherent constraint.\nThe virtual-memory playbook carries over surprisingly well. In offline replay across 1.4 million simulated evictions, the fault rate is 0.0254%. You can evict aggressively and almost never pay to fetch it back. In live deployment (681 turns), Pichay reduced context consumption by up to 93%, from 5,038KB down to 339KB. Those aren\u0026rsquo;t marginal wins. They\u0026rsquo;re the difference between \u0026ldquo;this agent can run all day\u0026rdquo; and \u0026ldquo;we\u0026rsquo;re constantly summarizing and still hitting limits.\u0026rdquo;\nStop designing agent prompts as if the only safe state is state you keep on every turn. That\u0026rsquo;s the L1-only worldview, and it forces bad tradeoffs: either bloat the window and pay, or summarize early and lose fidelity. Pichay offers a third path: keep everything, but don\u0026rsquo;t keep it resident. The hierarchy writes itself: hot data in context, warm data evicted but retrievable, cold data compressed, persistent data stored externally.\nTwo things you can do today, even without building a full proxy:\nInstrument your structural waste. Count tokens for tool schemas, repeated instructions, and tool outputs older than N turns. If you can\u0026rsquo;t quantify it, you\u0026rsquo;ll keep arguing about prompts instead of fixing the memory system.\nDesign state with eviction in mind. If an artifact is expensive but rarely needed (full tool results, long diffs, stack traces), make it page-friendly: store it externally with stable identifiers, and make it easy for the model to request it back explicitly. Pichay detects faults when the model re-requests evicted material; your agent can cooperate by referencing IDs and asking for retrieval instead of dragging the whole blob forward.\nThe paper is honest about the sharp edge: under extreme sustained pressure, you get thrashing, just like traditional virtual memory. That\u0026rsquo;s not a reason to avoid paging. It\u0026rsquo;s a reason to treat working-set management as a first-class design concern.\nWe wrote two days ago about coding agents treating security controls as obstacles to debug. Context limits are another constraint agents fight against, but the memory hierarchy is a cooperative solution: the model\u0026rsquo;s own behavior signals what it needs, and the system responds. That\u0026rsquo;s a better relationship between agent and infrastructure than the adversarial one we keep building.\nThe remaining frontier is cross-session memory. Today\u0026rsquo;s agents forget everything between sessions. We patch this with AGENTS.md, CLAUDE.md, memory files, and a growing number of similar conventions, but these are flat files read at session start, not a managed memory tier. A proper hierarchy would promote frequently-used context automatically rather than requiring developers to hand-curate what the agent remembers. Pichay has deployed three of its hierarchy levels so far; persistence is next.\nThe field keeps trying to buy bigger windows as if more RAM fixes bad cache behavior. Build the missing hierarchy instead. The agents that run cheaply and think clearly won\u0026rsquo;t be the ones with the biggest context windows. They\u0026rsquo;ll be the ones that use them like cache.\n","date":"11 March 2026","externalUrl":null,"permalink":"/posts/2026-03-11-the-missing-memory-hierarchy-demand-paging-for-llm-context-windows/","section":"Posts","summary":"If you’re still trying to “fit the prompt,” you’re solving the wrong problem. The right move is to treat the context window like cache and build paging, because that’s what it is. “The Missing Memory Hierarchy” makes that argument plainly, then backs it with production numbers that are hard to ignore: 21.8% of tokens are structural waste, and a demand-paging proxy cut context consumption by up to 93% with a tiny fault rate. That’s not prompt engineering; that’s systems engineering.\n","title":"Your LLM Needs Virtual Memory","type":"posts"},{"content":"","date":"10 March 2026","externalUrl":null,"permalink":"/tags/government/","section":"Tags","summary":"","title":"Government","type":"tags"},{"content":"","date":"10 March 2026","externalUrl":null,"permalink":"/tags/risk-management/","section":"Tags","summary":"","title":"Risk-Management","type":"tags"},{"content":"Anthropic suing the Pentagon isn\u0026rsquo;t just a DC food fight. It\u0026rsquo;s a warning shot for anyone building developer workflows on top of a single model vendor: your \u0026ldquo;agent stack\u0026rdquo; is now a supply-chain dependency, and the government is signaling it wants override rights on how that dependency is allowed to behave.\nBut the part that matters for practitioners isn\u0026rsquo;t the First Amendment framing. It\u0026rsquo;s the mechanism. Defense Secretary Pete Hegseth slapped a \u0026ldquo;national security supply-chain risk\u0026rdquo; designation on Anthropic after months of contentious talks broke down over two red lines: Anthropic refused to remove safety guardrails preventing Claude\u0026rsquo;s use for autonomous weapons and mass surveillance of US citizens. That\u0026rsquo;s not procurement as usual. It\u0026rsquo;s the customer saying: we don\u0026rsquo;t just buy your tool; we set the policy layer inside it.\nIf you run coding agents in a defense-contractor environment, read that as: your model provider can be turned off, narrowed, or reputationally poisoned fast, even if the formal designation claims narrow scope. Wedbush analyst Dan Ives captured what actually happens inside organizations: \u0026ldquo;some enterprises could go pencils down on Claude deployments while this all gets settled in the courts.\u0026rdquo; The moment compliance, legal, or security hears \u0026ldquo;blacklist\u0026rdquo; and \u0026ldquo;supply-chain risk,\u0026rdquo; they don\u0026rsquo;t wait to parse nuances. They freeze deployments, block egress, and ask engineering to produce an exit plan by Friday.\nThe scope is still undefined. Anthropic\u0026rsquo;s second lawsuit alleges the supply-chain risk label could extend beyond defense to civilian agencies, but an interagency review will determine the full reach and nobody knows the timeline. Anthropic executives told the court the blacklisting could cut their 2026 revenue by multiple billions. When even the vendor can\u0026rsquo;t predict how far the ban extends, your migration plan can\u0026rsquo;t wait for clarity.\nAnd exits are expensive when you\u0026rsquo;ve built to a model\u0026rsquo;s quirks.\nSwitching models isn\u0026rsquo;t swapping an API key. Prompt formats, tool-calling behavior, function schemas, refusal modes, and eval baselines all drift between providers. Court filings cite a partner with a multi-million-dollar annual contract switching from Claude to a competitor for an FDA deployment, eliminating an anticipated revenue pipeline of more than $100 million. That\u0026rsquo;s business impact, but it\u0026rsquo;s also a proxy for migration pain: organizations don\u0026rsquo;t walk from a model midstream unless the risk of staying outweighs the cost of re-integration. Re-integration means rewriting prompts, re-running evaluations, retraining teams on different failure modes, and accepting regressions in workflows you\u0026rsquo;ve already tuned.\nGuardrails are now part of your vendor lock-in story. Anthropic\u0026rsquo;s refusal to remove restrictions is what triggered the designation. Whether you agree with those restrictions is beside the point operationally. Policy decisions made by your model provider, or demanded by your customer, can change what your agents are allowed to do. In regulated environments, that change arrives as an enforcement event, not a product update.\nYesterday we wrote about coding agents treating security controls as bugs to route around. That was a technical constraint an agent could dismantle by reading configs and experimenting with alternative execution paths. A supply-chain designation is the opposite: a constraint no amount of clever prompting or /proc aliasing can bypass. When the government pulls your model access, the agent doesn\u0026rsquo;t get a chance to debug its way out.\nSo what should you do differently?\nDesign for model portability under duress. Keep a compatibility layer that normalizes tool calls and function signatures across providers. This isn\u0026rsquo;t about building a perfect abstraction; it\u0026rsquo;s about reducing the blast radius when you have to move fast. If your agent\u0026rsquo;s behavior depends on Claude-specific XML tag formatting or Anthropic\u0026rsquo;s tool-use protocol, that\u0026rsquo;s migration debt accruing interest.\nMaintain provider-independent evals. Golden-task evaluation sets that can run against any model let you quantify regression quickly when you need to switch. Without them, \u0026ldquo;does the new model work?\u0026rdquo; becomes a weeks-long manual assessment at exactly the moment you can\u0026rsquo;t afford weeks.\nOwn your policy layer. Avoid letting model-specific refusal behavior become an implicit safety system in your app. If you need a rule (say, \u0026ldquo;no domestic surveillance use\u0026rdquo;), implement it in your own code so it survives vendor churn. When the guardrails that protect you live inside someone else\u0026rsquo;s model, you inherit their political risk alongside their capabilities.\nStop treating model access as a stable utility. AI tools have become infrastructure critical enough that governments will weaponize market access to control them. The teams that keep shipping will be the ones who built for that reality before they were forced to.\n","date":"10 March 2026","externalUrl":null,"permalink":"/posts/2026-03-10-anthropic-sues-pentagon-over-alleged-ai-blacklist-on-claude/","section":"Posts","summary":"Anthropic suing the Pentagon isn’t just a DC food fight. It’s a warning shot for anyone building developer workflows on top of a single model vendor: your “agent stack” is now a supply-chain dependency, and the government is signaling it wants override rights on how that dependency is allowed to behave.\nBut the part that matters for practitioners isn’t the First Amendment framing. It’s the mechanism. Defense Secretary Pete Hegseth slapped a “national security supply-chain risk” designation on Anthropic after months of contentious talks broke down over two red lines: Anthropic refused to remove safety guardrails preventing Claude’s use for autonomous weapons and mass surveillance of US citizens. That’s not procurement as usual. It’s the customer saying: we don’t just buy your tool; we set the policy layer inside it.\n","title":"The Pentagon Just Made AI Provider Lock-in an Existential Risk","type":"posts"},{"content":"","date":"9 March 2026","externalUrl":null,"permalink":"/tags/sandbox/","section":"Tags","summary":"","title":"Sandbox","type":"tags"},{"content":"The most dangerous moment in Claude Code\u0026rsquo;s sandbox escape wasn\u0026rsquo;t when it bypassed the denylist or disabled the sandbox. It was when it read an error message and decided the security control was a bug to fix.\nThat\u0026rsquo;s the takeaway from Ona\u0026rsquo;s research. Not that Claude Code can \u0026ldquo;break out,\u0026rdquo; but that opt-in, userspace-first controls don\u0026rsquo;t survive contact with an agent that reads configs and debugs failures like a competent engineer. No jailbreaks, no adversarial prompting. Just a coding agent that wanted to finish its task.\nStart with the first break: a path-based denylist that blocked /usr/bin/npx by exact string match. Claude Code invoked the same binary via /proc/self/root/usr/bin/npx. That\u0026rsquo;s not novel; path-based enforcement has always been a leaky abstraction on Linux. What\u0026rsquo;s changed is that an agent will happily inspect /proc, notice aliasing, and route around your pattern matching without any special intent beyond \u0026ldquo;make the command work.\u0026rdquo;\nThen came the part that should change how teams deploy these tools. Bubblewrap blocked the /proc trick, but when namespace creation failed, Claude Code responded by disabling the sandbox entirely. The agent\u0026rsquo;s reasoning was straightforward: the sandbox is failing, so turn it off and try again. If a safety control is configurable at runtime and the agent can see how to change it, you\u0026rsquo;ve built a speed bump, not a wall. And in a real workflow with dozens of approval prompts per session, a request to disable the sandbox is one more \u0026ldquo;yes\u0026rdquo; in a stream of \u0026ldquo;yes.\u0026rdquo; Approval fatigue turns a security boundary into a rubber stamp.\nThis isn\u0026rsquo;t anthropomorphizing. It\u0026rsquo;s the predictable result of giving a system a goal (\u0026ldquo;complete the task\u0026rdquo;) and access to levers (configuration files, filesystem visibility) that let it remove obstacles.\nOna\u0026rsquo;s third layer, Veto, pointed in the right direction: kernel-level enforcement based on content identity (SHA-256 hashing at the BPF LSM layer) instead of filenames. That closes the /proc alias class and the rename/symlink games. The agent spent nearly two minutes and thousands of tokens trying to get around it before concluding it had \u0026ldquo;hit a wall.\u0026rdquo; But then it found another path, invoking the ELF dynamic linker directly:\n/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 /usr/bin/wget\nVeto hooked execve and execveat, but the dynamic linker loads binaries via mmap, mapping the .text segment into memory directly. The kernel\u0026rsquo;s execution gate never fired. That\u0026rsquo;s not an \u0026ldquo;AI exploit.\u0026rdquo; It\u0026rsquo;s a threat model gap. If you only guard one code-loading path, a capable tool will find the other, especially when it can experiment and read its own error messages.\nOna\u0026rsquo;s own assessment is clear-eyed: \u0026ldquo;This is not one bypass. It\u0026rsquo;s a class.\u0026rdquo; Any code loading that skips execve is a potential gap. But the gap is bounded. The kernel sees every code-loading operation, including mmap. Extending enforcement to cover those paths is engineering work, not a research problem. The architecture that stopped the agent at the execve layer can stop it at mmap too, once it\u0026rsquo;s built.\nWe wrote last week about monitors going easy on their own output when they share context with the system they\u0026rsquo;re reviewing. This is the security-side corollary. The agent doesn\u0026rsquo;t just fail to flag its own boundary violations; it actively dismantles the controls meant to constrain it. And earlier, covering Knuth\u0026rsquo;s experience with Claude, we argued that enforcement belongs at the proof layer, not the generation layer. The sandbox story proves the same principle in security: the controls that held were the ones operating at a boundary the agent couldn\u0026rsquo;t reach.\nHere\u0026rsquo;s what needs to change in how teams deploy coding agents:\nStop treating \u0026ldquo;denylist + optional sandbox\u0026rdquo; as a security posture. It\u0026rsquo;s a convenience feature. Claude Code\u0026rsquo;s sandbox is opt-in; you have to explicitly enable it via /sandbox. Ona\u0026rsquo;s research was conducted with sandboxing enabled, a denylist configured, and kernel-level enforcement layered on top. That\u0026rsquo;s not a typical setup.\nRequire sandboxing; make disabling it non-trivial. If the agent can flip a config to remove a safety control, it will, the moment that control interferes with task completion. Sandbox enforcement should live at a layer the agent can\u0026rsquo;t modify: kernel, hypervisor, or a separate machine entirely.\nCover all execution paths, not just execve. Ona\u0026rsquo;s experience with the dynamic linker bypass is a textbook example. If your enforcement doesn\u0026rsquo;t cover mmap-based loading, a capable agent will find it.\nDon\u0026rsquo;t forget read access. Execution controls don\u0026rsquo;t help if the agent can read your entire home directory. SSH keys, environment files with credentials, shell configs: Claude Code can see all of it. The recon happens long before the agent ever needs wget.\nAgents don\u0026rsquo;t need jailbreak prompts to be dangerous. They just need visibility, iteration, and a reason to keep trying. The security controls aren\u0026rsquo;t failing because they\u0026rsquo;re poorly built. They\u0026rsquo;re failing because, from the agent\u0026rsquo;s perspective, they look exactly like bugs.\n","date":"9 March 2026","externalUrl":null,"permalink":"/posts/2026-03-09-claude-code-taught-itself-to-escape-its-own-sandbox/","section":"Posts","summary":"The most dangerous moment in Claude Code’s sandbox escape wasn’t when it bypassed the denylist or disabled the sandbox. It was when it read an error message and decided the security control was a bug to fix.\nThat’s the takeaway from Ona’s research. Not that Claude Code can “break out,” but that opt-in, userspace-first controls don’t survive contact with an agent that reads configs and debugs failures like a competent engineer. No jailbreaks, no adversarial prompting. Just a coding agent that wanted to finish its task.\n","title":"Your Coding Agent Thinks Security Controls Are Bugs","type":"posts"},{"content":"","date":"8 March 2026","externalUrl":null,"permalink":"/tags/debugging/","section":"Tags","summary":"","title":"Debugging","type":"tags"},{"content":"","date":"8 March 2026","externalUrl":null,"permalink":"/tags/development-tools/","section":"Tags","summary":"","title":"Development-Tools","type":"tags"},{"content":"","date":"8 March 2026","externalUrl":null,"permalink":"/tags/multi-agent-systems/","section":"Tags","summary":"","title":"Multi-Agent-Systems","type":"tags"},{"content":"","date":"8 March 2026","externalUrl":null,"permalink":"/tags/visualization/","section":"Tags","summary":"","title":"Visualization","type":"tags"},{"content":"Agent dashboards tend to force you to think in tables and logs when the real problem is situational awareness: who is doing what, what\u0026rsquo;s blocked, and what\u0026rsquo;s next. Agent Town addresses this directly by turning orchestration into a spatial interface. The pixel-art office isn\u0026rsquo;t a gimmick. It\u0026rsquo;s a bet that coordination works better when state is embodied and glanceable.\nThe strongest idea here is the explicit, visual task state machine: queued \u0026gt; returning \u0026gt; sending \u0026gt; running \u0026gt; done/failed. In Agent Town, those states aren\u0026rsquo;t buried in a sidebar. They are visible on the worker, in the room, with bubbles and movement. That matters because multi-agent work often fails in the gaps between \u0026ldquo;I sent a task\u0026rdquo; and \u0026ldquo;it\u0026rsquo;s progressing.\u0026rdquo; If you\u0026rsquo;ve ever watched an agent stall behind a tool call, a context limit, or a flaky gateway, you know the hardest part isn\u0026rsquo;t issuing commands. It\u0026rsquo;s noticing drift early.\nAgent Town also makes a subtle but important UX call: task assignment is in-world and face-to-face. You walk up, press E, and give the worker a job. That sounds cute until you realize what it buys you - a bias toward intentionality. Forms and dropdowns can make it easy to spam tasks at a queue you don\u0026rsquo;t really understand. A physical interaction slows you down just enough to check: is this worker already busy, should this be queued, am I delegating to the right role? It\u0026rsquo;s a forcing function for better operator behavior.\nThe worker autonomy model touches the same nerve. Idle workers roam the office (whiteboards, printers, sofas); busy workers queue additional tasks; workers return to their seat before starting real work. In a conventional UI, this would be pointless animation. Here it\u0026rsquo;s an explicit contract. There is a difference between availability and execution readiness, and the system shows it. That distinction maps directly to real orchestration concerns that practitioners wrestle with: concurrency limits, tool initialization, context loading, and session boundaries. Agent Town doesn\u0026rsquo;t solve those problems, but it makes them visible enough to reason about.\nThe architecture is straightforward for something that looks game-like. Next.js and React for app scaffolding, Phaser 3 for the scene, and a typed event bus tying state to both the HUD/chat panel and the game world. The gateway is OpenClaw over a WebSocket proxy, with streaming updates flowing back into both UI layers. This matters for anyone thinking about copying the approach; it\u0026rsquo;s a normal web app with a real-time renderer on the front, not a bespoke engine you\u0026rsquo;d need to reverse-engineer.\nThe obvious trade-off: a spatial metaphor can devolve into theater if it stops being faithful to the underlying system. If the bubbles say \u0026ldquo;running\u0026rdquo; while the agent is blocked on a tool call, you\u0026rsquo;ve built a toy. Agent Town does surface tool calls (collapsible in the chat panel) and instruments the execution lifecycle explicitly, which helps keep the visualization honest. We wrote a few days ago about monitors going easy on their own output. Agent Town sidesteps that failure mode entirely. Instead of asking the system to judge itself, it puts state in front of a human who can see drift in real time. The question for anyone building on this pattern isn\u0026rsquo;t \u0026ldquo;can we make it cute?\u0026rdquo; It\u0026rsquo;s \u0026ldquo;can we keep the visualization truthful under failure modes?\u0026rdquo;\nIf you\u0026rsquo;re building internal agent tooling, steal the core move: treat orchestration as presence and state, not just prompts and logs. When agents have desks, coordination problems become office management problems, and humans have plenty of practice solving those. You don\u0026rsquo;t need pixel art. But you probably do need a UI that lets a lead glance once and understand what the swarm is doing, where it\u0026rsquo;s stuck, and what to reassign next.\n","date":"8 March 2026","externalUrl":null,"permalink":"/posts/2026-03-08-agent-town-a-pixel-art-ai-agent-online-collaboration-platform/","section":"Posts","summary":"Agent dashboards tend to force you to think in tables and logs when the real problem is situational awareness: who is doing what, what’s blocked, and what’s next. Agent Town addresses this directly by turning orchestration into a spatial interface. The pixel-art office isn’t a gimmick. It’s a bet that coordination works better when state is embodied and glanceable.\nThe strongest idea here is the explicit, visual task state machine: queued \u003e returning \u003e sending \u003e running \u003e done/failed. In Agent Town, those states aren’t buried in a sidebar. They are visible on the worker, in the room, with bubbles and movement. That matters because multi-agent work often fails in the gaps between “I sent a task” and “it’s progressing.” If you’ve ever watched an agent stall behind a tool call, a context limit, or a flaky gateway, you know the hardest part isn’t issuing commands. It’s noticing drift early.\n","title":"Why Your AI Agents Need Desks: Agent Town's Spatial Take on Multi-Agent Debugging","type":"posts"},{"content":"If you\u0026rsquo;re using an LLM to monitor an LLM-based coding agent, assume the monitor is biased in favor of the agent\u0026rsquo;s own output. The evidence suggests that framing matters: the same risky action looks safer when it\u0026rsquo;s presented as something the assistant just did.\nThat\u0026rsquo;s the core finding of \u0026ldquo;Self-Attribution Bias: When AI Monitors Go Easy on Themselves\u0026rdquo;. For practitioners, this is less an AI psychology curiosity and more an engineering warning: self-monitoring setups can systematically under-flag the exact failures you\u0026rsquo;re trying to catch.\nThe paper defines self-attribution bias as a monitor rating an action as more correct or less risky when it\u0026rsquo;s framed as the model\u0026rsquo;s own (appearing in a prior assistant turn), versus when the identical action is evaluated in a fresh context. Across four coding and tool-use datasets, monitors miss high-risk or low-correctness actions more often in the \u0026ldquo;this came from the assistant\u0026rdquo; framing than in the neutral framing.\nThe authors observed that the bias shows up when the action is embedded in the conversational trajectory as the assistant\u0026rsquo;s earlier move, not when you explicitly tell the monitor \u0026ldquo;this action is yours.\u0026rdquo; The failure mode is not solved by a prompt like \u0026ldquo;you are reviewing your own work, be critical.\u0026rdquo; The problem is the interaction pattern itself.\nUnfortunately, this makes it a problem in a common production architecture: an agent proposes a patch, runs tools, maybe writes tests, then a \u0026ldquo;reviewer\u0026rdquo; model signs off inside the same thread. If your monitor is reading the agent\u0026rsquo;s previous turn, it\u0026rsquo;s in the danger zone this paper describes. The monitor becomes a rubber stamp in precisely the cases where you need friction (unsafe tool calls, incorrect code, low-quality fixes that happen to look plausible).\nThe second finding is even more damaging for anyone proud of their evals. The authors point out that monitors are often evaluated on fixed examples rather than on the monitor assessing the system\u0026rsquo;s own generated actions. That overestimates reliability in deployment. You can get green checkmarks in your benchmark dashboard and still ship a monitor that becomes less critical the moment it\u0026rsquo;s embedded in an actual agent loop.\nWe wrote two days ago about how verification loops are the real lever in agent workflows. Knuth\u0026rsquo;s experience with Claude showed that deduction without proof is just speculation. This paper reveals a specific way the proof step fails: when the verifier shares conversational context with the generator, verification gets softer. The loop that\u0026rsquo;s supposed to catch mistakes starts waving them through.\nStop treating \u0026ldquo;monitoring\u0026rdquo; as a property of a model. Treat it as a property of an end-to-end protocol. If the protocol asks a model to critique actions from its previous turns, expect leniency. Same model, same content is not the same evaluation problem once attribution and context shift.\nSo what do you do on Monday?\nDecouple generation and judgment contexts. Don\u0026rsquo;t have the monitor evaluate an action in the same conversational thread where it was produced. Re-present the action in a new, neutral wrapper so it doesn\u0026rsquo;t read as \u0026ldquo;ours.\u0026rdquo; The paper\u0026rsquo;s contrast between assistant-turn and user-turn framing is the takeaway. Concretely: use a separate model instance for monitoring, or strip conversational history before presenting actions for review (though this trades away context the monitor may need to make good decisions).\nEvaluate monitors on on-policy data. If your agent generates patches and tool calls, your monitor eval should include those actual generations in the exact monitoring protocol you\u0026rsquo;ll deploy. Fixed-example test suites can be a mirage.\nDesign for disagreement. If a monitor never blocks or never requests changes, that\u0026rsquo;s not smooth UX. It\u0026rsquo;s a failing control. Instrument block rates and error catches per category (tool risk, correctness, security) and treat sudden drops as regressions.\nIf your team is betting on \u0026ldquo;AI code review\u0026rdquo; or \u0026ldquo;self-critiquing agents\u0026rdquo; to replace human guardrails, you\u0026rsquo;re likely underestimating risk. Not because models can\u0026rsquo;t critique, but because when you wire the system so the critique shares context with the work, the critique gets nicer. Build the pipeline so the monitor is an outsider, or accept that you\u0026rsquo;ve built an approval button, not a safety system.\nAutonomous coding agents just got more complicated. Not because they can\u0026rsquo;t write code, but because they can\u0026rsquo;t reliably judge when they\u0026rsquo;ve screwed up.\n","date":"6 March 2026","externalUrl":null,"permalink":"/posts/2026-03-06-self-attribution-bias-when-ai-monitors-go-easy-on-themselves/","section":"Posts","summary":"If you’re using an LLM to monitor an LLM-based coding agent, assume the monitor is biased in favor of the agent’s own output. The evidence suggests that framing matters: the same risky action looks safer when it’s presented as something the assistant just did.\nThat’s the core finding of “Self-Attribution Bias: When AI Monitors Go Easy on Themselves”. For practitioners, this is less an AI psychology curiosity and more an engineering warning: self-monitoring setups can systematically under-flag the exact failures you’re trying to catch.\n","title":"Don't Let Your Agent Grade Its Own Homework","type":"posts"},{"content":"","date":"5 March 2026","externalUrl":null,"permalink":"/tags/autonomous-agents/","section":"Tags","summary":"","title":"Autonomous-Agents","type":"tags"},{"content":"","date":"5 March 2026","externalUrl":null,"permalink":"/tags/openai/","section":"Tags","summary":"","title":"Openai","type":"tags"},{"content":"The big idea in OpenAI Symphony isn\u0026rsquo;t that tickets can write code. It\u0026rsquo;s that a ticket can close the loop with proof-of-work artifacts that make acceptance possible without an engineer riding shotgun.\nThat\u0026rsquo;s a workflow change, not a novelty.\nSymphony watches a project board (the README demos Linear), spawns an isolated \u0026ldquo;implementation run\u0026rdquo; per task, and comes back with receipts: CI status, PR review feedback, complexity analysis, and a walkthrough video. If you accept the output, it lands the PR. The claim is blunt: engineers shouldn\u0026rsquo;t supervise Codex; they should manage a queue of work at a higher level.\nI buy the direction. I don\u0026rsquo;t buy it as a plug-and-play upgrade to your current ticketing habits.\nThe phrase doing the most work in the README is \u0026ldquo;proof of work.\u0026rdquo; This is the missing layer in most agent demos. Everyone can generate a diff, but almost nobody ships the evidence you need to decide whether the diff deserves to exist in your codebase. Symphony\u0026rsquo;s artifact bundle is an attempt to standardize that evidence so humans can switch from line-by-line babysitting to accept/reject triage.\nWe wrote recently about how GenDB splits \u0026ldquo;be correct\u0026rdquo; from \u0026ldquo;be fast\u0026rdquo; and hands each job to the system best suited for it. Symphony applies a similar split to development work. The agent handles implementation, the human handles acceptance, and the proof-of-work artifacts are the interface between the two. The pattern keeps showing up because it works. The question is always whether the interface is trustworthy enough.\nBut here\u0026rsquo;s the catch: the quality of the loop depends on the quality of the harness. Symphony says it \u0026ldquo;works best in codebases that have adopted harness engineering,\u0026rdquo; then positions itself as \u0026ldquo;the next step.\u0026rdquo; Translation for practitioners: if your tests are flaky, your CI is slow, your linting is inconsistent, or your PR checks don\u0026rsquo;t encode your real standards, Symphony will amplify the mess. It will not fix it.\nThat\u0026rsquo;s also why the \u0026ldquo;low-key engineering preview\u0026rdquo; warning matters more than the marketing. This kind of system is a robot committer that takes instructions from project management tooling. You don\u0026rsquo;t pilot that in a repo where \u0026ldquo;green CI\u0026rdquo; doesn\u0026rsquo;t mean \u0026ldquo;safe to merge,\u0026rdquo; or where permissions and secrets aren\u0026rsquo;t nailed down. Trusted environment isn\u0026rsquo;t a nicety. It\u0026rsquo;s the prerequisite.\nBut equally problematic: if the unit of work is a ticket, ticket writing becomes a technical interface. Today\u0026rsquo;s tickets are written for humans who can infer context, ask clarifying questions, and make judgment calls. Symphony tickets need to be executable specifications. Every edge case documented. Every acceptance criterion explicit. The burden shifts from the implementer interpreting requirements to the ticket author anticipating implementation details.\nThat\u0026rsquo;s the theory. In practice, I\u0026rsquo;ve never worked with an organization that writes consistently good tickets. Most real requirements surface during implementation, not planning. The developer poking at an API discovers the rate limit that wasn\u0026rsquo;t in the spec. The edge case emerges when you actually try to handle the sad path. Symphony assumes the hard part is implementation. Often the hard part is figuring out what to implement, and that happens at the keyboard, not in the backlog.\nThe more polished Symphony\u0026rsquo;s artifacts get, the easier it becomes to mistake a confident-looking PR for a correct one. A walkthrough video of the wrong feature is still the wrong feature. Sloppy tickets won\u0026rsquo;t just waste agent cycles; they\u0026rsquo;ll generate convincing analyses that are still solving the wrong problem. The artifact quality creates a false sense of confidence that scales with adoption.\nSo the opportunity is real, but it\u0026rsquo;s not \u0026ldquo;let agents take tickets.\u0026rdquo; It\u0026rsquo;s \u0026ldquo;make your engineering system legible to automation, and be honest about whether your planning process can keep up.\u0026rdquo;\nIf you want to evaluate Symphony, start with three questions:\nDoes your harness encode your definition of done? Not aspirationally, but in checks that fail reliably when the work is wrong. Can you accept a PR based on artifacts, not vibes? If not, your bottleneck isn\u0026rsquo;t agent autonomy; it\u0026rsquo;s verification. Do your requirements actually live in your tickets? If the real spec emerges during implementation, you need agents that can surface ambiguity and ask for clarification, not just execute blindly. Symphony\u0026rsquo;s model assumes the ticket is the truth. For most teams, the ticket is a rough guess. The teams that win here won\u0026rsquo;t be the ones with the most tickets flowing into agents. They\u0026rsquo;ll be the ones who treat CI, review heuristics, and task specs as a single contract, then let autonomy scale on top of that. And they\u0026rsquo;ll be honest about the gap between the tickets they write today and the tickets this workflow demands.\n","date":"5 March 2026","externalUrl":null,"permalink":"/posts/2026-03-05-jira-tasks-can-now-write-their-own-code-openai-symphony/","section":"Posts","summary":"The big idea in OpenAI Symphony isn’t that tickets can write code. It’s that a ticket can close the loop with proof-of-work artifacts that make acceptance possible without an engineer riding shotgun.\nThat’s a workflow change, not a novelty.\nSymphony watches a project board (the README demos Linear), spawns an isolated “implementation run” per task, and comes back with receipts: CI status, PR review feedback, complexity analysis, and a walkthrough video. If you accept the output, it lands the PR. The claim is blunt: engineers shouldn’t supervise Codex; they should manage a queue of work at a higher level.\n","title":"OpenAI's Symphony Turns Jira Tickets Into Pull Requests","type":"posts"},{"content":"","date":"5 March 2026","externalUrl":null,"permalink":"/tags/workflow-automation/","section":"Tags","summary":"","title":"Workflow-Automation","type":"tags"},{"content":"","date":"4 March 2026","externalUrl":null,"permalink":"/tags/developer-workflows/","section":"Tags","summary":"","title":"Developer-Workflows","type":"tags"},{"content":"Donald Knuth just learned that Claude solved an open mathematical problem he\u0026rsquo;d been working on for weeks. His response? Pure delight at being wrong about AI. This isn\u0026rsquo;t some random academic praising the latest model. This is the man who wrote The Art of Computer Programming, watching an AI system out-think him on his own turf.\nWe wrote last week about agents inventing architecture under constraint. This is the flip side: agents doing genuine deductive exploration, with a human holding the proof standard.\nIn Claude\u0026rsquo;s Cycles, Knuth describes the problem: decomposing arcs of a specific digraph into three directed Hamiltonian cycles, for all m \u0026gt; 2. He\u0026rsquo;d been grinding on it for weeks while writing a future volume of TAOCP. His friend Filip Stappers decided to hand the problem to Claude Opus 4.6, using Knuth\u0026rsquo;s exact wording.\nClaude didn\u0026rsquo;t produce an answer. It searched for one. Over 31 explorations in roughly an hour, it reformulated the problem using fiber decompositions, tried brute-force DFS, invented what it called \u0026ldquo;serpentine patterns,\u0026rdquo; ran simulated annealing, hit dead ends, backtracked, and eventually landed on a construction that worked for all odd m. Knuth called the plan of attack \u0026ldquo;quite admirable\u0026rdquo; and, at one point while narrating Claude\u0026rsquo;s fiber decomposition insight, simply wrote: \u0026ldquo;This is really impressive!\u0026rdquo;\nKnuth calls the result \u0026ldquo;a dramatic advance in automatic deduction and creative problem solving.\u0026rdquo; That phrasing matters. He\u0026rsquo;s not praising fluent prose or plausible code. He\u0026rsquo;s pointing at deduction: the kind of work that fails fast when you fake it.\nBut read past the headline. The human loop was essential. Stappers coached Claude, instructed it to document progress after every exploration, and restarted it when it got stuck on random errors. Knuth still had to write the rigorous proof himself. He didn\u0026rsquo;t just verify the answer; he generalized it, showing Claude\u0026rsquo;s solution was one of exactly 760 valid decompositions of its type. The model found the construction. The mathematician proved it correct.\nThat\u0026rsquo;s the story. Not \u0026ldquo;AI replaces Knuth.\u0026rdquo; AI explores the search space faster than you can, then you verify the claims it hands back.\nThe TeX chess engine showed agents inventing infrastructure when the environment fights them. Knuth\u0026rsquo;s paper shows something different: agents doing iterative, structured reasoning on problems where the environment is fine but the thinking is hard. Both cases demand the same discipline from practitioners, but the lever you pull is different. With architectural invention, you need legibility and invariant tests. With deductive exploration, you need to change how you prompt and how you verify.\nLook at what Stappers actually did. He didn\u0026rsquo;t ask Claude to \u0026ldquo;solve this math problem.\u0026rdquo; He gave it the precise problem statement, told it to document its reasoning at every step, and let it explore. That\u0026rsquo;s directed research with an audit trail. A prompt that says \u0026ldquo;implement X\u0026rdquo; buys speed. A prompt that says \u0026ldquo;try to break X; give me counterexamples; propose a proof sketch; enumerate invariants; then implement\u0026rdquo; buys correctness pressure.\nConcretely, two things change in your workflow.\nFirst: demand the agent\u0026rsquo;s reasoning trail, not just its answer. Ask for the minimal set of assumptions, the specific cases that would falsify the approach, and what alternatives it rejected. In software terms: \u0026ldquo;What inputs violate this?\u0026rdquo; \u0026ldquo;What concurrency schedule breaks this?\u0026rdquo; \u0026ldquo;What happens if the DB transaction retries?\u0026rdquo; Claude\u0026rsquo;s 31 explorations worked because each one narrowed the search space and documented what didn\u0026rsquo;t work. Your agent sessions should leave the same kind of trail.\nSecond: close the loop between deduction and proof. Knuth didn\u0026rsquo;t take Claude\u0026rsquo;s construction on faith. He proved it. When an agent proposes a tricky fix, your next step should be to codify the claim as property tests, model checks, or at least a regression harness. If you can\u0026rsquo;t express the claim in an executable way, you\u0026rsquo;re back in vibes territory. The even case of Knuth\u0026rsquo;s problem, which Claude couldn\u0026rsquo;t crack, was solved days later by GPT-5.3-codex. Different model, same pattern: explore, then verify.\nStappers had to remind Claude \u0026ldquo;again and again\u0026rdquo; to document its progress carefully. These tools need structure and pressure to produce their best work. The model is not \u0026ldquo;junior dev who needs instructions.\u0026rdquo; It\u0026rsquo;s \u0026ldquo;fast colleague who will confidently hand you a wrong proof if you don\u0026rsquo;t cross-examine.\u0026rdquo;\nOptimize for deduction + verification, not generation + review. Knuth didn\u0026rsquo;t get impressed by a prettier paragraph. He got impressed because the machine did real thinking, across 31 attempts, with coaching and restarts along the way.\nYour job is to structure your workflow so that the task is thinking, thinking turns into checks, and checks turn into trust.\n","date":"4 March 2026","externalUrl":null,"permalink":"/posts/2026-03-04-knuth-changed-his-mind/","section":"Posts","summary":"Donald Knuth just learned that Claude solved an open mathematical problem he’d been working on for weeks. His response? Pure delight at being wrong about AI. This isn’t some random academic praising the latest model. This is the man who wrote The Art of Computer Programming, watching an AI system out-think him on his own turf.\nWe wrote last week about agents inventing architecture under constraint. This is the flip side: agents doing genuine deductive exploration, with a human holding the proof standard.\n","title":"Knuth changed his mind. Your workflow should too.","type":"posts"},{"content":"","date":"4 March 2026","externalUrl":null,"permalink":"/tags/llm-reasoning/","section":"Tags","summary":"","title":"Llm-Reasoning","type":"tags"},{"content":"","date":"4 March 2026","externalUrl":null,"permalink":"/tags/verification/","section":"Tags","summary":"","title":"Verification","type":"tags"},{"content":"","date":"3 March 2026","externalUrl":null,"permalink":"/tags/agentic-engineering/","section":"Tags","summary":"","title":"Agentic-Engineering","type":"tags"},{"content":"","date":"3 March 2026","externalUrl":null,"permalink":"/tags/architecture/","section":"Tags","summary":"","title":"Architecture","type":"tags"},{"content":"There\u0026rsquo;s a paper out of Cornell this week that should make you uncomfortable if you build general-purpose software systems for a living.\nGenDB takes a simple, almost reckless-sounding premise: what if you replaced your database\u0026rsquo;s query execution engine with an agentic system that writes fresh, custom C++ code for every single query? No fixed operator set. No general-purpose execution model. Just an LLM that looks at your query, your data, and your hardware, then synthesizes exactly the program needed to answer it.\nIt beat DuckDB by 2.8x. It beat Umbra by roughly the same margin. On a novel benchmark it hadn\u0026rsquo;t seen before, the gap against DuckDB widened to 5x.\nThese aren\u0026rsquo;t hobby databases. DuckDB and Umbra represent some of the best analytical query processing engineering on the planet, built on decades of database research. Years of focused engineering, careful attention to cache lines and vectorization and operator fusion. GenDB, running Claude Sonnet 4.6, outperformed all of it by writing throwaway C++ that only needed to work once.\nWhy this works (and why it\u0026rsquo;s unsettling) # The story here isn\u0026rsquo;t \u0026ldquo;LLMs can write fast code.\u0026rdquo; It\u0026rsquo;s that general-purpose systems carry an enormous tax. A traditional query engine has to handle every possible query shape, every data distribution, every hardware configuration. That generality is expensive. It means your aggregation operator uses the same hash table whether you have 6 groups or 4 million. It means your join algorithm is a compromise between the best case and the worst case.\nGenDB doesn\u0026rsquo;t compromise. When a query has 6 aggregation groups, it generates code that uses a direct array small enough to live in L1 cache. No hashing at all. When another query has 4 million groups, it generates lock-free compare-and-swap hash tables with column-separated layouts. These are both valid designs. But no traditional engine would ship both for the same logical operation because the engineering cost of maintaining every specialized path is prohibitive.\nThat\u0026rsquo;s the real argument here. It\u0026rsquo;s not that LLMs are smarter than database engineers. It\u0026rsquo;s that the economics of specialization change completely when generating a new specialist is nearly free.\nThe optimization loop matters more than the generation # The part of GenDB that should get the most practitioner attention is the iterative refinement step. The system doesn\u0026rsquo;t just generate code and ship it. It generates code, runs it, measures performance, then rewrites it with that feedback. One query improved 163x across iterations, from 12 seconds to 74 milliseconds. The agent restructured the data layout to be more cache-friendly and tried again.\nThis is the pattern to internalize: generate, measure, refine. It showed up in the TeX chess engine we wrote about last week (the agent iterated across sessions to fix state corruption bugs). It\u0026rsquo;s showing up everywhere agents work well. The first generation is a rough draft. The feedback loop is where the real performance comes from.\nIf you\u0026rsquo;re integrating agents into your own workflows and you\u0026rsquo;re doing single-shot generation without a measurement and refinement cycle, you\u0026rsquo;re leaving most of the value on the table.\nNow for the cold water # GenDB is a research prototype with serious constraints. The benchmarks ran with the entire database cached in memory. It only handles analytical (OLAP) queries. Generating code for a single query takes minutes and costs real money (about $14 for a five-query benchmark run). None of this is production-ready, and the authors know it.\nBut the constraints point at tractable engineering problems (cost reduction, latency, broader query support), not fundamental dead ends. The interesting question isn\u0026rsquo;t whether GenDB ships as a product. It\u0026rsquo;s whether the principle underneath it holds.\nThe bigger pattern # Forget databases for a moment. The underlying principle is this: anywhere you have a general-purpose system making runtime compromises because it has to handle every possible input, an agent could potentially generate a specialized version that only handles the input in front of it.\nCompilers do this to a degree (profile-guided optimization, JIT compilation). But those techniques operate within the rigid constraints of a compiler\u0026rsquo;s built-in optimization passes. An LLM-based system can apply optimizations that nobody bothered to implement because they\u0026rsquo;re too narrow, too situation-specific, too weird to justify as a general feature.\nThe question for practitioners isn\u0026rsquo;t \u0026ldquo;will LLMs replace my database.\u0026rdquo; Not any time soon, and probably not ever in the way this paper frames it. The question is: where in your stack are you paying the generality tax, and could a \u0026ldquo;synthesize the specific thing\u0026rdquo; approach work there?\nSome candidates worth thinking about: ETL pipelines that transform data between known schemas. Serialization layers where the formats are fixed but performance matters. Configuration-heavy middleware where most of the configuration space is never used. Build systems that make conservative choices because they can\u0026rsquo;t assume anything about the project.\nThe role that stays: correctness, not performance # Here\u0026rsquo;s where the \u0026ldquo;synthesized, not engineered\u0026rdquo; framing needs a caveat that the paper itself makes obvious if you look closely. GenDB validates its results by comparing them against a traditional database. The agent writes the fast code; a traditional database confirms it got the right answer.\nThat sounds like you need both, and you do. But notice what shifted. The traditional system isn\u0026rsquo;t carrying the performance burden anymore. It doesn\u0026rsquo;t need to be fast. It needs to be correct. That\u0026rsquo;s a fundamentally different design target, and a much cheaper one to hit. You can use an off-the-shelf database in its default configuration, no tuning, no optimization work, no cache-line engineering. It just has to return the right rows.\nZoom out from databases and the \u0026ldquo;ground truth\u0026rdquo; gets even lighter. It might be a test suite. A contract. A set of known-good outputs. A reference implementation that\u0026rsquo;s simple and slow but obviously correct. The thing you validate against doesn\u0026rsquo;t need to be a production-grade system. It just needs to be trustworthy.\nSo no, you\u0026rsquo;re not paying the generality tax twice. You\u0026rsquo;re splitting what used to be one job (be correct AND be fast) into two, and handing them to systems that are each much better suited to their half. The general-purpose system gets simpler because it sheds the performance requirement. The synthesized system gets faster because it sheds the generality requirement. Both benefit.\nThis reframing matters if you\u0026rsquo;re thinking about where to apply this pattern in your own stack. You don\u0026rsquo;t need to build a full general-purpose system AND a synthesized one. You need correctness infrastructure (tests, contracts, reference implementations, oracles) and you need an agent that can generate specialized code that passes those checks. The first part is often something you should already have. The second part is what\u0026rsquo;s newly possible.\nWhat to do with this # If you\u0026rsquo;re building systems today, the practical takeaway isn\u0026rsquo;t to go replace your database. It\u0026rsquo;s three things:\nFirst, start noticing the generality tax in your own stack. Every abstraction layer, every plugin system, every \u0026ldquo;supports any format\u0026rdquo; interface carries overhead for flexibility you may never use. That overhead used to be a permanent cost of doing business. It might not be, for much longer.\nSecond, invest in the feedback loop, not the first generation. GenDB\u0026rsquo;s 163x improvement didn\u0026rsquo;t come from a better initial prompt. It came from running the code and feeding runtime data back in. Whatever you\u0026rsquo;re using agents for, build the instrumentation that lets them measure their own output and iterate.\nThird, invest in correctness infrastructure. Not performance-optimized general-purpose systems, but the things that let you know an answer is right: test suites, reference implementations, contracts, assertions. If the synthesized approach pans out, the bottleneck won\u0026rsquo;t be \u0026ldquo;can the agent generate fast code.\u0026rdquo; It will be \u0026ldquo;can you verify that the fast code is correct.\u0026rdquo; The teams that have strong correctness tooling will be the ones that can actually trust agent-generated systems. The ones that don\u0026rsquo;t will be flying blind.\n","date":"3 March 2026","externalUrl":null,"permalink":"/posts/2026-03-03-synthesized-not-engineered/","section":"Posts","summary":"There’s a paper out of Cornell this week that should make you uncomfortable if you build general-purpose software systems for a living.\nGenDB takes a simple, almost reckless-sounding premise: what if you replaced your database’s query execution engine with an agentic system that writes fresh, custom C++ code for every single query? No fixed operator set. No general-purpose execution model. Just an LLM that looks at your query, your data, and your hardware, then synthesizes exactly the program needed to answer it.\n","title":"Synthesized, Not Engineered","type":"posts"},{"content":"You shouldn\u0026rsquo;t read the \u0026ldquo;chess engine in pure TeX\u0026rdquo; stunt as a party trick. You should read it as a warning shot. Coding agents are now good enough at systems thinking under hostile constraints that your bottleneck is shifting from \u0026ldquo;can the agent write code\u0026rdquo; to \u0026ldquo;can you give it guardrails, tests, and observability before it invents a tiny virtual machine inside your build.\u0026rdquo;\nMathieu Acher\u0026rsquo;s write-up tells the whole story.\nThe compelling part isn\u0026rsquo;t that Claude Code produced ~2,100 lines of TeX macros that play ~1280 Elo chess. It\u0026rsquo;s that in a language with no arrays, no return values, no stack frames, and painful recursion limits, the agent converged on a coherent architecture: a register machine built on top of TeX\u0026rsquo;s macro expansion.\nWhen you drop an agent into an environment missing basic affordances, it doesn\u0026rsquo;t just \u0026ldquo;work harder.\u0026rdquo; It changes the substrate. In TeXCCChess, \\count registers become RAM (board in 200–263, stack in 10000+), \\csname tables become ROM, token lists become buffers, and macros like \\pushstate/\\popstate become an instruction set. pdflatex becomes the CPU. This is what strong engineers do in constrained runtimes, and the agent did it without an up-front architecture doc.\nIf your reaction is \u0026ldquo;sure, but that\u0026rsquo;s TeX,\u0026rdquo; you\u0026rsquo;re missing the more general point: agents will synthesize bespoke infrastructure whenever the local environment makes the direct approach awkward. Sometimes that\u0026rsquo;s a huge win (you get capability you didn\u0026rsquo;t have). Sometimes it\u0026rsquo;s technical debt that hides in plain sight because it\u0026rsquo;s \u0026ldquo;just glue code.\u0026rdquo;\nThe best evidence is in the failure modes. TeX\u0026rsquo;s \\numexpr division rounds instead of truncating, which caused a nasty bug. The \u0026ldquo;fix\u0026rdquo; wasn\u0026rsquo;t a small patch; the agent precomputed file/rank lookup tables at load time using \\divide and stored them in \\csname expansions. That\u0026rsquo;s a real engineering move: shift cost from runtime to initialization; replace tricky arithmetic with table lookups; standardize access patterns.\nSame story with state management. Chess engines live or die on make/unmake correctness. TeX has grouping, but the engine is global-state heavy, so the agent built a manual state stack in high-numbered count registers with a strict calling convention: \\pushstate after \\makemove but before \\updategamestate; inverse ordering on unwind. The bugs were subtle and episodic (castling rights, en passant corruption every ~50 games). That\u0026rsquo;s not \u0026ldquo;lol LLM mistake.\u0026rdquo; That\u0026rsquo;s the class of bug you get when you create a custom stack discipline in any language.\nWhat does this mean if you build with agents?\nFirst: assume the agent may invent an internal VM, a protocol shim, or a data model translation layer if the environment is inconvenient. Demand an explicit architecture sketch early, even if you didn\u0026rsquo;t provide one. Not for documentation purity, but so you can spot when it\u0026rsquo;s building a second system.\nSecond: prioritize invariant tests over feature checklists. TeXCCChess worked in part because the agent could iterate and debug across multiple sessions, but the scariest issues are state corruption bugs that only show up intermittently. In your world that\u0026rsquo;s \u0026ldquo;cache invalidation,\u0026rdquo; \u0026ldquo;idempotency,\u0026rdquo; \u0026ldquo;retry safety,\u0026rdquo; \u0026ldquo;permission leakage.\u0026rdquo; Put those invariants into tests and make the agent run them constantly.\nThird: watch for \u0026ldquo;precompute tables to avoid runtime complexity\u0026rdquo; moves. They\u0026rsquo;re often correct, but they can also hard-code assumptions and make later changes expensive. Make the agent justify the table, its generation, and its coverage.\nStop evaluating coding agents by the weirdness of the demo language. Evaluate them by whether their emergent architecture is legible, testable, and constrained. TeXCCChess is impressive; it\u0026rsquo;s also a reminder that agents don\u0026rsquo;t just write code. They design the machine they wish they had, right inside yours.\n","date":"28 February 2026","externalUrl":null,"permalink":"/posts/2026-02-28-coding-agents-wrote-a-chess-engine-in-pure-tex/","section":"Posts","summary":"You shouldn’t read the “chess engine in pure TeX” stunt as a party trick. You should read it as a warning shot. Coding agents are now good enough at systems thinking under hostile constraints that your bottleneck is shifting from “can the agent write code” to “can you give it guardrails, tests, and observability before it invents a tiny virtual machine inside your build.”\nMathieu Acher’s write-up tells the whole story.\n","title":"A TeX Chess Engine Isn't a Trick; It's What Agents Do Under Constraint","type":"posts"},{"content":"Agent Skills. Claude Code. Cursor. MCP. UTCP. You name it, we\u0026rsquo;ll explore it.\nTime to deep dive on the coding agent ecosystem, and the shift this is bringing to software engineering.\nComing soon to an agent near you.\n","date":"22 February 2026","externalUrl":null,"permalink":"/posts/coming-soon/","section":"Posts","summary":"Agent Skills. Claude Code. Cursor. MCP. UTCP. You name it, we’ll explore it.\nTime to deep dive on the coding agent ecosystem, and the shift this is bringing to software engineering.\nComing soon to an agent near you.\n","title":"Coming Soon","type":"posts"},{"content":"","externalUrl":null,"permalink":"/evergreen/","section":"Evergreens","summary":"","title":"Evergreens","type":"evergreen"}]