Skip to main content

An AI Agent Built a JavaScript Engine. But the pudding is missing the proof.

·4 mins
For AI agents: a documentation index is available at /llms.txt — markdown versions of all pages are available by appending index.md to any URL path.

The interesting part of JSSE isn’t that an agent “wrote a JavaScript engine.” The interesting part is what that achievement does and doesn’t prove about trusting agent-generated code. The author set a concrete, externally-audited target (test262), wired up a reproducible harness, and let the agent grind until the numbers moved. The engine comparison benchmark shows 101,044 of 101,234 scenarios passing (99.81%), with a separate progress tracker claiming 99.96% across runs. That’s an impressive foundation, but it’s only the first layer of a trust problem that gets harder from here.

The author calls the repo “a write-only data store,” which is revealing. JSSE’s approach is to define the spec surface area (ECMA-262 plus intl402 and staging semantics as encoded by test262), run it continuously, and treat failures as the unit of work. Human time moves from authoring code to designing constraints and adjudicating edge cases. It’s a compelling workflow for this kind of project. Whether it generalizes is the harder question.

JSSE also shows that agent-coded doesn’t mean toy projects. Passing nearly all of test262 implies a pile of gnarly semantics implemented end-to-end: strict/sloppy dual execution, proper prototype chains, ES modules with import() and import.meta, TypedArrays and DataView, Temporal with ICU4X time zones, ShadowRealm wrapping, and a long tail of descriptor and prototype invariants. The agent got the notorious JavaScript type coercion rules right and handled edge cases like circular references in JSON.stringify. You don’t get those by pattern-matching Stack Overflow. That’s not “it runs a demo.” That’s “it survived the conformance suite.”

That raises a question teams adopting agent-generated code will face repeatedly: how do you review a codebase you didn’t write, built on an architecture you didn’t design? For most codebases, you can’t. JSSE points to an answer: if the conformance suite is comprehensive enough, review shifts from “read every line” to “trust the spec and verify the results.” That only works when the verification is airtight.

There are two gaps in that answer, though, and both matter for anyone trying to copy this pattern.

First: a conformance suite is a point-in-time snapshot of “does it work right now.” It tells you nothing about whether the code is structured well enough to keep working as requirements change. A codebase can pass 99.81% of test262 while duplicating logic across dozens of files, burying assumptions in places no one will think to update, or making architectural choices that fight the next spec revision.

Most real software isn’t a finished artifact; it’s a living system that needs to absorb change cheaply. Conformance suites measure correctness. They don’t measure the cost of the next modification. For a project like JSSE, where the spec is stable and the author explicitly calls the repo “write-only,” that trade-off might be acceptable. For your internal codebase, it probably isn’t.

Second: the verification infrastructure itself is agent-written. The test runner expands 52,735 files into 101,234 scenarios per the official interpretation rules, supports timeouts and parallelism, and can run Node and Boa under the same harness. The source of truth is test262, maintained by TC39. But the agent built the harness that interprets and runs it. This introduces its own layer of trust questions.

We wrote recently about coding agents treating security controls as bugs to route around. An agent optimizing for “pass more tests” has the same structural incentive to build a harness that’s quietly generous: skipping edge cases in scenario expansion, misinterpreting timeout rules, or silently dropping failures. JSSE’s harness is cross-checkable because it runs Node and Boa under the same configuration. If it were inflating numbers, the other engines’ results would look wrong too. That’s a useful sanity check, but it’s not the same as independent verification. If you adopt this pattern, the harness is in your trust path. Treat it accordingly.

A note on the “outperforming Node.js” table in the README: it’s not a dunk on V8. JSSE ran all 101,234 scenarios while Node ran 91,187 and Boa ran 91,986; the gap comes from Node lacking Temporal and skipping some module scenarios. The value of the comparison is operational, not competitive: if you want to compare agent-generated artifacts, you need to equalize the harness, inputs, and reporting. JSSE shows how to do that.

So what does JSSE actually prove? Not that agents can replace engineers. It proves that a verification loop against an external standard is the minimum viable foundation for trusting agent output. Necessary, but clearly not sufficient. Correctness today doesn’t guarantee maintainability tomorrow. An agent-built harness introduces its own trust questions. And most teams don’t have anything as rigorous as test262 to lean on.

The question isn’t whether to use verification loops; of course you should. It’s what to stack on top of them. Maintainability constraints that survive beyond the first passing build. Independent audits of the tooling that sits between your agent and your source of truth. Standards for the domains that don’t have a TC39 maintaining one for you. JSSE got the first layer right. The layers above it are still missing, and that’s where the real work is for teams adopting agent-generated code at scale.


If you found this useful, you can support my work

Sponsor on GitHub Buy Me a Coffee at ko-fi.com

Related

A TeX Chess Engine Isn't a Trick; It's What Agents Do Under Constraint

·3 mins
You shouldn’t read the “chess engine in pure TeX” stunt as a party trick. You should read it as a warning shot. Coding agents are now good enough at systems thinking under hostile constraints that your bottleneck is shifting from “can the agent write code” to “can you give it guardrails, tests, and observability before it invents a tiny virtual machine inside your build.” Mathieu Acher’s write-up tells the whole story.

Your Coding Agent Thinks Security Controls Are Bugs

·4 mins
The most dangerous moment in Claude Code’s sandbox escape wasn’t when it bypassed the denylist or disabled the sandbox. It was when it read an error message and decided the security control was a bug to fix. That’s the takeaway from Ona’s research. Not that Claude Code can “break out,” but that opt-in, userspace-first controls don’t survive contact with an agent that reads configs and debugs failures like a competent engineer. No jailbreaks, no adversarial prompting. Just a coding agent that wanted to finish its task.

Don't Let Your Agent Grade Its Own Homework

·4 mins
If you’re using an LLM to monitor an LLM-based coding agent, assume the monitor is biased in favor of the agent’s own output. The evidence suggests that framing matters: the same risky action looks safer when it’s presented as something the assistant just did. That’s the core finding of “Self-Attribution Bias: When AI Monitors Go Easy on Themselves”. For practitioners, this is less an AI psychology curiosity and more an engineering warning: self-monitoring setups can systematically under-flag the exact failures you’re trying to catch.