# Synthesized, Not Engineered
*March 3, 2026*


There's a paper out of Cornell this week that should make you uncomfortable if you build general-purpose software systems for a living.

[GenDB](https://arxiv.org/abs/2603.02081) takes a simple, almost reckless-sounding premise: what if you replaced your database's query execution engine with an agentic system that writes fresh, custom C++ code for every single query? No fixed operator set. No general-purpose execution model. Just an LLM that looks at your query, your data, and your hardware, then synthesizes exactly the program needed to answer it.

It beat DuckDB by 2.8x. It beat Umbra by roughly the same margin. On a novel benchmark it hadn't seen before, the gap against DuckDB widened to 5x.

These aren't hobby databases. DuckDB and Umbra represent some of the best analytical query processing engineering on the planet, built on decades of database research. Years of focused engineering, careful attention to cache lines and vectorization and operator fusion. GenDB, running Claude Sonnet 4.6, outperformed all of it by writing throwaway C++ that only needed to work once.

## Why this works (and why it's unsettling)

The story here isn't "LLMs can write fast code." It's that general-purpose systems carry an enormous tax. A traditional query engine has to handle every possible query shape, every data distribution, every hardware configuration. That generality is expensive. It means your aggregation operator uses the same hash table whether you have 6 groups or 4 million. It means your join algorithm is a compromise between the best case and the worst case.

GenDB doesn't compromise. When a query has 6 aggregation groups, it generates code that uses a direct array small enough to live in L1 cache. No hashing at all. When another query has 4 million groups, it generates lock-free compare-and-swap hash tables with column-separated layouts. These are both valid designs. But no traditional engine would ship both for the same logical operation because the engineering cost of maintaining every specialized path is prohibitive.

That's the real argument here. It's not that LLMs are smarter than database engineers. It's that the economics of specialization change completely when generating a new specialist is nearly free.

## The optimization loop matters more than the generation

The part of GenDB that should get the most practitioner attention is the iterative refinement step. The system doesn't just generate code and ship it. It generates code, runs it, measures performance, then rewrites it with that feedback. One query improved 163x across iterations, from 12 seconds to 74 milliseconds. The agent restructured the data layout to be more cache-friendly and tried again.

This is the pattern to internalize: generate, measure, refine. It showed up in the TeX chess engine we wrote about last week (the agent iterated across sessions to fix state corruption bugs). It's showing up everywhere agents work well. The first generation is a rough draft. The feedback loop is where the real performance comes from.

If you're integrating agents into your own workflows and you're doing single-shot generation without a measurement and refinement cycle, you're leaving most of the value on the table.

## Now for the cold water

GenDB is a research prototype with serious constraints. The benchmarks ran with the entire database cached in memory. It only handles analytical (OLAP) queries. Generating code for a single query takes minutes and costs real money (about $14 for a five-query benchmark run). None of this is production-ready, and the authors know it.

But the constraints point at tractable engineering problems (cost reduction, latency, broader query support), not fundamental dead ends. The interesting question isn't whether GenDB ships as a product. It's whether the principle underneath it holds.

## The bigger pattern

Forget databases for a moment. The underlying principle is this: anywhere you have a general-purpose system making runtime compromises because it has to handle every possible input, an agent could potentially generate a specialized version that only handles the input in front of it.

Compilers do this to a degree (profile-guided optimization, JIT compilation). But those techniques operate within the rigid constraints of a compiler's built-in optimization passes. An LLM-based system can apply optimizations that nobody bothered to implement because they're too narrow, too situation-specific, too weird to justify as a general feature.

The question for practitioners isn't "will LLMs replace my database." Not any time soon, and probably not ever in the way this paper frames it. The question is: where in your stack are you paying the generality tax, and could a "synthesize the specific thing" approach work there?

Some candidates worth thinking about: ETL pipelines that transform data between known schemas. Serialization layers where the formats are fixed but performance matters. Configuration-heavy middleware where most of the configuration space is never used. Build systems that make conservative choices because they can't assume anything about the project.

## The role that stays: correctness, not performance

Here's where the "synthesized, not engineered" framing needs a caveat that the paper itself makes obvious if you look closely. GenDB validates its results by comparing them against a traditional database. The agent writes the fast code; a traditional database confirms it got the right answer.

That sounds like you need both, and you do. But notice what shifted. The traditional system isn't carrying the performance burden anymore. It doesn't need to be fast. It needs to be *correct*. That's a fundamentally different design target, and a much cheaper one to hit. You can use an off-the-shelf database in its default configuration, no tuning, no optimization work, no cache-line engineering. It just has to return the right rows.

Zoom out from databases and the "ground truth" gets even lighter. It might be a test suite. A contract. A set of known-good outputs. A reference implementation that's simple and slow but obviously correct. The thing you validate against doesn't need to be a production-grade system. It just needs to be trustworthy.

So no, you're not paying the generality tax twice. You're splitting what used to be one job (be correct AND be fast) into two, and handing them to systems that are each much better suited to their half. The general-purpose system gets simpler because it sheds the performance requirement. The synthesized system gets faster because it sheds the generality requirement. Both benefit.

This reframing matters if you're thinking about where to apply this pattern in your own stack. You don't need to build a full general-purpose system AND a synthesized one. You need correctness infrastructure (tests, contracts, reference implementations, oracles) and you need an agent that can generate specialized code that passes those checks. The first part is often something you should already have. The second part is what's newly possible.

## What to do with this

If you're building systems today, the practical takeaway isn't to go replace your database. It's three things:

First, start noticing the generality tax in your own stack. Every abstraction layer, every plugin system, every "supports any format" interface carries overhead for flexibility you may never use. That overhead used to be a permanent cost of doing business. It might not be, for much longer.

Second, invest in the feedback loop, not the first generation. GenDB's 163x improvement didn't come from a better initial prompt. It came from running the code and feeding runtime data back in. Whatever you're using agents for, build the instrumentation that lets them measure their own output and iterate.

Third, invest in correctness infrastructure. Not performance-optimized general-purpose systems, but the things that let you *know* an answer is right: test suites, reference implementations, contracts, assertions. If the synthesized approach pans out, the bottleneck won't be "can the agent generate fast code." It will be "can you verify that the fast code is correct." The teams that have strong correctness tooling will be the ones that can actually trust agent-generated systems. The ones that don't will be flying blind.

