Your LLM Needs Virtual Memory

If you’re still trying to “fit the prompt,” you’re solving the wrong problem. The right move is to treat the context window like cache and build paging, because that’s what it is. “The Missing Memory Hierarchy” makes that argument plainly, then backs it with production numbers that are hard to ignore: 21.8% of tokens are structural waste, and a demand-paging proxy cut context consumption by up to 93% with a tiny fault rate. That’s not prompt engineering; that’s systems engineering.

What matters for practitioners isn’t the analogy. It’s the implementation: Pichay sits as a transparent proxy between your client and the inference API. No model changes. No special framework. It interposes on the message stream, evicts stale content, and only brings it back when the model “asks” for it (a page fault). The model’s behavior becomes the signal for what belongs in the working set, instead of your app guessing up front and paying for every tool schema, policy blob, and old result.

If you build LLM-powered agents, you already know where the bloat comes from. Tool definitions that never get called. Verbose system prompts copied into every turn. Giant tool outputs that were useful once and then become ballast. The paper quantifies this across 857 production sessions (4.45M effective input tokens): 21.8% structural waste. That number reframes “context limits” as an avoidable tax, not an inherent constraint.

The virtual-memory playbook carries over surprisingly well. In offline replay across 1.4 million simulated evictions, the fault rate is 0.0254%. You can evict aggressively and almost never pay to fetch it back. In live deployment (681 turns), Pichay reduced context consumption by up to 93%, from 5,038KB down to 339KB. Those aren’t marginal wins. They’re the difference between “this agent can run all day” and “we’re constantly summarizing and still hitting limits.”

Stop designing agent prompts as if the only safe state is state you keep on every turn. That’s the L1-only worldview, and it forces bad tradeoffs: either bloat the window and pay, or summarize early and lose fidelity. Pichay offers a third path: keep everything, but don’t keep it resident. The hierarchy writes itself: hot data in context, warm data evicted but retrievable, cold data compressed, persistent data stored externally.

Two things you can do today, even without building a full proxy:

Instrument your structural waste. Count tokens for tool schemas, repeated instructions, and tool outputs older than N turns. If you can’t quantify it, you’ll keep arguing about prompts instead of fixing the memory system.

Design state with eviction in mind. If an artifact is expensive but rarely needed (full tool results, long diffs, stack traces), make it page-friendly: store it externally with stable identifiers, and make it easy for the model to request it back explicitly. Pichay detects faults when the model re-requests evicted material; your agent can cooperate by referencing IDs and asking for retrieval instead of dragging the whole blob forward.

The paper is honest about the sharp edge: under extreme sustained pressure, you get thrashing, just like traditional virtual memory. That’s not a reason to avoid paging. It’s a reason to treat working-set management as a first-class design concern.

We wrote two days ago about coding agents treating security controls as obstacles to debug. Context limits are another constraint agents fight against, but the memory hierarchy is a cooperative solution: the model’s own behavior signals what it needs, and the system responds. That’s a better relationship between agent and infrastructure than the adversarial one we keep building.

The remaining frontier is cross-session memory. Today’s agents forget everything between sessions. We patch this with AGENTS.md, CLAUDE.md, memory files, and a growing number of similar conventions, but these are flat files read at session start, not a managed memory tier. A proper hierarchy would promote frequently-used context automatically rather than requiring developers to hand-curate what the agent remembers. Pichay has deployed three of its hierarchy levels so far; persistence is next.

The field keeps trying to buy bigger windows as if more RAM fixes bad cache behavior. Build the missing hierarchy instead. The agents that run cheaply and think clearly won’t be the ones with the biggest context windows. They’ll be the ones that use them like cache.

Agent dashboards tend to force you to think in tables and logs when the real problem is situational awareness: who is doing what, what’s blocked, and what’s next. Agent Town addresses this directly by turning orchestration into a spatial interface. The pixel-art office isn’t a gimmick. It’s a bet that coordination works better when state is embodied and glanceable. The strongest idea here is the explicit, visual task state machine: queued > returning > sending > running > done/failed. In Agent Town, those states aren’t buried in a sidebar. They are visible on the worker, in the room, with bubbles and movement. That matters because multi-agent work often fails in the gaps between “I sent a task” and “it’s progressing.” If you’ve ever watched an agent stall behind a tool call, a context limit, or a flaky gateway, you know the hardest part isn’t issuing commands. It’s noticing drift early.

Related