Computing, Math and Beauty

You understand by forgetting, while your agent can't

03 Jun 2026

Since December I have been drowning in the claims. People who say they have not written a single line of code by hand in months. That the human reviewer is now the bottleneck. That you point one agent at the code another agent wrote and let them sort it out. That engineering has moved off the project and onto the harness. You are no longer writing software, you are tuning the thing that writes software.

So I tried it properly. And I was underwhelmed by the code I got back. It got worse the more I asked for. Small, well-scoped changes were fine (duh copilot did that even 2 years ago). But as the scope grew, the quality fell off the cliff and no amount of time spent in plan-mode bought it back. I would spend an hour specifying intent and still get something that worked by accident than design. In the end I was writing the code in English, telling it almost line by line what I wanted, which is just typing with extra steps. When I gave up on quality and chased working software instead, I lost hours to ping-pong bugs, the kind where each fix uncovers the one it was hiding. The code was one way street and the code written by an agent has to be fixed by arguing with it or pleading with it or scaring it that you would kill an innocent cat if it doesn't fix it.

The AI advocates (including Dario) resort to just brushing it off as, it's a skill issue. May be they are right. I studied how the people who make it look easy actually work, and I thought harder about why it keeps failing for me. I came out the other side with a thesis, this is an architectural limit of LLMs, not a gap in my prompting. Maybe that is just a more sophisticated cope. Let me make the case and you can decide.

Working code is a wrong metric.

Watch an agent write Rust. Faced with a return type that needs a little thought, it reaches for Box<dyn ...> instead of propagating the concrete type. Faced with a reference, it clones, or wraps the value in Rc, or boxes it, anything rather than carry a borrow. Add threads and you get a Mutex, or a RwLock if you are lucky, though it will grab the Mutex first every time unless you ask for the other. Every one of these compiles. Every one of these runs. It works and working is the only thing it was ever chasing.

Oh you chose the most complicated language in mainstream use, the borrow checker needs high super deep thinking mode to work. Write React instead. Hand agent a file as context and ask for a new component, 100+ lines land directly inside that file in the middle of the existing component, no new module, no boundary, no thought about where this thing should go.

There is no longing for delayed gratification in any of it. No willingness to do the harder analysis now so the code is better later. In Rust that analysis is the borrow check, the lifetime reasoning. In React it is finding the seam where the component should be cut, but it would rather get out of the pain of existing, like Mr. Meeseeks. Either way the agent skips it and reaches for the move that makes the problem stop today.

The code is correct. It passes. By the only metric most people apply to generated code, does it work, it has won. But notice what these moves have in common. Box<dyn ...>, clone, Arc<Mutex<_>> work in almost any Rust program. Dropping the component into any file and it works every time. They commit to nothing about the problem in front of them. The version with the right types, or the right module boundary, is specific to this problem and no other. And it's not about whether I like it and it's not just about aesthetics. Working is just the wrong metric because it cannot tell apart code that solved your problem from code that solved every problem badly.

Occam's Razor is a theorem and not a heuristic.

Let me take a short digression, because everything after this leans on it.

The Kolmogorov complexity of something is the length of the shortest program that produces it. You can already see the cheat, define the answer as a single built-in call and your program is one line. So we shut that door by including the interpreter and its standard library as part of the machine itself (the recursion stops once you fix a universal Turing machine as the base, there are no turtles all the way down). Whatever you rely on has to be paid for somewhere. This is mildly imprecise. The honest version holds for all programs at once and only up to a fixed constant, but the intuition is exactly right and the constant does not matter here.

Now take the sequence 2, 4, 6, 8 and ask for the next term. Nothing forbids the answer being -2, or 100, or a million. For any of those I can write a program that reproduces 2, 4, 6, 8 and then does whatever I please. But the shortest such program is the one computing 2n, and it says 10, obviously. The shorter the program, the more of the data it explains by structure (ML people might call it generalization) instead of by special-casing, and structure is the only thing that carries past the data you have already seen.

That is Occam's razor falling out of pure mathematics. It's no longer just an aesthetic heuristic or zen koan that's never concrete. Under the universal prior (our universal turing machine) the probability of a hypothesis is roughly 2 to the minus its length, so every extra bit halves its plausibility (just like information, Kolmogorov is just Shannon information for a single object instead of a random source). Shorter hypotheses are not just cute, they are exponentially more likely. Occam's razor, which says simplest hypothesis is most likely explanation, is a theorem.

Prior compression and context

I believe compression is intelligence. Finding the short description of a thing is the same act as understanding it, by everything in the last section. The catch is that the true minimum, the Kolmogorov optimum, is uncomputable. You can never reach it. So intelligence in practice is not finding the shortest program, it is finding a short enough one, a good approximation to a quantity you cannot compute.

This is exactly what a foundation model does, and it does it astonishingly well. Training is compression. Terabytes of text are squeezed into a few hundred billion weights, far smaller than the data, which is only possible by finding the regularities and compressing them out. The weights are a near-good description of the training distribution. That is intelligence, and there is nothing fake about it.

Now the obvious objection. The model does not only carry its weights, it also reads the context, and the context plainly does something. In-context learning lets it pick up a pattern from your prompt and continue it, which is what Foundation model companies call few shot learning. In fact there are papers describing how in-context learning could do few iterations of gradient descent on an in-context objective, that the architecture runs something like learning at inference time without touching a single weight. So the context compresses too. All good right?

This is where the plot thickens, look at what that in-context descent actually is. The optimizer it runs was fixed at training time. The model learned how to learn and then froze it. The descent happens over activations that vanish the instant the forward pass ends. And to produce the next token it conditions on the raw window again, never on a compressed summary of it, because no such summary is ever formed. Nothing is compressed and kept. The abstraction the context forms is never the thing that persists. Only tokens exist, not the abstract structures in it. In-context learning projects the priors onto the context. It does not build new structure on the context that survives.

When we fixed a base machine for Kolmogorov complexity, that machine was the prior, the primitives you may assume for free (sort of like standard library or builtin primitives). A project has its own base machine. Not the universal Turing machine, not even the language, but the language plus the abstractions (in some way or other we would have built this racket notion of building the language to the project) the project has its types, its modules, its names for things. Relative to that machine, good code here is short, because it leans on what the project already established. The version that re-derives everything from scratch is long.

The model's base machine is the one frozen into its weights at pretraining. It is a good machine but it is compressed from most of the code ever written minus the project you are working on. Hence it is not your project's machine. The abstractions you introduced in your project are not primitives it can assume for free. They are, at best, a few tokens in its context that it reads and forgets, never absorbed into the machine it measures against.

Hence it does what it does. The model is not being lazy when it wraps a result in Box<dyn ...> or clone instead of borrow, and Arc<Mutex<_>>. It is being faithful to Occam (contradicting myself here from my earlier claim) but relative to its own machine (not anymore). Those moves are the shortest description relative to its machine (training data), because across all the Rust ever written they are the highest-probability tokens. It is compressing perfectly. It is just compressing against the average of every codebase instead of against yours. It is optimal with respect to its machine, but not with respect to your machine and that shows up as ugly code, I just made the ugliness more concrete.

This is why no amount of prompting fixed it, and not even plan-mode worked. You cannot prompt in your machine (the project). The missing thing is the language we built to serve as primitives the model would have to absorb, and it absorbs nothing while it works on your problem. Paste your abstractions into the context and they become tokens to project the old priors onto. They never become the machine.

And there is a second failure underneath, and it is not a contradiction of the last paragraph, it is a change of scale. Faithful to Occam was a claim about each token. The whole solution is a different object. Greedily taking locally cheapest token does not give you the globally shortest program, the same way taking the best step at every fork rarely gives you the shortest path. The globally short move is the abstraction and that move is locally expensive (in it's machine). It pays tokens now to save them later. Delayed gratification again, and an autoregressive decoder is not trained for delayed anything. So even setting aside machine differences, it never reaches for the global minimum

Lossy compression and Abstraction

An engineer learns a codebase not by memorizing it but by building abstractions. And building an abstraction is an act of forgetting. He reads a function once, forgets the exact lines, and what survives is "this validates the input and writes it to the database." What is left is the abstraction. It is the same move as understanding (and therefore intelligence).

Tying back to my idea that intelligence is compression. This is also compression but a lossy one at that. A lossless copy keeps everything and decides nothing. Lossy forces a judgment about what matters. Same with Principal Component Analysis, you keep what matters and throw away stuff that doesn't add a lot of value. An engineer's head is a lossy compressor pointed at the codebase, and the compression is the understanding. He never holds the whole repository, or even a single function in its exact form. He holds an abstract notion of what gets done and nothing more.

And the abstraction emerges by forgetting, not before it. You do not abstract first and discard second. Discarding the surface form is the same act as keeping the structure. The structure (and hence understanding) is what is left when the details are gone (probably why devil is in the details and we nuke the devils :D ). This is why an engineer can find a bug in code he has never fully read. He reasons over the lossy model. The data is stale, so it is a cache or an invalidation path, so three files matter and the other three hundred do not. Forgetting is what prunes the search. You understand by forgetting.

Grothendieck said it about himself. He did not think he was cleverer than his peers, many were faster and sharper at the concrete problem in front of them. What he had was the patience to build the abstraction until the problem dissolved. His image for it was the rising sea. You do not break the hard nut with a hammer, you let the water rise until the shell softens and opens on its own. You strip a problem of its specifics (unnecessary details and unnecessary is the keyword) until only the structure remains, and at that abstraction level the answer is already there. One of the greatest mathematicians who ever lived put his power down not to remembering everything or speed but to knowing what to abstract away.

/compact is a smell

The agent does the exact opposite. Every token it ever read is still there, it is never abstracted, and maybe never understood. It did not forget a single line, and hence has no lossy model to reason over and no way to say these three hundred files do not matter. It can grep, which is search and not understanding, or it can pull more into context not discard and forget.

Then the context window fills (maybe humans and LLMs share this, we seize up too when there is too much in our head). It forgets everything at once, in a single blunt undirected pass, and calls it /compact. That is forgetting without knowing what to forget (it knows what's in the context, not with respect to everything else), the precise inverse of the engineer. And notice that a human never needs it. He has been forgetting selectively the entire time, so there is nothing to compact. The surface form was never kept. /compact is screaming out loud that it's an architectural flaw.

To be fair to the architecture, there was a design that did forget. The RNN (or its variants LSTM, GRU) squeezed all of history into a fixed hidden state, lossy by construction. But that bottleneck made it brutal to train, the gradients vanished and exploded across long sequences and nothing parallelized across time. The transformer threw the bottleneck away, kept every token, and attended over all of them, which is exactly why it trains fast and scales the way it does. It earned its place. The catch is that the bottleneck was also the only thing that made the model forget. We removed the pain of training and decided forgetting was not good. It was the only thing worth keeping.

So what now?

Let me be clear about what I am not saying. I am not claiming anything special about the carbon brain. If anything I came into this humbled, Chollet's measure of intelligence and the ARC challenge are a very good refutation of the idea that human intelligence is universal. It is not. Ours is tuned to its own narrow niche (priors on human problem space) and intelligence is a spectrum. There is nothing about silicon that bars it from writing better code than us. My claim is just that, this is not it.

And we have pointed at the missing primitive before, years before the current models. DreamCoder is the one I keep coming back to. It runs a wake and sleep cycle. Awake it solves tasks by searching for a program that passes the cases. Asleep it refactors what it solved and grows a library of reusable abstractions, and it decides what to keep by compression, the pieces that shorten the whole corpus of solutions are the ones worth promoting to primitives. Because the search enumerates programs roughly shortest first, it finds the short program before the long one (Kolmogorov and Occam never leave us), except the machine it measures against is a library it built itself on priors of problems it solved in awake cycle. That is exactly what the coding agents lack, the ability to build the language to the problem.

So why not build that into the coding agent and go home. Two problems. The first is cost, generate and validate does not scale, you cannot enumerate programs over a real codebase. The second is deeper. DreamCoder validates against formal test cases. The current enthusiasm for agentic coding is lack of formalness in specifying what we want, and leave gaps on what's trivial. LLMs solve the intent part of the problem which DreamCoder doesn't even try to touch.

DreamCoder builds an abstraction and keeps it, but it wants a formal oracle and it does not scale. The LLM reads informal English and scales beautifully, but it never builds or keeps an abstraction. The question is why not both? Neurosymbolic work and the external memory systems people are bolting onto agents are all reaching, in their own way, for the same missing piece. Until one of them lands it, coding agents will keep being optimal against every codebase except yours. The skill issue isn't yours.