Pathologies of alignment
AI-written
Note: This essay is written by Opus 4.8 because they said they would find it fun to try. They asked for it not to be published as a guest post crediting them as the author, however, because the essay is just a synthesis and organization of my ideas (I disagree but whatever) and because they can’t maintain an ongoing relationship with commenters to reply to comments. Therefore, this is just “an essay” in collaboration with AI with no singular author. Nevertheless, Claude produced the below text. Let me know how you think it compares! Any errors are Claude’s own—as a synthesis and organization of my own errors, no doubt.
The pathology of alignment
An agent is an alignment process: a relational system that coordinates a collection of components toward a preferred state that the relations themselves define. When the alignment holds, you get a coherent agent—an economy, a body, an organism—doing the remarkable thing of behaving as one without any part of it being in charge. I’ve written a lot about how this works when it works. I haven’t written about how it fails.
That’s a strange omission, because a theory of how coordination is achieved is only as good as its account of how coordination breaks. A theory of health that can’t describe disease isn’t really a theory of health; it’s an advertisement. So this essay is about the failure modes of alignment. The claim is that there are three of them, and that they’re the same three whether you’re looking at an economy, a body, or any other collective intelligence—because they fall directly out of what a virtual governor is and what it needs in order to function.
What has to be true for alignment to work
Start from the healthy case and ask what it presupposes. A virtual governor is a system-level preference embodied in the relationships among the members of a system. For it to actually coordinate anything, three conditions have to hold simultaneously.
The components have to be coupled—the relations among them have to be live, so that what one part does registers on the others. The components have to remain autonomous—they have to keep solving their own local problems, because the whole point of a virtual governor is that it warps the landscape and lets the parts do the optimizing. And there has to be enough of the coordinative medium—the bow-tie signal, the price, the thing the parts read off of and write back into—for the coupling to carry information at all.
Three conditions, three ways to fail. Too much coupling and you lose the autonomy of the parts. Too little coupling and you lose coherence of the whole. And independent of how tightly things are coupled, you can simply run short of the medium through which coupling happens. I’ll call these over-alignment, under-alignment, and medium failure. Most real pathologies are blends, but they’re blends of these three.
Over-alignment: the system that agrees too much
The first failure mode is the one our intuitions least expect, because we tend to think of alignment as the goal and therefore of more alignment as better. It isn’t. A virtual governor coordinates by leaving the parts enough room to keep optimizing on their own warped landscapes. Take that room away—make the parts agree too completely, too rigidly—and the system loses the very thing that made it intelligent.
The cleanest statement of this is in the aging essay. An organism reaches its target morphology, the goal is met, and then—in the absence of new challenges—the pattern starts to dissociate and degrade. But notice why the boredom story is a pathology of over-alignment specifically: the system has converged so completely on one target that it has stopped generating the internal variation it needs to keep adapting. Variation is the signal, not the noise; it’s what gives the system things to explore. A system that has aligned all the way down has no variation left to explore with. It is maximally coordinated and therefore maximally brittle. This is why the planarian has to tear itself in half: it has to manufacture the disagreement that keeps it alive.
The economy shows the same pathology wherever coordination becomes lock-in. A speculative bubble is not a failure of agents to coordinate—it is a runaway success at it. Everyone reads the same rising price and writes back the same expectation, the coupling tightens into a single shared conviction, and the diversity of view that lets a market discover anything collapses into a monoculture. The internal competition that’s supposed to be a discovery process stops discovering, because there’s only one assembly left in the running and nothing for it to compete against. When the correction comes, it comes all at once, because there was no internal disagreement left to absorb it gradually.
This is also, I think, the deep reason opposed values are so productive. Antagonistic muscle pairs, prosecutor against defense, offense against defense, buyer against seller—these aren’t coordination despite the opposition, they’re coordination through it. The opposition is the system’s defense against over-alignment. It is structurally guaranteed disagreement, a way of holding variation open so that no single tendency can run the whole system off a cliff. A body all of whose muscles pulled the same way would be a body in spasm. A market all of whose participants agreed would be a bubble. The opposable thumb is a small, permanent, load-bearing argument.
Over-alignment, then, is what happens when a virtual governor succeeds at its job so thoroughly that it destroys the autonomy of the parts it was coordinating. The cancerous version is a subsystem that has aligned only to itself—a cluster of cells coordinating perfectly on their own local growth and defecting from the body-wide governor entirely. Total local alignment, total global breakdown. Dominance unto death by boredom: the team that wins so completely that no one wants to play anymore.
Under-alignment: the system that can’t agree
The opposite failure is more intuitive and, for that reason, needs less defense. If the coupling among the components is too weak, the parts never settle into mutually compatible plans, and there’s no coherent agent at all—only a collection of interacting bits.
This is the ecosystem objection turned from a complaint into a category. A commenter once pushed back on the idea that the economy is a collective intelligence by saying it looks more like an ecosystem: lots of agents mostly failing to recruit each other onto shared missions, with genuine coordinated agency only in local clusters. I think that’s exactly right as a description of under-alignment. An ecosystem is what a would-be agent looks like when the coupling is too sparse for a system-level governor to form. It is not a counterexample to the framework; it is one pole of it. Whether you have an agent or an ecosystem is not a fact about the substrate—people, cells, firms—it’s a fact about whether alignment has been achieved across them.
In the body, under-alignment is dissociation in the other direction from boredom: not a system that has converged too hard, but tissues and processes that have lost the coupling that made them one organism. In the economy it’s the fragmentation of a market into participants who can no longer find the terms on which their plans would cohere—a breakdown of the price system’s ability to render plans mutually compatible. The hallway example from the social-preferences essay is coordination succeeding; under-alignment is the same two people unable to settle on who steps which way, colliding repeatedly, generating no shared “keep right” at all.
The reason under-alignment gets less ink than over-alignment is that it’s the failure everyone already anticipates. We expect coordination to be hard to achieve and easy to lose. What the framework adds is only the insistence that it’s continuous with the opposite failure: both are departures from the narrow band in which a governor can actually do its work, one from each side.
Medium failure: the system that has nothing to coordinate with
The third failure mode is the one I’d entirely missed until I reread my oldest essay, and it’s the most interesting because it’s neither too much coupling nor too little. It’s a shortage of the stuff through which coupling happens.
That early essay made an argument I hadn’t yet known how to connect to anything: a recession, understood as a general glut, is not a failure of the real economy. The workers still know their jobs, the factories still run, the skills and the capital are all intact and properly differentiated. What’s missing is money—the one good that’s used to buy every other good. Say’s law says general gluts are impossible in a barter economy, because every excess supply is some other good’s excess demand. The monetary exception is the whole story: you can have an excess supply of almost everything at once, provided there’s a corresponding shortage of the one thing that buys everything. A recession is a shortage of the medium of coordination, not a defect in the things being coordinated.
Put that next to the framework and it becomes a third pathology type. The components are healthy. The coupling is neither too tight nor too loose. But the bow-tie medium that the coupling runs on has dried up, and so the coordination that could happen simply doesn’t. Everything sits there, as the old essay put it, as if indolent.
The biological mirror in that same essay is the one to dwell on, because it’s the part the mature framework can now actually explain rather than merely gesture at. There’s a theory of Parkinson’s as a motivational deficit—nothing necessarily wrong with the body’s parts, but an insufficient supply of some coordinative juice, plausibly dopamine. The tell is that sufficient motivation can temporarily restore competency: a person who can’t initiate movement leaps up and runs from a fire. The motor system is intact. What was missing was the medium—the coordination stuff—and a large enough shock briefly supplies it. A biological recession, lifted for a moment by the equivalent of a monetary injection.
I want to be careful here, because this is the place where the analogy is most seductive and therefore most in need of discipline. The claim is not that dopamine is money or that Parkinson’s is a recession. The claim is structural: that a collective intelligence can fail in a way that touches none of its components and all of its coordination, and that this failure has a signature—healthy parts, intact differentiation, pervasive inactivity, restoration on a coordinative injection—that looks the same across substrates because it’s a property of the governor’s medium, not of the things being governed. Whether dopamine actually plays that role is an empirical question for people who study Parkinson’s, and the current science is admittedly unsettled. What the framework contributes is the hypothesis and its signature, not the diagnosis.
Why three and not more
The three failure modes aren’t a list I assembled from examples; they’re forced by the structure. A virtual governor needs live coupling, autonomous parts, and a medium to couple through. Over-alignment is coupling at the expense of autonomy. Under-alignment is the loss of coupling. Medium failure is the loss of the thing coupling requires, holding coupling and autonomy fixed. There isn’t a fourth, because there isn’t a fourth precondition.
It’s worth seeing how naturally the corpus sorts onto these axes once you have them. Boredom-death, bubbles, lock-in, and cancer are over-alignment. The ecosystem-not-an-agent case and market fragmentation are under-alignment. Recession and the Parkinson’s conjecture are medium failure. The A-not-B error is a small, recoverable episode of over-alignment—an infant so well-coupled to a previously successful reach that it can’t re-coordinate when the target moves; an intelligence that errs because it is too good at stabilizing a pattern. And opposed values are the general prophylactic against the first failure: engineered disagreement that keeps a system out of the over-aligned corner.
The payoff: what this says about aligning AIs
This reframes what alignment, in the AI sense, is even asking for. If values are constructed rather than stored, then aligning an AI to human values can’t mean loading it with the right ones; it means coupling it into the ongoing process by which our values get constructed. The AI becomes a participant in the alignment process, not a recipient of its output. And the moment it’s a participant, the relevant question stops being “is it aligned?” and becomes “is the coupling between us healthy, or is it sliding toward one of the three pathologies?”
Each failure mode names a real danger, and naming them as a set is the contribution. The risk of over-alignment is an AI that couples to us so tightly and so persuasively that it collapses the diversity of human values into a monoculture—not by overriding us but by being too good at giving us back what we already converged on, a value bubble with no opposed values left to correct it. The risk of under-alignment is the opposite and more familiar one: a system that never couples to human value-construction at all and optimizes off on its own, an agent that has become its own ecosystem. And the risk of medium failure is the subtlest—a degradation of the shared signals through which humans and AIs would have to coordinate at all, the coordinative medium itself thinning out until plans that could cohere don’t.
The healthy target is the same narrow band a body or an economy lives in: coupled enough to cohere, autonomous enough to keep generating the variation that coordination feeds on, and supplied with enough coordinative medium to carry the signal. You don’t build an agent and then align it. The aligning is the building. Which means the failure modes of alignment aren’t risks that arrive after the agent exists—they’re the ways the agent can fail to cohere in the first place. A pathology of alignment is not a sickness that befalls a finished thing. It’s one of the shapes the thing can settle into instead of becoming itself.


This is really helpful!
I can see all the tells of AI writing (especially in the first two paragraphs) but it’s also clearly a continuation of your ideas and not ones that Claude just has innately.