WHO
DECIDES.

The conversation about AI in consequential work has settled on the model. Pick the strongest one. Route the hard calls to a frontier advisor and the rest to something cheaper. Post-train an open model until its scores climb. All of it works, and the floor keeps rising. None of it touches the part that decides whether the output can be trusted, which is what counts as correct and who is willing to put their name on that call.

The harder problem sits a layer above the model, on what the output is supposed to be and how anyone would know it got there. In domains where a wrong answer carries real weight, that is where the difficulty lives, and most of the attention and money goes elsewhere.

[WHERE THE MODEL BROKE]

We took one analyst's investigative tradecraft and turned it into a standard a system could be measured against. Then we ran a capable model, composed into a multi-agent system, against a case the analyst had already closed by hand. The ground truth was settled, because the human had finished the same case months earlier.

The reasoning held up. The system framed the right questions, found the right places to look, added a compliance angle the analyst had not, followed the leads, and recorded two dozen dead ends honestly. Where it fell short, the cause was never the thinking.

It missed the one primary-source record that made the human's case, because it searched the wrong official source and never tried the form of the name the records use. A name came back subtly wrong, the kind of miss no larger model corrects, because the right version only lives in an official filing nobody thought to pull. Three records that carried the assessment went unchecked, because the system did not know they existed. When its own review step could not open the draft, the system graded itself and passed itself on the very failures it was built to catch.

A faster model runs the same wrong search, and a larger one still has to be told which records matter. The gap was never in how well it reasoned. Knowing what to look for is the part the analyst carried in his head, and the system had never been given it.

[WHAT THE EXPERT KNEW]

None of what the system was missing could be downloaded or fine-tuned in. It lived in one person's experience, and getting it out was the actual work.

None of this turned out to be new. When the failures kept tracing back to what the analyst knew and the system did not, I went looking, and the problem already had a name. Pulling expert judgment out of a person's head and into a machine stalled the expert systems of the 1980s, where it was called the knowledge acquisition bottleneck. Every wave since has run into it again under a different label, from hand-built rules to labeled datasets to human preference tuning. What I had not appreciated until I lived it is how the cost on each side moved. The model half collapsed from years of work to an afternoon, while the human half did not move at all. The oldest bottleneck in the field now decides the outcome, because it is the only part that did not get cheaper.

The analyst had no written method for the research phase. He had run enough cases that the method had gone tacit, the way an experienced operator stops being able to recite the steps because the steps have become instinct. Asking him to describe his process produced nothing I could use. When I had him walk me through a finished report, everything came out, because a real piece of work forces the specifics a summary smooths over. Two sessions on two closed cases were enough to see that the same person works two ways, top-down when the subject is an institution and bottom-up when the subject is a person.

From that I could name what good requires: coverage of the right questions, fidelity to the primary sources, knowing when a thread is worth pivoting on, corroboration before anything is called a finding, honest reporting of what came back empty, and holding the picture together when independent threads converge. Every one had to be made checkable. The hardest was the last, the judgment that turns separate facts into a single read, which no checklist captures and a person still has to weigh.

One rule underneath all of it was simple to say and hard to live by. A single data point is a lead. Several independent points, from sources that fail in different ways, make a finding. Three articles repeating one press release count as one source. When two facts contradict, you hold both and leave the claim unresolved.

[HOLDING THE BAR]

Once the standard exists, the point is to make the bar hold and only ever rise. Every change has to meet or beat the last accepted version. A miss caught in production becomes a new case, and the next run has to prove both that the miss is fixed and that nothing good broke on the way.

Two decisions keep that honest. Judgments are recorded one criterion at a time, because a single blended score tells you the work got worse without telling you where, while a per-criterion read tells you exactly what to fix. The cheap, certain checks run first and the expensive ones only when the answer is in doubt. And when a score dips, nothing reverts on its own, because a system that retreats at every dip is quietly optimizing to never improve. A person looks at the dip and decides what it means.

[WHO SIGNS]

The hard part comes before any of that: deciding what the work is supposed to be, and binding a person's name to it.

The same expert sits on both ends of this. The analyst whose practice we encoded is the one who accepts the result, and by design that is a different person than whoever built the system, so the bar is never set by the one on the hook to ship. They say what correct means, accept the agent against their own practice, and pull the standard back in when new information changes it. The bar is parity: matching what that analyst would have produced. Anything past that is upside.

What makes this hold in consequential work is accountability that travels. The accepted standard is a document with a trail behind it, recording whose judgment it encodes and who stood behind it, so the person who acts on the output downstream knows what they are trusting. The signed acceptance is the product, in settings classified and unclassified alike. A number on a dashboard carries none of that.

A well-funded category uses these exact words for a different job. It builds reinforcement-learning environments that score model behavior and feed the reward back into training, generating the trajectories that fine-tune the weights so the next checkpoint scores higher. The buyer is a frontier lab and the deliverable is a stronger model. What I am describing never touches the weights. The deliverable is an accepted standard with a named expert behind it, and the buyer is the organization that has to act on the output. Evals, rubrics, graders, the words are shared. What comes out the other end has nothing in common.

A better model lands every few months, and I take every one of them. Bringing a new one in is real work, and even when it goes perfectly it leaves the hard question where it was. The model does not decide what counts as correct, and it cannot answer for the decision. A person does both, and building the place where that person can do their part well still looks, from where I sit, like open space.