STEP
ZERO.
I started working with LLMs in January 2024 on the early frontier-model APIs and the prompt engineering that came with them. By the summer of 2024 the work had become real. LLM evaluations and the agentic-workflow evaluations layered on top of them were part of the daily toolkit. Acceptance criteria, scoring functions, judge models, ground-truth sets, eval harnesses, the whole machinery. In product roles I treated them the way I treated unit tests and CI gates: the bar a release had to clear before it could go out the door.
July 2025 is when I joined a fractional team rebuilding an evaluation platform for high-stakes work, the kind where wrong answers carry real consequences. The first day looked familiar. Take subject-matter experts in a regulated field, watch them do the work, and turn that expertise into agents that can run at scale. I had built versions of this in product roles. The new version came with Claude Code in the loop and a frontier model behind the agent, which compressed parts of the work I used to spend weeks on.
[WHAT STAYED HARD]
Most of the code I used to write by hand, the model writes now. Schema design, route plumbing, repository layers, the test scaffolding around all of it. The model gets it right most of the time, and prompt engineering covers the rest of the work I would have spent on integration. The work upstream of that did not compress at all, because none of it lives in the model and no amount of fine-tuning can synthesize what was never in the training data.
What lives upstream is what an analyst, an engineer, or a clinician knows after 10 years of doing the job. Which mistakes matter, which sources hold up under scrutiny, which parts of doctrine bend under pressure. A fraction of it sits in standard operating procedures; the bulk stays undocumented. Interviewing 10 experts is doable; interviewing 10,000 is not. The practice is locked inside their heads, and getting it out of there was the hard part of the work.
[WHO OWNS ACCEPTANCE]
I spent the fall of 2025 watching this play out on a real team. The platform had clean rubric primitives, a working orchestrator, scoring functions, all the components any evaluation system is expected to have. The acceptance bar still came from the engineer who wrote the prompt, because the engineer was the only one in the room when it was time to gate a release. I have lowered bars to ship more than once, and so has every engineer I know who runs deploy clocks. The fix was to put acceptance with someone other than the person on the hook.
We rebuilt the platform around three roles, with a named human owning each gate. An analyst codifies the standard from interviews and source documents. Acceptance lives with a program manager, who routes signals when a run falls below the bar. Between them, the applied AI engineer configures and runs evaluations against the codified version. The same person can hold multiple roles, and the surfaces are separated by function so the applied AI engineer is not also the judge of whether the work is good enough.
[HYPOTHESIS, NOT VELOCITY]
The hardest question in evals lives upstream of scoring: what does success look like, and who decides. Today's tooling makes it cheap to fire experiments, which is fine when the bar is settled and dangerous when it is not. Dozens of experiments a week without a hypothesis underneath confuses speed for direction. The experiments that earn their keep are the ones I can defend before they ship: a clear position, a definition of success written down up front, and a named recipient for the signal at the other end. From there the next move is often outside experimentation entirely. A customer conversation, a prototype review, or the call the operator makes without another round of data.
Evals sit downstream of that judgment. The platform holds the line once a team draws it; the line itself comes from outside the platform. Once a team treats the rubric as the work product itself, eval engineering is product strategy.
[TWO PRIMITIVES]
Citations come first. Every criterion the platform asserts is anchored to a specific location in the source document that produced it: page, paragraph, character offsets, and the excerpt text. The chain records whether a person authored the criterion or a model generated it. When a regeneration produced it, the rejection feedback that triggered the regeneration is recorded on the chain too. A reviewer following any criterion can land on the exact passage in the source document and read it.
In the platforms I have used, the criterion lives as a string with no link back to the source. Making the trace navigable was the part reviewers told us they trusted. Without it, every criterion has to be cross-checked against the source by hand, which is the work that consumes program-manager time and ages a rubric out of usefulness.
Feedback loops are the second primitive. End-user signals from production flow into the same store that holds the ground-truth examples. Reviewer signals sit alongside the gold-standard answers: pairwise preferences from structured comparisons, pointwise thumbs-up and thumbs-down on real outputs, all of it feeding the next benchmark run. The criterion gets refined by what production actually returned, not by what the engineer assumed at authoring time. The coverage-gap failure mode, the case the eval suite missed because no one had thought to write it down, becomes a test case the moment a reviewer flags it.
[STILL UNRESOLVED]
A part of this I still do not have a clean answer for. If the analyst on the other end of the codification session does not want to be there, the content does not arrive. We hit this often enough that I stopped treating it as a workflow problem. Experts supply the content; the platform supplies the structure that holds it and the surface where they record it. We mitigated with interview workflows, agents that read source documents and proposed first drafts the analyst could correct, and a product surface designed around expert time as the scarcest resource on the team. Those mitigations help. Acquisition is the part of the approach I do not know how to scale.
Two structural decisions in the schema are also still open. The fields that define a standard operating procedure live on the rubric table, a leftover from the pre-rewrite primitive, and migrating them off requires every service that reads them today to switch to the new shape. Acceptance thresholds still sit inside the engineering configuration rather than on the program-manager surface that owns them. Both are in the backlog. Neither is blocking, but I do not want to build the next surface on top of them.
Most of the 2026 conversation about evals skips this layer. Today's platforms handle scoring, prompt engineering, and the reinforcement-learning feedback loops on top of them. Translating a stated criterion into an evaluation suite is a wrapped frontier-model call available in any workbench. What stays unsolved is the content the model cannot produce, because it sits outside any training data. That content lives in the heads of the experts who hold the practice. Most of my weeks went into building the surface where they could record that practice, and the structure that protects what they recorded.
