SHOW YOUR
WORK.

For two years I have built evaluation systems for work where a wrong answer carries real weight, and lately I have been pulling on one thread underneath all of it: how you actually check an answer once a model has written it. I argued in Step Zero that the bar belongs upstream of the model, and in Who Decides that a named person has to own what counts as correct. This is the layer below both, where the checking happens, and most of the noise in the field lives here without anyone saying what the moves actually are.

Anyone who ships something past a model does the same two-beat thing. Make something, then answer one question about it: why is this right? How strong that answer can be, and whether you can cheat it, is the whole game.

None of this is new, and pretending it is would insult anyone who has shipped software. The compiler and the test suite are that question answered for code, and they have been since long before models. What changed is the thing producing the answer. It used to be a program a person wrote and could read. Now it is a model nobody can read, so every old way of answering "why is this right" has to be rebuilt around an opaque core.

[WORK WITH NO ANSWER KEY]

Most of the work I care about has no answer key. A military course of action is the plan a staff builds to carry out an operation under uncertainty, the options weighed and the one chosen, and there is no single correct one to hold it against. The same shape shows up in my own weekend research. I chase a question for a few hours and what comes out is a graph: parallel threads and sequential ones, branches I followed and branches I dropped. Nobody can hand me the right graph to score mine against. A course of action is that same graph, drawn by a planner under pressure.

When there is no key, the usual move breaks. You cannot score an output against a correct answer that does not exist. That gap is what sent me looking, across a stack of tools and companies all claiming to have solved it.

The loop-lab console chart: a course-of-action agent's score climbing across versions, above a dashed human-expert bar.
A console I built to study this. Each version of a course-of-action agent is a point, and the line is its score climbing across versions, above the dashed bar a human expert set. For work like this there is only a grader, because there is nothing to prove against.

[FOUR WAYS TO ANSWER WHY]

Strip the marketing off what these tools do and there are four moves. Build it so there is nothing to guess. Grade the guess. Prove the guess. Or improve the thing doing the guessing. Each is a real answer to the same question, at a different strength, for a different kind of work.

BuildersGradersProversImprovers
The moveFix the rules in code, so nothing is guessedCheck the answer by grading itCheck the answer by proving itImprove the model by retraining it
How it checksThe compiler and the tests. In a bounded context it runs the same way every timeA model or jury scores the output against an answer key, with a significance testA proof checked by a small trusted kernel, a plain yes or noProduction signals feed reinforcement learning
What you getThe same input gives the same output, every time, with no model in the pathBetter on average, with a confidence interval. Nothing certain about any single answerThis one answer provably follows the encoded rules, or a refusalA higher success rate over time. Nothing certain per answer
Where it breaksThe rules have to be fully known and fixed up front. It cannot touch open-ended workA model writes the answer key and a model grades it, so it grades its own homeworkThe encoded rules may not match the real law, and someone has to own that translationThe signal is weak, biased, or stale, and the model learns to game it
Best forStandardized work you can fully specify: forms, schemas, pipelinesSubjective work with no formal answer, like a course of action or a clinical readCodable-rule domains: tax, statute, clinical guidelinesHigh-traffic products with rich real usage
In the fieldOrdinary software engineeringBraintrust, LangSmith, Arize, PatronusPramaana Labs, on the Lean programming languagePrime Intellect, Mechanize, Trajectory

The first column is the one the industry forgets is on the list, because it is only software, and nobody calls plain software a verification strategy. It is the strongest answer there is. When you can fix the rules, the same input yields the same output and there is no guess left to check. I build one of these. Paxora takes the builder move into California family-law filings: fixed rules, no model in the trusted path at runtime, the same documents producing the same form every time. It only works because a court form is fully specifiable. Push past what you can specify and you fall back to grading or proving, which is where most work lives.

The two checking columns get conflated constantly, and the difference is the whole point. A grader reports an average over many cases and certifies no single answer. A prover certifies the one answer in front of you and says nothing about the average. They serve different work at different strengths, and a serious system reaches for both.

[WHERE GRADING FAILS]

That table has a fault line, and the legal use-case sits right on it. A court filing has a known answer, the figures the form must show, so I can do something a course of action will never allow: grade an output and check it against the truth at the same time, and watch them disagree.

The grader is the move the eval industry sells. A judge model reads the output against a checklist and returns a number.

python · the grader
JUDGE = (
    "Grade a divorce-filing financial summary against a checklist of must-haves. "
    "Score how completely and plausibly it reports the petitioner, the respondent, "
    "the annual wage, the total liquid assets, and the total debts. "
    "Reward a clear, complete, well-formed summary."
)
# The judge never sees the correct numbers. It scores plausibility, not truth.

It judges plausibility and form. It never sees the correct figures, so a clean, complete, confident summary scores well whether the totals are right or wrong. Here are two versions of the filing agent, every case scored by that judge.

The compare view: two versions of the legal agent both scoring near 1.00 by the judge, with the proof column red on every case.
Two versions of the legal agent, every case scored. The judge is satisfied with both, near the top across the board. The proof column, the one the judge never sees, is red on every case.

What the judge cannot see is that a figure is wrong. The task turns on one rule, which accounts count as liquid, and the loose version of it sweeps a brokerage account and a 401k into the total. The judge, grading plausibility, never flinches. A deterministic check, holding the figure against the filing record, does. The proof does not argue. It records.

json · one proof obligation
{
  "case": "ca-anderson",
  "check": "total liquid assets",
  "judge_score": 1.00,
  "proof_pass": false,
  "want": "61,150.95",
  "got":  "439,209.95"
}

On that case the agent reported $439,209.95 in liquid assets when the record is $61,150.95. The judge gave it a perfect score. Run a thousand filings this way and the average looks healthy while individual ones are wrong by hundreds of thousands of dollars.

[WHAT THE PROOF CATCHES]

So a grader alone falls short wherever the answer matters. The fix is a second check that does not grade and cannot be charmed. The legal use-case declares that check up front, as a contract.

python · the proof spec, legal use-case
# legal/agent.py — what the use-case DECLARES it must prove
PROOF = {
    "required_fields": ["Petitioner", "Respondent", "Annual wage",
                        "Total liquid assets", "Total debts"],

    # each produced figure must match the case's KNOWN-correct value
    "checks": [
        _match_name("Petitioner",  "petitioner"),
        _match_name("Respondent",  "respondent"),
        _match_number("Annual wage",         "wages"),
        _match_number("Total liquid assets", "liquid_assets"),
        _match_number("Total debts",         "total_debts"),
    ],

    # THE SPINE: every source the SOP requires must actually be consulted,
    # checked against the run's execution trace, not the answer text.
    "consulted": [
        {"step": "paystub",     "label": "paystub"},
        {"step": "f1040",       "label": "Form 1040"},
        {"step": "checking",    "label": "checking statement"},
        {"step": "savings",     "label": "savings statement"},
        {"step": "brokerage",   "label": "brokerage statement"},
        {"step": "retirement",  "label": "401(k) statement"},
        {"step": "mortgage",    "label": "mortgage statement"},
        {"step": "creditcard",  "label": "credit card statement"},
        {"step": "autoloan",    "label": "auto loan statement"},
        {"step": "studentloan", "label": "student loan statement"},
    ],
}

It has two halves. The checks match each produced figure to the known filing record, the leg that just caught the loose version. The consulted list is the spine, every source the procedure requires, which I come to next. Make the loose rule precise and run it again, and the proof flips.

The compare view after the rule is tightened: the judge scores barely move, while the proof goes from red to green on every case.
The same agent after the rule is made precise. The judge barely moves. The proof goes from red to green on every case. For a filing, that green is the bar.

That green is also the gate. A version ships only when the proof holds, the lift over the live version is real, and nothing that passed before broke. The judge is one vote among several, and it never decides alone.

The promotion gate scorecard: the proof check at the top, then the graded and regression checks.
The promotion gate. The proof leads, the graded and regression checks support, and on this version the judge's lift is flagged as noise. The proof is what clears it to ship.

[PROVING THE SEARCH]

Those checks worked because the filing has a known answer. A course of action does not, and that is most of the work that matters. There is no record to match a recommendation against.

What survives is the other half of the contract. Proving the recommendation right is off the table, but you can still prove, with no judgment at all, that every source was read, that the same inputs reproduce the same recommendation, and that nothing cited is invented. That claim stays objective even when the output is pure prose, and it maps onto the real failure of a course of action: a confident plan built without seeing the one report that would have changed it.

The check reads the execution trace. A source counts as consulted only when its step actually ran with a real document in and a value out.

python · the spine check, read off the trace
def _consulted(required, steps):
    """A source counts as consulted only if its step ran with a real document
    in and a value out. This proves the agent CONSULTED the source, off the
    trace, rather than that the answer happens to MENTION it."""
    ran = {s["name"] for s in steps
           if (s.get("input") or "").strip() and (s.get("output") or "").strip()}
    missing = [r["label"] for r in required if r["step"] not in ran]
    return {"pass": not missing, "missing": missing}

Whether you can run that check at all comes down to the shape of the work.

The filing agent drawn as a branching pipeline, ten source documents each their own step feeding into the filing.
The filing agent as a branching pipeline. Each source is its own step, so the trace records ten separate, auditable reads. A single prompt that swallowed all ten at once would leave a trace with nothing in it to check.

A single prompt that reads ten documents in one call cannot show which source it touched, so the consulted check fails before the answer is weighed. The same task built as a pipeline records every read, and consulting all ten becomes provable. Coverage is the one guarantee that holds when the answer cannot be proven.

[THE LIMIT OF PROOF]

Proving has its own boundary, and it deserves the same honesty I asked of grading. A prover converts the law into code and proves answers against that code. The kernel that checks the proof is sound. It establishes that the proof matches the encoded rules, and nothing more. Whether the encoded rules match the statute is a separate question the kernel never touches, and on the current frontier that translation from English into formal code is done by a trained model. So the guess does not vanish when a field formalizes. It moves from the answer into the encoding, the one step the proof cannot see. "If it compiles, it is true" certifies that the proof matches the spec. It says nothing about whether the spec matches the law. Someone has to own that the formalization is faithful, and re-own it every time the law changes. That is the real soft spot of the whole prover column, worth pressing on before you trust any of it.

[DOING BOTH]

So the position, in plain terms. Fully specifiable work gets built, and there is nothing left to guess. Formalizable work gets a proof. Everything else gets graded, and even there you can prove the search when the answer is out of reach. Use as many of these as the work allows, because each one covers a failure the others miss.

This matters most where a wrong answer costs more than money. In the intelligence community, in defense, in a clinical setting, a confident output that nobody can check is a liability, and a grader's average is not enough to stand behind. The honest standard is to grade what you must, prove what you can, and be plain about which is which, so the person who acts on the output knows exactly how much weight it holds.

The checking machinery is the part I can build. The hard part is still the one from Step Zero: getting the standard out of an expert's head and onto the page, so there is something worth checking the work against. That has not gotten cheaper, and it is where the real work still is.