#ai-agents#evaluation#production#advanced#harness-engineering

How to Evaluate AI Agents: What Works in 2026

Self-evaluation skews optimistic - Anthropic measured it. Generator-evaluator loops, trace debugging, eval sets from failures, and LLM-as-judge pitfalls.

ShirleyJune 12, 20265 min read

Course outline · Build AI Agents (4.21)

Ask your agent "did you complete the task correctly?" and the answer is yes. It is always yes. Anthropic measured this while building long-running coding agents: models evaluating their own output skew systematically optimistic, and the fuzzier the criterion - design quality, completeness, "is this actually good" - the worse the skew. The agent isn't lying. It's grading homework it just wrote, with the same brain that wrote it.

Evaluation is the harness component builders skip most, and it's the exact line between "demos well" and "deployed." Here's the toolkit that works: separated evaluators, trace debugging, eval sets grown from real failures, and the judge pitfalls that silently corrupt your numbers.

Why Self-Evaluation Fails (It's Not the Model's Fault)

Three structural reasons, none fixable by prompting harder:

Confirmation bias, mechanized. The reasoning that produced the answer is sitting right there in context. Asked to verify, the model re-walks its own steps and finds them - surprise - reasonable. Same failure as a developer reviewing their own PR ten seconds after writing it.
Shared blind spots. Whatever misunderstanding produced the bug also evaluates the bug. If the agent misread the spec, its self-check applies the same misreading and passes.
"Looks done" vs "is done." Generation optimizes for plausible-looking output. Code that reads correct and code that runs correct are different claims - and only one of them can be checked by reading.

The fix follows from the diagnosis - and it's a rule older than AI: production and acceptance must be different parties.

Pattern 1: The Generator-Evaluator Split

The architecture Anthropic landed on for autonomous coding runs:

Generator builds the thing
Evaluator - a separate agent with fresh context - checks it. It never sees the generator's reasoning, so it can't inherit the blind spots
Failures route back as concrete feedback; the loop continues until the evaluator passes it

The detail that makes it work: the evaluator operates the output instead of reading it. It opens the app in a browser, clicks the flow, runs the test suite, checks the console - behavioral verification, not code review. "Looks done" can't survive contact with a clicked button that doesn't work.

You can run this pattern today at three budget levels: a sub-agent with a review prompt and read-only tools (cheapest), an Agent Teams reviewer teammate with browser access (the multi-perspective version), or a Stop hook that runs lint+tests before the agent is allowed to finish (deterministic floor - the agent literally can't hand over broken code).

TIP

The Stop hook costs five minutes to set up and changes the default from "claims done" to "verified done." If you do exactly one thing from this article, do that.

Free AI Builder Newsletter

Weekly guides on AI tools & builder strategies.

Pattern 2: Traces, or Debugging the Middle

A 15-step agent run that produced a wrong answer is not one failure - it's one failure somewhere in 15 steps, and without traces you're guessing which. Outcome-only evaluation tells you that it failed; traces tell you where: the retrieval that fetched the wrong doc at step 3, the tool error silently swallowed at step 7, the goal drift after the context filled at step 12.

Minimum viable tracing: log every step's input, tool calls with arguments, results, and decision - structured (JSONL), replayable, and greppable. The audit-log hook gives you this for Claude Code in one config entry. For your own agents it's an afternoon of plumbing that pays back the first time a run goes sideways. Two metrics worth computing from traces beyond pass/fail: steps-to-completion (rising step counts = degrading efficiency, even while pass rate holds) and cost-per-success (the number that decides if the agent ships).

Pattern 3: Eval Sets Grown From Failures

Public benchmarks (SWE-bench, GAIA, τ²-bench) tell you which model to pick. They tell you nothing about whether your agent handles your tasks - your data shapes, your edge cases, your users' weird phrasings. For that you need your own eval set, and the cheapest way to build one is to never waste a failure:

Start embarrassingly small - 20 real tasks with verifiable expected outcomes
Every production failure becomes a new eval case (the agent equivalent of regression tests)
Run the set on every prompt change, model swap, or harness tweak
Track pass rate, steps, cost over time

This is eval-driven development, and it converts agent work from vibes ("feels better after the prompt change?") into engineering ("pass rate went 71% → 84%, cost per success down 12%"). LangChain's Terminal Bench climb - top 30 to top 5 without touching the model - was exactly this loop applied relentlessly to the harness.

Pattern 4: LLM-as-Judge, With Its Three Corruptions

Many criteria can't be asserted in code - "is this summary faithful," "is the tone right." An LLM judging outputs against a rubric scales where human review doesn't. It works, if you dodge three documented biases:

Bias	What happens	Mitigation
Position	Pairwise comparisons favor the first option	Judge both orders, average
Verbosity	Longer outputs score higher at equal quality	Length-cap or explicitly instruct against it
Self-preference	Models rate their own family's prose higher	Judge from a different provider than the generator

Two rules regardless: rubrics must be specific ("does it cite a source for every numeric claim - yes/no per claim", not "rate quality 1-10"), and a 5-10% human spot-check sample stays forever. The judge is a scaling tool, not an unsupervised authority - calibrate it against humans before trusting it, and re-calibrate when you change the rubric.

The Maturity Ladder

Where most builders are vs where production agents need to be:

Level 0 - vibes. Run it, eyeball it, ship it. Every demo you've seen.
Level 1 - deterministic gates. Tests + lint via hooks. An afternoon. Catches the embarrassing class.
Level 2 - separated evaluator. Fresh-context agent operates the output. Catches "looks done."
Level 3 - eval set + traces. Regression suite from real failures, every run debuggable. Changes become measurable.
Level 4 - continuous. Production sampling scored by calibrated judges, alerts on drift, cost-per-success on a dashboard.

Each level catches what the previous one structurally cannot. Most builders sit at 0; level 1 is one hook away; level 3 is where "we think it works" becomes "we know what changed." And the cultural shift underneath is the same one harness engineering keeps teaching: stop asking the model to be more trustworthy, and build the system that doesn't need to trust it. It's also the same lesson behind loop engineering - in any agent loop, the evaluator (the verifier) is the bottleneck, not the model.

Sources & Verification

This guide is written from hands-on testing, then cross-checked against primary sources - official documentation and first-party announcements. Field results and opinions are labeled as such. See our editorial standards.