How to Evaluate AI Agents: What Works in 2026
Self-evaluation skews optimistic - Anthropic measured it. Generator-evaluator loops, trace debugging, eval sets from failures, and LLM-as-judge pitfalls.
Course outline · AI Agents (4.6)
Ask your agent "did you complete the task correctly?" and the answer is yes. It is always yes. Anthropic measured this while building long-running coding agents: models evaluating their own output skew systematically optimistic, and the fuzzier the criterion - design quality, completeness, "is this actually good" - the worse the skew. The agent isn't lying. It's grading homework it just wrote, with the same brain that wrote it.
Evaluation is the harness component builders skip most, and it's the exact line between "demos well" and "deployed." Here's the toolkit that works: separated evaluators, trace debugging, eval sets grown from real failures, and the judge pitfalls that silently corrupt your numbers.
Why Self-Evaluation Fails (It's Not the Model's Fault)
Three structural reasons, none fixable by prompting harder:
- Confirmation bias, mechanized. The reasoning that produced the answer is sitting right there in context. Asked to verify, the model re-walks its own steps and finds them - surprise - reasonable. Same failure as a developer reviewing their own PR ten seconds after writing it.
- Shared blind spots. Whatever misunderstanding produced the bug also evaluates the bug. If the agent misread the spec, its self-check applies the same misreading and passes.
- "Looks done" vs "is done." Generation optimizes for plausible-looking output. Code that reads correct and code that runs correct are different claims - and only one of them can be checked by reading.
The fix follows from the diagnosis - and it's a rule older than AI: production and acceptance must be different parties.

Pattern 1: The Generator-Evaluator Split
The architecture Anthropic landed on for autonomous coding runs:
- Generator builds the thing
- Evaluator - a separate agent with fresh context - checks it. It never sees the generator's reasoning, so it can't inherit the blind spots
- Failures route back as concrete feedback; the loop continues until the evaluator passes it
The detail that makes it work: the evaluator operates the output instead of reading it. It opens the app in a browser, clicks the flow, runs the test suite, checks the console - behavioral verification, not code review. "Looks done" can't survive contact with a clicked button that doesn't work.
You can run this pattern today at three budget levels: a sub-agent with a review prompt and read-only tools (cheapest), an Agent Teams reviewer teammate with browser access (the multi-perspective version), or a Stop hook that runs lint+tests before the agent is allowed to finish (deterministic floor - the agent literally can't hand over broken code).
TIP
The Stop hook costs five minutes to set up and changes the default from "claims done" to "verified done." If you do exactly one thing from this article, do that.
Free AI Builder Newsletter
Weekly guides on AI tools & builder strategies.
Pattern 2: Traces, or Debugging the Middle
A 15-step agent run that produced a wrong answer is not one failure - it's one failure somewhere in 15 steps, and without traces you're guessing which. Outcome-only evaluation tells you that it failed; traces tell you where: the retrieval that fetched the wrong doc at step 3, the tool error silently swallowed at step 7, the goal drift after the context filled at step 12.
Minimum viable tracing: log every step's input, tool calls with arguments, results, and decision - structured (JSONL), replayable, and greppable. The audit-log hook gives you this for Claude Code in one config entry. For your own agents it's an afternoon of plumbing that pays back the first time a run goes sideways. Two metrics worth computing from traces beyond pass/fail: steps-to-completion (rising step counts = degrading efficiency, even while pass rate holds) and cost-per-success (the number that decides if the agent ships).
Pattern 3: Eval Sets Grown From Failures
Public benchmarks (SWE-bench, GAIA, τ²-bench) tell you which model to pick. They tell you nothing about whether your agent handles your tasks - your data shapes, your edge cases, your users' weird phrasings. For that you need your own eval set, and the cheapest way to build one is to never waste a failure:
- Start embarrassingly small - 20 real tasks with verifiable expected outcomes
- Every production failure becomes a new eval case (the agent equivalent of regression tests)
- Run the set on every prompt change, model swap, or harness tweak
- Track pass rate, steps, cost over time
This is eval-driven development, and it converts agent work from vibes ("feels better after the prompt change?") into engineering ("pass rate went 71% → 84%, cost per success down 12%"). LangChain's Terminal Bench climb - top 30 to top 5 without touching the model - was exactly this loop applied relentlessly to the harness.
Pattern 4: LLM-as-Judge, With Its Three Corruptions
Many criteria can't be asserted in code - "is this summary faithful," "is the tone right." An LLM judging outputs against a rubric scales where human review doesn't. It works, if you dodge three documented biases:
| Bias | What happens | Mitigation |
|---|---|---|
| Position | Pairwise comparisons favor the first option | Judge both orders, average |
| Verbosity | Longer outputs score higher at equal quality | Length-cap or explicitly instruct against it |
| Self-preference | Models rate their own family's prose higher | Judge from a different provider than the generator |
Two rules regardless: rubrics must be specific ("does it cite a source for every numeric claim - yes/no per claim", not "rate quality 1-10"), and a 5-10% human spot-check sample stays forever. The judge is a scaling tool, not an unsupervised authority - calibrate it against humans before trusting it, and re-calibrate when you change the rubric.
The Maturity Ladder
Where most builders are vs where production agents need to be:
- Level 0 - vibes. Run it, eyeball it, ship it. Every demo you've seen.
- Level 1 - deterministic gates. Tests + lint via hooks. An afternoon. Catches the embarrassing class.
- Level 2 - separated evaluator. Fresh-context agent operates the output. Catches "looks done."
- Level 3 - eval set + traces. Regression suite from real failures, every run debuggable. Changes become measurable.
- Level 4 - continuous. Production sampling scored by calibrated judges, alerts on drift, cost-per-success on a dashboard.
Each level catches what the previous one structurally cannot. Most builders sit at 0; level 1 is one hook away; level 3 is where "we think it works" becomes "we know what changed." And the cultural shift underneath is the same one harness engineering keeps teaching: stop asking the model to be more trustworthy, and build the system that doesn't need to trust it.
Continue Learning
AI Builder Club
Courses, workshops, and a builder community for shipping with AI agents, Claude Code, and more.
Get the free newsletter
Weekly deep-dives on AI tools, automation workflows, and builder strategies. Join 5,000+ readers.
No spam. Unsubscribe anytime.