Skip to content

Evaluation

Evals 102: Let Your Agent Do the Hard Parts

Evals 102 Header

I gave a follow-up talk at the LangChain meetup on more advanced eval patterns. This blog covers the main ideas.

A few months ago I gave a talk on the basics of agent evaluation. The gist: stop copy-pasting test prompts like a caveman. Build datasets. Run evaluators. Measure things. That talk covered tests versus evals, black-boxing your agent's output, and a pattern I call "stress testing": find an issue, build a small dataset around it, iterate until it's fixed.

This is the sequel. I'm assuming you're using coding agents (Claude Code, Cursor, Codex, whatever). If you're not, the ideas still apply, but the real unlock is what happens when your coding agent becomes part of the eval process itself.

Golden Datasets Are Dead

Golden Dataset Header

There's an instinct when you start building agent evals to replicate what the big benchmarks do. You see TerminalBench or SWE-bench or whatever, and there's this nice hill to climb. Model releases improve the score, progress is visible, stakeholders are happy. So you think: why not build an internal version? Start at 10%, iterate throughout the year, end at 80%. Show the chart in your quarterly review.

It doesn't work. Here's why.

The Meta-Evaluator: Your Coding Agent as an Eval Layer

Meta-Evaluator Header

I've been building AI products for a while now, and I've always followed the standard playbook: build your agent, write your evals, iterate on prompts until the numbers look good, ship. It works. But recently I stumbled onto something that completely changed how I think about the evaluation layer.

What if your coding agent is the evaluation layer?

Let me explain.