Evals 102: Let Your Agent Do the Hard Parts

I gave a follow-up talk at the LangChain meetup on more advanced eval patterns. This blog covers the main ideas.
A few months ago I gave a talk on the basics of agent evaluation. The gist: stop copy-pasting test prompts like a caveman. Build datasets. Run evaluators. Measure things. That talk covered tests versus evals, black-boxing your agent's output, and a pattern I call "stress testing": find an issue, build a small dataset around it, iterate until it's fixed.
This is the sequel. I'm assuming you're using coding agents (Claude Code, Cursor, Codex, whatever). If you're not, the ideas still apply, but the real unlock is what happens when your coding agent becomes part of the eval process itself.

