Skip to content

Evaluation

Golden Datasets Are Dead

Golden Dataset Header

There's an instinct when you start building agent evals to replicate what the big benchmarks do. You see TerminalBench or SWE-bench or whatever, and there's this nice hill to climb. Model releases improve the score, progress is visible, stakeholders are happy. So you think: why not build an internal version? Start at 10%, iterate throughout the year, end at 80%. Show the chart in your quarterly review.

It doesn't work. Here's why.

The Meta-Evaluator: Your Coding Agent as an Eval Layer

Meta-Evaluator Header

I've been building AI products for a while now, and I've always followed the standard playbook: build your agent, write your evals, iterate on prompts until the numbers look good, ship. It works. But recently I stumbled onto something that completely changed how I think about the evaluation layer.

What if your coding agent is the evaluation layer?

Let me explain.