Skip to content

2026

Golden Datasets Are Dead

Golden Dataset Header

There's an instinct when you start building agent evals to replicate what the big benchmarks do. You see TerminalBench or SWE-bench or whatever, and there's this nice hill to climb. Model releases improve the score, progress is visible, stakeholders are happy. So you think: why not build an internal version? Start at 10%, iterate throughout the year, end at 80%. Show the chart in your quarterly review.

It doesn't work. Here's why.