Skip to content

Golden Datasets Are Dead

Golden Dataset Header

There's an instinct when you start building agent evals to replicate what the big benchmarks do. You see TerminalBench or SWE-bench or whatever, and there's this nice hill to climb. Model releases improve the score, progress is visible, stakeholders are happy. So you think: why not build an internal version? Start at 10%, iterate throughout the year, end at 80%. Show the chart in your quarterly review.

It doesn't work. Here's why.

Ground truths don't stay true

Enterprise data moves. If you're in a technology-forward, regulation-forward, or marketing-forward org, your knowledge base is a living thing. Questions that were valid in January become irrelevant by June. Answers that were correct get updated when someone learns something new. Policies change. Products change. The world changes.

And even if nothing changes, getting ground truths in the first place is brutal. Ask one SME and you get one answer. Ask another and you get a different one. Loop in your boss for review and now you have three opinions. Building a dataset good enough to last a year is not just hard. It's probably impossible.

You'll saturate in weeks, not months

Here's the thing about model benchmarks: the benchmark is static, and the only variable is the model. That's why progress is slow and measurable.

Agent evals aren't like that. You can change the model. You can change the harness. You can change the prompt. You can change all three at once. And you should be. If your agent is scoring 10% and you're not hitting 80% within a few months, you basically didn't start. The iteration speed is just too fast now. You throw a better model at it, you fix obvious bugs, you tighten the prompt, and suddenly most of the solvable cases are solved.

Saturation isn't a failure. It's the expected outcome of competent iteration. The problem is what comes next.

The exponential grind

Say you saturate V1 of your benchmark. Time to build V2. Your QA team spends two months scouring traces, cataloging failures, assembling a fresh hundred edge cases. You iterate on those. Two more months, you saturate again.

Now build V3.

This is where it gets painful. Your agent is performing well. The obvious failures are gone. Your QA team is now looking through data with a 95% pass rate trying to find the 5% that breaks. That's not a two-month task anymore. That's an exponentially longer grind. And the cruel part? Once they finally scrape together enough cases, you'll saturate V3 in a fraction of the time it took them to build it. Models improved while they were searching. Your harness got better. The dataset is half-stale before it ships.

Think about how accustomed we've become to current frontier models. I use AI constantly, and I genuinely cannot think of a single query I'm confident would stump them. Your QA team is fighting the same battle. They're going to get fatigued. They're going to struggle. And the datasets will keep getting smaller and harder to assemble.

The adversarial trap

When you can't find real failures, the temptation is to manufacture hard ones. Make up weird edge cases. Combine unrelated constraints. "Compile all our policies into Chinese but replace the first letter of each answer with Z." Something that actually breaks the agent.

Don't do this.

If you solve for queries your customers aren't asking, you're optimizing for a distribution that doesn't exist. You'll blow out latency, add complexity, and probably make the agent worse at the things people actually need. Adversarial datasets feel productive but they're a trap. You're building for imaginary users.

What to do instead

The better approach is feature-level evals with regression sweeps.

flowchart TD
    A[New Feature] --> B[Build Dataset]
    B --> C[Iterate to Saturation]
    C --> D[Ship]
    D --> E[Change Agent]
    E --> F[Regression Sweep]
    F -->|Expect ~100%| E

Before you build a feature, design the eval. Want your agent to handle shore excursions? Write a bunch of shore excursion queries, define what good looks like, and build the feature. Expect to hit saturation before you ship. If you're any good, you will.

Once that feature is live, don't keep making harder and harder shore excursion datasets. Just treat it as a regression sweep. You're not climbing a hill anymore. You're maintaining a quality bar. If you swap models or change the architecture, you expect the same score (or better), not progress toward some distant target.

This reframes the whole relationship with evals. They're not annual benchmarks to climb. They're per-feature contracts to maintain. Build, saturate, ship, regress. Repeat for the next feature.

The era of golden datasets is over. The models move too fast, the data decays too quickly, and the edge cases get exponentially harder to find. Stop building monuments. Build checkpoints.


Related posts: