The Meta-Evaluator: Your Coding Agent as an Eval Layer

I've been building AI products for a while now, and I've always followed the standard playbook: build your agent, write your evals, iterate on prompts until the numbers look good, ship. It works. But recently I stumbled onto something that completely changed how I think about the evaluation layer.
What if your coding agent is the evaluation layer?
Let me explain.
The Traditional Stack
If you're building an AI product, your stack probably looks something like this:
flowchart TD
A[You - the developer] --> B[Evals Layer]
B --> C[Production Agent]
B -.- D["datasets, judges, metrics"]
D -.- B
You write evals. You run them. You look at the results. You tweak the prompt. You run evals again. Repeat until satisfied. This is the right way to do it, and I've written about it before.
But there's a layer missing from this diagram. What's actually doing the work of changing prompts and re-running evals?
You are. Manually.
The Missing Layer
Here's what the stack actually looks like for most of us:
flowchart TD
A[You - the developer] --> B[Coding Agent]
B --> C[Evals Layer]
C --> D[Production Agent]
B -.- E["Claude Code, Cursor, Codex, etc."]
E -.- B
You're already using a coding agent to write your code. You're probably using it to write your evals too. But you're treating it as a dumb tool - you tell it what to change, it changes it, you run evals, you interpret results, you tell it what to change next.
What if you collapsed that loop?
What if your coding agent could run your production agent, evaluate the outputs, change the prompts, and iterate - all autonomously?
That's the meta-evaluator pattern.
A Real Example: The Emoji Problem
Let me walk through a real scenario. Say you have an AI product in production, and users are complaining that the agent uses too many emojis. Classic problem.
How do you solve this with traditional evals?
Option 1: Programmatic eval
Count emojis per response. But this is too blunt - if a user sends you a bunch of emojis, maybe your agent should mirror that energy. A hard threshold doesn't capture "uses emojis inappropriately."
Option 2: LLM-as-a-judge
Write a judge that scores emoji appropriateness. But now you need to:
- Create a dataset of queries that trigger emoji usage
- Write the judge prompt
- Align the judge (make sure it scores things the way you would)
- Run the eval suite
- Interpret the results
- Change the prompt
- Run the suite again
- Repeat until satisfied
That's a lot of work for what started as "too many emojis."
Option 3: The meta-evaluator
Here's what I do now. I tell my coding agent:
"The production agent is using too many emojis. Here are 5 queries where I didn't like the emoji usage: [queries]. Fix it."
That's it. Here's what happens next:
- The agent runs those queries against the production agent and sees the outputs
- It evaluates the outputs itself - it can see the emojis, it understands the context, it knows what "too many" means in this situation
- It changes the prompt - maybe adds "Use emojis sparingly" or restructures the tone section
- It runs the same queries again and compares the results
- It iterates until the outputs look right
The coding agent is the judge. It's not scoring in a vacuum - it's comparing before and after, understanding the context of the task, and deciding if the problem is actually solved.
Why This Works
Traditional LLM-as-a-judge evaluates outputs in isolation. The meta-evaluator evaluates in context - it knows what the original problem was, what it tried, and whether the fix actually worked. It's comparative evaluation without the formal pairwise setup.
Beyond Ad-Hoc Fixes
The emoji example is simple, but this pattern scales to complex problems. Here's another real scenario:
The Problem: Our agent was sometimes including external links when it should only provide internal documentation links.
The Challenge: I didn't know which queries triggered external links. I just knew it happened sometimes in production.
The Meta-Evaluator Approach:
I gave my coding agent the task: "The agent sometimes provides external links when it should only provide internal documentation. Fix it."
Here's what happened:
- Discovery: The agent queried the knowledge base directly, looking for content that referenced external URLs
- Hypothesis generation: Based on what it found, it generated queries that should trigger external link responses
- Validation: It ran those queries against the production agent until it actually reproduced the issue
- Dataset creation: It saved the queries that triggered external links as a test dataset
- Evaluator creation: It wrote a quick check for external URLs in responses
- Iteration: It modified the prompt, re-ran the dataset, and iterated until external links stopped appearing
- Shipping: Once fixed, it committed the changes
The agent discovered the edge cases, built its own dataset, wrote its own evaluator, and solved the problem. End to end.
Why This Is Different
Let me be clear about what makes this different from just "using a coding agent":
1. Comparative Evaluation
Traditional evals judge outputs in isolation. Is this response good? Score: 7/10.
The meta-evaluator judges comparatively. Did this change make things better? Is the problem actually solved? This is closer to how humans evaluate - we compare before and after, not just rate things in a vacuum.
2. Context-Aware Judging
When you write an LLM-as-a-judge, it only knows what you put in the prompt. It's in a vacuum.
The meta-evaluator has full context. It knows the original complaint, the queries that triggered it, what changes were tried, and how the outputs evolved. It can make nuanced judgments that would be impossible to capture in a judge prompt.
3. Ad-Hoc by Default
You don't need to formalize everything upfront. No dataset creation, no judge prompt writing, no eval framework setup. Just describe the problem and let the agent figure out how to test and fix it.
This is huge for the long tail of issues that aren't worth building formal evals for. Not every bug deserves a 100-row dataset.
4. Judge Alignment Built In
One of the hardest parts of LLM-as-a-judge is alignment - making sure the judge scores things the way you would.
With the meta-evaluator, alignment happens naturally. The agent is aligned to your task because it understands the full context. If you later want to formalize an eval, the agent can generate the dataset and check if a standalone judge agrees with its own assessments. If the judge disagrees, it can refine the judge prompt until they align.
5. Prompt Automation
This is where it connects to something I'm very bullish on: prompt automation.
The meta-evaluator is doing automated prompt optimization. It tries a change, evaluates results, tries another change, evaluates again. It's the same loop that tools like DSPy and GEPA do, but with a context-aware judge and the ability to discover test cases on its own.
The Architecture
Here's what the new stack looks like:
The coding agent sits above the evals layer. It can use formal evals when they exist, but it can also create ad-hoc evaluations on the fly. It's the orchestration layer that was always there - we just weren't using it.
Making It Work
To use this pattern, your coding agent needs a few capabilities:
1. Ability to invoke the production agent
Your agent needs to be able to run queries against your production agent and see the outputs. This might mean:
- A CLI command that runs a query
- An API endpoint it can hit
- Direct access to run the agent code
2. Ability to modify prompts
The agent needs to be able to change system prompts, which it can already do since prompts live in your codebase. No special setup needed.
3. A clear problem statement
The more specific you are about the problem, the better. "Too many emojis" is good. "Here are 5 queries where I didn't like the emoji usage" is better. But even vague problems work - the agent will explore until it finds concrete failure cases.
When to Use This
The meta-evaluator pattern shines for:
- Subjective issues: Things that are hard to capture in a programmatic eval
- Ad-hoc fixes: Problems that aren't worth building a formal eval suite for
- Discovery: When you know something's wrong but don't know exactly what triggers it
- Rapid iteration: When you want to try many prompt variations quickly
It's not a replacement for formal evals. If you have a critical behavior that needs to be tracked over time, build a proper eval. But for the day-to-day work of improving your agent, the meta-evaluator pattern is faster and often more effective.
The Bigger Picture
Here's what excites me about this: we're collapsing layers.
First, coding agents collapsed the gap between "I want this code" and "the code exists." You describe what you want, the agent writes it.
Now, with the meta-evaluator pattern, we're collapsing the gap between "I have this problem" and "the problem is solved." You describe the issue, the agent discovers the failure cases, iterates on fixes, and ships the solution.
The developer's job shifts from "do the work" to "describe the work." And describing work is something humans are pretty good at.
Related posts:
- My Bull Case for Prompt Automation - Why I'm bullish on automated prompt optimization
- AI Agent Testing: Stop Caveman Testing and Use Evals - The case for formal evals (which still matter!)
- Complex AI Agents - On the temporary nature of complex architectures