Skip to content

Prompt Engineering

The Meta-Evaluator: Your Coding Agent as an Eval Layer

Meta-Evaluator Header

I've been building AI products for a while now, and I've always followed the standard playbook: build your agent, write your evals, iterate on prompts until the numbers look good, ship. It works. But recently I stumbled onto something that completely changed how I think about the evaluation layer.

What if your coding agent is the evaluation layer?

Let me explain.

The Traditional Stack

If you're building an AI product, your stack probably looks something like this:

flowchart TD
    A[You - the developer] --> B[Evals Layer]
    B --> C[Production Agent]

    B -.- D["datasets, judges, metrics"]
    D -.- B

You write evals. You run them. You look at the results. You tweak the prompt. You run evals again. Repeat until satisfied. This is the right way to do it, and I've written about it before.

But there's a layer missing from this diagram. What's actually doing the work of changing prompts and re-running evals?

You are. Manually.

The Missing Layer

Here's what the stack actually looks like for most of us:

flowchart TD
    A[You - the developer] --> B[Coding Agent]
    B --> C[Evals Layer]
    C --> D[Production Agent]

    B -.- E["Claude Code, Cursor, Codex, etc."]
    E -.- B

You're already using a coding agent to write your code. You're probably using it to write your evals too. But you're treating it as a dumb tool - you tell it what to change, it changes it, you run evals, you interpret results, you tell it what to change next.

What if you collapsed that loop?

What if your coding agent could run your production agent, evaluate the outputs, change the prompts, and iterate - all autonomously?

That's the meta-evaluator pattern.

A Real Example: The Emoji Problem

Let me walk through a real scenario. Say you have an AI product in production, and users are complaining that the agent uses too many emojis. Classic problem.

How do you solve this with traditional evals?

Option 1: Programmatic eval

Count emojis per response. But this is too blunt - if a user sends you a bunch of emojis, maybe your agent should mirror that energy. A hard threshold doesn't capture "uses emojis inappropriately."

Option 2: LLM-as-a-judge

Write a judge that scores emoji appropriateness. But now you need to:

  1. Create a dataset of queries that trigger emoji usage
  2. Write the judge prompt
  3. Align the judge (make sure it scores things the way you would)
  4. Run the eval suite
  5. Interpret the results
  6. Change the prompt
  7. Run the suite again
  8. Repeat until satisfied

That's a lot of work for what started as "too many emojis."

Option 3: The meta-evaluator

Here's what I do now. I tell my coding agent:

"The production agent is using too many emojis. Here are 5 queries where I didn't like the emoji usage: [queries]. Fix it."

That's it. Here's what happens next:

  1. The agent runs those queries against the production agent and sees the outputs
  2. It evaluates the outputs itself - it can see the emojis, it understands the context, it knows what "too many" means in this situation
  3. It changes the prompt - maybe adds "Use emojis sparingly" or restructures the tone section
  4. It runs the same queries again and compares the results
  5. It iterates until the outputs look right

The coding agent is the judge. It's not scoring in a vacuum - it's comparing before and after, understanding the context of the task, and deciding if the problem is actually solved.

Why This Works

Traditional LLM-as-a-judge evaluates outputs in isolation. The meta-evaluator evaluates in context - it knows what the original problem was, what it tried, and whether the fix actually worked. It's comparative evaluation without the formal pairwise setup.

Beyond Ad-Hoc Fixes

The emoji example is simple, but this pattern scales to complex problems. Here's another real scenario:

The Problem: Our agent was sometimes including external links when it should only provide internal documentation links.

The Challenge: I didn't know which queries triggered external links. I just knew it happened sometimes in production.

The Meta-Evaluator Approach:

I gave my coding agent the task: "The agent sometimes provides external links when it should only provide internal documentation. Fix it."

Here's what happened:

  1. Discovery: The agent queried the knowledge base directly, looking for content that referenced external URLs
  2. Hypothesis generation: Based on what it found, it generated queries that should trigger external link responses
  3. Validation: It ran those queries against the production agent until it actually reproduced the issue
  4. Dataset creation: It saved the queries that triggered external links as a test dataset
  5. Evaluator creation: It wrote a quick check for external URLs in responses
  6. Iteration: It modified the prompt, re-ran the dataset, and iterated until external links stopped appearing
  7. Shipping: Once fixed, it committed the changes

The agent discovered the edge cases, built its own dataset, wrote its own evaluator, and solved the problem. End to end.

Why This Is Different

Let me be clear about what makes this different from just "using a coding agent":

1. Comparative Evaluation

Traditional evals judge outputs in isolation. Is this response good? Score: 7/10.

The meta-evaluator judges comparatively. Did this change make things better? Is the problem actually solved? This is closer to how humans evaluate - we compare before and after, not just rate things in a vacuum.

2. Context-Aware Judging

When you write an LLM-as-a-judge, it only knows what you put in the prompt. It's in a vacuum.

The meta-evaluator has full context. It knows the original complaint, the queries that triggered it, what changes were tried, and how the outputs evolved. It can make nuanced judgments that would be impossible to capture in a judge prompt.

3. Ad-Hoc by Default

You don't need to formalize everything upfront. No dataset creation, no judge prompt writing, no eval framework setup. Just describe the problem and let the agent figure out how to test and fix it.

This is huge for the long tail of issues that aren't worth building formal evals for. Not every bug deserves a 100-row dataset.

4. Judge Alignment Built In

One of the hardest parts of LLM-as-a-judge is alignment - making sure the judge scores things the way you would.

With the meta-evaluator, alignment happens naturally. The agent is aligned to your task because it understands the full context. If you later want to formalize an eval, the agent can generate the dataset and check if a standalone judge agrees with its own assessments. If the judge disagrees, it can refine the judge prompt until they align.

5. Prompt Automation

This is where it connects to something I'm very bullish on: prompt automation.

The meta-evaluator is doing automated prompt optimization. It tries a change, evaluates results, tries another change, evaluates again. It's the same loop that tools like DSPy and GEPA do, but with a context-aware judge and the ability to discover test cases on its own.

The Architecture

Here's what the new stack looks like:

![Meta-Evaluator Architecture](../../img/meta-evaluator-diagram.png){ width="400" }

The coding agent sits above the evals layer. It can use formal evals when they exist, but it can also create ad-hoc evaluations on the fly. It's the orchestration layer that was always there - we just weren't using it.

Making It Work

To use this pattern, your coding agent needs a few capabilities:

1. Ability to invoke the production agent

Your agent needs to be able to run queries against your production agent and see the outputs. This might mean:

  • A CLI command that runs a query
  • An API endpoint it can hit
  • Direct access to run the agent code

2. Ability to modify prompts

The agent needs to be able to change system prompts, which it can already do since prompts live in your codebase. No special setup needed.

3. A clear problem statement

The more specific you are about the problem, the better. "Too many emojis" is good. "Here are 5 queries where I didn't like the emoji usage" is better. But even vague problems work - the agent will explore until it finds concrete failure cases.

When to Use This

The meta-evaluator pattern shines for:

  • Subjective issues: Things that are hard to capture in a programmatic eval
  • Ad-hoc fixes: Problems that aren't worth building a formal eval suite for
  • Discovery: When you know something's wrong but don't know exactly what triggers it
  • Rapid iteration: When you want to try many prompt variations quickly

It's not a replacement for formal evals. If you have a critical behavior that needs to be tracked over time, build a proper eval. But for the day-to-day work of improving your agent, the meta-evaluator pattern is faster and often more effective.

The Bigger Picture

Here's what excites me about this: we're collapsing layers.

First, coding agents collapsed the gap between "I want this code" and "the code exists." You describe what you want, the agent writes it.

Now, with the meta-evaluator pattern, we're collapsing the gap between "I have this problem" and "the problem is solved." You describe the issue, the agent discovers the failure cases, iterates on fixes, and ships the solution.

The developer's job shifts from "do the work" to "describe the work." And describing work is something humans are pretty good at.


Related posts:

My Bull Case for Prompt Automation

Recently, Andrej Karpathy did the Dwarkesh Patel podcast, and one of the stories he told stuck out to me.

He they were doing an experiment where they had an LLM-as-a-judge scoring a student LLM. All of a sudden, he says, the loss went straight to zero, meaning the student LLM was getting 100% out of nowhere. So either the student LLM achieved perfection, or something went wrong.

They dug into the outputs, and it turns out the student LLM was just outputting the word "the" a bunch of times: "the the the the the the the." For some reason, that tricked the LLM-as-a-judge into giving a passing score. It was just an anomalous input that gave them an anomalous output, and it broke the judge.

It's an interesting story in itself, just on the flakiness of LLMs, but we knew that already. I think the revelation for me here is that if outputting the word "the" a bunch of times is enough to get an LLM to perform in ways you wouldn't expect, then how random is the process of prompting? Are there scenarios where if you put "the the the the the" a bunch of times in the system prompt, maybe it solves a behavior, or creates a behavior you were trying to get to?

We treat prompting like we're speaking to an entity, and that if we can get really clear instructions in the system prompt, we can steer these LLMs as if they're just humans that are a little less smart. But that doesn't seem to be the case, because even a dumb human wouldn't interpret the word "the" a bunch of times as some kind of successful response. These things are more enigmatic than we treat them. It's not too far removed from random at this point.

Which means we can automate this.

And that makes me bullish on things like DSPy and GEPA that use LLMs to generate prompts for you and use measurement criteria to validate that the prompt changes were effective. That automates the whole process and kinda gives you a handle on that randomness. Because if it is random (even partially) then having a human iterate until they find the right combination seems like an inefficient, Bitter Lesson way to solve these problems.

So yeah: I'm bullish on prompt automation, and bearish on prompt engineering as a skill.