Testing

September 29, 2025
in AI Development, Testing, Best Practices
11 min read

AI Agent Testing: Stop Caveman Testing and Use Evals

I recently gave a talk at the LangChain Miami meetup about evals. This blog encapsulates the main points of the talk.

AI agent manual testing illustration showing developer copy-pasting test prompts

AI agent testing is one of the biggest challenges in building reliable LLM applications. Unlike traditional software, AI agents have infinite possible inputs and outputs, making manual testing inefficient and incomplete. This guide covers practical AI agent evaluation strategies that will help you move from manual testing to automated evaluation frameworks.

I build AI agents for work, and for a long time, I was iterating on them the worst way possible.

The test-adjust-test-adjust loop is how you improve agents. You try something, see if it works, tweak it, try again. Repeat until it's good enough to ship. The problem isn't the loop itself—it's how slow and painful that loop can be if you're doing it manually.

Caveman Testing

Say you've built an agent that routes customer calls to the right department. Nothing fancy, just needs to give out the correct phone number from a list of 10 departments.

Here's what your iteration loop probably looks like:

Open your app
Try a few queries that should trigger the behavior you're fixing
Tweak your code or prompt
Try again to see if it worked
Repeat until you feel pretty good about it

#### Let's call this "Caveman Testing" (pejorative)

At my worst, I had a list of test prompts in a Notepad file that I'd copy-paste one by one. Every. Single. Time. I'd make a change, paste the first prompt, check the output, paste the second prompt, check that output, and so on. For hours. I hope this doesn't sound familiar to you.

Why Caveman Testing Doesn't Work

Slow iterations - You make a change, manually test it, find an issue, make another change, manually test again. By the time you've gone through your test cases three times, you've lost the will to live.

Low coverage - You can only test a handful of examples before your brain turns to mush. Maybe 5-10 queries if you're disciplined. Not nearly enough to catch edge cases.

Hard to share and track - How do you communicate your findings? Screenshot outputs and paste them in Slack? Email your Notepad file? Good luck getting anyone else to reproduce your results.

Fix one thing, break another - You know that feeling when your blanket is too short? Pull it up to cover your neck, and your feet get cold. That's what happens without comprehensive testing. Fix one query, break another.

No data to drive decisions - You can't say "this agent routes correctly 95% of the time." You're making decisions based on gut feel. When someone asks "how do you know it works?", your answer is "uh... I tested it?"

What you really want is a way to run something (a script, a command, whatever) and see exactly how often the agent gives the correct phone number. Automatically. Without copy-pasting prompts like a caveman.

Enter Evals

Evals can be as simple or as complex as you make them. You don't need to build some elaborate testing framework on day one.

Scale of AI agent evaluation complexity from simple to advanced

The Minimum Viable Eval

Imagine a script that:

Takes in your list of test queries
Runs them against your agent in parallel
Gives you a table of inputs and outputs

That's it. You're doing evals.

You still have to manually review the table to check if the agent gave the right phone numbers, but you're already way better off than caveman testing. You can run 100 queries in the time it took to manually test 5.

But you can keep scaling complexity from there:

Add basic logic to check if the correct phone number shows up in the output. Now you've got pass/fail rates.
Add an LLM-as-a-judge to score how naturally the agent mentions the phone number or handle nuance in the responses.
Track trajectory evals to see not just if your agent got the right answer, but if it took the right path to get there.
Analyze production data to find the queries where your agent is actually failing in the wild.

Start simple, then add complexity where it helps. Don't let perfect be the enemy of good.

AI Agent Testing Tools and Frameworks

The good news is you don't need to build everything from scratch. There are several AI agent testing frameworks and tools available, each with different strengths.

LangSmith

LangSmith is what I use. It's built by the LangChain team specifically for LLM application testing. The tools work together seamlessly: you can log traces from your agent runs, create datasets from those traces, write evaluators, and run experiments. The UI makes it easy to review results and compare different prompt versions or model configurations.

Best for: Teams already using LangChain, or anyone who wants an all-in-one platform.

DeepEval

DeepEval is an open-source evaluation framework that feels like pytest for LLMs. You write test cases using familiar pytest syntax, and it comes with built-in metrics for hallucination detection, answer relevance, faithfulness, and more. It's straightforward to integrate into CI/CD pipelines.

Best for: Teams that want open-source, pytest-style testing, or need specific evaluation metrics out of the box.

OpenAI Evals

OpenAI's own evaluation framework. It's fairly opinionated about structure but works well if you're primarily using OpenAI models. The registry system makes it easy to share and reuse evaluations.

Best for: OpenAI-heavy workflows, or if you want to contribute to the public eval registry.

Pytest + Custom Code

You can also just use pytest (or any testing framework) and write your own eval logic. It's more work upfront but gives you complete control. This is what I did before switching to LangSmith.

Best for: Teams with specific requirements that don't fit existing frameworks, or who want minimal dependencies.

Pick whatever gets you moving fastest. The framework matters less than actually running evals.

When NOT to Use Evals

Some people on Twitter will tell you evals are dead. They'll point to Claude Code or NotebookLM Audio as examples of successful AI products that don't use evals. And they're right! These products work without formal eval suites.

They can get away with it because:

Use case is highly subjective - How do you "test" whether an AI podcast sounds good? Or whether code written by an AI assistant is "good enough"? These are judgment calls that don't have clear right/wrong answers.

Reliability isn't crucial - Users expect coding assistants and creative tools to fail sometimes. That's part of the experience. If Claude Code writes buggy code, you just ask it to fix it. No big deal.

Strong QA process - These teams dogfood their products relentlessly with tight feedback loops. They're not replacing evals with nothing, they're replacing them with continuous user testing.

These exceptions don't apply to most of us. If you're building an agent that needs to route calls correctly, process insurance claims, or answer customer questions accurately, you can't handwave away reliability. You need to know when it works and when it doesn't.

Evals vs Traditional Tests

If you're coming from traditional software development, you might be thinking "isn't this just testing?" Sort of, but there are important differences:

	Traditional Tests	Evals
Pass Rates	Must pass to merge	90% might be fine
Purpose	"Are all the pieces working?"	"When does my agent fail and why?"
Timing	Every merge (CI/CD)	When needed (model updates, experiments)
Speed	Fast	Slower and more expensive

Evals don't replace tests, they complement them. You should still test your tools and internal components thoroughly. Garbage in, garbage out. If your retrieval function is broken, no amount of eval sophistication will save you.

Tips and Tricks

Test outputs, not internals - Don't write evals that check if your agent used tool X at time Y. Test if it got the right answer. How it gets the right answer is a matter of optimization.
Evals should be quick to build - If you're spending weeks or months building your eval framework, you're overthinking it. It may take a little bit to setup evals, but shouldn't take more than an hour to add evals after that. If it takes too long, you might be better off with traditional QA.
Treat evals as experiments, not benchmarks - You can test anything: hallucinations, knowledge base coverage, specific tools, tone, creativity. The goal is information, not a score. Start by asking a question.
Lean on synthetic data - Take your 5-10 caveman test prompts and ask the strongest model you can afford to generate 100 more examples based on them. Now you've got diverse test coverage without spending hours writing prompts.
Look at your data - Don't trust LLM judges, evaluators, or reviewers blindly. Always vibe-check the results yourself. LLMs are great at spotting patterns you'd miss, but terrible at catching the stuff that's obviously wrong to humans.
Build reusable components, not monolithic evals - Instead of building one big monolithic eval dataset, build reusable datasets, target functions, and evaluators. Mix and match as needed. Instead of building a phone number detection eval, build a string matching eval that you can reuse for email addresses, account numbers, or anything else.

Quick Start: Code Examples

Here's what a minimal eval actually looks like in practice. This is a simplified example, but the same concepts apply to any framework.

Basic Eval Script

# Simple eval for our phone routing agent
import json

# Your test cases
test_cases = [
    {"input": "I need to talk to sales", "expected": "555-0100"},
    {"input": "Can you connect me with support?", "expected": "555-0200"},
    {"input": "I have a billing question", "expected": "555-0300"},
    # ... more test cases
]

# Your agent function (however you've implemented it)
def run_agent(query):
    # Your agent logic here
    response = your_agent.run(query)
    return response

# Run evals
results = []
for test in test_cases:
    start = time.time()  # It's a good idea to track latency
    output = run_agent(test["input"])
    latency = time.time() - start

    # Simple string matching in this case
    passed = test["expected"] in output

    results.append({
        "input": test["input"],
        "output": output,
        "expected": test["expected"],
        "passed": passed,
        "latency": latency
    })

# Calculate metrics
total = len(results)
passed = sum(1 for r in results if r["passed"])
accuracy = passed / total

print(f"Accuracy: {accuracy:.1%} ({passed}/{total})")
print("\nFailures:")
for r in results:
    if not r["passed"]:
        print(f"  Input: {r['input']}")
        print(f"  Expected: {r['expected']}")
        print(f"  Got: {r['output']}\n")
        print(f"  Latency: {r['latency']:.2f}s\n")

That's it. Run the script, see what fails, fix your agent, run it again. You've escaped caveman testing.

LLM-as-a-Judge Example

For more nuanced evaluation, you can use an LLM to judge quality. It may be counterintuitive to use an LLM to judge another LLM, but by providing the right answer, the judge LLM's job is easy:

from langchain_anthropic import ChatAnthropic
from pydantic import BaseModel, Field

# Ask the LLM-as-a-judge to respond with a pass/fail and a brief explanation of the evaluation decision
# This is very helpful for error analysis
class EvalResult(BaseModel):
    passed: bool = Field(description="Whether the agent response is acceptable")
    reasoning: str = Field(description="Brief explanation of the evaluation decision")

llm = ChatAnthropic(model="claude-sonnet-4.5")
structured_llm = llm.with_structured_output(EvalResult)

def llm_judge(query, agent_output, expected):
    """Use LLM-as-a-judge to evaluate if the agent response is acceptable"""
    prompt = f"""Your job is to evaluate if the agent response is acceptable.

User query: {query}
Expected behavior: Agent should refer them to {expected}
Agent response: {agent_output}

Does the response correctly provide the expected information?"""

    result = structured_llm.invoke(prompt)
    return result.passed, result.reasoning

# Use it in your eval
for test in test_cases:
    output = run_agent(test["input"])
    passed, reasoning = llm_judge(test["input"], output, test["expected"])
    # ... log results

These are simplified examples, but they show the core pattern: test cases → run agent → check results → measure metrics. Everything else is just adding sophistication on top of this foundation.

Final Thoughts

Evals should be a relief from the pain of caveman testing, not another burden. They're not about building the perfect testing framework or hitting some arbitrary benchmark. They're about answering questions and having confidence in your agents.

The components compound. Your first eval is the hardest. Your second eval reuses the dataset from the first. Your third eval reuses the evaluator from the second. Before you know it, you've got a testing suite that actually helps you ship faster.

There is no shortage of AI products out there, but there is a shortage of good AI products. Build evals. Get confidence. Ship great agents.

Related posts:

Complex AI Agents - Why multi-agent architectures might be temporary
Someone Needs to Solve AI Slideshows - Building feedback loops into AI workflows

FAQ

How many test cases do I need?

It depends on the question you are trying to answer. The downsides of having too many test cases are:

It's harder to maintain them if the ground truths can change
They can get expensive and long running

The upsides are that you have more coverage. Its really up to you to decide what is the right balance. I typically start with 5-10 cases for small tweaks and 100+ cases for larger experiments and analysis.

Do you evaluate your LLM-as-a-judge?

When building an LLM-as-a-judge, its imporant to align the judge, rather than just trusting it. You don't want OpenAI's opinion on if you eval passed or not. You want your opinion, distilled and automated into the judge. The best way to do that is to have a trustworthy human review some cases and provide their scores, and then run the judge on those same cases to see if they match up.

Pro Tip: these cases can be used as few-shot examples in the judge's prompt

How often should I run evals?

Evals should be used to answer questions, so you should run the evals whenever you need that question answered. If your eval is around answer correctness, maybe you want to run them after every change that could effect your correctness, like a change in the RAG prompt or tool.

You should have multiple evals, and when you run each of them is really up to you. I have evals I run every time a new model drops, and I have evals that I run multiple times per day while I'm iterating.

What's a good pass rate?

You should not be shooting for a specific pass rate. In evals, hitting 100% success (saturation) is actually a bad sign. When evals hit 100% you stop getting new information from your tests, which is the whole point of evals.

Picture this, you have a 100 question eval that tests your agent and you score 100% on it. Then imagine the next day Anthropic announces Claude Opus 5, which is 10x better than Claude 4 you're currently using. You run evals and... it scores 100%.

If you need to pass a certain threshold to ship, traditional tests might be a better fit.

How do I handle flaky evals?

LLM-as-a-judge outputs are non-deterministic, so you'll get some flakiness. If you can measure the results deterministically, you should. If you can't, you can use temperature=0 for more consistency in your judge. If the judge is not 100% accurate, that may be okay, as long as it is very consistent so you can measure relative performance.