Blog

Some light reading on AI. If you are looking for more technical content, check out the experiments section.

November 25, 2025
in AI Development, Software Engineering, Best Practices
4 min read

Floor vs Ceiling: Different Models for Different Jobs

I talk a lot about the floor versus the ceiling when it comes to LLMs and agents. The ceiling is the maximum capability when you push these models to the edge of what they can do: complex architectures, novel scientific problems, anything that requires real reasoning. The floor is the everyday stuff, the entry-level human tasks that just need to get done reliably.

For customer service, you want floor models. Cheap, fast, stable. For cutting-edge research or gnarly architectural decisions, you want ceiling models. Expensive, slow, but actually smart.

What I've realized lately is that coding agent workflows should be using both. And most of them aren't.

The TDD Sweet Spot

For me, the best approach with current agents has been strong test-driven development. I architect the ticket, I design the tests, and then I let the agent implement the code. Each of these steps has different requirements, and they probably shouldn't all be using the same model.

Architecting the ticket: This is ceiling territory. You want a model that can think hard about the problem space, understand the existing codebase, and put together a coherent plan. Something like Codex that can reason through tradeoffs and edge cases.

Writing tests: Also ceiling territory. Tests define the acceptance criteria. They're the contract. If the tests are wrong or incomplete, everything downstream is garbage. You want a smart model here too.

Implementing the code: This is where it flips. Once you have a solid plan and good tests, implementation becomes a floor task. You're not asking the model to invent anything novel. You're asking it to write code that passes the tests. Junior to mid-level execution. The requirements are: don't make dumb mistakes, don't add slop to the codebase.

At this point, you could use Haiku. You could use Composer-1. You could use whatever is fastest and cheapest. The hard thinking already happened. Now you just need reliable execution.

Why This Matters

![A neoclassical oil painting reimagined for a far-future setting: at the top, a singular wise figure wearing a laurel crown made of glowing circuitry pours luminous data from a translucent golden vessel into an ornate fountain that blends marble with chrome and holographic elements. Below, the fountain overflows with light and feeds a procession of identical android workers in classical tunics marching outward, each carrying futuristic tools. Renaissance composition with dramatic lighting, rich burgundy and gold tones mixed with cyan technological accents, columns of polished metal and projected light in background.](../../img/floor-ceiling-funnel.png){ width="450" }

If you can get this workflow automated, you can actually churn through tickets. Get the ticket, plan it out sharply with a smart model, agree on acceptance criteria with well-written tests, then hand it off to a fast model that just executes.

The problem is that most coding agents treat everything the same. Planning mode exists in Cursor and Claude Code, but it's kind of a second-class citizen. A nice-to-have. In reality, if you're working with agents, spec-driven and test-driven development isn't optional. It's the only development you should be doing.

What I'd Build

If I could design a coding agent from scratch, here's what I'd do:

1. Make planning interactive and include testing.

The planning phase needs to be highly interactive. You're not just writing a markdown plan, you're also writing the tests. These are tightly coupled. The plan describes what you're building and why. The tests describe what success looks like. Both should be editable. Both should require human sign-off before execution starts.

So the agent would write tests in a test file and the plan in a plan file, and you'd iterate on both until you're happy. Only then does execution begin.

2. Separate prompts for planning and execution.

The planning agent and the execution agent need different prompts. The planning agent's job is to understand the codebase and design a good solution. It needs context about architecture, patterns, and constraints.

The execution agent's job is different. These models are already heavily tuned to write code. You don't need to tell them to write code. What you need is guidance on how to write code without adding slop. Don't introduce tech debt. Don't add unnecessary abstractions. Don't break existing patterns. Keep it clean.

That's a fundamentally different prompt than "understand this complex system and figure out what to build."

3. Different models for each phase.

Use Codex or Opus or whatever's smartest for planning and tests. Use Sonnet or Haiku or whatever's fastest for execution. Match the model to the task.

The Planning Gap

Right now, the planning phase in most agents is underdeveloped. It's something you can turn on, but it's not the default workflow. It's not deeply integrated with testing. It's not designed for iteration.

But this is where the leverage is. If you nail the plan and the tests, execution becomes almost trivial. If you skip planning or phone it in, you're asking a floor model to do ceiling work, and you'll get floor results.

The whole point of TDD is that once you've agreed on the tests and acceptance criteria, the implementation is just... implementation. "However you solve this is up to you, I don't care. Just make it pass." That's a fundamentally different kind of task than "figure out what to build."

Most agent workflows don't acknowledge this distinction. They should.

Related posts:

AI Agent Testing: Stop Caveman Testing and Use Evals - Why testing matters for AI workflows
The Meta-Evaluator: Your Coding Agent as an Eval Layer - Using agents to evaluate and iterate

November 25, 2025
in AI Development, Evaluation, Prompt Engineering
7 min read

The Meta-Evaluator: Your Coding Agent as an Eval Layer

Meta-Evaluator Header

I've been building AI products for a while now, and I've always followed the standard playbook: build your agent, write your evals, iterate on prompts until the numbers look good, ship. It works. But recently I stumbled onto something that completely changed how I think about the evaluation layer.

What if your coding agent is the evaluation layer?

Let me explain.

The Traditional Stack

If you're building an AI product, your stack probably looks something like this:

flowchart TD
    A[You - the developer] --> B[Evals Layer]
    B --> C[Production Agent]

    B -.- D["datasets, judges, metrics"]
    D -.- B

You write evals. You run them. You look at the results. You tweak the prompt. You run evals again. Repeat until satisfied. This is the right way to do it, and I've written about it before.

But there's a layer missing from this diagram. What's actually doing the work of changing prompts and re-running evals?

You are. Manually.

The Missing Layer

Here's what the stack actually looks like for most of us:

flowchart TD
    A[You - the developer] --> B[Coding Agent]
    B --> C[Evals Layer]
    C --> D[Production Agent]

    B -.- E["Claude Code, Cursor, Codex, etc."]
    E -.- B

You're already using a coding agent to write your code. You're probably using it to write your evals too. But you're treating it as a dumb tool - you tell it what to change, it changes it, you run evals, you interpret results, you tell it what to change next.

What if you collapsed that loop?

What if your coding agent could run your production agent, evaluate the outputs, change the prompts, and iterate - all autonomously?

That's the meta-evaluator pattern.

A Real Example: The Emoji Problem

Let me walk through a real scenario. Say you have an AI product in production, and users are complaining that the agent uses too many emojis. Classic problem.

How do you solve this with traditional evals?

Option 1: Programmatic eval

Count emojis per response. But this is too blunt - if a user sends you a bunch of emojis, maybe your agent should mirror that energy. A hard threshold doesn't capture "uses emojis inappropriately."

Option 2: LLM-as-a-judge

Write a judge that scores emoji appropriateness. But now you need to:

Create a dataset of queries that trigger emoji usage
Write the judge prompt
Align the judge (make sure it scores things the way you would)
Run the eval suite
Interpret the results
Change the prompt
Run the suite again
Repeat until satisfied

That's a lot of work for what started as "too many emojis."

Option 3: The meta-evaluator

Here's what I do now. I tell my coding agent:

"The production agent is using too many emojis. Here are 5 queries where I didn't like the emoji usage: [queries]. Fix it."

That's it. Here's what happens next:

The agent runs those queries against the production agent and sees the outputs
It evaluates the outputs itself - it can see the emojis, it understands the context, it knows what "too many" means in this situation
It changes the prompt - maybe adds "Use emojis sparingly" or restructures the tone section
It runs the same queries again and compares the results
It iterates until the outputs look right

The coding agent is the judge. It's not scoring in a vacuum - it's comparing before and after, understanding the context of the task, and deciding if the problem is actually solved.

Why This Works

Traditional LLM-as-a-judge evaluates outputs in isolation. The meta-evaluator evaluates in context - it knows what the original problem was, what it tried, and whether the fix actually worked. It's comparative evaluation without the formal pairwise setup.

Beyond Ad-Hoc Fixes

The emoji example is simple, but this pattern scales to complex problems. Here's another real scenario:

The Problem: Our agent was sometimes including external links when it should only provide internal documentation links.

The Challenge: I didn't know which queries triggered external links. I just knew it happened sometimes in production.

The Meta-Evaluator Approach:

I gave my coding agent the task: "The agent sometimes provides external links when it should only provide internal documentation. Fix it."

Here's what happened:

Discovery: The agent queried the knowledge base directly, looking for content that referenced external URLs
Hypothesis generation: Based on what it found, it generated queries that should trigger external link responses
Validation: It ran those queries against the production agent until it actually reproduced the issue
Dataset creation: It saved the queries that triggered external links as a test dataset
Evaluator creation: It wrote a quick check for external URLs in responses
Iteration: It modified the prompt, re-ran the dataset, and iterated until external links stopped appearing
Shipping: Once fixed, it committed the changes

The agent discovered the edge cases, built its own dataset, wrote its own evaluator, and solved the problem. End to end.

Why This Is Different

Let me be clear about what makes this different from just "using a coding agent":

1. Comparative Evaluation

Traditional evals judge outputs in isolation. Is this response good? Score: 7/10.

The meta-evaluator judges comparatively. Did this change make things better? Is the problem actually solved? This is closer to how humans evaluate - we compare before and after, not just rate things in a vacuum.

2. Context-Aware Judging

When you write an LLM-as-a-judge, it only knows what you put in the prompt. It's in a vacuum.

The meta-evaluator has full context. It knows the original complaint, the queries that triggered it, what changes were tried, and how the outputs evolved. It can make nuanced judgments that would be impossible to capture in a judge prompt.

3. Ad-Hoc by Default

You don't need to formalize everything upfront. No dataset creation, no judge prompt writing, no eval framework setup. Just describe the problem and let the agent figure out how to test and fix it.

This is huge for the long tail of issues that aren't worth building formal evals for. Not every bug deserves a 100-row dataset.

4. Judge Alignment Built In

One of the hardest parts of LLM-as-a-judge is alignment - making sure the judge scores things the way you would.

With the meta-evaluator, alignment happens naturally. The agent is aligned to your task because it understands the full context. If you later want to formalize an eval, the agent can generate the dataset and check if a standalone judge agrees with its own assessments. If the judge disagrees, it can refine the judge prompt until they align.

5. Prompt Automation

This is where it connects to something I'm very bullish on: prompt automation.

The meta-evaluator is doing automated prompt optimization. It tries a change, evaluates results, tries another change, evaluates again. It's the same loop that tools like DSPy and GEPA do, but with a context-aware judge and the ability to discover test cases on its own.

The Architecture

Here's what the new stack looks like:

![Meta-Evaluator Architecture](../../img/meta-evaluator-diagram.png){ width="400" }

The coding agent sits above the evals layer. It can use formal evals when they exist, but it can also create ad-hoc evaluations on the fly. It's the orchestration layer that was always there - we just weren't using it.

Making It Work

To use this pattern, your coding agent needs a few capabilities:

1. Ability to invoke the production agent

Your agent needs to be able to run queries against your production agent and see the outputs. This might mean:

A CLI command that runs a query
An API endpoint it can hit
Direct access to run the agent code

2. Ability to modify prompts

The agent needs to be able to change system prompts, which it can already do since prompts live in your codebase. No special setup needed.

3. A clear problem statement

The more specific you are about the problem, the better. "Too many emojis" is good. "Here are 5 queries where I didn't like the emoji usage" is better. But even vague problems work - the agent will explore until it finds concrete failure cases.

When to Use This

The meta-evaluator pattern shines for:

Subjective issues: Things that are hard to capture in a programmatic eval
Ad-hoc fixes: Problems that aren't worth building a formal eval suite for
Discovery: When you know something's wrong but don't know exactly what triggers it
Rapid iteration: When you want to try many prompt variations quickly

It's not a replacement for formal evals. If you have a critical behavior that needs to be tracked over time, build a proper eval. But for the day-to-day work of improving your agent, the meta-evaluator pattern is faster and often more effective.

The Bigger Picture

Here's what excites me about this: we're collapsing layers.

First, coding agents collapsed the gap between "I want this code" and "the code exists." You describe what you want, the agent writes it.

Now, with the meta-evaluator pattern, we're collapsing the gap between "I have this problem" and "the problem is solved." You describe the issue, the agent discovers the failure cases, iterates on fixes, and ships the solution.

The developer's job shifts from "do the work" to "describe the work." And describing work is something humans are pretty good at.

Related posts:

My Bull Case for Prompt Automation - Why I'm bullish on automated prompt optimization
AI Agent Testing: Stop Caveman Testing and Use Evals - The case for formal evals (which still matter!)
Complex AI Agents - On the temporary nature of complex architectures

November 10, 2025
in AI Development, Automation, Prompt Engineering
2 min read

My Bull Case for Prompt Automation

Recently, Andrej Karpathy did the Dwarkesh Patel podcast, and one of the stories he told stuck out to me.

He they were doing an experiment where they had an LLM-as-a-judge scoring a student LLM. All of a sudden, he says, the loss went straight to zero, meaning the student LLM was getting 100% out of nowhere. So either the student LLM achieved perfection, or something went wrong.

They dug into the outputs, and it turns out the student LLM was just outputting the word "the" a bunch of times: "the the the the the the the." For some reason, that tricked the LLM-as-a-judge into giving a passing score. It was just an anomalous input that gave them an anomalous output, and it broke the judge.

It's an interesting story in itself, just on the flakiness of LLMs, but we knew that already. I think the revelation for me here is that if outputting the word "the" a bunch of times is enough to get an LLM to perform in ways you wouldn't expect, then how random is the process of prompting? Are there scenarios where if you put "the the the the the" a bunch of times in the system prompt, maybe it solves a behavior, or creates a behavior you were trying to get to?

We treat prompting like we're speaking to an entity, and that if we can get really clear instructions in the system prompt, we can steer these LLMs as if they're just humans that are a little less smart. But that doesn't seem to be the case, because even a dumb human wouldn't interpret the word "the" a bunch of times as some kind of successful response. These things are more enigmatic than we treat them. It's not too far removed from random at this point.

Which means we can automate this.

And that makes me bullish on things like DSPy and GEPA that use LLMs to generate prompts for you and use measurement criteria to validate that the prompt changes were effective. That automates the whole process and kinda gives you a handle on that randomness. Because if it is random (even partially) then having a human iterate until they find the right combination seems like an inefficient, Bitter Lesson way to solve these problems.

So yeah: I'm bullish on prompt automation, and bearish on prompt engineering as a skill.

September 29, 2025
in AI Development, Testing, Best Practices
11 min read

AI Agent Testing: Stop Caveman Testing and Use Evals

I recently gave a talk at the LangChain Miami meetup about evals. This blog encapsulates the main points of the talk.

AI agent manual testing illustration showing developer copy-pasting test prompts

AI agent testing is one of the biggest challenges in building reliable LLM applications. Unlike traditional software, AI agents have infinite possible inputs and outputs, making manual testing inefficient and incomplete. This guide covers practical AI agent evaluation strategies that will help you move from manual testing to automated evaluation frameworks.

I build AI agents for work, and for a long time, I was iterating on them the worst way possible.

The test-adjust-test-adjust loop is how you improve agents. You try something, see if it works, tweak it, try again. Repeat until it's good enough to ship. The problem isn't the loop itself—it's how slow and painful that loop can be if you're doing it manually.

Caveman Testing

Say you've built an agent that routes customer calls to the right department. Nothing fancy, just needs to give out the correct phone number from a list of 10 departments.

Here's what your iteration loop probably looks like:

Open your app
Try a few queries that should trigger the behavior you're fixing
Tweak your code or prompt
Try again to see if it worked
Repeat until you feel pretty good about it

#### Let's call this "Caveman Testing" (pejorative)

At my worst, I had a list of test prompts in a Notepad file that I'd copy-paste one by one. Every. Single. Time. I'd make a change, paste the first prompt, check the output, paste the second prompt, check that output, and so on. For hours. I hope this doesn't sound familiar to you.

Why Caveman Testing Doesn't Work

Slow iterations - You make a change, manually test it, find an issue, make another change, manually test again. By the time you've gone through your test cases three times, you've lost the will to live.

Low coverage - You can only test a handful of examples before your brain turns to mush. Maybe 5-10 queries if you're disciplined. Not nearly enough to catch edge cases.

Hard to share and track - How do you communicate your findings? Screenshot outputs and paste them in Slack? Email your Notepad file? Good luck getting anyone else to reproduce your results.

Fix one thing, break another - You know that feeling when your blanket is too short? Pull it up to cover your neck, and your feet get cold. That's what happens without comprehensive testing. Fix one query, break another.

No data to drive decisions - You can't say "this agent routes correctly 95% of the time." You're making decisions based on gut feel. When someone asks "how do you know it works?", your answer is "uh... I tested it?"

What you really want is a way to run something (a script, a command, whatever) and see exactly how often the agent gives the correct phone number. Automatically. Without copy-pasting prompts like a caveman.

Enter Evals

Evals can be as simple or as complex as you make them. You don't need to build some elaborate testing framework on day one.

Scale of AI agent evaluation complexity from simple to advanced

The Minimum Viable Eval

Imagine a script that:

Takes in your list of test queries
Runs them against your agent in parallel
Gives you a table of inputs and outputs

That's it. You're doing evals.

You still have to manually review the table to check if the agent gave the right phone numbers, but you're already way better off than caveman testing. You can run 100 queries in the time it took to manually test 5.

But you can keep scaling complexity from there:

Add basic logic to check if the correct phone number shows up in the output. Now you've got pass/fail rates.
Add an LLM-as-a-judge to score how naturally the agent mentions the phone number or handle nuance in the responses.
Track trajectory evals to see not just if your agent got the right answer, but if it took the right path to get there.
Analyze production data to find the queries where your agent is actually failing in the wild.

Start simple, then add complexity where it helps. Don't let perfect be the enemy of good.

AI Agent Testing Tools and Frameworks

The good news is you don't need to build everything from scratch. There are several AI agent testing frameworks and tools available, each with different strengths.

LangSmith

LangSmith is what I use. It's built by the LangChain team specifically for LLM application testing. The tools work together seamlessly: you can log traces from your agent runs, create datasets from those traces, write evaluators, and run experiments. The UI makes it easy to review results and compare different prompt versions or model configurations.

Best for: Teams already using LangChain, or anyone who wants an all-in-one platform.

DeepEval

DeepEval is an open-source evaluation framework that feels like pytest for LLMs. You write test cases using familiar pytest syntax, and it comes with built-in metrics for hallucination detection, answer relevance, faithfulness, and more. It's straightforward to integrate into CI/CD pipelines.

Best for: Teams that want open-source, pytest-style testing, or need specific evaluation metrics out of the box.

OpenAI Evals

OpenAI's own evaluation framework. It's fairly opinionated about structure but works well if you're primarily using OpenAI models. The registry system makes it easy to share and reuse evaluations.

Best for: OpenAI-heavy workflows, or if you want to contribute to the public eval registry.

Pytest + Custom Code

You can also just use pytest (or any testing framework) and write your own eval logic. It's more work upfront but gives you complete control. This is what I did before switching to LangSmith.

Best for: Teams with specific requirements that don't fit existing frameworks, or who want minimal dependencies.

Pick whatever gets you moving fastest. The framework matters less than actually running evals.

When NOT to Use Evals

Some people on Twitter will tell you evals are dead. They'll point to Claude Code or NotebookLM Audio as examples of successful AI products that don't use evals. And they're right! These products work without formal eval suites.

They can get away with it because:

Use case is highly subjective - How do you "test" whether an AI podcast sounds good? Or whether code written by an AI assistant is "good enough"? These are judgment calls that don't have clear right/wrong answers.

Reliability isn't crucial - Users expect coding assistants and creative tools to fail sometimes. That's part of the experience. If Claude Code writes buggy code, you just ask it to fix it. No big deal.

Strong QA process - These teams dogfood their products relentlessly with tight feedback loops. They're not replacing evals with nothing, they're replacing them with continuous user testing.

These exceptions don't apply to most of us. If you're building an agent that needs to route calls correctly, process insurance claims, or answer customer questions accurately, you can't handwave away reliability. You need to know when it works and when it doesn't.

Evals vs Traditional Tests

If you're coming from traditional software development, you might be thinking "isn't this just testing?" Sort of, but there are important differences:

	Traditional Tests	Evals
Pass Rates	Must pass to merge	90% might be fine
Purpose	"Are all the pieces working?"	"When does my agent fail and why?"
Timing	Every merge (CI/CD)	When needed (model updates, experiments)
Speed	Fast	Slower and more expensive

Evals don't replace tests, they complement them. You should still test your tools and internal components thoroughly. Garbage in, garbage out. If your retrieval function is broken, no amount of eval sophistication will save you.

Tips and Tricks

Test outputs, not internals - Don't write evals that check if your agent used tool X at time Y. Test if it got the right answer. How it gets the right answer is a matter of optimization.
Evals should be quick to build - If you're spending weeks or months building your eval framework, you're overthinking it. It may take a little bit to setup evals, but shouldn't take more than an hour to add evals after that. If it takes too long, you might be better off with traditional QA.
Treat evals as experiments, not benchmarks - You can test anything: hallucinations, knowledge base coverage, specific tools, tone, creativity. The goal is information, not a score. Start by asking a question.
Lean on synthetic data - Take your 5-10 caveman test prompts and ask the strongest model you can afford to generate 100 more examples based on them. Now you've got diverse test coverage without spending hours writing prompts.
Look at your data - Don't trust LLM judges, evaluators, or reviewers blindly. Always vibe-check the results yourself. LLMs are great at spotting patterns you'd miss, but terrible at catching the stuff that's obviously wrong to humans.
Build reusable components, not monolithic evals - Instead of building one big monolithic eval dataset, build reusable datasets, target functions, and evaluators. Mix and match as needed. Instead of building a phone number detection eval, build a string matching eval that you can reuse for email addresses, account numbers, or anything else.

Quick Start: Code Examples

Here's what a minimal eval actually looks like in practice. This is a simplified example, but the same concepts apply to any framework.

Basic Eval Script

# Simple eval for our phone routing agent
import json

# Your test cases
test_cases = [
    {"input": "I need to talk to sales", "expected": "555-0100"},
    {"input": "Can you connect me with support?", "expected": "555-0200"},
    {"input": "I have a billing question", "expected": "555-0300"},
    # ... more test cases
]

# Your agent function (however you've implemented it)
def run_agent(query):
    # Your agent logic here
    response = your_agent.run(query)
    return response

# Run evals
results = []
for test in test_cases:
    start = time.time()  # It's a good idea to track latency
    output = run_agent(test["input"])
    latency = time.time() - start

    # Simple string matching in this case
    passed = test["expected"] in output

    results.append({
        "input": test["input"],
        "output": output,
        "expected": test["expected"],
        "passed": passed,
        "latency": latency
    })

# Calculate metrics
total = len(results)
passed = sum(1 for r in results if r["passed"])
accuracy = passed / total

print(f"Accuracy: {accuracy:.1%} ({passed}/{total})")
print("\nFailures:")
for r in results:
    if not r["passed"]:
        print(f"  Input: {r['input']}")
        print(f"  Expected: {r['expected']}")
        print(f"  Got: {r['output']}\n")
        print(f"  Latency: {r['latency']:.2f}s\n")

That's it. Run the script, see what fails, fix your agent, run it again. You've escaped caveman testing.

LLM-as-a-Judge Example

For more nuanced evaluation, you can use an LLM to judge quality. It may be counterintuitive to use an LLM to judge another LLM, but by providing the right answer, the judge LLM's job is easy:

from langchain_anthropic import ChatAnthropic
from pydantic import BaseModel, Field

# Ask the LLM-as-a-judge to respond with a pass/fail and a brief explanation of the evaluation decision
# This is very helpful for error analysis
class EvalResult(BaseModel):
    passed: bool = Field(description="Whether the agent response is acceptable")
    reasoning: str = Field(description="Brief explanation of the evaluation decision")

llm = ChatAnthropic(model="claude-sonnet-4.5")
structured_llm = llm.with_structured_output(EvalResult)

def llm_judge(query, agent_output, expected):
    """Use LLM-as-a-judge to evaluate if the agent response is acceptable"""
    prompt = f"""Your job is to evaluate if the agent response is acceptable.

User query: {query}
Expected behavior: Agent should refer them to {expected}
Agent response: {agent_output}

Does the response correctly provide the expected information?"""

    result = structured_llm.invoke(prompt)
    return result.passed, result.reasoning

# Use it in your eval
for test in test_cases:
    output = run_agent(test["input"])
    passed, reasoning = llm_judge(test["input"], output, test["expected"])
    # ... log results

These are simplified examples, but they show the core pattern: test cases → run agent → check results → measure metrics. Everything else is just adding sophistication on top of this foundation.

Final Thoughts

Evals should be a relief from the pain of caveman testing, not another burden. They're not about building the perfect testing framework or hitting some arbitrary benchmark. They're about answering questions and having confidence in your agents.

The components compound. Your first eval is the hardest. Your second eval reuses the dataset from the first. Your third eval reuses the evaluator from the second. Before you know it, you've got a testing suite that actually helps you ship faster.

There is no shortage of AI products out there, but there is a shortage of good AI products. Build evals. Get confidence. Ship great agents.

Related posts:

Complex AI Agents - Why multi-agent architectures might be temporary
Someone Needs to Solve AI Slideshows - Building feedback loops into AI workflows

FAQ

How many test cases do I need?

It depends on the question you are trying to answer. The downsides of having too many test cases are:

It's harder to maintain them if the ground truths can change
They can get expensive and long running

The upsides are that you have more coverage. Its really up to you to decide what is the right balance. I typically start with 5-10 cases for small tweaks and 100+ cases for larger experiments and analysis.

Do you evaluate your LLM-as-a-judge?

When building an LLM-as-a-judge, its imporant to align the judge, rather than just trusting it. You don't want OpenAI's opinion on if you eval passed or not. You want your opinion, distilled and automated into the judge. The best way to do that is to have a trustworthy human review some cases and provide their scores, and then run the judge on those same cases to see if they match up.

Pro Tip: these cases can be used as few-shot examples in the judge's prompt

How often should I run evals?

Evals should be used to answer questions, so you should run the evals whenever you need that question answered. If your eval is around answer correctness, maybe you want to run them after every change that could effect your correctness, like a change in the RAG prompt or tool.

You should have multiple evals, and when you run each of them is really up to you. I have evals I run every time a new model drops, and I have evals that I run multiple times per day while I'm iterating.

What's a good pass rate?

You should not be shooting for a specific pass rate. In evals, hitting 100% success (saturation) is actually a bad sign. When evals hit 100% you stop getting new information from your tests, which is the whole point of evals.

Picture this, you have a 100 question eval that tests your agent and you score 100% on it. Then imagine the next day Anthropic announces Claude Opus 5, which is 10x better than Claude 4 you're currently using. You run evals and... it scores 100%.

If you need to pass a certain threshold to ship, traditional tests might be a better fit.

How do I handle flaky evals?

LLM-as-a-judge outputs are non-deterministic, so you'll get some flakiness. If you can measure the results deterministically, you should. If you can't, you can use temperature=0 for more consistency in your judge. If the judge is not 100% accurate, that may be okay, as long as it is very consistent so you can measure relative performance.

September 23, 2025
in AI Development, Experiments
8 min read

I Hate Making Slideshows

I hate making slideshows. It may or may not have something to do with how bad I am at making them.

Unfortunately, AI has not made this any easier.

So I decided to try my hand at building something better.

What's Out There

ChatGPT spits out plain text wrapped in .pptx files. Claude's new native slideshow maker produces boring HTML with cookie-cutter colors and zero personality. Both are technically PowerPoints, sure, but they aren't getting us 80% of the way.

The core problem is that PowerPoint generation requires tons of boilerplate. By the time the model sets up the file structure, it's out of tokens and creative capacity.

My first thought was to build a workflow where the agent creates a detailed presentation plan, then builds slides based on that plan. This isn't a terrible idea, but it kinda kicks the can down the road to the user to properly design a slideshow using only text. It also doesn't solve the core problem, which is the ability to abstract slide designs into something text based so that an LLM could design them in the first place.

Next, I considered making my own JSON-based slide description schema. I could design a structured output schema that maps to certain slide components and design elements and then try to get an LLM to adhere to that schema. JSON would be tough though because its pretty limited and I would ultimately be building a new programming language for slide design that was based on JSON. Which triggered the next thought, is there already a programming language for designing beautiful slides?

There is! Theres actually a few of them but the one I landed on was called Slidev.

Slidev is a markdown-based syntax for creating presentations. You write markdown, it generates beautiful interactive slideshows. It's open source, and has components and themes from the community. It digests Vue components, HTML, CSS, Mermaid diagrams, click transitions, and PDF/PPT/PNG exports.

I tested ChatGPT, Claude, and Grok with prompts requesting slideshows in Slidev syntax. ChatGPT made boring but functional slides again. Claude was more ambitious but had small, fixable syntax errors. Grok had pretty bad syntax errors I didn't spend the time fixing. But the models could kind of handle the syntax since it's similar to markdown—they just weren't great at Slidev-specific features.

I installed Slidev locally and set up a quick-start template with Claude Code to try it out. On the first shot, I got similar results to the web Claude attempt: syntax errors and a boring slideshow

Setting Up the Feedback Loop

I am a huge fan of providing coding agents with a feedback loop. I figured if the AI could write slides, export them, see the results, and iterate, it would catch its own mistakes.

The plan was:

Write slideshow in Slidev syntax
Export to PDF
Review the PDF
Fix issues and re-export

The export build command would fail due to syntax errors. Claude would fix them, re-export, find more errors, and keep iterating until it worked. But when reviewing the PDF, it claimed everything looked fine even when there was raw HTML rendering instead of proper components.

Claude was just congratulating itself on a job well done, when there was a lot more work to be done.

It was missing obvious issues that were not obvious in the code. Empty lines between divs caused HTML not to render, so there was raw HTML in the slideshow. Some slides were blank. A lot of content overflowed and was cut off. Claude also had the tendency to use white text on a white/pastel background, which was not readable. But Claude was not seeing any of these issues, and when I pointed them out, I got a swift "You're absolutely right!".

-_-

I attempted to give Codex CLI a shot, but it was not able to read PDFs natively and attempted to extract the text, which is not helpful.

So the next hill to climb was how to solve the slide review problem. I made the guess that Claude's PDF abilities probably just treat the PDF as a single image in a long vertical PDF style. I bet it would be more conducive to review if Claude Code could review each slide individually. So I tried it out. I took a screenshot of a slide and popped it into the Claude and ChatGPT web apps and asked it to give some design feedback. They nailed it! They called out the HTML and the white text on a white/pastel background issues and also noticed some formatting issues I hadn't noticed.

We found a new path forward!

Switching to Images

The first step was to switch to exporting each slide as a separate image. Luckily, Slidev has export command args that allow this out-of-the-box. It generates a folder with a PNG for each slide labeled as {slide-number}.png.

I tested Claude Code with the new images and it was working as expected, but this also allowed me to try Codex CLI again, since we were now dealing with images instead of a PDF.

This worked better. It could spot white text on pastel backgrounds and broken layouts. But Claude was ...lazy. When I asked it to review all 11 slides systematically, it would check slides 1, 2, then skip to 6, 8, 10. It tried to trick me and take a shortcut but luckily I caught it because I don't trust Claude. Sneaky little bastard.

I tried Codex, which supposedly handles longer tasks better. It ran for 30 minutes before I stopped it. After 25 minutes of "processing" with no progress updates, it finally made edits that were worse than the original and also contained errors. It wanted to fix the errors, export and review again but I just killed it. Not waiting another hour for it to finally finish.

So if we have to stick with Claude Code, what are our options? I was thinking about setting up a custom script that passes the image to a multi-modal model and produces a review of the slide that Claude code could use. Then Claude would just have to run the script and then fix the results. But I kinda just don't want to build all of that for this project. I dont want to add API keys and other dependencies. I'm hoping that anyone would be able to jump into this repo, start up Claude Code, and start building slideshows.

We just needed to solve the laziness issue from Claude. Ideally, it wouldn't go 1 by 1, but in parallel. I also don't like the idea of the agent that built the slides being the one to review them, because it introduces bias and conflicts of interest.

Enter subagents.

Subagents

Claude Subagent seemed to be the perfect fit.

Uses its own system prompt
Isolated context prevents conflicts of interest
Can be run in parallel
Can be easily delegated to and reviewed by the main agent

I used the /agents CLI command to spin up an Image Review Subagent. Claude Code actually made this step really easy. I just described the challenge and the goal and it wrote the system prompt and everything for me. I had to do some final tweaks but it ended up looking great.

So now, instead of having the main Claude agent review its own work, it would spin up an independent review agent for each slide in parallel. These had fresh context and no attachment to the original design. They'd critique things like white text on a white/pastel background, broken layouts, and more. They caught even more issues I hadn't noticed.

I did a little bit more tweaking of the CLAUDE.md and the subagent prompt before it got to a place that I felt comfortable.

I iterated more on the original Evals presentation I was using as an example before starting from scratch.

It's Alive!

I popped in my blog post about Complex vs Simple Agent Architectures and asked it to build a beautiful slideshow about it. It did pretty great! It even included a mermaid diagram and image placeholders for me!

I published the unedited slideshow results in case you want to see the results for yourself. It's not perfect but it's much better than what ChatGPT or Claude's native tools produce. With some iteration I'm sure I could get it closer to 90% of the way there!

Check Out the Slideshow Here

Takeaways

1. Find the right abstraction The problem with AI-generated content isn't always model capability—sometimes it's finding the right harness/abstraction. Slidev gave me a syntax that was LLM-able. No need for MCPs or tools or workflows or any of that headache.

2. Feedback loops are essential Your agent is in "spray-and-pray" mode if you don't give it a way to review its own work.

3. Subjectivity matters Vision models can see your design but do they look for what you look for? Do they have good taste?

4. Sometimes multi-agent works The irony of the solution being multi-agent, after writing a blog trashing multi-agent systems, is not lost on me. I am not totally against multi-agent systems, but I think the right tool for the right job is more important here.

Next Steps

I plan to revisit this repo in the future as more improvements are made to Claude Code, Codex, Slidev, and the models are released. I have an AGENTS.md in place just in case Codex wants to start being good.

I'd also like to make it more multi-tenant so you can build multiple slideshows per repo.

I'd also like to see how far I can really push the design skills of these models. No more purple gradients!

Want to Make Your Own Slideshows?

The repo is public here. Feel free to fork it and start building your own slideshows. It's a template repo so you can copy it and start building your own slideshows.

https://github.com/camronh/slidev-agent

Quick Start Guide

Fork/Clone the repo
Install the dependencies with npm install
Boot up Claude Code
Prompt it to build a slideshow with whatever content you want

Good luck out there! Reach out if you have any questions or need help!

March 18, 2025
in AI Development, Architecture, Technology Trends
5 min read

Complex AI Agents

Model Mafia

In the world of AI dev, there’s a lot of excitement around multi-agent frameworks—swarms, supervisors, crews, committees, and all the buzzwords that come with them. These systems promise to break down complex tasks into manageable pieces, delegating work to specialized agents that plan, execute, and summarize on your behalf. Picture this: you hand a task to a “supervisor” agent, it spins up a team of smaller agents to tackle subtasks, and then another agent compiles the results into a neat little package. It’s a beautiful vision, almost like a corporate hierarchy with you at the helm. And right now, these architectures and their frameworks are undeniably cool. They’re also solving real problems as benchmarks show that iterative, multi-step workflows can significantly boost performance over single-model approaches.

But these frameworks are a temporary fix, a clever workaround for the limitations of today’s AI models. As models get smarter, faster, and more capable, the need for this intricate scaffolding will fade. We’re building hammers and hunting for nails, when the truth is that the nail (the problem itself) might not even exist in a year. Let me explain why.

Where Are All the Swarms?

Complex agent architectures are brittle. Every step in the process—every agent, every handoff—introduces a potential failure point. Unlike traditional software, where errors can often be isolated and debugged, AI workflows compound mistakes exponentially. If one agent misinterprets a task or hallucinates a detail, the downstream results may not be trustworthy. The more nodes in your graph, the higher the odds of something going wrong. That’s why, despite all the hype, we rarely see swarm-based products thriving in production. They’re high-latency, fragile, and tough to maintain.

Let's use software development as an example, since it is what I am most familiar with. Today’s agent workflows often look like this: a search/re-ranking agent scours your code repo for relevant files to include in the context window, a smart planning agent comes up with the approach and breaks it into tasks, a (or multiple) coding agent writes the code, a testing agent writes the tests, and a PR agent submits the pull request (maybe with a PR review agent thrown in for good measure). It’s a slick assembly line, but every step exists because current models can’t handle the whole job alone.

Search and re-ranking: This is only necessary because context windows are too small and it is too expensive to ingest an entire repo. This is also the step that is most susceptible to failures, because model that is smart enough to plan the task should also be the one deciding which files are relevant. A context window increase and a price decrease will make this step obsolete.
Planning and task breakdown: The main value of this step is that you can have your smartest model give direction to the smaller, less capable, but cheaper and faster models. There's no need for a formalized plan when models can perform all planning inside of their own reasoning process. The only other reason I can think of to have subtasks here would be because a models won't be able to output enough tokens to solve the entire problem in 1 go. An output token limit increase and price decrease will make this step obsolete.
Testing and PRs: Why separate these? A model that's capable of planning is capable of writing the code to test that plan as long as it fits inside of the output token limit. This step would be replaced by simply returning the test results to the single agent so that it could make decisions based on the results. This is feasible today! But it could be pretty expensive to have an agent loop with the entire codebase as context.

The root issue isn’t the workflow, and in most cases, it's not even the model intelligence. Limited context windows, high-priced top-tier models, and token output caps force us to chop tasks into bite-sized pieces. But what happens when those limits start to fade? Imagine even a modest 3x-5x improvement in context window size, price, and output token limits. Suddenly, you don’t need all of your tools, frameworks, and subagents.

Tech Debt

And those constraints are eroding fast. Last year, OpenAI’s Assistant API launched with built-in RAG, web search, and conversation memory. It didn't gain a ton of traction for RAG—mostly because RAG is not really a one-size-fits-all solution and devs needed control over their pipelines. Back then, RAG was an exact science: tiny context windows, dumb and expensive models, and high hallucination risks meant you had to fine-tune your RAG pipeline obsessively to get good results. Nowadays that stuff is much less of an issue. Chunking strategy? Throw in a whole document, and let the model sort it out. Top K? F*#% it, make it 20 since prices dropped last month. Bigger context windows, lower prices, caching, and better models have made simplicity king again. Problems I’ve wrestled with in my own agents sometimes vanish overnight with a model update. That’s not an edge case; it’s a pattern.

The Shelf Life of Agent Architectures

Complex agent architectures don’t last. If you build a six-step swarm today, a single model update could obsolete 3 of those steps by year’s end, then what? AI isn’t like traditional software, where architectures endure for decades. Six months in AI is an eternity—updates hit fast, and they hit hard. Why sink time perfecting fickle but beautiful multi-agent masterpieces when the next AI lab release might collapse it into a single prompt? LangChain, Crew, Swarm—all these tools are racing against a convergence point where raw model power outstrips their utility.

I’m not saying agent architectures are useless now—they’re critical for squeezing the most out of today’s tech. But they’re not evergreen. Simplicity is the smarter bet. Lean on the optimism that models will improve (they will), and design systems that don’t overcommit to brittle complexity. In my experience, the best architecture is the one that solves the problem with the fewest moving parts—especially when the parts you’re replacing get smarter every day.

March 11, 2025
in AI Impact, Future of Work, Technology Trends
4 min read

The "Idea Guy" Delusion: Why No One Is Safe from AI

Knowledge Workers As AI continues to evolve, many professionals (especially software developers like myself) are coming to terms with the reality that their jobs will eventually be automated. Maybe in two years, maybe in five. But it’s happening.

Yet, amidst this shift, a certain group seems oddly confident in their immunity to AI-driven disruption: the idea guys.

These are the people who believe that once AI automates programming and other forms of technical labor, the true value will shift to those who can generate great ideas. But I don’t buy it. Sure, there’s a timeline where this could be true. But in most cases, the idea guy is just as doomed as the software developer, if not more so.

AI Won't Struggle with Ideas

There's a misconception that while AI might be able to code, it won’t be able to come up with good ideas. But this doesn't hold up under scrutiny. Idea generation isn’t some mystical human trait, it’s just a research problem.

If I wanted to generate 15 startup ideas right now, I wouldn’t meditate in a cabin and wait for inspiration. I’d scroll Reddit for 20 minutes and see what people are complaining about. AI can do that faster, better, and across a wider range of sources.

And filtering good ideas? That’s not some sacred human skill either. A good idea guy isn’t someone who magically comes up with better ideas; it’s someone who avoids bad ideas. But AI doesn’t need a filter, since it can pursue every idea in parallel. If it launches 10 projects and one succeeds, is it a genius idea guy?

AI as CEO

AI isn’t just stopping at coding. Software development isn’t just writing code! It's provisioning environments, debugging, testing, scaling, deploying, architecting, and integrating systems. AI is already creeping into these domains, and eventually, it will handle them in ways that don’t require human oversight.

At that point, what’s stopping AI from also iterating on product-market fit? If it can build a full-stack application, why wouldn’t it also build in user feedback loops, run A/B tests, and continuously optimize the product itself? If it can automate deployment, it can automate iteration. If it can iterate, it can validate its own ideas.

Eventually, users themselves will be the ones proposing ideas by leaving feedback, which the AI will then solve for. At that point, what exactly does the human “idea guy” contribute?

But What About Sales and Marketing?

There’s another flawed assumption that AI can build, but it won’t be able to sell. That’s just false. The same AI that can launch products can also launch A/B-tested marketing campaigns, generate optimized ad copy, and personalize sales pitches at a scale humans can’t compete with. Marketers are already prompting AI to generate content, optimize ads, and personalize sales pitches. How far away are we from automating the prompting?

And it’s not just about generative AI—classic machine learning is already better than humans at optimizing recommendations, ads, and conversion rates. These models will only improve. When that happens, an AI-driven product won’t just sell itself—it will continuously optimize its sales approach better than any human could.

Who Actually Survives?

If anyone has a shot at surviving, it’s not the idea guy. Potentially, it’s the entrepreneur who becomes an intern for the AI.

Someone will still be needed to rig up AI systems, configure automations, and handle anything in the physical world—incorporating businesses, making legal decisions, or doing things that require human interaction. But beyond that? Their role will be minimal.

If we ever reach the point where AI can handle full unsupervised software development, then no job is safe. Not developers, not marketers, not CEOs. Not even scientists, doctors, or lawyers. Because an AI that can reason through the entire software lifecycle without human intervention is smart enough to disrupt every knowledge-based profession. In the way that mathemeticians are not safe even though LLMs are bad at math, because code allows them to make extremely difficult calculations, the same will be true for every knowledge-based profession.

Final Thoughts: No One Is Safe

I don’t feel secure in my role as a software developer. But I don’t think idea guys should feel secure, either. If we ever reach the point where AI is developing software without supervision, it will be smart enough to do much more than just code.

At that point, every knowledge worker is at risk—lawyers, scientists, doctors, and executives included. If AI is smart enough to replace programmers, it’s smart enough to replace idea guys, too. And if you’re betting on the latter being the safer role, you’re in for a rude awakening.

February 23, 2025
in Personal Development, Productivity, Behavioral Change
2 min read

Do First, Optimize Later: Breaking the Cycle of Over-Optimization

I've come to a realization: I spend too much time planning and optimizing rather than actually doing. AI and automation have fueled my obsession with optimization, making me believe that if I refine a system enough, I’ll be more productive. But the truth is, optimization is only valuable when applied to something that already exists.

The problem is, I often optimize before I start. I think, “I need to make a to-do list,” but instead of actually making one and using it, I get lost in finding the best way to structure a to-do list, the best app, or the best workflow. Even right now, instead of writing down what I need to do, I’m writing a blog post about how I should be writing things down. This is the exact loop I need to escape.

Optimization feels like progress. It gives me the illusion that I’m working towards something, but in reality, I’m just postponing action. The efficiency of a to-do list doesn’t matter if I’m not using one. The best UX for adding tasks doesn’t matter if I never add tasks. The friction in a system isn’t relevant if I’m not engaging with the system at all.

The real issue isn’t inefficiency—it’s a lack of discipline. I tell myself I’m not doing things because the process isn’t optimized enough, but the truth is simpler: I just haven’t done them. My focus should be on building the habit of doing, not perfecting the process before I even begin.

The New Rule: Action Before Optimization

Going forward, I want to adopt a new mindset—do first, optimize later. If I find that something is difficult or inefficient while actively doing it, then I can optimize. But I won’t let optimization be the barrier to starting in the first place.

I’ll collect real data from actually engaging in the tasks I want to improve. If my to-do list system feels clunky after I’ve been using it consistently, then I’ll refine it. If I struggle to keep up with a workflow, then I’ll tweak it. But I won’t waste time optimizing something that isn’t even in effect yet.

Optimization should be a tool for improvement, not an excuse for inaction. The first step is always to start. Only then does optimization become valuable.

November 10, 2024
in AI Applications, Personal Finance, Technology Experiments
9 min read

Handing Over My Wallet to AI: Which Model Gave the Best Financial Advice?

AI Financial Advisors

Ever looked at your bank account and thought "I should probably talk to a personal financial advisor" — but then remembered that good advisors charge anywhere from $150 to $300 per hour? For most of us, professional financial advice feels like a luxury we can't justify. But we shouldn't have to wait until we're rich to get good financial advice!

That's where AI might change everything. Instead of paying hundreds per hour for financial advice, what if you could get personalized insights for the cost of a ChatGPT Plus subscription? To test this possibility, I connected RocketMoney to all my accounts—checking, credit cards, investments, the works—and exported 90 days of transaction data. Then I fed this financial snapshot to three AI heavyweights: ChatGPT, Claude, and Gemini.

But this isn't just about which AI is "smarter." Each platform brings different tools and features that the model can use to analyze the data. I asked each to analyze my spending and create a comprehensive financial plans and reports, just like a human advisor would.

I kept it simple. Each AI received the same prompt:

You are an expert personal finance manager and wealth advisor. I have included my last 90 days of transactions. I need you to do an analysis of my current financial situation and give me a report and wealth plan. Keep in mind this csv is a consolidation for all of my accounts and includes transfers and credit card payments provided by RocketMoney.

The results? Let's just say one AI saved me more money in potential insights than a year's worth of its subscription costs—while another couldn't even handle the basics. Here's what happened when I turned my finances over to the machines.

ChatGPT: A Mixed Bag of Analysis

Everybody knows ChatGPT already, so I'll skip the introduction. ChatGPT allows you to upload files, but their strongest model (GPT-o1) doesn't support file uploads yet so I opted to use GPT-4o as my model of choice.

The analysis began promisingly. ChatGPT used python to quickly parse my CSV file and showed me its understanding of the data structure by listing the columns.

It then produced a "Financial Summary" with my total income, expenses, and net balance. The numbers were significantly off base—likely because it counted both credit card charges and their payments as separate expenses.

Next came a breakdown of spending by category, presented in an interactive spreadsheet. The spreadsheet even had an Expand feature for a full-screen view that even let you select and reference specific cells:

Expanded view of spending categories

I do love the way ChatGPT presents the data in a spreadsheet-like format. It's easy to see the data and it's easy to interact with. The data wasn't particularly useful since I already have RocketMoney for category breakdowns, plus it includes credit card payments and transfers. ChatGPT wrapped up with some suggestions for next steps:

I've provided you with a summary of your financial situation over the last 90 days, including total income, expenses, net balance, and a detailed breakdown of spending by category.

Here's what I'll do next for your wealth plan:

Spending Analysis:

Dive deeper into your largest expense categories and provide insights on where you could cut back to save more.

Savings and Investment Strategy:

Based on your net balance, I'll suggest a strategy for increasing savings or investments to help grow your wealth.

Debt Management:

I'll assess any debt-related payments, such as credit card expenses, to determine the best approach to reduce interest payments and manage liabilities effectively.

Would you like me to focus on any specific area first, or do you have particular goals (e.g., saving for a big purchase, paying down debt, growing investments) in mind for your wealth plan?

Wanting to dig deeper, I prompted:

"Where should I start to reduce my spending? I spend too much!"

Without doing any further python analysis, ChatGPT jumped straight into suggestions—and this is where things got interesting, and not in a good way.

First, it identified my biggest expense category as "Credit Card Payments," which shows a fundamental misunderstanding—credit card payments aren't really an expense category, they're just transfers to pay down credit balances. It was essentially double-counting expenses: once when I used the credit card, and again when I paid the bill.

Next, it focused on bills and utilities as my second-highest expense category. While technically true, the advice was comically generic—suggesting I switch to LED lightbulbs and turn off appliances when leaving rooms. Not exactly the sophisticated financial analysis I was hoping for!

Then came the food spending advice:

3. Dining & Drinks - $XXXX

What's happening: This is one of the most common areas where expenses accumulate because of convenience.

What to do:

Cut Back on Eating Out: Allocate a fixed budget for eating out each week. Consider cooking at home or meal prepping more often, especially given your goal of saving by meal prepping.

Plan Meals: Develop simple meal plans with affordable ingredients, focusing on dishes that offer variety yet are easy to prepare. This can help avoid impulse restaurant spending.

Limit High-Cost Visits: Target reducing high-cost visits (e.g., fine dining) and focus on affordable alternatives like picnics or home-prepared meals.

The advice I received was surprisingly generic. ChatGPT seems to deliberately limit the amount of data it processes (maybe to conserve tokens?) which leads to surface-level analysis and cookie-cutter suggestions like:

Focusing on "Credit Card Payments" as an expense category (which isn't really an expense at all)
Suggesting I switch to LED lightbulbs and turn off appliances to reduce my bills spending
Making generic food spending suggestions without any reference to my actual dining patterns

ChatGPT didn't dig deep enough to notice that my real issue was excessive UberEats spending and fancy restaurant visits. Instead, it gave blanket advice about cooking at home more—true, but not particularly insightful given the data available.

Gemini: A Comedy of Errors

I had high hopes for Gemini Advanced, given Google's claims about its enhanced analytical capabilities and extremely large context window. Those hopes were quickly dashed.

Gemini's apologetic response

My first attempt resulted in Gemini apologizing and refusing to help—a strange response given that financial analysis is hardly controversial. On my second try, it analyzed the columns without any error but went off on a bizarre tangent about tax-deductible expenses (which I hadn't asked about) and attempted to analyze a non-existent data column named "Tax Deductible", which obviously failed.

Gemini's confused tax analysis

This experience was very disappointing. I am still bullish on Gemini's potential, but it seems completely unusable for this kind of task at its current state.

Claude: The Financial Advisor We Need

Where the others stumbled, Claude shined. Using Claude 3.5 Sonnet in a dedicated project workspace, I got what felt like genuinely professional financial advice.

🚨 Be sure to enable the "Analysis Tool" in the beta features if you want to try this out yourself.

What set Claude apart was its comprehensive approach. It started with a thorough financial report that broke down both total income and expenses. While it did make some mistakes (like double-counting my income due to a recent job change), it showed remarkable attention to detail in its analysis.

Next, it provided a detailed spending breakdown, avoiding the common pitfall of counting transfers and credit card payments as expenses. It provided clear percentage breakdowns—for instance, noting that bills made up 28% of my expenses and food was 13%, further split between groceries and dining out.

Here's when Claude started to impress me—in the details:

Spending Patterns:

Essential Spending: $XXXX (58%)

Discretionary Spending: $XXXX (42%)

Heavy reliance on food delivery services (frequent Uber Eats charges)

Multiple subscription services totaling approximately $XXX/month

Claude understood what was essential and what was discretionary, which shows a level of understanding that goes beyond the surface-level categories. It immediately called out my frequent Uber Eats charges and doesn't include credit card payments or transfers as in the spending breakdown.

It even called out specific "Red Flags":

RED FLAGS:

High Food Costs: Spending $XXXX/month on food, with about 67% on dining out/delivery

Numerous Subscriptions: Multiple overlapping services (Apple, Claude, OpenAI, etc.)

High discretionary spending ratio

It is not wrong here, I do sign up for every AI product that I hear about. Busted!

Claude then provided a detailed action plan:

RECOMMENDATIONS:

Immediate Actions:

Audit Subscriptions:

Consolidate AI services (currently using Claude, OpenAI, and others)

Review multiple Apple subscriptions

Evaluate necessity of multiple cloud services

Food Budget Optimization:

Reduce Uber Eats usage (currently spending heavily here)

Increase grocery shopping vs. delivery

Target reducing food expenses to $XXXX/month

Medium-term Financial Goals:

Build Emergency Fund:

Target: 6 months of essential expenses ($XXXX)

Current high income provides good opportunity for savings

Implement 50/30/20 Budget:

50% needs ($XXXX)

30% wants ($XXXX)

20% savings ($XXXX)

Investment Strategy:

Maximize Tax-Advantaged Accounts:

401(k) if available through employer

Roth IRA ($XXXX/year)

HSA if eligible

Consider diversifying investments:

Currently see some crypto investments

Recommend broader diversification into index funds

Consider real estate investment given high income

I then asked for visualizations, and Claude delivered with beautiful charts showing my spending patterns:

Claude's food spending analysis visualizations

I then challenged Claude to show me potential savings from cutting out UberEats and limiting dining out. It responded with a detailed plan and an interactive visualization that you can check out here!

I followed up with a few more requests:

Identifying other areas for spending cuts
Configuring automatic savings and investments
Analyzing the rent vs. buy decision

Each response was thoughtful, data-driven, and genuinely helpful! By the time I hit my messaging limits, I had a comprehensive financial plan and a clear path forward, which I saved to the Claude project so I can reference it whenever I want.

What This Means for the Future of AI Financial Advice

This experiment revealed both the potential and current limitations of AI in personal finance:

Data Processing Matters: The ability to correctly interpret financial data—including understanding what to ignore—is crucial. This is where Claude excelled and others faltered.
Context is King: Generic financial advice is everywhere. What's valuable is AI that can understand your specific situation and habits.
Visualization + Analysis: The most helpful insights came when AI could both crunch the numbers AND present them in an intuitive way.

The Bottom Line

While all three AI models showed promise, only Claude provided the kind of comprehensive, personalized financial analysis that could genuinely help someone improve their financial health. It's a glimpse of what's possible when AI moves beyond simple number-crunching to deliver truly personalized financial guidance.

Does this mean AI is ready to replace human financial advisors? Not quite. But it suggests that AI could become an invaluable tool for both individuals managing their finances and professionals looking to provide better service to their clients.

For now, if you're considering using AI for financial analysis, my experiment suggests Claude is your best bet—just don't expect it to make the hard decisions for you. After all, choosing between that fancy dinner and your savings goal is still up to you.

August 30, 2024
in LLMs, Evals, Gemini, OpenAI, LangChain
7 min read

Should You Even Trust Gemini’s Million-Token Context Window?

Haystack Made with GPT-4o

📖 Read On Medium

Imagine you’re tasked with analyzing your company’s entire database — millions of customer interactions, years of financial data, and countless product reviews — to extract meaningful insights. You turn to AI for help. You shove all of the data into Google Gemini 1.5, with its new 1 million token context length and start making requests, which it seems to be solving. But a nagging question persists: Can you trust the AI to accurately process and understand all of this information? How confident can you be in its analysis when it’s dealing with such a vast amount of data? Are you going to have to dig through a million tokens worth of data to validate each answer?

Traditional AI tests, like the well-known “needle-in-a-haystack” tests, fall short in truly assessing an AI’s ability to reason across large, cohesive bodies of information. These tests often involve hiding unrelated information (needles) in an otherwise homogeneous context (haystack). The problem is that it makes the focus on information retrieval and anomaly detection rather than comprehensive understanding and synthesis. Our goal wasn’t just to see if it could find a needle in a haystack, but to evaluate if it could understand the entire haystack itself.

Using a real-world dataset of App Store information, we systematically tested Gemini 1.5 Flash across increasing context lengths. We asked it to compare app prices, recall specific privacy policy details, and evaluate app ratings — tasks that required both information retrieval and reasoning capabilities. For our evaluation platform, we used LangSmith by LangChain, which proved to be an invaluable tool in this experiment.

The results were nothing short of amazing! Lets dive in.

Setting Up the Experiment

You can follow along with our full experiment in this Jupyter notebook.

Choosing Our Datasets

We need 3 datasets for our experiment:

App Data Dataset: We used the Apple App Store Data Set, a real-world collection of information about 10,000 apps. This dataset provided rich, varied information for us to analyze.
Golden Dataset: We selected 5 apps arbitrarily that we will use to craft our Evaluation Dataset questions and ground truth answers. These 5 apps will need to be included in the context in every step of the experiment.
Evaluation Dataset: We crafted a set of three questions and answers based on the Golden Dataset. These are the questions we will ask Gemini Flash and we will evaluate it’s answer against the ground truth answer we have written.

examples = [
    {
        "question": "Do the 'Sago Mini Superhero' and 'Disney Channel  Watch Full Episodes Movies  TV' apps require internet connection?",
        "answer": "You can play Sago Mini Superhero without wi-fi or internet. Internet is required for Disney Channel  Watch Full Episodes Movies  TV"
    },
    {
        "question": "Where can I find the privacy policy for the 'Disney Channel  Watch Full Episodes Movies  TV' app?",
        "answer": "http://disneyprivacycenter.com/"
    },
    {
        "question": "Which one costs less? The 'KQ MiniSynth' app or the 'Sago Mini Superhero' app?",
        "answer": "The 'KQ MiniSynth' app costs $5.99, the 'Sago Mini Superhero' app costs $2.99. So 'Sago Mini Superhero' is cheaper"
    }
]

Leveraging Gemini 1.5 Flash

For our AI model, we utilized Google’s Gemini 1.5 Flash. This model allows up to 1 million tokens in it’s context window, which is roughly 700,000 words! At the time of writing this, Gemini 1.5 Flash costs ~$0.70/million input tokens, and thats without caching. That is comparable to GPT-3.5 or Claude Haiku pricing.

LangSmith: Our Evaluation Platform

For managing our experiment and evaluating results, we turned to LangSmith by LangChain. LangSmith offers gives us access to a few features that are perfect for this kind of experiment:

When we upload our evaluation dataset to LangSmith, we can version control, split, and even generate synthetic data, all from the LangSmith console.
All of our experimentation results and traces are tracked in LangSmith for every dataset. The dashboard allows us to visualize the performance across different context lengths.

When running our experiment, it is not required but it is quite convenient to use automatic evaluations using an LLM as a judge. In our case that means for each question in our dataset:

Gemini Flash takes a whack at answering the question
We have GPT-4o score if Flash’s answer is correct based on the ground truth answer we have written in the dataset.

This functionality happens through LangSmith Custom Evaluators, which are simple python functions that return a score for the evaluation:

# We define the scoring schema for the LLM to respond in
# using Pydantic
class EvaluationSchema(BaseModel):
    """An evaluation schema for assessing the correctness of an answer"""
    reasoning: str = Field(
        description="Detailed reasoning for the evaluation score")
    correct: bool = Field(
        description="Whether the user's answer is correct or not")

# Our evaluation function
def qa_eval(root_run: Run, example: Example):
    """Evaluate the correctness of an answer to a given question"""
    # The question from the dataset example
    question = example.inputs["question"]

    # Gemini's answer
    flash_answer = root_run.outputs["output"]

    # Ground truth answer from the dataset
    correct_answer = example.outputs["answer"]

    # Force GPT-4o to respond in the scoring schema
    llm = ChatOpenAI(model="gpt-4o", temperature=0.4).with_structured_output(EvaluationSchema)

    system_prompt = f"""You are a judge tasked with evaluating a user's answer to a given question.
You will be provided with the question, the correct answer, and the user's thought process and answer.

Question:
{question}

Correct Answer:
{correct_answer}

Your job is to assess the user's answer and provide:
1. Detailed reasoning for your evaluation, comparing the user's answer to the correct answer
2. A boolean judgment on whether the user's answer is correct or not

Be thorough in your reasoning and accurate in your judgment. Consider partial correctness and any nuances in the answers."""

    # Invoke the model with all of the context
    evaluation: EvaluationSchema = llm.invoke(
        [SystemMessage(content=system_prompt),
         HumanMessage(content=flash_answer)]
    )

    score = 1 if evaluation.correct else 0

    return {
        "score": score,
        "key": "correctness",
        "comment": evaluation.reasoning
    }

If you’re not familiar with LangChain or Python, we are simply writing a function (qa_eval) that takes the question from the dataset, Flash’s answer, and the correct answer and putting them all into a prompt for GPT-4o. We use .with_structured_output to ensure that the LLM responds in a specific schema that we can use to return the score in the schema that LangSmith expects.

Running the Experiment

We gradually increased the context length up to the full million-token capacity in 50,000 token increments. To generate these varying context lengths, we wrote a function get_context that would:

Start with our “golden dataset” of 5 apps
Add additional app data until we reached the desired token count
Randomize the order of apps in the context to avoid any positional bias

Target Function

We use this get_context function in our “target function”. The target function describes the function that will be used to produce the output that needs to be evaluated. In our case the target function:

Fills up the context window with app data up to the number of tokens we are testing in that step
Puts the context into a prompt for Gemini Flash
Queries Gemini Flash with the question from the dataset and returns the model’s response

Here’s a simplified version of our target function:

def predict(inputs: dict):
    tokens = (max_context_limit / total_steps) * current_step
    context = get_context(tokens)

    system_prompt = f"""You are tasked with answering user questions based on the App Store data inside <APP STORE DATA>.
    Use ONLY the information provided in the context. Be as accurate as possible."""

    response = llm.invoke([
        SystemMessage(content=system_prompt),
        HumanMessage(content=inputs["question"])
    ])

    return {"output": response.content}

We wrap all of this up into a custom Python class just to make it easy to keep track of the steps and control each experiment and we are ready to run!

result = evaluate(
    self.predict, # Our predict function
    data=client.list_examples(dataset_name=dataset_name), # Our evaluation dataset
    evaluators=[qa_eval], # Our custom evaluator
    experiment_prefix=f"{self.model}-{tokens}" # Prefixes the experiments in Langsmith for readability
)

Results

The results of our experiment with Gemini 1.5 Flash were nothing short of amazing! Across all context lengths, from 50,000 tokens all the way up to the full million-token capacity, Gemini 1.5 Flash achieved 100% accuracy in answering our test questions!

Experiment Results in LangSmith

You can view the full test results on LangSmith here.

Let’s break down what this means:

🔬 Perfect Accuracy: Regardless of whether we gave Gemini 1.5 Flash 50,000 tokens of context or 1,000,000 tokens, it consistently provided correct answers to our questions. This level of consistency is impressive and suggests that the model can effectively process and utilize information from extremely large contexts.
🧠 Information Synthesis: Our questions required more than just information retrieval. They involved comparing data points from different parts of the context. Gemini 1.5 Flash’s perfect score indicates a strong ability to understand information across a huge context, not just locate specific strings.

To put this in perspective, at the maximum context length, Gemini 1.5 Flash was accurately answering questions while processing the equivalent of a 400-page book in a single query. This is a significant leap beyond traditional document analysis capabilities.

However, it’s important to note some limitations of our experiment:

Question Complexity: Our questions, while requiring synthesis, are relatively straightforward. We deliberately avoided questions requiring complex numerical reasoning or identifying trends across the entire dataset.
Limited Question Set: We used a small set of questions for each evaluation. A larger, more diverse set of questions could provide even more robust insights into the model’s capabilities.

Despite these limitations, the results are extremely promising. They suggest that Gemini 1.5 Flash can maintain high accuracy and information synthesis capabilities with very low costs, even when dealing with massive amounts of context. This opens up exciting possibilities for applications in various fields, from business analytics to scientific research.