There's an instinct when you start building agent evals to replicate what the big benchmarks do. You see TerminalBench or SWE-bench or whatever, and there's this nice hill to climb. Model releases improve the score, progress is visible, stakeholders are happy. So you think: why not build an internal version? Start at 10%, iterate throughout the year, end at 80%. Show the chart in your quarterly review.
2025 was supposed to be "the year of the agents". We saw real agent use cases being pushed to production at enterprises and startups and actually being useful. These are usually very simple tool-loop agents that devs plug into APIs, allowing LLMs to use tools to fetch info (RAG) or to take actions. A ton of agents popped up in 2025, but not a ton of great ones. You would think this was due to model capabilities, but what Claude Code taught us is that the harness, or the architecture of the agent, is just as important as the model, if not more.
If you haven't been using Claude Code, I highly recommend you give it a try, even if you're not a programmer. It's magical.
The core pitch of LangChain was interchangeability. Plug-and-play components. Swap ChatGPT for Anthropic for Gemini for whatever. Replace your vector database, swap out your tools, mix and match to your heart's content. Build agents from standardized Lego bricks. It sounded great.
I think there's still a place for LangGraph for orchestration. But the rest of it? I don't think LangChain makes sense anymore. Here's why.
I talk a lot about the floor versus the ceiling when it comes to LLMs and agents. The ceiling is the maximum capability when you push these models to the edge of what they can do: complex architectures, novel scientific problems, anything that requires real reasoning. The floor is the everyday stuff, the entry-level human tasks that just need to get done reliably.
For customer service, you want floor models. Cheap, fast, stable. For cutting-edge research or gnarly architectural decisions, you want ceiling models. Expensive, slow, but actually smart.
What I've realized lately is that coding agent workflows should be using both. And most of them aren't.
I've been building AI products for a while now, and I've always followed the standard playbook: build your agent, write your evals, iterate on prompts until the numbers look good, ship. It works. But recently I stumbled onto something that completely changed how I think about the evaluation layer.
What if your coding agent is the evaluation layer?
Recently, Andrej Karpathy did the Dwarkesh Patel podcast, and one of the stories he told stuck out to me.
He they were doing an experiment where they had an LLM-as-a-judge scoring a student LLM. All of a sudden, he says, the loss went straight to zero, meaning the student LLM was getting 100% out of nowhere. So either the student LLM achieved perfection, or something went wrong.
They dug into the outputs, and it turns out the student LLM was just outputting the word "the" a bunch of times: "the the the the the the the." For some reason, that tricked the LLM-as-a-judge into giving a passing score. It was just an anomalous input that gave them an anomalous output, and it broke the judge.
It's an interesting story in itself, just on the flakiness of LLMs, but we knew that already. I think the revelation for me here is that if outputting the word "the" a bunch of times is enough to get an LLM to perform in ways you wouldn't expect, then how random is the process of prompting? Are there scenarios where if you put "the the the the the" a bunch of times in the system prompt, maybe it solves a behavior, or creates a behavior you were trying to get to?
We treat prompting like we're speaking to an entity, and that if we can get really clear instructions in the system prompt, we can steer these LLMs as if they're just humans that are a little less smart. But that doesn't seem to be the case, because even a dumb human wouldn't interpret the word "the" a bunch of times as some kind of successful response. These things are more enigmatic than we treat them. It's not too far removed from random at this point.
I recently gave a talk at the LangChain Miami meetup about evals. This blog encapsulates the main points of the talk.
AI agent testing is one of the biggest challenges in building reliable LLM applications. Unlike traditional software, AI agents have infinite possible inputs and outputs, making manual testing inefficient and incomplete. This guide covers practical AI agent evaluation strategies that will help you move from manual testing to automated evaluation frameworks.
I build AI agents for work, and for a long time, I was iterating on them the worst way possible.
The test-adjust-test-adjust loop is how you improve agents. You try something, see if it works, tweak it, try again. Repeat until it's good enough to ship. The problem isn't the loop itself—it's how slow and painful that loop can be if you're doing it manually.
In the world of AI dev, there’s a lot of excitement around multi-agent frameworks—swarms, supervisors, crews, committees, and all the buzzwords that come with them. These systems promise to break down complex tasks into manageable pieces, delegating work to specialized agents that plan, execute, and summarize on your behalf. Picture this: you hand a task to a “supervisor” agent, it spins up a team of smaller agents to tackle subtasks, and then another agent compiles the results into a neat little package. It’s a beautiful vision, almost like a corporate hierarchy with you at the helm. And right now, these architectures and their frameworks are undeniably cool. They’re also solving real problems as benchmarks show that iterative, multi-step workflows can significantly boost performance over single-model approaches.
But these frameworks are a temporary fix, a clever workaround for the limitations of today's AI models. As models get smarter, faster, and more capable, the need for this intricate scaffolding will fade. We're building hammers and hunting for nails, when the truth is that the nail (the problem itself) might not even exist in a year. Let me explain why.