Best Practices

March 6, 2026
in AI Development, Architecture, Best Practices
4 min read

Ship SDKs, not MCPs

Ship SDKs header

If you're running a SaaS, you're probably thinking about how AI agents will interact with your product. The current hype cycle says build an MCP server. If you're a little more tapped in, you've probably moved on to shipping a CLI, like Google's Workspace agent tools. Both are fine. But I think the real play is the SDK.

February 19, 2026
in AI Development, Evaluation, Best Practices
8 min read

Evals 102: Let Your Agent Do the Hard Parts

Evals 102 Header

I gave a follow-up talk at the LangChain meetup on more advanced eval patterns. This blog covers the main ideas.

A few months ago I gave a talk on the basics of agent evaluation. The gist: stop copy-pasting test prompts like a caveman. Build datasets. Run evaluators. Measure things. That talk covered tests versus evals, black-boxing your agent's output, and a pattern I call "stress testing": find an issue, build a small dataset around it, iterate until it's fixed.

This is the sequel. I'm assuming you're using coding agents (Claude Code, Cursor, Codex, whatever). If you're not, the ideas still apply, but the real unlock is what happens when your coding agent becomes part of the eval process itself.

January 14, 2026
in AI Development, Evaluation, Best Practices
4 min read

Golden Datasets Are Dead

Golden Dataset Header

There's an instinct when you start building agent evals to replicate what the big benchmarks do. You see TerminalBench or SWE-bench or whatever, and there's this nice hill to climb. Model releases improve the score, progress is visible, stakeholders are happy. So you think: why not build an internal version? Start at 10%, iterate throughout the year, end at 80%. Show the chart in your quarterly review.

It doesn't work. Here's why.

December 17, 2025
in AI Development, Architecture, Best Practices
4 min read

Stop using LLM frameworks

Build direct

The core pitch of LangChain was interchangeability. Plug-and-play components. Swap ChatGPT for Anthropic for Gemini for whatever. Replace your vector database, swap out your tools, mix and match to your heart's content. Build agents from standardized Lego bricks. It sounded great.

I think there's still a place for LangGraph for orchestration. But the rest of it? I don't think LangChain makes sense anymore. Here's why.

November 25, 2025
in AI Development, Software Engineering, Best Practices
4 min read

Floor vs Ceiling: Different Models for Different Jobs

I talk a lot about the floor versus the ceiling when it comes to LLMs and agents. The ceiling is the maximum capability when you push these models to the edge of what they can do: complex architectures, novel scientific problems, anything that requires real reasoning. The floor is the everyday stuff, the entry-level human tasks that just need to get done reliably.

For customer service, you want floor models. Cheap, fast, stable. For cutting-edge research or gnarly architectural decisions, you want ceiling models. Expensive, slow, but actually smart.

What I've realized lately is that coding agent workflows should be using both. And most of them aren't.

September 29, 2025
in AI Development, Testing, Best Practices
11 min read

AI Agent Testing: Stop Caveman Testing and Use Evals

I recently gave a talk at the LangChain Miami meetup about evals. This blog encapsulates the main points of the talk.

AI agent manual testing illustration showing developer copy-pasting test prompts

AI agent testing is one of the biggest challenges in building reliable LLM applications. Unlike traditional software, AI agents have infinite possible inputs and outputs, making manual testing inefficient and incomplete. This guide covers practical AI agent evaluation strategies that will help you move from manual testing to automated evaluation frameworks.

I build AI agents for work, and for a long time, I was iterating on them the worst way possible.

The test-adjust-test-adjust loop is how you improve agents. You try something, see if it works, tweak it, try again. Repeat until it's good enough to ship. The problem isn't the loop itself—it's how slow and painful that loop can be if you're doing it manually.