AI Development

March 6, 2026
in AI Development, Architecture, Best Practices
4 min read

Ship SDKs, not MCPs

Ship SDKs header

If you're running a SaaS, you're probably thinking about how AI agents will interact with your product. The current hype cycle says build an MCP server. If you're a little more tapped in, you've probably moved on to shipping a CLI, like Google's Workspace agent tools. Both are fine. But I think the real play is the SDK.

March 5, 2026
in AI Development, Evals, Experiments
4 min read

Advanced tool calling cooks

PokeBench header

ReAct vs Programmatic tool calling

ReAct agents call one tool at a time. Think, call, observe, repeat. Every step is a full round trip to the LLM. It works, but it's slow, and each loop is another chance for the model to lose the thread.

Anthropic's programmatic tool calling takes a different approach: Claude writes a code block that orchestrates multiple tool calls at once. Instead of five round trips for a five-tool task, one generation handles the whole thing.

I wanted to know how much this actually matters on a real eval. So I built one.

February 26, 2026
in AI Development, Product Design, AI Impact
4 min read

Don't give up on UI

Don't Give Up on UI

There's a growing crowd on Twitter that thinks frontend is dying. That the future of AI is pure chat. OpenClaw runs through Telegram. Poke operates through texting. Manus is adding WhatsApp. No app, no UI, just messages.

And I'm over here so fatigued by chat that I barely read the responses anymore.

February 19, 2026
in AI Development, Evaluation, Best Practices
8 min read

Evals 102: Let Your Agent Do the Hard Parts

Evals 102 Header

I gave a follow-up talk at the LangChain meetup on more advanced eval patterns. This blog covers the main ideas.

A few months ago I gave a talk on the basics of agent evaluation. The gist: stop copy-pasting test prompts like a caveman. Build datasets. Run evaluators. Measure things. That talk covered tests versus evals, black-boxing your agent's output, and a pattern I call "stress testing": find an issue, build a small dataset around it, iterate until it's fixed.

This is the sequel. I'm assuming you're using coding agents (Claude Code, Cursor, Codex, whatever). If you're not, the ideas still apply, but the real unlock is what happens when your coding agent becomes part of the eval process itself.

February 15, 2026
in AI Development, Experiments
5 min read

What Claude SSH Actually Unlocks

SSH Agent Server

I wrote about the AI PC concept back in December. The idea: a personal server that you own, running a codebase that your AI agents can connect to, edit, and serve from. An app that persists and grows with you as you use it.

Claude Code just shipped native SSH support in Desktop. So I spun up an EC2 instance and tested it. I wanted to see how close we are to making the AI PC real. The answer: closer than I expected.

The experiment

The plan was simple. Compare two approaches side by side:

Traditional SSH: Use Claude Code CLI locally and have it SSH into the server
Native Desktop SSH: Use Claude Desktop's new built-in SSH feature, which runs Claude Code directly on the remote machine

I spun up a tiny EC2 instance, grabbed the SSH key, and started running through a checklist: CRUD operations on files, permissions, config files, file transfers, and eventually building and serving a web app.

Traditional SSH

The first thing I noticed when I told CLI Claude to SSH into the server is that it doesn't open an interactive session. It sends one-off commands. Every interaction looks like:

ssh -i key.pem ec2-user@3.84.197.88 "mkdir -p /home/ec2-user/docs && cat > /home/ec2-user/docs/01-attention-mechanism.md << 'DOCEOF'
# The Attention Mechanism
..."

Every file operation goes through the Bash tool. Claude can't use its native Read, Write, or Edit tools because those target the local filesystem. So it falls back to cat with heredocs for writes and reads. No structured diffs, no clean UI, just raw shell commands piped through SSH.

The permissions UX is also terrible. You're approving raw Bash commands, but you can't tell whether a cat is a read or a write. Claude suggested SSHFS as a workaround (mount the remote filesystem locally), but that doesn't give you a remote terminal, you have to set it up every time, and it'll never work on mobile. Dead end.

Traditional SSH confines the agent to shell scripting.

Native Desktop SSH

Then I connected through Claude Desktop's native SSH feature. Same permissioning prompts as local. Same tool suite:

Read: native file reading, no cat over SSH
Edit: structured edits with diffs, no heredocs
Write: clean file creation, same UI as local

It felt like working on my own machine, except the filesystem was on a server. One quirk: Desktop doesn't support ! bang commands like the CLI does, so everything has to go through the agent. Minor annoyance.

Config files work (and they're fully isolated)

I added a CLAUDE.md on the server with instructions to talk like a pirate and set bypassPermissions in ~/.claude/settings.json. Fresh session: both worked. Pirate speak, zero permission prompts. Full yolo mode on a remote server.

But then I put my phone number in my local global ~/.claude/CLAUDE.md and asked SSH Claude if it knew it. Nope. Fully isolated. It doesn't see:

Your local global CLAUDE.md
Your local skills
Any local config whatsoever

Double-edged sword. You get full separation per server (different personality, tools, permissions for each), but if you've built up a library of custom skills locally, you'd need to recreate them on each server. No inheritance, no merging.

The trust boundary

A disposable server is the perfect sandbox for an agent. Worst case, roll back to a snapshot. This is actually safer than giving an agent full permissions on your personal laptop.

Conservative: Don't allow sudo. The agent can go wild with everything else.
Full yolo: Allow sudo too, but snapshot first.

The boundary gets interesting when the agent needs cloud access. Yolo on a server is fine, the blast radius is one disposable box. Yolo with AWS credentials means the agent could spin up resources, delete things, rack up bills. Totally different risk profile.

So I split it: SSH Claude builds and serves, local Claude handles infra. Two agents, one trust boundary.

Trust boundary model

Proof of concept

I told SSH Claude to set up a web app. When it needed port 80 opened, it told me. Local Claude handled that. Done.

Hello World served from EC2

"Hello, World!" served live from the EC2 instance. No deploy pipeline. No git push. No CI/CD. The agent just... built it and served it.

To demo it, I opened the url on my phone, asked claude to the change colors and the message, and refreshed the page on my phone to see the changes instantly... The implications of this are huge.

Where this is going

This unlocks a type of app that's never existed before. An app where the developer and the user are the same person, and the app is always running while it's being built.

Think about what just happened with the demo. I was using the app on my phone. I thought of a change. I told the agent. The change was live in the app I was already looking at. No redeploy, no waiting, no switching contexts. The app and the development environment are the same thing.

And because it's your server, you can bring it anywhere. Claude connects to it today, Codex connects to it tomorrow. Switch providers whenever you want, the app and all your data comes with you.

Now scale that up. You upload your schedule, your resume, your contacts, all the stuff you currently track across five different apps and a notes folder. The agent turns it into a personal app, served from your server. Then your life changes. New job, new address, kid starts school. You tell Claude, and the app morphs a little bit more to what you need. It's a living thing that adapts because using it is the same as building it.

Building toward that next.

January 14, 2026
in AI Development, Evaluation, Best Practices
4 min read

Golden Datasets Are Dead

Golden Dataset Header

There's an instinct when you start building agent evals to replicate what the big benchmarks do. You see TerminalBench or SWE-bench or whatever, and there's this nice hill to climb. Model releases improve the score, progress is visible, stakeholders are happy. So you think: why not build an internal version? Start at 10%, iterate throughout the year, end at 80%. Show the chart in your quarterly review.

It doesn't work. Here's why.

December 29, 2025
in AI Development
9 min read

The Year of the AI PC

AI PC

2025 was supposed to be "the year of the agents". We saw real agent use cases being pushed to production at enterprises and startups and actually being useful. These are usually very simple tool-loop agents that devs plug into APIs, allowing LLMs to use tools to fetch info (RAG) or to take actions. A ton of agents popped up in 2025, but not a ton of great ones. You would think this was due to model capabilities, but what Claude Code taught us is that the harness, or the architecture of the agent, is just as important as the model, if not more.

If you haven't been using Claude Code, I highly recommend you give it a try, even if you're not a programmer. It's magical.

December 17, 2025
in AI Development, Architecture, Best Practices
4 min read

Stop using LLM frameworks

Build direct

The core pitch of LangChain was interchangeability. Plug-and-play components. Swap ChatGPT for Anthropic for Gemini for whatever. Replace your vector database, swap out your tools, mix and match to your heart's content. Build agents from standardized Lego bricks. It sounded great.

I think there's still a place for LangGraph for orchestration. But the rest of it? I don't think LangChain makes sense anymore. Here's why.

November 25, 2025
in AI Development, Software Engineering, Best Practices
4 min read

Floor vs Ceiling: Different Models for Different Jobs

I talk a lot about the floor versus the ceiling when it comes to LLMs and agents. The ceiling is the maximum capability when you push these models to the edge of what they can do: complex architectures, novel scientific problems, anything that requires real reasoning. The floor is the everyday stuff, the entry-level human tasks that just need to get done reliably.

For customer service, you want floor models. Cheap, fast, stable. For cutting-edge research or gnarly architectural decisions, you want ceiling models. Expensive, slow, but actually smart.

What I've realized lately is that coding agent workflows should be using both. And most of them aren't.

November 25, 2025
in AI Development, Evaluation, Prompt Engineering
7 min read

The Meta-Evaluator: Your Coding Agent as an Eval Layer

Meta-Evaluator Header

I've been building AI products for a while now, and I've always followed the standard playbook: build your agent, write your evals, iterate on prompts until the numbers look good, ship. It works. But recently I stumbled onto something that completely changed how I think about the evaluation layer.

What if your coding agent is the evaluation layer?

Let me explain.