Nick Winder

Nick Winder

Software & AI Developer | 13 years of building stuff

Trace-Driven Development: How I Use LangSmith and Claude Code to Fix Bugs I Didn't Know I Had

Nick Winder

Automate LLM bug fixing with LangSmith traces and Claude Code. Trace-driven development creates a feedback loop for autonomous investigation, debugging, and fixes.

I've been building Miles, an AI-powered running coach that connects to Strava and helps athletes plan their training. It's a pretty standard LLM application — user chats with an AI agent that has access to their workout data through tool calls. Nothing groundbreaking on the surface, but the way I've been finding and fixing bugs has completely changed how I think about application development.

Here's the thing that blew my mind: I've automated the entire loop. LangSmith traces get scanned automatically, problematic conversations get flagged, Claude Code investigates the root cause, designs a fix, and presents me with a plan. The entire pipeline runs with exactly one human decision point. One. That's the moment where I look at the proposed plan and say "yes, do it" or "nah, let's rethink this." Everything else — scanning traces, reading the conversation, understanding the problem, exploring the codebase, designing the solution, writing the code — happens without me.

This trace-driven development might be the tightest feedback loop I've ever worked with.

What Trace-Driven Development Actually Looks Like

Trace-driven development is a debugging and improvement methodology where structured observability traces from LLM applications automatically trigger investigation and code fix proposals, with human approval as the only manual step before implementation.

Traditional debugging goes something like this: a user complains, you try to reproduce the issue, you dig through logs, you form a hypothesis, you investigate the code, you fix it, you test it. Each step involves context-switching and manual effort. It works, but it's slow and it depends heavily on how well the user described their problem.

Trace-driven development flips this on its head. With LangSmith tracing on your LLM application, every single conversation is recorded as a structured trace — the user's messages, the AI's responses, every tool call, every API response, and every failure. When something goes wrong, you don't need the user to describe the problem. You have the entire execution path sitting right there, ready to be analyzed.

Traditional Debugging Trace-Driven Development
Discovery User reports issue Automated trace scanning
Investigation Manual log reading AI reads structured traces
Root cause Developer-led analysis Autonomous codebase exploration
Solution design Manual AI-proposed, human-approved
Implementation Manual coding Automated with single checkpoint
Time to fix Days to weeks Minutes to hours

The workflow I've landed on looks like this:

  1. LangSmith flags a problematic trace — automated rules catch bad experiences: tool calls returning empty results, the AI apologizing for missing data, users having to repeat themselves
  2. Claude Code picks up the flagged trace — it reads the trace via the LangSmith MCP server, sees exactly what went wrong
  3. Claude Code investigates — it explores the codebase, traces the data flow, identifies the root cause
  4. Claude Code proposes a plan — a detailed design with specific code changes, reasoning, and verification steps
  5. I review and approve (or reject) — the one human decision point
  6. Claude Code implements — writes the code, runs the tests, creates the commit

That's the whole thing. Traces in, working code out, with a single human checkpoint in the middle.

A Real Example: The Missing Workout Data

Let me walk you through a real session to show how this plays out. The automated scan flagged a trace where a user was asking their AI coach about the details of a recent workout. The conversation went something like this: the user asked about their splits, the AI tried to answer, but it could only give vague responses because it didn't actually have access to the detailed workout data.

The frustrating part? The data existed. It was right there in Strava's API. The app just wasn't fetching it at the right level of detail.

The trace got picked up, Claude Code opened a session, and the investigation kicked off automatically.

How Claude Code Investigated

What happened next was genuinely impressive. Claude Code didn't just read the trace and guess. It launched a multi-step investigation that would have taken me an hour of manual digging.

First, it used the LangSmith MCP server to fetch the actual trace data. It could see every message in the conversation, every tool call the AI coach made, and critically, what data each tool call returned. This immediately revealed the gap — the coaching tool was using a summary API endpoint that returned basic activity stats but none of the detail the user was asking about.

Then, in parallel, it spawned an exploration agent to investigate the codebase. This sub-agent read the database schema, the Strava API client library, the agent tool definitions, and the data flow architecture. It came back with a comprehensive report that mapped exactly what data was being fetched, what was being stored, what was available but unused, and what the gaps were.

The key finding? The Strava client library already had functions for fetching detailed activity data — heart rate, lap splits, pace streams, the whole lot. These functions existed and worked fine. They just weren't wired up to the coaching agent's tools. The coaching tools were only calling the summary endpoint, which doesn't include heart rate or detailed breakdowns.

Here's what makes this different from traditional debugging: Claude Code didn't just identify what was wrong. By reading the LangSmith trace, it understood why it mattered from the user's perspective. It could see the user trying to discuss their workout splits, the AI admitting it didn't have that data, and the user getting frustrated. That context shaped the solution.

The Plan

Claude Code's proposed fix was elegant in its practicality. Rather than doing one big refactor, it designed a two-tier approach:

Tier 1: Enrich the existing activity listing. When the coaching agent fetches recent activities, make parallel API calls to get the detailed version of each activity. This adds descriptions, heart rate data, and basic metrics to every activity the AI can see. Small cost in API calls, massive improvement in coaching quality.

Tier 2: Add a dedicated detail tool. Create a new agent tool specifically for drilling into a single activity's lap-by-lap data. This is the expensive call (streams, splits, segment efforts), so it only gets triggered when the AI needs to analyze specific workout structure — not on every conversation.

The plan even included rate limit analysis. The app has a budget of 100 Strava API requests per 15-minute window. Enriching 10 activities with detail calls would use 11 requests total (1 for the listing, 10 for the details). Plenty of headroom.

It documented all of this in a design plan file, specified exactly which files needed changing, described the code modifications, and laid out verification steps. Then it asked me to approve.

The Human in the Loop

This is where I come in, and honestly, it's the most interesting part of the whole workflow. My job at this point isn't to debug or design — it's to make a judgment call. Does this plan make sense? Is the approach right? Are there any concerns about rate limits, token costs, or user experience that the AI might have missed?

In this particular case, I actually rejected the first plan. Not because it was wrong, but because I wanted to think more about the approach before committing to code changes. That's the beauty of the plan-approve-execute workflow — rejecting a plan costs nothing. No code was written, no tests were broken, no branches were polluted with half-baked changes.

The key insight here is that the human decision point isn't about how to fix something. It's about whether to fix it and whether the proposed approach aligns with the broader product direction. That's a much more valuable use of human judgment than manually tracing through API responses to figure out why a field is null.

Why This Works So Well

There are a few things that make this workflow genuinely powerful, and they all come down to tooling convergence.

LangSmith gives you structured observability. Every LLM conversation is a trace with typed inputs and outputs on every step. When something goes wrong, you don't get a vague error log — you get the complete execution path with full context. And because LangSmith has an MCP server, Claude Code can read these traces programmatically, not just look at screenshots.

Claude Code gives you autonomous debugging. The ability to spawn sub-agents that read your actual codebase, not just a description of it, means the investigation is grounded in reality. When Claude Code says "the getStravaActivities tool returns SummaryActivity which doesn't include heart rate data," it's not hallucinating — it read the type definitions and the API client code.

The plan-and-execute pattern gives you safety. Nothing gets implemented without human approval. The AI can investigate as deeply as it wants, propose whatever solution it thinks is best, but the human always gets the final say before any code changes happen. This means you can be aggressive about giving Claude Code latitude to explore without worrying about unwanted changes.

The Feedback Loop

What really makes this approach sing is the feedback loop it creates. Here's the cycle:

Users interact with your AI app
        ↓
LangSmith records every conversation as structured traces
        ↓
Automated rules flag traces where something went wrong
        ↓
Claude Code reads the flagged trace + investigates the codebase
        ↓
Claude Code proposes a fix
        ↓
You approve (or refine)
        ↓
Code ships → Users get a better experience
        ↓
(Repeat)

The tightness of this loop is what matters. In traditional development, the gap between "user has a bad experience" and "fix ships" is measured in days or weeks. With trace-driven development and proper LLM observability, it can be measured in minutes. And because the investigation is automated, you catch issues you'd never have noticed manually — subtle data gaps, tool calls returning incomplete results, conversations where the AI had to apologize for not having information that was technically available.

Setting This Up

If you want to try this yourself, here's what you need:

LangSmith tracing on your LLM application. If you're using LangChain or LangGraph, this is a few lines of configuration. Even if you're not, LangSmith's SDK supports tracing arbitrary LLM calls. The important thing is that your tool calls and their results are captured.

LangSmith rules and alerts. This is what automates the scanning. Set up rules that flag traces matching failure patterns — tool calls returning empty results, conversations where the AI couldn't answer a question, users asking follow-up questions that suggest dissatisfaction. LangSmith's rule system lets you filter on trace metadata, output content, and latency, so you can get surprisingly specific about what counts as a problem worth investigating.

The LangSmith MCP server connected to Claude Code. This is what lets Claude Code actually read your traces programmatically. When a flagged trace triggers the pipeline, Claude Code pulls the full trace data through the MCP server — every message, every tool call, every response.

A plan-and-execute workflow. You want Claude Code to investigate and design before it starts writing code. This is where tools like Claude Code plugins come in — you can set up skills that enforce a brainstorm-plan-approve-implement workflow so the AI doesn't just start hacking away at your codebase.

Codebase context. Claude Code needs to be able to read your actual source files, understand your architecture, and trace data flows. This works best when your project has a good CLAUDE.md file that describes the structure, and when the codebase is in a state where Claude Code can navigate it effectively.

What's Next

I've been running this workflow for a few weeks now, and it's changed how I think about monitoring my application. I used to look at LangSmith traces to check that things were working. Now I barely look at them at all — the automated pipeline surfaces the problems and proposes fixes before I even know there's an issue. Every trace where a user didn't get exactly what they needed becomes a proposed improvement sitting in my queue, waiting for a thumbs up.

The natural evolution is to get smarter about what gets flagged. Right now the rules are relatively simple — empty tool results, AI apologies, repeated user questions. But you could train a classifier on traces that led to accepted fixes versus traces that were noise, and progressively tighten the signal. The pipeline gets better at finding real problems, and your application improves faster with less human filtering.

The other direction is removing the human checkpoint entirely for low-risk fixes. If Claude Code proposes a change that only adds data to an existing tool response — no schema changes, no new endpoints, no permission changes — maybe that just ships automatically with a notification. Save the human review for architectural decisions and anything that touches auth or data models.

For now, the single-checkpoint version already feels like a superpower. Reviewing a well-structured plan is a lot more fun than stepping through API responses trying to figure out why average_heartrate is always null.

The tools exist. The workflow works. The only question is how much of the loop you want to close.