Building LLM Agents That Don't Break in Production

Most LLM agent demos look clean. You prompt the model, it calls a tool, returns a result. The happy path works beautifully. Production is a different story.

I've spent the last couple of years building AI agent systems at scale — a YC-backed receptionist platform, multi-provider voice orchestration, healthcare AI with live EHR data. Here's what I've learned about keeping agents alive in production.

The core problem: non-determinism meets real infrastructure

When you chain LLM calls, tool invocations, and external APIs, you're stacking non-determinism on top of fallible infrastructure. Any individual component might fail gracefully. The system fails catastrophically when failures compound.

The first thing I do on any agent project is map the failure modes. Not just "what happens if the LLM returns garbage" but: what happens if the tool call times out halfway through a state mutation? What if the model hallucinates a tool name that doesn't exist? What if a user session disconnects mid-handoff between agents?

These aren't edge cases. They happen constantly.

Design for idempotency first

If your agent modifies state — writes to a database, sends a message, triggers a webhook — that action needs to be idempotent or recoverable. The LLM will sometimes call the same tool twice. Retries will happen. Your system should not care.

On one project, we had an agent that would schedule calendar events. Early versions would occasionally schedule the same event twice because the confirmation step got retried after a timeout. The fix wasn't prompt engineering — it was building an idempotency layer into the scheduling service itself.

async function scheduleEvent(params: EventParams, idempotencyKey: string) {
  const existing = await db.events.findByIdempotencyKey(idempotencyKey)
  if (existing) return existing

  const event = await calendarApi.create(params)
  await db.events.create({ ...event, idempotencyKey })
  return event
}

The model doesn't need to know about idempotency. The infrastructure handles it.

Make tool schemas adversarially robust

Your LLM will do unexpected things with tool schemas. It will pass null where you expected a string. It will omit required fields. It will pass fields in the wrong format even when the schema says otherwise.

The solution is to treat every tool invocation like untrusted user input. Validate at the boundary. Coerce what you can. Reject clearly with a useful error message the model can understand and retry.

I've had good results with Zod schemas at every tool boundary — the parse errors are readable enough that the model can self-correct on the next attempt.

Observability is not optional

This is the biggest mistake I see teams make. They deploy agents with no visibility into what the model actually did, what tools were called, what the inputs and outputs were, and how long things took.

You need structured logs for every agent step:

Model invocation: prompt, response, token count, latency
Tool calls: name, input, output, latency, success/failure
State transitions: what changed and why
Session metadata: user ID, session duration, handoff events

Without this, debugging a production failure is archaeology. With it, you can replay sessions, spot failure patterns, and actually improve the system.

The handoff problem

Multi-agent systems introduce a new category of failure: the handoff. When one agent delegates to another, you need to transfer context cleanly. The receiving agent needs enough information to continue without asking the user to repeat themselves.

This is harder than it sounds because LLM context windows have limits, handoffs have latency, and the two agents may have different system prompts, tool sets, and operational modes.

My approach is to maintain a structured "session state" object alongside the raw conversation history — a distilled representation of what's been established, what the user wants, and what's been done. This travels with every handoff and can be rendered into any agent's context without bloating the full history.

Start with chaos engineering

The most valuable thing I've done on recent agent projects is intentional failure injection early in development. Break tools randomly. Add random latency. Simulate bad model outputs. See what happens.

Your agent will either fail gracefully or fail in ways that corrupt state, confuse users, or produce phantom actions. Better to find out in development.

The systems that make it to production reliably are the ones that were designed with failure as the default assumption — not the exception.

This is the first in a series on production AI systems. Next up: building the abstraction layer for multi-provider voice AI.