Over the past ten days we've built agents that work. Today we make them reliable. There's a significant gap between a prototype that impresses in a demo and a service that handles real traffic, recovers from failures, and gives you visibility when something goes wrong.
This post is about closing that gap.
A production agentic pipeline has five layers:
[User / Trigger]
↓
[Input validation + guardrails]
↓
[Orchestration layer] ←——→ [Tool layer (APIs, DBs, search)]
↓
[Output validation]
↓
[Logging + tracing]Nothing here is optional in production. Let's go through each.
The single biggest source of agent bugs is an LLM returning JSON in a slightly different shape than you expected. Structured output enforces a schema at the model level — the model cannot return a malformed response.
Python: Pydantic + Instructor
import anthropic
import instructor
from pydantic import BaseModel, Field
class WeatherResult(BaseModel):
city: str
temperature_celsius: float
condition: str = Field(description="e.g. 'sunny', 'cloudy', 'rain'")
confidence: float = Field(ge=0.0, le=1.0)
# Patch the Anthropic client with instructor
client = instructor.from_anthropic(anthropic.Anthropic())
result = client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=256,
response_model=WeatherResult,
messages=[{"role": "user", "content": "What's the weather like in Berlin today?"}],
)
# result is a validated WeatherResult instance — guaranteed
print(result.temperature_celsius) # float, not a stringIf the model returns something that doesn't match WeatherResult, instructor automatically retries with the validation error as feedback.
TypeScript: Zod
import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";
const WeatherSchema = z.object({
city: z.string(),
temperatureCelsius: z.number(),
condition: z.enum(["sunny", "cloudy", "rain", "snow"]),
confidence: z.number().min(0).max(1),
});
// Use claude's tool-use to enforce the schema
const tools = [
{
name: "return_weather",
description: "Return structured weather data",
input_schema: {
type: "object" as const,
properties: {
city: { type: "string" },
temperatureCelsius: { type: "number" },
condition: {
type: "string",
enum: ["sunny", "cloudy", "rain", "snow"],
},
confidence: { type: "number", minimum: 0, maximum: 1 },
},
required: ["city", "temperatureCelsius", "condition", "confidence"],
},
},
];
// Parse and validate the tool call input with Zod
const parsed = WeatherSchema.parse(toolUseBlock.input);Use Pydantic in Python; use Zod in TypeScript. Both give you runtime validation with TypeScript/Python type inference as a bonus.
LLMs fail. APIs return 529s. Tools time out. Your pipeline needs to handle all of this gracefully.
Exponential backoff for transient errors:
import time
import random
from anthropic import RateLimitError, APIStatusError
def call_with_retry(fn, max_retries: int = 3):
for attempt in range(max_retries):
try:
return fn()
except RateLimitError:
wait = (2 ** attempt) + random.random()
print(f"Rate limited. Retrying in {wait:.1f}s...")
time.sleep(wait)
except APIStatusError as e:
if e.status_code >= 500:
time.sleep(2 ** attempt)
else:
raise # 4xx errors are caller bugs — don't retry
raise RuntimeError(f"Failed after {max_retries} retries")Model fallback: if your primary model is unavailable, fall back to a cheaper/faster one:
MODELS = ["claude-3-7-sonnet-20250219", "claude-3-5-haiku-20241022"]
def call_with_fallback(messages: list, tools: list):
for model in MODELS:
try:
return client.messages.create(
model=model, max_tokens=1024,
tools=tools, messages=messages
)
except Exception as e:
print(f"Model {model} failed: {e}. Trying next...")
raise RuntimeError("All models failed")Tool timeouts: always wrap external API calls:
import signal
from contextlib import contextmanager
@contextmanager
def timeout(seconds: int):
def handler(signum, frame):
raise TimeoutError(f"Tool timed out after {seconds}s")
signal.signal(signal.SIGALRM, handler)
signal.alarm(seconds)
try:
yield
finally:
signal.alarm(0)
def safe_search(query: str) -> str:
with timeout(5):
return external_search_api(query)You cannot debug what you cannot see. Every agent run should emit a structured trace: what the model was asked, which tools were called, what they returned, and how long each step took.
Minimal structured logging:
import json
import time
from datetime import datetime
def log_event(event_type: str, data: dict):
print(json.dumps({
"timestamp": datetime.utcnow().isoformat(),
"event": event_type,
**data,
}))
# Wrap your tool calls
def traced_tool_call(tool_name: str, inputs: dict, fn):
start = time.time()
log_event("tool_start", {"tool": tool_name, "inputs": inputs})
try:
result = fn()
log_event("tool_end", {"tool": tool_name, "duration_ms": int((time.time()-start)*1000)})
return result
except Exception as e:
log_event("tool_error", {"tool": tool_name, "error": str(e)})
raiseFor LangGraph pipelines, LangSmith is the easiest drop-in:
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls-..."
# That's it — all LangGraph runs are traced automaticallyLangSmith gives you a timeline view of every node, input/output at each step, and token counts per call.
| Option | Best for | Cold start | Cost |
|---|---|---|---|
| Serverless function (Lambda, Cloud Run) | Low-traffic, event-triggered agents | ~1–3s | Pay per invocation |
| Container (Docker + ECS/GKE) | Consistent latency, persistent connections | None | Pay always |
| Long-running process (VM) | Agents with in-memory state or local vector DBs | None | Pay always |
| LangGraph Platform | LangGraph pipelines with built-in persistence | None | Managed service |
For most agentic APIs, Cloud Run is the sweet spot: scale-to-zero when idle, no cold-start pain for agentic workflows (latency is dominated by LLM calls, not startup time), and simple deployment.
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]gcloud run deploy my-agent --source . --region europe-west1 --allow-unauthenticatedBefore going live, estimate your per-request cost:
Run a load test with realistic inputs before launch. LLM token costs are easy to underestimate when tool results are large.
Tomorrow: how do you know your agent is actually working correctly? We cover unit testing tools, trace-based evaluation, LLM-as-judge scoring, and regression testing with golden datasets.