From Prototype to Production: Deploying a Reliable Agentic Pipeline

A prototype that works on your laptop is not a production service. This post covers the plumbing decisions that separate the two: structured output with Zod and Pydantic, retry and fallback strategies, trace logging, and how to choose between serverless and long-running deployment models.

5 min read

Over the past ten days we've built agents that work. Today we make them reliable. There's a significant gap between a prototype that impresses in a demo and a service that handles real traffic, recovers from failures, and gives you visibility when something goes wrong.

This post is about closing that gap.

Reference Architecture

A production agentic pipeline has five layers:

[User / Trigger]

[Input validation + guardrails]

[Orchestration layer]  ←——→  [Tool layer (APIs, DBs, search)]

[Output validation]

[Logging + tracing]

Nothing here is optional in production. Let's go through each.

Structured Output: Make Tool Calls Predictable

The single biggest source of agent bugs is an LLM returning JSON in a slightly different shape than you expected. Structured output enforces a schema at the model level — the model cannot return a malformed response.

Python: Pydantic + Instructor

import anthropic
import instructor
from pydantic import BaseModel, Field
 
class WeatherResult(BaseModel):
    city: str
    temperature_celsius: float
    condition: str = Field(description="e.g. 'sunny', 'cloudy', 'rain'")
    confidence: float = Field(ge=0.0, le=1.0)
 
# Patch the Anthropic client with instructor
client = instructor.from_anthropic(anthropic.Anthropic())
 
result = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=256,
    response_model=WeatherResult,
    messages=[{"role": "user", "content": "What's the weather like in Berlin today?"}],
)
 
# result is a validated WeatherResult instance — guaranteed
print(result.temperature_celsius)  # float, not a string

If the model returns something that doesn't match WeatherResult, instructor automatically retries with the validation error as feedback.

TypeScript: Zod

import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";
 
const WeatherSchema = z.object({
  city: z.string(),
  temperatureCelsius: z.number(),
  condition: z.enum(["sunny", "cloudy", "rain", "snow"]),
  confidence: z.number().min(0).max(1),
});
 
// Use claude's tool-use to enforce the schema
const tools = [
  {
    name: "return_weather",
    description: "Return structured weather data",
    input_schema: {
      type: "object" as const,
      properties: {
        city: { type: "string" },
        temperatureCelsius: { type: "number" },
        condition: {
          type: "string",
          enum: ["sunny", "cloudy", "rain", "snow"],
        },
        confidence: { type: "number", minimum: 0, maximum: 1 },
      },
      required: ["city", "temperatureCelsius", "condition", "confidence"],
    },
  },
];
 
// Parse and validate the tool call input with Zod
const parsed = WeatherSchema.parse(toolUseBlock.input);

Use Pydantic in Python; use Zod in TypeScript. Both give you runtime validation with TypeScript/Python type inference as a bonus.

Retry and Fallback Strategies

LLMs fail. APIs return 529s. Tools time out. Your pipeline needs to handle all of this gracefully.

Exponential backoff for transient errors:

import time
import random
from anthropic import RateLimitError, APIStatusError
 
def call_with_retry(fn, max_retries: int = 3):
    for attempt in range(max_retries):
        try:
            return fn()
        except RateLimitError:
            wait = (2 ** attempt) + random.random()
            print(f"Rate limited. Retrying in {wait:.1f}s...")
            time.sleep(wait)
        except APIStatusError as e:
            if e.status_code >= 500:
                time.sleep(2 ** attempt)
            else:
                raise  # 4xx errors are caller bugs — don't retry
    raise RuntimeError(f"Failed after {max_retries} retries")

Model fallback: if your primary model is unavailable, fall back to a cheaper/faster one:

MODELS = ["claude-3-7-sonnet-20250219", "claude-3-5-haiku-20241022"]
 
def call_with_fallback(messages: list, tools: list):
    for model in MODELS:
        try:
            return client.messages.create(
                model=model, max_tokens=1024,
                tools=tools, messages=messages
            )
        except Exception as e:
            print(f"Model {model} failed: {e}. Trying next...")
    raise RuntimeError("All models failed")

Tool timeouts: always wrap external API calls:

import signal
from contextlib import contextmanager
 
@contextmanager
def timeout(seconds: int):
    def handler(signum, frame):
        raise TimeoutError(f"Tool timed out after {seconds}s")
    signal.signal(signal.SIGALRM, handler)
    signal.alarm(seconds)
    try:
        yield
    finally:
        signal.alarm(0)
 
def safe_search(query: str) -> str:
    with timeout(5):
        return external_search_api(query)

Trace Logging

You cannot debug what you cannot see. Every agent run should emit a structured trace: what the model was asked, which tools were called, what they returned, and how long each step took.

Minimal structured logging:

import json
import time
from datetime import datetime
 
def log_event(event_type: str, data: dict):
    print(json.dumps({
        "timestamp": datetime.utcnow().isoformat(),
        "event": event_type,
        **data,
    }))
 
# Wrap your tool calls
def traced_tool_call(tool_name: str, inputs: dict, fn):
    start = time.time()
    log_event("tool_start", {"tool": tool_name, "inputs": inputs})
    try:
        result = fn()
        log_event("tool_end", {"tool": tool_name, "duration_ms": int((time.time()-start)*1000)})
        return result
    except Exception as e:
        log_event("tool_error", {"tool": tool_name, "error": str(e)})
        raise

For LangGraph pipelines, LangSmith is the easiest drop-in:

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls-..."
# That's it — all LangGraph runs are traced automatically

LangSmith gives you a timeline view of every node, input/output at each step, and token counts per call.

Deployment Options

OptionBest forCold startCost
Serverless function (Lambda, Cloud Run)Low-traffic, event-triggered agents~1–3sPay per invocation
Container (Docker + ECS/GKE)Consistent latency, persistent connectionsNonePay always
Long-running process (VM)Agents with in-memory state or local vector DBsNonePay always
LangGraph PlatformLangGraph pipelines with built-in persistenceNoneManaged service

For most agentic APIs, Cloud Run is the sweet spot: scale-to-zero when idle, no cold-start pain for agentic workflows (latency is dominated by LLM calls, not startup time), and simple deployment.

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
Terminal
gcloud run deploy my-agent --source . --region europe-west1 --allow-unauthenticated

Cost Estimation Checklist

Before going live, estimate your per-request cost:

  • Average input tokens per run (include system prompt, history, tool results)
  • Average output tokens per run
  • Average number of tool calls per run × tool API cost
  • Expected requests per day
  • p99 latency target — does it fit within your timeout budget?

Run a load test with realistic inputs before launch. LLM token costs are easy to underestimate when tool results are large.

What's Next

Tomorrow: how do you know your agent is actually working correctly? We cover unit testing tools, trace-based evaluation, LLM-as-judge scoring, and regression testing with golden datasets.