Testing and Evaluating AI Agents: What Good Looks Like

Traditional unit tests break down for agents — the output is non-deterministic and the logic is distributed across LLM calls and tool invocations. This final post in the series covers four practical evaluation strategies: unit-testing tools in isolation, trace-based evaluation, LLM-as-judge scoring, and golden-dataset regression testing.

6 min read

This is the final post in our 12-day AI agentic development series. We've built agents with Claude, OpenAI, local models, RAG, LangGraph, and production infrastructure. The question we haven't answered: how do you know any of it actually works?

Traditional testing is hard to apply to agents. You can't assertEqual("What's the capital of France?", agent.run(...)) because the phrasing of the answer varies on every run. You need different strategies for different layers of the system.

Four Evaluation Strategies

StrategyWhat it testsDeterministic?
Unit test toolsTool logic in isolation, mocking the LLMYes
Trace-based evaluationDid the agent call the right tools in the right order?Yes
LLM-as-judgeIs the final answer correct / high quality?No (use thresholds)
Golden dataset + regressionDoes quality hold across versions?Statistical

You need all four. They catch different failure modes.


Strategy 1: Unit Test Your Tools

Tools are just functions. Test them like functions — mock the LLM and focus on the tool logic itself.

import pytest
 
# The tool we want to test
def get_weather(city: str) -> dict:
    """Call a real weather API and return structured data."""
    response = requests.get(f"https://api.weather.example.com/{city}")
    response.raise_for_status()
    data = response.json()
    return {"city": city, "temp_c": data["main"]["temp"], "condition": data["weather"][0]["main"]}
 
# Unit test — mock the HTTP call, test our parsing logic
def test_get_weather_parses_response(requests_mock):
    requests_mock.get(
        "https://api.weather.example.com/Tokyo",
        json={"main": {"temp": 22.5}, "weather": [{"main": "Clear"}]},
    )
    result = get_weather("Tokyo")
    assert result["city"] == "Tokyo"
    assert result["temp_c"] == 22.5
    assert result["condition"] == "Clear"
 
def test_get_weather_raises_on_http_error(requests_mock):
    requests_mock.get("https://api.weather.example.com/Unknown", status_code=404)
    with pytest.raises(requests.HTTPError):
        get_weather("Unknown")

This gives you fast, deterministic tests for every tool in your agent's arsenal. A failing tool test is a clear signal and easy to fix — no LLM involved.


Strategy 2: Trace-Based Evaluation

Even if the LLM says the right thing, it may have reached the answer via the wrong path — skipping a tool call, calling the wrong tool, or calling tools in the wrong order. Trace-based evaluation checks the trajectory, not just the final output.

from agents import Runner, Agent
from agents.testing import AgentTestCase  # OpenAI Agents SDK testing helpers
 
class TestWeatherAgent(AgentTestCase):
    async def test_calls_weather_tool_for_city_question(self):
        result = await Runner.run(
            weather_agent,
            input="What's the weather in Paris?",
        )
        # Assert the trajectory
        tool_calls = [s.tool_name for s in result.steps if s.type == "tool_call"]
        self.assertIn("get_weather", tool_calls)
 
    async def test_does_not_call_weather_tool_for_non_weather_question(self):
        result = await Runner.run(
            weather_agent,
            input="What is the capital of France?",
        )
        tool_calls = [s.tool_name for s in result.steps if s.type == "tool_call"]
        self.assertNotIn("get_weather", tool_calls)

For LangGraph pipelines, check which nodes executed:

async def test_editor_approves_on_first_try():
    state = await graph.ainvoke({"topic": "Python type hints", ...})
    assert state["revision_count"] == 1
    assert state["approved"] is True

Trace assertions are deterministic when the tool mock is deterministic — they don't depend on the model's phrasing.


Strategy 3: LLM-as-Judge

For final answer quality, use a separate LLM as a grader. This is the standard approach for evaluating open-ended responses.

import anthropic
from pydantic import BaseModel, Field
import instructor
 
class Evaluation(BaseModel):
    score: int = Field(ge=1, le=5, description="1=poor, 5=excellent")
    reasoning: str
    passed: bool  # score >= 3
 
judge_client = instructor.from_anthropic(anthropic.Anthropic())
 
def evaluate_answer(question: str, answer: str, reference: str | None = None) -> Evaluation:
    context = f"\nReference answer: {reference}" if reference else ""
    return judge_client.messages.create(
        model="claude-3-5-haiku-20241022",  # cheap model is fine for judging
        max_tokens=256,
        response_model=Evaluation,
        messages=[{
            "role": "user",
            "content": (
                f"Evaluate this answer to the question.\n\n"
                f"Question: {question}{context}\n\n"
                f"Answer: {answer}\n\n"
                f"Score 1–5 on: factual accuracy, completeness, and clarity."
            ),
        }],
    )
 
# Usage
eval_result = evaluate_answer(
    question="What is RAG and why is it useful for AI agents?",
    answer=agent_output,
    reference="RAG is Retrieval-Augmented Generation...",
)
print(f"Score: {eval_result.score}/5 — {eval_result.reasoning}")

Use claude-3-5-haiku or gpt-4o-mini as the judge — fast and cheap. Save the heavy model for the agent being evaluated.


Strategy 4: Golden Dataset + Regression Testing

Track quality across versions by running a fixed set of test cases and comparing aggregate scores.

# golden_dataset.py
GOLDEN_CASES = [
    {
        "id": "weather_001",
        "input": "What's the weather in Tokyo?",
        "expected_tools": ["get_weather"],
        "quality_reference": "A response about Tokyo's current weather conditions.",
    },
    {
        "id": "memory_001",
        "input": "My name is Hitesh. Remember that.",
        "expected_tools": ["remember"],
        "quality_reference": "Confirmation that the name has been stored.",
    },
    # ... 20-50 cases covering key user journeys
]
 
async def run_regression_suite(agent, score_threshold: float = 3.5):
    scores = []
    failures = []
 
    for case in GOLDEN_CASES:
        result = await Runner.run(agent, input=case["input"])
 
        # Trace check
        actual_tools = [s.tool_name for s in result.steps if s.type == "tool_call"]
        trajectory_ok = all(t in actual_tools for t in case["expected_tools"])
 
        # Quality check
        eval_result = evaluate_answer(
            question=case["input"],
            answer=result.final_output,
            reference=case["quality_reference"],
        )
        scores.append(eval_result.score)
 
        if not trajectory_ok or not eval_result.passed:
            failures.append({
                "id": case["id"],
                "trajectory_ok": trajectory_ok,
                "score": eval_result.score,
            })
 
    avg_score = sum(scores) / len(scores)
    print(f"Average quality score: {avg_score:.2f} / 5.0")
    print(f"Failures: {len(failures)} / {len(GOLDEN_CASES)}")
 
    assert avg_score >= score_threshold, f"Quality regression: {avg_score:.2f} < {score_threshold}"
    return avg_score

Run this in CI before every deployment. If the average score drops below your threshold, the deployment is blocked.

Evaluation Tooling

You don't have to build all of this from scratch:

ToolBest for
LangSmithTracing, LangGraph eval datasets, online feedback capture
BraintrustLLM-as-judge at scale, dataset management, CI integration
OpenAI EvalsEvaluating OpenAI models with custom graders
RAGASSpecific to RAG pipelines — measures faithfulness, answer relevance, context recall
pytest + custom fixturesUnit and trace tests (no third-party dependency)

Start with pytest for tools and traces. Add LangSmith when you need visibility into production runs. Add Braintrust when you need a structured eval dataset with version history.


Wrapping Up the Series

Over the past 12 days we've covered:

  1. What is an agent — the perceive–reason–act loop
  2. Claude SDK — first tool-using agent
  3. OpenAI Agents SDK — handoffs and guardrails
  4. Function calling comparison — Claude vs. OpenAI side by side
  5. Memory patterns — short-term, long-term, episodic
  6. NVIDIA Nemotron 3 — local open-weight agents
  7. MCP — standardised tool connectivity
  8. Assistants API — hosted agent runtime
  9. RAG — retrieval-augmented generation with ChromaDB
  10. LangGraph — multi-agent orchestration
  11. Production pipeline — structured output, retries, logging, deployment
  12. Testing and evaluation — this post

The backlog continues — check back for posts on Claude extended thinking, voice agents, browser-use, and security hardening for production agents.