This is the final post in our 12-day AI agentic development series. We've built agents with Claude, OpenAI, local models, RAG, LangGraph, and production infrastructure. The question we haven't answered: how do you know any of it actually works?
Traditional testing is hard to apply to agents. You can't assertEqual("What's the capital of France?", agent.run(...)) because the phrasing of the answer varies on every run. You need different strategies for different layers of the system.
| Strategy | What it tests | Deterministic? |
|---|---|---|
| Unit test tools | Tool logic in isolation, mocking the LLM | Yes |
| Trace-based evaluation | Did the agent call the right tools in the right order? | Yes |
| LLM-as-judge | Is the final answer correct / high quality? | No (use thresholds) |
| Golden dataset + regression | Does quality hold across versions? | Statistical |
You need all four. They catch different failure modes.
Tools are just functions. Test them like functions — mock the LLM and focus on the tool logic itself.
import pytest
# The tool we want to test
def get_weather(city: str) -> dict:
"""Call a real weather API and return structured data."""
response = requests.get(f"https://api.weather.example.com/{city}")
response.raise_for_status()
data = response.json()
return {"city": city, "temp_c": data["main"]["temp"], "condition": data["weather"][0]["main"]}
# Unit test — mock the HTTP call, test our parsing logic
def test_get_weather_parses_response(requests_mock):
requests_mock.get(
"https://api.weather.example.com/Tokyo",
json={"main": {"temp": 22.5}, "weather": [{"main": "Clear"}]},
)
result = get_weather("Tokyo")
assert result["city"] == "Tokyo"
assert result["temp_c"] == 22.5
assert result["condition"] == "Clear"
def test_get_weather_raises_on_http_error(requests_mock):
requests_mock.get("https://api.weather.example.com/Unknown", status_code=404)
with pytest.raises(requests.HTTPError):
get_weather("Unknown")This gives you fast, deterministic tests for every tool in your agent's arsenal. A failing tool test is a clear signal and easy to fix — no LLM involved.
Even if the LLM says the right thing, it may have reached the answer via the wrong path — skipping a tool call, calling the wrong tool, or calling tools in the wrong order. Trace-based evaluation checks the trajectory, not just the final output.
from agents import Runner, Agent
from agents.testing import AgentTestCase # OpenAI Agents SDK testing helpers
class TestWeatherAgent(AgentTestCase):
async def test_calls_weather_tool_for_city_question(self):
result = await Runner.run(
weather_agent,
input="What's the weather in Paris?",
)
# Assert the trajectory
tool_calls = [s.tool_name for s in result.steps if s.type == "tool_call"]
self.assertIn("get_weather", tool_calls)
async def test_does_not_call_weather_tool_for_non_weather_question(self):
result = await Runner.run(
weather_agent,
input="What is the capital of France?",
)
tool_calls = [s.tool_name for s in result.steps if s.type == "tool_call"]
self.assertNotIn("get_weather", tool_calls)For LangGraph pipelines, check which nodes executed:
async def test_editor_approves_on_first_try():
state = await graph.ainvoke({"topic": "Python type hints", ...})
assert state["revision_count"] == 1
assert state["approved"] is TrueTrace assertions are deterministic when the tool mock is deterministic — they don't depend on the model's phrasing.
For final answer quality, use a separate LLM as a grader. This is the standard approach for evaluating open-ended responses.
import anthropic
from pydantic import BaseModel, Field
import instructor
class Evaluation(BaseModel):
score: int = Field(ge=1, le=5, description="1=poor, 5=excellent")
reasoning: str
passed: bool # score >= 3
judge_client = instructor.from_anthropic(anthropic.Anthropic())
def evaluate_answer(question: str, answer: str, reference: str | None = None) -> Evaluation:
context = f"\nReference answer: {reference}" if reference else ""
return judge_client.messages.create(
model="claude-3-5-haiku-20241022", # cheap model is fine for judging
max_tokens=256,
response_model=Evaluation,
messages=[{
"role": "user",
"content": (
f"Evaluate this answer to the question.\n\n"
f"Question: {question}{context}\n\n"
f"Answer: {answer}\n\n"
f"Score 1–5 on: factual accuracy, completeness, and clarity."
),
}],
)
# Usage
eval_result = evaluate_answer(
question="What is RAG and why is it useful for AI agents?",
answer=agent_output,
reference="RAG is Retrieval-Augmented Generation...",
)
print(f"Score: {eval_result.score}/5 — {eval_result.reasoning}")Use claude-3-5-haiku or gpt-4o-mini as the judge — fast and cheap. Save the heavy model for the agent being evaluated.
Track quality across versions by running a fixed set of test cases and comparing aggregate scores.
# golden_dataset.py
GOLDEN_CASES = [
{
"id": "weather_001",
"input": "What's the weather in Tokyo?",
"expected_tools": ["get_weather"],
"quality_reference": "A response about Tokyo's current weather conditions.",
},
{
"id": "memory_001",
"input": "My name is Hitesh. Remember that.",
"expected_tools": ["remember"],
"quality_reference": "Confirmation that the name has been stored.",
},
# ... 20-50 cases covering key user journeys
]
async def run_regression_suite(agent, score_threshold: float = 3.5):
scores = []
failures = []
for case in GOLDEN_CASES:
result = await Runner.run(agent, input=case["input"])
# Trace check
actual_tools = [s.tool_name for s in result.steps if s.type == "tool_call"]
trajectory_ok = all(t in actual_tools for t in case["expected_tools"])
# Quality check
eval_result = evaluate_answer(
question=case["input"],
answer=result.final_output,
reference=case["quality_reference"],
)
scores.append(eval_result.score)
if not trajectory_ok or not eval_result.passed:
failures.append({
"id": case["id"],
"trajectory_ok": trajectory_ok,
"score": eval_result.score,
})
avg_score = sum(scores) / len(scores)
print(f"Average quality score: {avg_score:.2f} / 5.0")
print(f"Failures: {len(failures)} / {len(GOLDEN_CASES)}")
assert avg_score >= score_threshold, f"Quality regression: {avg_score:.2f} < {score_threshold}"
return avg_scoreRun this in CI before every deployment. If the average score drops below your threshold, the deployment is blocked.
You don't have to build all of this from scratch:
| Tool | Best for |
|---|---|
| LangSmith | Tracing, LangGraph eval datasets, online feedback capture |
| Braintrust | LLM-as-judge at scale, dataset management, CI integration |
| OpenAI Evals | Evaluating OpenAI models with custom graders |
| RAGAS | Specific to RAG pipelines — measures faithfulness, answer relevance, context recall |
| pytest + custom fixtures | Unit and trace tests (no third-party dependency) |
Start with pytest for tools and traces. Add LangSmith when you need visibility into production runs. Add Braintrust when you need a structured eval dataset with version history.
Over the past 12 days we've covered:
The backlog continues — check back for posts on Claude extended thinking, voice agents, browser-use, and security hardening for production agents.