Sunday, April 5

Most agent failures in production don’t look like crashes — they look like subtle quality drops that nobody notices until a customer complains. Your summarisation agent starts truncating important details. Your support bot begins confidently hallucinating policy information. Your data extraction pipeline misses edge cases it used to handle correctly. Without systematic agent benchmarking testing, you’re flying blind, shipping changes and hoping nothing breaks. This article shows you how to build a testing framework that actually catches these problems before users do.

Why Standard Unit Tests Aren’t Enough for Agents

You can’t just assert that an agent’s output equals an expected string. LLM outputs are non-deterministic, context-dependent, and often evaluated on dimensions like coherence and completeness that don’t reduce to a boolean. A response can be technically correct but miss the intent of the task. Traditional testing catches syntax errors; agent testing needs to catch reasoning failures.

The other problem is regression. You swap Claude 3 Haiku for Sonnet to improve quality on complex queries, and now your simple classification tasks are slower and cost 5x more. Or you tweak a system prompt to improve tone and accidentally break structured output parsing. Without a benchmark suite, you won’t know which change caused the regression or when it happened.

What you actually need is a framework that covers three things: functional correctness (does the agent do the right thing?), performance characteristics (how fast, how expensive?), and regression detection (did something get worse after the last change?).

Designing Your Benchmark Suite

Choosing Test Cases That Actually Matter

Don’t generate synthetic test cases from the same model you’re testing — that’s circular and will give you false confidence. Use real examples from production logs, edge cases your team has hit manually, and deliberately adversarial inputs.

A useful benchmark suite has at least three categories:

  • Golden path cases: Clean, representative inputs that should always produce correct outputs. These are your baseline smoke tests.
  • Edge cases: Ambiguous inputs, unusual formats, boundary conditions. This is where prompt changes tend to cause silent regressions.
  • Adversarial cases: Inputs designed to trigger failure modes — jailbreak-adjacent prompts, contradictory instructions, inputs that previously caused hallucinations.

Aim for 30–50 test cases per major agent capability to start. More isn’t always better if the cases are redundant. Diversity matters more than volume.

Defining Evaluation Metrics

For each test case you need an evaluation function. This is where most teams underinvest. Common options:

  • Exact match: Fine for classification, entity extraction, structured JSON output
  • Substring / regex match: Useful when the format matters but the exact phrasing doesn’t
  • LLM-as-judge: Use a separate model call to score relevance, completeness, or correctness. Works well for free-form outputs but adds cost and latency
  • Schema validation: For agents that must return structured data — validate against a Pydantic model or JSON Schema

For most production agents I’d recommend a tiered approach: fast, deterministic checks first (schema, regex), then LLM-as-judge only for cases that pass the structural checks. This keeps evaluation costs manageable.

Building the Testing Framework in Python

Here’s a minimal but production-viable framework. It’s structured around three components: a test case dataclass, a runner that collects metrics, and an evaluator you can swap out per use case.

import anthropic
import time
import json
from dataclasses import dataclass, field
from typing import Callable, Optional
from datetime import datetime

@dataclass
class AgentTestCase:
    id: str
    category: str  # "golden", "edge", "adversarial"
    input_messages: list[dict]
    system_prompt: str
    evaluator: Callable[[str], dict]  # returns {"passed": bool, "score": float, "reason": str}
    expected_output: Optional[str] = None
    tags: list[str] = field(default_factory=list)

@dataclass
class BenchmarkResult:
    test_id: str
    passed: bool
    score: float
    reason: str
    latency_ms: float
    input_tokens: int
    output_tokens: int
    cost_usd: float
    timestamp: str
    model: str

def run_benchmark(
    test_cases: list[AgentTestCase],
    model: str = "claude-haiku-4-5",
    max_tokens: int = 1024
) -> list[BenchmarkResult]:
    client = anthropic.Anthropic()
    results = []

    # Approximate pricing per 1M tokens (verify current rates)
    pricing = {
        "claude-haiku-4-5": {"input": 0.80, "output": 4.00},
        "claude-sonnet-4-5": {"input": 3.00, "output": 15.00},
    }

    for tc in test_cases:
        start = time.time()

        response = client.messages.create(
            model=model,
            max_tokens=max_tokens,
            system=tc.system_prompt,
            messages=tc.input_messages
        )

        latency_ms = (time.time() - start) * 1000
        output_text = response.content[0].text
        input_tokens = response.usage.input_tokens
        output_tokens = response.usage.output_tokens

        # Calculate cost
        p = pricing.get(model, {"input": 0, "output": 0})
        cost = (input_tokens / 1_000_000 * p["input"]) + \
               (output_tokens / 1_000_000 * p["output"])

        # Run the evaluator for this test case
        eval_result = tc.evaluator(output_text)

        results.append(BenchmarkResult(
            test_id=tc.id,
            passed=eval_result["passed"],
            score=eval_result["score"],
            reason=eval_result["reason"],
            latency_ms=round(latency_ms, 1),
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            cost_usd=round(cost, 6),
            timestamp=datetime.utcnow().isoformat(),
            model=model
        ))

    return results

The evaluator is a plain function — this means you can write tight, fast evaluators for structured outputs and only reach for LLM-as-judge when you actually need it.

Writing Evaluators

import re

def make_json_schema_evaluator(required_keys: list[str]):
    """Evaluator for agents that must return valid JSON with specific fields."""
    def evaluate(output: str) -> dict:
        try:
            # Strip markdown code fences if present
            clean = re.sub(r"```(?:json)?|```", "", output).strip()
            parsed = json.loads(clean)
            missing = [k for k in required_keys if k not in parsed]
            if missing:
                return {"passed": False, "score": 0.0,
                        "reason": f"Missing keys: {missing}"}
            return {"passed": True, "score": 1.0, "reason": "Schema valid"}
        except json.JSONDecodeError as e:
            return {"passed": False, "score": 0.0,
                    "reason": f"Invalid JSON: {e}"}
    return evaluate

def make_llm_judge_evaluator(criterion: str, threshold: float = 0.7):
    """Use Claude itself to score free-form outputs."""
    client = anthropic.Anthropic()

    def evaluate(output: str) -> dict:
        prompt = f"""Score the following response on this criterion: {criterion}

Response: {output}

Return JSON: {{"score": 0.0-1.0, "reason": "brief explanation"}}"""

        # Use the cheapest model for evaluation to keep costs down
        resp = client.messages.create(
            model="claude-haiku-4-5",
            max_tokens=200,
            messages=[{"role": "user", "content": prompt}]
        )
        try:
            result = json.loads(resp.content[0].text)
            score = float(result["score"])
            return {
                "passed": score >= threshold,
                "score": score,
                "reason": result.get("reason", "")
            }
        except Exception:
            return {"passed": False, "score": 0.0, "reason": "Judge parse error"}
    return evaluate

Regression Detection and Reporting

Running benchmarks is only useful if you track results over time and alert when things get worse. Store results in a SQLite database (fine for most teams) or Postgres, and compare each run against a baseline.

import sqlite3

def save_results(results: list[BenchmarkResult], run_id: str, db_path: str = "benchmarks.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS benchmark_runs (
            run_id TEXT, test_id TEXT, passed INTEGER, score REAL,
            reason TEXT, latency_ms REAL, cost_usd REAL,
            model TEXT, timestamp TEXT
        )
    """)
    rows = [(run_id, r.test_id, int(r.passed), r.score, r.reason,
             r.latency_ms, r.cost_usd, r.model, r.timestamp)
            for r in results]
    conn.executemany("INSERT INTO benchmark_runs VALUES (?,?,?,?,?,?,?,?,?)", rows)
    conn.commit()
    conn.close()

def detect_regressions(current_run_id: str, baseline_run_id: str,
                       db_path: str = "benchmarks.db") -> list[dict]:
    conn = sqlite3.connect(db_path)
    query = """
        SELECT c.test_id,
               b.score AS baseline_score,
               c.score AS current_score,
               (c.score - b.score) AS delta
        FROM benchmark_runs c
        JOIN benchmark_runs b ON c.test_id = b.test_id
        WHERE c.run_id = ? AND b.run_id = ?
          AND (c.score - b.score) < -0.1  -- flag drops > 10 points
        ORDER BY delta ASC
    """
    rows = conn.execute(query, (current_run_id, baseline_run_id)).fetchall()
    conn.close()
    return [{"test_id": r[0], "baseline": r[1],
             "current": r[2], "delta": r[3]} for r in rows]

Wire this into your CI pipeline so every PR that touches prompts or model config runs the full benchmark suite. A 10% score drop on any category should block the merge — the threshold is your call, but having one at all is what matters.

A/B Testing Model and Prompt Changes

Once your benchmark suite is stable, you can use it for structured A/B testing. The pattern is simple: run the same test cases against two configurations and compare the aggregate results.

def run_ab_test(
    test_cases: list[AgentTestCase],
    config_a: dict,  # {"model": "...", "system_prompt": "..."}
    config_b: dict,
) -> dict:
    """Returns summary stats for both configurations."""

    def summarise(results: list[BenchmarkResult]) -> dict:
        passed = sum(1 for r in results if r.passed)
        return {
            "pass_rate": passed / len(results),
            "avg_score": sum(r.score for r in results) / len(results),
            "avg_latency_ms": sum(r.latency_ms for r in results) / len(results),
            "total_cost_usd": round(sum(r.cost_usd for r in results), 4),
        }

    # Override system prompt per config if provided
    def apply_config(tc: AgentTestCase, config: dict) -> AgentTestCase:
        if "system_prompt" in config:
            tc = AgentTestCase(**{**tc.__dict__, "system_prompt": config["system_prompt"]})
        return tc

    cases_a = [apply_config(tc, config_a) for tc in test_cases]
    cases_b = [apply_config(tc, config_b) for tc in test_cases]

    results_a = run_benchmark(cases_a, model=config_a["model"])
    results_b = run_benchmark(cases_b, model=config_b["model"])

    return {
        "config_a": summarise(results_a),
        "config_b": summarise(results_b),
    }

A realistic A/B test between Haiku and Sonnet on a 50-case suite costs roughly $0.04 for Haiku and $0.18 for Sonnet at current pricing — cheap enough to run on every meaningful change. The results will tell you whether the quality improvement justifies the cost increase for your specific task distribution.

What Breaks in Practice

LLM-as-judge evaluators introduce their own variance — the same output can score differently on back-to-back runs. Run judge evaluations three times and average, or use temperature=0 and accept that you’re trading diversity for consistency. Neither is perfect.

Latency measurements from a single run are noisy. For production benchmarking, run each test case at least three times and take the median. This adds cost but gives you numbers you can actually act on.

Watch out for prompt injection in your test inputs. If your adversarial test cases contain instructions to “ignore previous instructions”, make sure your evaluator doesn’t blindly pass them to an LLM judge without sanitisation — you’ll get weird false positives.

Finally, benchmark drift: as you add features, test cases accumulate and the suite gets slow. Set a time budget (e.g., 60 seconds in CI), and run the full suite only pre-merge or on a schedule. Fast smoke tests on every commit, full regression suite nightly.

When to Use This and Who Should Build It

Solo founders: Start with 20 golden-path test cases and exact-match or schema evaluators. Get CI running first, even if imperfectly. Skip LLM-as-judge until you have production traffic to calibrate against.

Small teams shipping multiple agents: The SQLite-backed regression detection is worth setting up early. The pain of a silent regression found by a user versus found by a nightly benchmark run is enormous. Invest 2–3 hours here and you’ll save much more than that.

Larger teams with model evaluation budgets: Look at purpose-built tools like Braintrust or LangSmith for the dashboard and dataset management layers. But the evaluation logic shown here transfers directly — those platforms let you bring your own evaluators.

Agent benchmarking testing isn’t glamorous work, but it’s the difference between deploying confidently and deploying hopefully. The framework above gives you the core primitives: structured test cases, composable evaluators, cost-tracked runs, and regression detection. Everything else is scaling and polish.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply