Skip to content

Evaluators

Evaluators decide whether an agent's reply passes or fails a turn. All evaluators implement an async evaluate(ctx: TurnContext) -> EvalResult method.

ContainsEvaluator

Checks that the reply contains expected substrings (case-insensitive).

from pytest_agent_eval import ContainsEvaluator

# Pass if reply contains at least one of these
ContainsEvaluator(any_of=["confirmed", "booked"])

# Pass if reply contains ALL of these
ContainsEvaluator(all_of=["booking", "reference number"])

# Both checks at once
ContainsEvaluator(
    any_of=["confirmed", "booked"],
    all_of=["tomorrow"],
)

Parameters:

Parameter Type Description
any_of list[str] Reply must contain at least one of these (case-insensitive)
all_of list[str] Reply must contain every one of these (case-insensitive)

ToolCallEvaluator

Validates that specific tools were (or were not) called during a turn.

from pytest_agent_eval import ToolCallEvaluator

# Require a tool and forbid another
ToolCallEvaluator(
    must_include=["book_slot"],
    must_exclude=["cancel_slot"],
)

# Enforce call order
ToolCallEvaluator(
    must_include=["authenticate", "fetch_availability", "create_booking"],
    ordered=True,
)

Parameters:

Parameter Type Description
must_include list[str] Tool names that must appear in the turn's tool calls
must_exclude list[str] Tool names that must NOT appear in the turn's tool calls
ordered bool If True, must_include tools must appear in the specified order

JudgeEvaluator

Uses an LLM (via pydantic-ai) to evaluate the reply against a natural-language rubric. Good for open-ended quality checks that are hard to express as string patterns.

from pytest_agent_eval import JudgeEvaluator

JudgeEvaluator(
    rubric=(
        "The reply must confirm the booking, include a reference number, "
        "mention the date and time, and have a friendly professional tone."
    ),
    model="openai:gpt-4o",      # optional — falls back to [tool.agent_eval] model
    retries=2,                   # retry API failures
    timeout=30.0,                # per-call timeout in seconds
)

Parameters:

Parameter Type Default Description
rubric str required Natural-language description of what a passing reply looks like
model str \| None None pydantic-ai model ID; falls back to [tool.agent_eval] model
retries int 2 Number of retries on API failure before returning a FAIL verdict
timeout float 30.0 Seconds before the judge call times out

Writing a custom evaluator

Implement the Evaluator protocol: an object with an async evaluate method.

from pytest_agent_eval.models import TurnContext, EvalResult

class SentimentEvaluator:
    """Fail if the reply has negative sentiment."""

    def __init__(self, threshold: float = 0.5):
        self.threshold = threshold

    async def evaluate(self, ctx: TurnContext) -> EvalResult:
        # ctx.reply  — the agent's string reply
        # ctx.user   — the user message for this turn
        # ctx.tool_calls — list of tool names called
        # ctx.history    — OpenAI-format message history

        score = await compute_sentiment(ctx.reply)   # your own logic
        passed = score >= self.threshold
        return EvalResult(
            passed=passed,
            reasoning=f"Sentiment score {score:.2f} vs threshold {self.threshold:.2f}",
        )

Then use it like any built-in evaluator:

from pytest_agent_eval import Turn, Expect

Turn(
    user="How was your experience?",
    expect=Expect(evaluators=[SentimentEvaluator(threshold=0.6)]),
)

The protocol requires only that evaluate is async and returns an EvalResult. There is no base class to inherit from.