Evaluators¶

Evaluators decide whether an agent's reply passes or fails a turn. All evaluators implement an async evaluate(ctx: TurnContext) -> EvalResult method.

`ContainsEvaluator`¶

Checks that the reply contains expected substrings (case-insensitive).

from pytest_agent_eval import ContainsEvaluator

# Pass if reply contains at least one of these
ContainsEvaluator(any_of=["confirmed", "booked"])

# Pass if reply contains ALL of these
ContainsEvaluator(all_of=["booking", "reference number"])

# Both checks at once
ContainsEvaluator(
    any_of=["confirmed", "booked"],
    all_of=["tomorrow"],
)

Parameters:

Parameter	Type	Description
`any_of`	`list[str]`	Reply must contain at least one of these (case-insensitive)
`all_of`	`list[str]`	Reply must contain every one of these (case-insensitive)

`ToolCallEvaluator`¶

Validates that specific tools were (or were not) called during a turn.

from pytest_agent_eval import ToolCallEvaluator

# Require a tool and forbid another
ToolCallEvaluator(
    must_include=["book_slot"],
    must_exclude=["cancel_slot"],
)

# Enforce call order
ToolCallEvaluator(
    must_include=["authenticate", "fetch_availability", "create_booking"],
    ordered=True,
)

Parameters:

Parameter	Type	Description
`must_include`	`list[str]`	Tool names that must appear in the turn's tool calls
`must_exclude`	`list[str]`	Tool names that must NOT appear in the turn's tool calls
`ordered`	`bool`	If `True`, `must_include` tools must appear in the specified order

`JudgeEvaluator`¶

Uses an LLM (via pydantic-ai) to evaluate the reply against a natural-language rubric. Good for open-ended quality checks that are hard to express as string patterns.

from pytest_agent_eval import JudgeEvaluator

JudgeEvaluator(
    rubric=(
        "The reply must confirm the booking, include a reference number, "
        "mention the date and time, and have a friendly professional tone."
    ),
    model="openai:gpt-4o",      # optional — falls back to [tool.agent_eval] model
    retries=2,                   # retry API failures
    timeout=30.0,                # per-call timeout in seconds
)

Parameters:

Parameter	Type	Default	Description
`rubric`	`str`	required	Natural-language description of what a passing reply looks like
`model`	`str \\| None`	`None`	pydantic-ai model ID; falls back to `[tool.agent_eval] model`
`retries`	`int`	`2`	Number of retries on API failure before returning a FAIL verdict
`timeout`	`float`	`30.0`	Seconds before the judge call times out

Writing a custom evaluator¶

Implement the Evaluator protocol: an object with an async evaluate method.

from pytest_agent_eval.models import TurnContext, EvalResult

class SentimentEvaluator:
    """Fail if the reply has negative sentiment."""

    def __init__(self, threshold: float = 0.5):
        self.threshold = threshold

    async def evaluate(self, ctx: TurnContext) -> EvalResult:
        # ctx.reply  — the agent's string reply
        # ctx.user   — the user message for this turn
        # ctx.tool_calls — list of tool names called
        # ctx.history    — OpenAI-format message history

        score = await compute_sentiment(ctx.reply)   # your own logic
        passed = score >= self.threshold
        return EvalResult(
            passed=passed,
            reasoning=f"Sentiment score {score:.2f} vs threshold {self.threshold:.2f}",
        )

Then use it like any built-in evaluator:

from pytest_agent_eval import Turn, Expect

Turn(
    user="How was your experience?",
    expect=Expect(evaluators=[SentimentEvaluator(threshold=0.6)]),
)

The protocol requires only that evaluate is async and returns an EvalResult. There is no base class to inherit from.

Evaluators¶

ContainsEvaluator¶

ToolCallEvaluator¶

JudgeEvaluator¶

Writing a custom evaluator¶

`ContainsEvaluator`¶

`ToolCallEvaluator`¶

`JudgeEvaluator`¶