Skip to content

Models API Reference

A multi-turn evaluation transcript.

Parameters:

Name Type Description Default
id str

Unique identifier used as the pytest test name.

required
turns list[Turn]

Ordered list of turns.

required
threshold float

Fraction of runs that must pass (0.0-1.0).

0.8
runs int

Number of times to execute this transcript.

1
tags list[str]

Optional quality-gate tags (e.g. ["gate:booking"]).

list()
Source code in src/pytest_agent_eval/models.py
@dataclass
class Transcript:
    """A multi-turn evaluation transcript.

    Args:
        id: Unique identifier used as the pytest test name.
        turns: Ordered list of turns.
        threshold: Fraction of runs that must pass (0.0-1.0).
        runs: Number of times to execute this transcript.
        tags: Optional quality-gate tags (e.g. ["gate:booking"]).
    """

    id: str
    turns: list[Turn]
    threshold: float = 0.8
    runs: int = 1
    tags: list[str] = field(default_factory=list)

A single turn in a transcript.

Parameters:

Name Type Description Default
user str

The user message (also used as the transcript when audio is set).

required
audio _PathLike | None

Optional path to a WAV file for voice adapters. Resolved relative to the YAML file's directory when loaded from YAML.

None
expect Expect

Expectations for the agent's reply.

Expect()
Source code in src/pytest_agent_eval/models.py
@dataclass
class Turn:
    """A single turn in a transcript.

    Args:
        user: The user message (also used as the transcript when ``audio`` is set).
        audio: Optional path to a WAV file for voice adapters. Resolved relative to
            the YAML file's directory when loaded from YAML.
        expect: Expectations for the agent's reply.
    """

    user: str
    audio: _PathLike | None = None
    expect: Expect = field(default_factory=Expect)

Expectations for a single transcript turn.

Parameters:

Name Type Description Default
evaluators list[Any]

Programmatic evaluators (Python API).

list()
judge JudgeConfig | None

YAML-defined judge config.

None
tool_calls_include list[str]

Tool names that must appear in tool_calls.

list()
tool_calls_exclude list[str]

Tool names that must NOT appear in tool_calls.

list()
reply_contains_any list[str]

Reply must contain at least one of these strings.

list()
reply_contains_all list[str]

Reply must contain all of these strings.

list()
Source code in src/pytest_agent_eval/models.py
@dataclass
class Expect:
    """Expectations for a single transcript turn.

    Args:
        evaluators: Programmatic evaluators (Python API).
        judge: YAML-defined judge config.
        tool_calls_include: Tool names that must appear in tool_calls.
        tool_calls_exclude: Tool names that must NOT appear in tool_calls.
        reply_contains_any: Reply must contain at least one of these strings.
        reply_contains_all: Reply must contain all of these strings.
    """

    evaluators: list[Any] = field(default_factory=list)
    judge: JudgeConfig | None = None
    tool_calls_include: list[str] = field(default_factory=list)
    tool_calls_exclude: list[str] = field(default_factory=list)
    reply_contains_any: list[str] = field(default_factory=list)
    reply_contains_all: list[str] = field(default_factory=list)

Context passed to every evaluator for a turn.

Parameters:

Name Type Description Default
user str

The user message for this turn.

required
reply str

The agent's reply.

required
tool_calls list[str]

Names of tools called during the turn.

required
history list[dict[str, Any]]

Full conversation history in OpenAI message format, up to but not including the assistant reply for this turn.

required
Source code in src/pytest_agent_eval/models.py
@dataclass
class TurnContext:
    """Context passed to every evaluator for a turn.

    Args:
        user: The user message for this turn.
        reply: The agent's reply.
        tool_calls: Names of tools called during the turn.
        history: Full conversation history in OpenAI message format, up to but not including
            the assistant reply for this turn.
    """

    user: str
    reply: str
    tool_calls: list[str]
    history: list[dict[str, Any]]

Result from a single evaluator on a single turn.

Source code in src/pytest_agent_eval/models.py
@dataclass
class EvalResult:
    """Result from a single evaluator on a single turn."""

    passed: bool
    reasoning: str = ""

Aggregated result across all runs of a transcript.

Parameters:

Name Type Description Default
passed bool

True if score >= threshold.

required
score float

Fraction of runs that passed (0.0-1.0).

required
threshold float

Required pass fraction.

required
runs list[RunResult]

Individual run results.

required
Source code in src/pytest_agent_eval/models.py
@dataclass
class TranscriptResult:
    """Aggregated result across all runs of a transcript.

    Args:
        passed: True if score >= threshold.
        score: Fraction of runs that passed (0.0-1.0).
        threshold: Required pass fraction.
        runs: Individual run results.
    """

    passed: bool
    score: float
    threshold: float
    runs: list[RunResult]

    @property
    def passed_run_count(self) -> int:
        """Number of runs that passed."""
        return sum(r.passed for r in self.runs)

    def assert_threshold(self) -> None:
        """Raise AssertionError if score is below threshold."""
        if not self.passed:
            raise AssertionError(
                f"LLM eval failed: score={self.score:.2f} < threshold={self.threshold:.2f} "
                f"({self.passed_run_count}/{len(self.runs)} runs passed)"
            )

passed_run_count: int property

Number of runs that passed.

assert_threshold() -> None

Raise AssertionError if score is below threshold.

Source code in src/pytest_agent_eval/models.py
def assert_threshold(self) -> None:
    """Raise AssertionError if score is below threshold."""
    if not self.passed:
        raise AssertionError(
            f"LLM eval failed: score={self.score:.2f} < threshold={self.threshold:.2f} "
            f"({self.passed_run_count}/{len(self.runs)} runs passed)"
        )