YAML API¶

YAML transcripts let you define evaluation tests without writing Python. They are loaded automatically from any directory listed in yaml_dirs.

Directory setup¶

# pyproject.toml
[tool.agent_eval]
yaml_dirs = ["tests/evals"]

Any *.yaml file inside tests/evals/ (searched recursively) becomes a test.

Full annotated transcript¶

# tests/evals/booking.yaml

# Unique ID — used as the pytest test name
id: booking_confirmation

# Fraction of runs that must pass (0.0 – 1.0)
threshold: 0.8

# Number of times to run the full transcript
runs: 3

# Optional tags for quality-gate filtering
tags:
  - gate:booking
  - smoke

turns:
  - user: "Book me a table for 2 tomorrow at 10am."

    expect:
      # Reply must contain at least one of these strings (case-insensitive)
      reply_contains_any:
        - "confirmed"
        - "booked"

      # Reply must contain ALL of these strings (case-insensitive)
      reply_contains_all:
        - "tomorrow"
        - "10"

      # Tool names that must appear in this turn's tool calls
      tool_calls_include:
        - create_booking

      # Tool names that must NOT appear in this turn's tool calls
      tool_calls_exclude:
        - cancel_booking

      # LLM-as-judge rubric (requires a model in [tool.agent_eval])
      judge:
        rubric: >
          The reply must confirm a booking with a date, time, and
          reference number. The tone should be friendly and professional.
        model: "openai:gpt-4o"   # optional — overrides [tool.agent_eval] model

  - user: "Can you email me the confirmation?"
    expect:
      reply_contains_any:
        - "email"
        - "sent"

Field reference¶

Top-level fields¶

Field	Type	Required	Default	Description
`id`	`str`	yes	—	Unique test identifier, used as the test name
`threshold`	`float`	no	`[tool.agent_eval]` threshold	Pass fraction required
`runs`	`int`	no	`[tool.agent_eval]` runs	Number of executions
`tags`	`list[str]`	no	`[]`	Quality-gate tags for filtering
`turns`	`list[Turn]`	yes	—	Ordered list of turns

`turns[].user`¶

The user message string for this turn. Required for every turn. Also acts as the transcript when an audio: fixture is generated for this turn.

`turns[].audio`¶

Optional path to a WAV file used by voice adapters (e.g. LiveKitAdapter). Resolved relative to the YAML file's directory unless absolute. Text adapters ignore this field — turns can mix audio and non-audio freely.

turns:
  - user: "Book me a slot tomorrow at 10am."
    audio: booking_t1.wav        # → tests/evals/booking_t1.wav
    expect:
      tool_calls_include: [create_booking]

Generate the WAV from user: text via:

python -m pytest_agent_eval.synthesize_audio

The CLI hashes turn.user into a <wav>.hash sidecar and only re-synthesises when the transcript changes. See the LiveKit adapter docs for the full pipeline.

`turns[].expect`¶

All expect fields are optional. Omit expect entirely for turns where you only care about the agent not crashing.

Field	Type	Description
`reply_contains_any`	`list[str]`	At least one string must appear in the reply
`reply_contains_all`	`list[str]`	All strings must appear in the reply
`tool_calls_include`	`list[str]`	These tool names must be present in the turn's calls
`tool_calls_exclude`	`list[str]`	These tool names must be absent from the turn's calls
`judge`	`JudgeConfig`	LLM-as-judge rubric evaluation

`turns[].expect.judge`¶

Field	Type	Description
`rubric`	`str`	Natural-language rubric sent to the judge model
`model`	`str \\| null`	pydantic-ai model ID override; falls back to global config

Agent fixture¶

YAML-loaded tests require a pytest fixture named llm_eval_agent that returns your agent callable:

# tests/conftest.py
import pytest

@pytest.fixture
def llm_eval_agent():
    async def my_agent(messages):
        # messages is a list of OpenAI-style {"role": ..., "content": ...} dicts
        return "Booking confirmed! Reference BK-1234."
    return my_agent

The fixture is resolved at collection time, so you can parametrize it or switch agents per test directory.