pytest-agent-eval¶
LLM evaluation tests that actually mean something.
pytest-agent-eval is a pytest plugin for testing LLM agents and applications with threshold-based pass/fail scoring, multi-turn YAML transcripts, and an LLM-as-judge rubric system — without breaking your CI bill.
Highlights¶
- 🎯 Threshold-based pass/fail — run each test N times, pass when ≥ threshold% succeed
- 📝 YAML or Python transcripts — pick the authoring style your team prefers
- 🔍 YAML auto-discovery — drop
*.yamlfiles in any configured directory and they become pytest tests automatically - 🎙 Voice agents (LiveKit) — drive a real
AgentSessionwith a WAV per turn; same evaluator surface as text agents - 🛡 CI-safe by default — eval tests skip unless
--agent-eval-liveorEVAL_LIVE=1 - ⚡ Parallel-ready —
pytest -n auto(viapytest-xdist) just works - 📄 Markdown reports — full per-run trace with
--agent-eval-report=eval.md
Install¶
For framework-specific adapters, install one of the optional extras shown in the Frameworks section.
What you can test¶
Three layers of checks, freely composable
Each evaluator runs against every turn and contributes to the threshold score. Mix the strict ones with the judgmental ones — there is no priority.
Substring / pattern assertions over the agent's reply. Cheap, fast, deterministic.
Assert that the agent invoked the right tools — optionally in a specific order, with disallowed tools.
Open-ended quality checks. The judge (a separate model) returns a verdict + reasoning against your rubric.
Supported frameworks¶
pytest-agent-eval ships first-class adapters for the major Python agent frameworks. Each is an optional extra so you only install what you use.
No extra needed — pydantic-ai support ships with the base install.
Each turn declares an audio: turn.wav path; the adapter streams it through a fresh AgentSession and captures tool calls + transcript on the same evaluator surface as text agents.
from livekit.agents.voice import Agent, AgentSession
from livekit.plugins import openai
from pytest_agent_eval.adapters.livekit import LiveKitAdapter
def make_session():
session = AgentSession(llm=openai.realtime.RealtimeModel())
agent = Agent(instructions="You are a booking assistant.", tools=[...])
return session, agent
@pytest.fixture
def llm_eval_agent():
return LiveKitAdapter(make_session)
Generate WAV fixtures from your YAML transcripts (hash-cached, idempotent):
See the LiveKit adapter docs for full options.
YAML auto-discovery¶
Zero-boilerplate evals
Point pytest-agent-eval at any directory of *.yaml files and every transcript becomes a pytest test — no Python wrapper required. Add files, run pytest, see results.
# tests/evals/booking.yaml
id: booking_confirmation
threshold: 0.8
runs: 3
turns:
- user: "Book me a slot tomorrow at 10am"
expect:
reply_contains_any: ["confirmed", "booked"]
tool_calls_include: ["create_booking"]
judge:
rubric: "Reply must include a reference number and be polite."
Provide one shared llm_eval_agent fixture (in conftest.py) and the loader handles the rest. See the YAML API reference for every field.
Quick start (Python API)¶
import pytest
from pytest_agent_eval import Turn, Expect, ContainsEvaluator, ToolCallEvaluator, JudgeEvaluator
@pytest.mark.agent_eval(threshold=0.8, runs=3)
async def test_booking(agent_eval):
result = await agent_eval.run(
agent=my_agent,
turns=[
Turn(
user="Book me a slot tomorrow at 10am",
expect=Expect(evaluators=[
ContainsEvaluator(any_of=["confirmed", "booked"]),
ToolCallEvaluator(must_include=["create_booking"]),
JudgeEvaluator(rubric="Reply must include a reference number."),
]),
)
],
)
result.assert_threshold()
See Getting Started for a full walkthrough.