pytest-agent-eval¶
LLM evaluation tests that actually mean something.
pytest-agent-eval is a pytest plugin for testing LLM agents and applications with threshold-based pass/fail scoring, multi-turn YAML transcripts, and an LLM-as-judge rubric system — without breaking your CI bill.
Highlights¶
- 🎯 Threshold-based pass/fail — run each test N times, pass when ≥ threshold% succeed
- 📝 YAML or Python transcripts — pick the authoring style your team prefers
- 🔍 YAML auto-discovery — drop
*.yamlfiles in any configured directory and they become pytest tests automatically - 🛡 CI-safe by default — eval tests skip unless
--agent-eval-liveorEVAL_LIVE=1 - ⚡ Parallel-ready —
pytest -n auto(viapytest-xdist) just works - 📄 Markdown reports — full per-run trace with
--agent-eval-report=eval.md
Install¶
For framework-specific adapters, install one of the optional extras shown in the Frameworks section.
What you can test¶
Three layers of checks, freely composable
Each evaluator runs against every turn and contributes to the threshold score. Mix the strict ones with the judgmental ones — there is no priority.
Substring / pattern assertions over the agent's reply. Cheap, fast, deterministic.
Assert that the agent invoked the right tools — optionally in a specific order, with disallowed tools.
Open-ended quality checks. The judge (a separate model) returns a verdict + reasoning against your rubric.
Supported frameworks¶
pytest-agent-eval ships first-class adapters for the major Python agent frameworks. Each is an optional extra so you only install what you use.
No extra needed — pydantic-ai support ships with the base install.
YAML auto-discovery¶
Zero-boilerplate evals
Point pytest-agent-eval at any directory of *.yaml files and every transcript becomes a pytest test — no Python wrapper required. Add files, run pytest, see results.
# tests/evals/booking.yaml
id: booking_confirmation
threshold: 0.8
runs: 3
turns:
- user: "Book me a slot tomorrow at 10am"
expect:
reply_contains_any: ["confirmed", "booked"]
tool_calls_include: ["create_booking"]
judge:
rubric: "Reply must include a reference number and be polite."
Provide one shared llm_eval_agent fixture (in conftest.py) and the loader handles the rest. See the YAML API reference for every field.
Quick start (Python API)¶
import pytest
from pytest_agent_eval import Turn, Expect, ContainsEvaluator, ToolCallEvaluator, JudgeEvaluator
@pytest.mark.agent_eval(threshold=0.8, runs=3)
async def test_booking(agent_eval):
result = await agent_eval.run(
agent=my_agent,
turns=[
Turn(
user="Book me a slot tomorrow at 10am",
expect=Expect(evaluators=[
ContainsEvaluator(any_of=["confirmed", "booked"]),
ToolCallEvaluator(must_include=["create_booking"]),
JudgeEvaluator(rubric="Reply must include a reference number."),
]),
)
],
)
result.assert_threshold()
See Getting Started for a full walkthrough.