Getting Started¶
This guide walks you through installing pytest-agent-eval and writing your first passing evaluation test.
Installation¶
For framework-specific adapters, install the matching optional extra:
Configure pyproject.toml¶
Add a [tool.agent_eval] section to your pyproject.toml:
[tool.agent_eval]
model = "openai:gpt-4o" # default judge + agent-fallback model
threshold = 0.8
runs = 3
yaml_dirs = ["tests/evals"] # enables YAML auto-discovery
Use a separate judge model
Set judge_model = "openai:gpt-4o" independently from the agent-under-test model — you typically want a stronger model judging a cheaper agent.
Write your first test — Python API¶
Create tests/test_my_agent.py:
import pytest
from pytest_agent_eval import Turn, Expect, ContainsEvaluator
async def my_agent(messages):
"""Your agent callable — receives OpenAI-style messages, returns (reply, tool_calls)."""
return "Your booking is confirmed for tomorrow at 10am.", []
@pytest.mark.agent_eval(threshold=0.8, runs=3)
async def test_booking_confirmation(agent_eval):
result = await agent_eval.run(
agent=my_agent,
turns=[
Turn(
user="Book me a slot for tomorrow at 10am.",
expect=Expect(
evaluators=[
ContainsEvaluator(any_of=["confirmed", "booked"]),
ContainsEvaluator(all_of=["tomorrow", "10"]),
]
),
)
],
)
result.assert_threshold()
Write your first test — YAML style¶
YAML auto-discovery
Any *.yaml file inside a directory listed in yaml_dirs becomes a pytest test automatically — no Python wrapper, no decorator. Drop a file, run pytest, see the result.
Create tests/evals/booking.yaml:
id: booking_confirmation
threshold: 0.8
runs: 3
turns:
- user: "Book me a slot for tomorrow at 10am."
expect:
reply_contains_any:
- "confirmed"
- "booked"
reply_contains_all:
- "tomorrow"
Then register the YAML directory in pyproject.toml:
You must also provide an llm_eval_agent fixture so the loader knows what to call:
# tests/conftest.py
import pytest
@pytest.fixture
def llm_eval_agent():
async def my_agent(messages):
return "Your booking is confirmed for tomorrow at 10am.", []
return my_agent
Run the tests¶
By default, eval tests are skipped in CI to avoid unexpected API calls. Enable them explicitly:
A passing run shows the score alongside the test name:
Use -vv for full turn-by-turn details including evaluator reasoning.
Run in parallel¶
For large eval suites, install the xdist extra and pass -n to run tests across worker processes — results from every worker are aggregated into the report: