Skip to content

Getting Started

This guide walks you through installing pytest-agent-eval and writing your first passing evaluation test.

Installation

pip install pytest-agent-eval
uv add pytest-agent-eval

For framework-specific adapters, install the matching optional extra:

pip install "pytest-agent-eval[langchain]"   # LangChain / LangGraph support
pip install "pytest-agent-eval[openai]"      # OpenAI SDK support
pip install "pytest-agent-eval[xdist]"       # parallel test execution
uv add "pytest-agent-eval[langchain]"
uv add "pytest-agent-eval[openai]"
uv add "pytest-agent-eval[xdist]"

Configure pyproject.toml

Add a [tool.agent_eval] section to your pyproject.toml:

[tool.agent_eval]
model       = "openai:gpt-4o"   # default judge + agent-fallback model
threshold   = 0.8
runs        = 3
yaml_dirs   = ["tests/evals"]   # enables YAML auto-discovery

Use a separate judge model

Set judge_model = "openai:gpt-4o" independently from the agent-under-test model — you typically want a stronger model judging a cheaper agent.

Write your first test — Python API

Create tests/test_my_agent.py:

import pytest
from pytest_agent_eval import Turn, Expect, ContainsEvaluator

async def my_agent(messages):
    """Your agent callable — receives OpenAI-style messages, returns (reply, tool_calls)."""
    return "Your booking is confirmed for tomorrow at 10am.", []

@pytest.mark.agent_eval(threshold=0.8, runs=3)
async def test_booking_confirmation(agent_eval):
    result = await agent_eval.run(
        agent=my_agent,
        turns=[
            Turn(
                user="Book me a slot for tomorrow at 10am.",
                expect=Expect(
                    evaluators=[
                        ContainsEvaluator(any_of=["confirmed", "booked"]),
                        ContainsEvaluator(all_of=["tomorrow", "10"]),
                    ]
                ),
            )
        ],
    )
    result.assert_threshold()

Write your first test — YAML style

YAML auto-discovery

Any *.yaml file inside a directory listed in yaml_dirs becomes a pytest test automatically — no Python wrapper, no decorator. Drop a file, run pytest, see the result.

Create tests/evals/booking.yaml:

id: booking_confirmation
threshold: 0.8
runs: 3
turns:
  - user: "Book me a slot for tomorrow at 10am."
    expect:
      reply_contains_any:
        - "confirmed"
        - "booked"
      reply_contains_all:
        - "tomorrow"

Then register the YAML directory in pyproject.toml:

[tool.agent_eval]
yaml_dirs = ["tests/evals"]

You must also provide an llm_eval_agent fixture so the loader knows what to call:

# tests/conftest.py
import pytest

@pytest.fixture
def llm_eval_agent():
    async def my_agent(messages):
        return "Your booking is confirmed for tomorrow at 10am.", []
    return my_agent

Run the tests

By default, eval tests are skipped in CI to avoid unexpected API calls. Enable them explicitly:

pytest --agent-eval-live
EVAL_LIVE=1 pytest

A passing run shows the score alongside the test name:

tests/test_my_agent.py::test_booking_confirmation PASSED [score=1.00 threshold=0.80 runs=3/3]

Use -vv for full turn-by-turn details including evaluator reasoning.

Run in parallel

For large eval suites, install the xdist extra and pass -n to run tests across worker processes — results from every worker are aggregated into the report:

pip install "pytest-agent-eval[xdist]"
pytest --agent-eval-live -n auto --agent-eval-report=eval.md
uv add "pytest-agent-eval[xdist]"
uv run pytest --agent-eval-live -n auto --agent-eval-report=eval.md