Skip to content

Configuration

All configuration lives under [tool.agent_eval] in pyproject.toml.

Complete example

[tool.agent_eval]
model     = "openai:gpt-4o"
threshold = 0.8
runs      = 3
retries   = 2
timeout   = 30
yaml_dirs = ["tests/evals"]
live      = false

Fields

model

Type: str Default: "openai:gpt-4o"

The pydantic-ai model ID used by JudgeEvaluator and for YAML-defined judge rubrics. Format is provider:model-name, e.g.:

  • "openai:gpt-4o"
  • "openai:gpt-4o-mini"
  • "anthropic:claude-3-5-sonnet-latest"

threshold

Type: float Default: 0.8

The default pass fraction across runs. A test passes when passed_runs / total_runs >= threshold.

Individual tests override this via @pytest.mark.agent_eval(threshold=0.9) or the YAML threshold field.

runs

Type: int Default: 3

The default number of times each transcript is executed. Higher values reduce the impact of nondeterminism but increase cost.

Individual tests override this via @pytest.mark.agent_eval(runs=5) or the YAML runs field.

retries

Type: int Default: 2

Number of times to retry a single run if the agent raises an exception (network error, rate limit, etc.).

timeout

Type: int Default: 30

Per-turn timeout in seconds. If the agent callable does not return within this window, the turn is marked as failed.

yaml_dirs

Type: list[str] Default: []

Directories to search recursively for *.yaml evaluation transcripts. Paths are relative to the project root (where pyproject.toml lives).

live

Type: bool Default: false

When true, eval tests run without needing --agent-eval-live or EVAL_LIVE=1. Useful for local development but should remain false in shared/CI config.

Precedence

Command-line flag --agent-eval-live > EVAL_LIVE=1 env var > live = true in config > default (skip).

Per-test threshold/runs in the mark or YAML always override the global config values.