Evaluators API Reference¶
Check that the reply contains expected substrings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
any_of
|
list[str]
|
Reply must contain at least one of these strings (case-insensitive). |
list()
|
all_of
|
list[str]
|
Reply must contain every one of these strings (case-insensitive). |
list()
|
Example
Source code in src/pytest_agent_eval/evaluators/contains.py
evaluate(ctx: TurnContext) -> EvalResult
async
¶
Evaluate substring presence in the reply.
Source code in src/pytest_agent_eval/evaluators/contains.py
Validate that specific tools were (or were not) called.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
must_include
|
list[str]
|
Tool names that must appear in tool_calls. |
list()
|
must_exclude
|
list[str]
|
Tool names that must NOT appear in tool_calls. |
list()
|
ordered
|
bool
|
If True, must_include tools must appear in the given order. |
False
|
Example
Source code in src/pytest_agent_eval/evaluators/tool_call.py
evaluate(ctx: TurnContext) -> EvalResult
async
¶
Evaluate tool call presence and ordering.
Source code in src/pytest_agent_eval/evaluators/tool_call.py
Use an LLM to evaluate the reply against a rubric.
Uses pydantic-ai under the hood; supports any pydantic-ai compatible model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rubric
|
str
|
Natural language rubric describing what a passing reply looks like. |
required |
model
|
str | None
|
pydantic-ai model string (e.g. |
None
|
retries
|
int
|
Number of retry attempts on API failure before returning a FAIL verdict. |
2
|
timeout
|
float
|
Seconds before the judge call times out. |
30.0
|
Example
Source code in src/pytest_agent_eval/evaluators/judge.py
evaluate(ctx: TurnContext) -> EvalResult
async
¶
Run the LLM judge against the turn and return its verdict.