Smolagents Adapter — Design¶

Date: 2026-04-29 Status: Approved

Goal¶

Add a first-class adapter for smolagents so users testing a smolagents-based agent can drop it into pytest-llm-eval the same way they would a pydantic-ai, LangChain, or OpenAI agent.

Background¶

pytest-llm-eval already exposes adapters for three frameworks (PydanticAIAdapter, LangChainAdapter, OpenAIAdapter). Each adapter wraps a framework-native agent so it conforms to the plugin's callable contract:

async def agent(history: list[dict[str, Any]]) -> tuple[str, list[str]]

history is OpenAI-style messages. The return tuple is (reply, tool_call_names). The runner calls this once per turn, with the accumulated conversation history.

Smolagents differs from the existing frameworks in three ways:

It owns its memory. Conversations continue via agent.run(msg, reset=False); passing reset=True (default) clears memory before running.
agent.run() is synchronous. Our adapter contract is async.
Tool calls are recorded in agent.memory.steps. ToolCallingAgent records structured tool_calls; CodeAgent invokes tools via Python code so its recorded calls are dominated by the smolagents-internal python_interpreter. Both agent types also record a final final_answer "tool call" to terminate.

Architecture¶

A single new module src/pytest_llm_eval/adapters/smolagents.py exporting one class:

class SmolagentsAdapter:
    def __init__(self, agent: Any, *, include_internal_tools: bool = False) -> None: ...
    async def __call__(self, history: list[dict[str, Any]]) -> tuple[str, list[str]]: ...

The adapter is duck-typed. It depends on:

agent.run(task: str, reset: bool) -> Any
agent.memory.steps — an iterable; each item may have a tool_calls attribute that is itself an iterable of objects with a .name string.

It does not import smolagents at runtime. This keeps the adapter usable with mocks/fakes in tests and forward-compatible with new smolagents agent classes.

Per-call flow¶

Extract the latest user message: user_msg = history[-1]["content"].
Decide reset behaviour: reset = (len(history) == 1).
First turn of a transcript (only one user message in history) ⇒ fresh memory.
Subsequent turns ⇒ continue the existing conversation.
Snapshot the existing step count: prev = len(agent.memory.steps).
Run the agent off the event loop: result = await asyncio.to_thread(agent.run, user_msg, reset=reset).
Walk the new step slice (agent.memory.steps[prev:]) and collect step.tool_calls[*].name from steps that have a tool_calls attribute.
If include_internal_tools is False, drop names equal to "python_interpreter" or "final_answer".
Return (str(result), tool_call_names).

Why `asyncio.to_thread`¶

Smolagents agents are synchronous. Other adapters in this project wrap async-native frameworks. To keep the contract consistent and avoid blocking the event loop (which matters under pytest-xdist and concurrent runs via asyncio.gather), the adapter runs the agent in a worker thread.

Why `len(history) == 1` for reset detection¶

Our runner builds up history turn-by-turn within a single transcript run. Across runs of the same transcript (when runs > 1), it restarts at turn 1 with a single-message history. So len(history) == 1 is the precise marker for "first turn of a fresh conversation". Subsequent runs are isolated because each first turn resets memory.

Configuration¶

A new optional dependency in pyproject.toml:

[project.optional-dependencies]
smolagents = ["smolagents>=1.0"]

Users install via pip install "pytest-llm-eval[smolagents]" or uv add "pytest-llm-eval[smolagents]".

Testing¶

All tests are unit tests against a hand-rolled fake agent. Smolagents itself is not added to dev dependencies — this matches the project's pattern (langchain/openai/pydantic-ai aren't in the dev group either).

Tests live in a new tests/adapters/test_smolagents.py:

Test	What it asserts
`test_first_turn_passes_reset_true`	history of length 1 ⇒ fake records `reset=True`
`test_subsequent_turn_passes_reset_false`	history of length > 1 ⇒ fake records `reset=False`
`test_returns_reply_string`	The agent's return value is stringified into the first tuple element
`test_extracts_new_tool_calls_only`	Pre-existing steps in `memory.steps` are ignored; only new steps' tool calls are returned
`test_filters_python_interpreter_and_final_answer_by_default`	Internal pseudo-tools are filtered
`test_include_internal_tools_returns_them`	With `include_internal_tools=True`, internals appear in the result
`test_handles_steps_without_tool_calls`	Steps lacking a `tool_calls` attribute are skipped without raising

The fake agent uses types.SimpleNamespace to mirror the duck-typed surface (memory.steps, tool_calls[*].name).

Documentation¶

docs/adapters.md — add a SmolagentsAdapter section with the install snippet (pip + uv tabs), a fixture example, and the constructor parameter table including include_internal_tools.
docs/index.md — add a Smolagents entry to the "Supported frameworks" tabbed block.
README.md — add a row to the framework table and an [smolagents] install line.

Error handling¶

No defensive guards in the adapter:

If agent lacks .run or .memory.steps, an AttributeError propagates. That is the correct signal — the user passed in a wrong object.
If agent.run raises, the exception propagates to the runner, where the existing retry layer (configured via [tool.llm_eval] retries) handles transient failures.

This matches the posture of the existing adapters.

Out of scope¶

Parsing CodeAgent's executed Python to extract per-tool calls. Smolagents records only the python_interpreter step for CodeAgent; getting at the inner tools would require AST-walking the executed code. Document this limitation; users wanting fine-grained tool-call assertions should use ToolCallingAgent.
An async agent factory pattern. Sharing a single agent across parallel tests is a known limitation that applies to every adapter in the project — out of scope for this design.
Streaming or step-by-step inspection. The adapter consumes only the final result of agent.run.