Smolagents Adapter — Design¶
Date: 2026-04-29 Status: Approved
Goal¶
Add a first-class adapter for smolagents so users testing a smolagents-based agent can drop it into pytest-llm-eval the same way they would a pydantic-ai, LangChain, or OpenAI agent.
Background¶
pytest-llm-eval already exposes adapters for three frameworks (PydanticAIAdapter, LangChainAdapter, OpenAIAdapter). Each adapter wraps a framework-native agent so it conforms to the plugin's callable contract:
history is OpenAI-style messages. The return tuple is (reply, tool_call_names). The runner calls this once per turn, with the accumulated conversation history.
Smolagents differs from the existing frameworks in three ways:
- It owns its memory. Conversations continue via
agent.run(msg, reset=False); passingreset=True(default) clears memory before running. agent.run()is synchronous. Our adapter contract is async.- Tool calls are recorded in
agent.memory.steps.ToolCallingAgentrecords structuredtool_calls;CodeAgentinvokes tools via Python code so its recorded calls are dominated by the smolagents-internalpython_interpreter. Both agent types also record a finalfinal_answer"tool call" to terminate.
Architecture¶
A single new module src/pytest_llm_eval/adapters/smolagents.py exporting one class:
class SmolagentsAdapter:
def __init__(self, agent: Any, *, include_internal_tools: bool = False) -> None: ...
async def __call__(self, history: list[dict[str, Any]]) -> tuple[str, list[str]]: ...
The adapter is duck-typed. It depends on:
agent.run(task: str, reset: bool) -> Anyagent.memory.steps— an iterable; each item may have atool_callsattribute that is itself an iterable of objects with a.namestring.
It does not import smolagents at runtime. This keeps the adapter usable with mocks/fakes in tests and forward-compatible with new smolagents agent classes.
Per-call flow¶
- Extract the latest user message:
user_msg = history[-1]["content"]. - Decide reset behaviour:
reset = (len(history) == 1). - First turn of a transcript (only one user message in history) ⇒ fresh memory.
- Subsequent turns ⇒ continue the existing conversation.
- Snapshot the existing step count:
prev = len(agent.memory.steps). - Run the agent off the event loop:
result = await asyncio.to_thread(agent.run, user_msg, reset=reset). - Walk the new step slice (
agent.memory.steps[prev:]) and collectstep.tool_calls[*].namefrom steps that have atool_callsattribute. - If
include_internal_tools is False, drop names equal to"python_interpreter"or"final_answer". - Return
(str(result), tool_call_names).
Why asyncio.to_thread¶
Smolagents agents are synchronous. Other adapters in this project wrap async-native frameworks. To keep the contract consistent and avoid blocking the event loop (which matters under pytest-xdist and concurrent runs via asyncio.gather), the adapter runs the agent in a worker thread.
Why len(history) == 1 for reset detection¶
Our runner builds up history turn-by-turn within a single transcript run. Across runs of the same transcript (when runs > 1), it restarts at turn 1 with a single-message history. So len(history) == 1 is the precise marker for "first turn of a fresh conversation". Subsequent runs are isolated because each first turn resets memory.
Configuration¶
A new optional dependency in pyproject.toml:
Users install via pip install "pytest-llm-eval[smolagents]" or uv add "pytest-llm-eval[smolagents]".
Testing¶
All tests are unit tests against a hand-rolled fake agent. Smolagents itself is not added to dev dependencies — this matches the project's pattern (langchain/openai/pydantic-ai aren't in the dev group either).
Tests live in a new tests/adapters/test_smolagents.py:
| Test | What it asserts |
|---|---|
test_first_turn_passes_reset_true |
history of length 1 ⇒ fake records reset=True |
test_subsequent_turn_passes_reset_false |
history of length > 1 ⇒ fake records reset=False |
test_returns_reply_string |
The agent's return value is stringified into the first tuple element |
test_extracts_new_tool_calls_only |
Pre-existing steps in memory.steps are ignored; only new steps' tool calls are returned |
test_filters_python_interpreter_and_final_answer_by_default |
Internal pseudo-tools are filtered |
test_include_internal_tools_returns_them |
With include_internal_tools=True, internals appear in the result |
test_handles_steps_without_tool_calls |
Steps lacking a tool_calls attribute are skipped without raising |
The fake agent uses types.SimpleNamespace to mirror the duck-typed surface (memory.steps, tool_calls[*].name).
Documentation¶
docs/adapters.md— add aSmolagentsAdaptersection with the install snippet (pip + uv tabs), a fixture example, and the constructor parameter table includinginclude_internal_tools.docs/index.md— add aSmolagentsentry to the "Supported frameworks" tabbed block.README.md— add a row to the framework table and an[smolagents]install line.
Error handling¶
No defensive guards in the adapter:
- If
agentlacks.runor.memory.steps, anAttributeErrorpropagates. That is the correct signal — the user passed in a wrong object. - If
agent.runraises, the exception propagates to the runner, where the existing retry layer (configured via[tool.llm_eval] retries) handles transient failures.
This matches the posture of the existing adapters.
Out of scope¶
- Parsing CodeAgent's executed Python to extract per-tool calls. Smolagents records only the
python_interpreterstep for CodeAgent; getting at the inner tools would require AST-walking the executed code. Document this limitation; users wanting fine-grained tool-call assertions should useToolCallingAgent. - An async agent factory pattern. Sharing a single agent across parallel tests is a known limitation that applies to every adapter in the project — out of scope for this design.
- Streaming or step-by-step inspection. The adapter consumes only the final
resultofagent.run.