Skip to content

Smolagents Adapter — Design

Date: 2026-04-29 Status: Approved

Goal

Add a first-class adapter for smolagents so users testing a smolagents-based agent can drop it into pytest-llm-eval the same way they would a pydantic-ai, LangChain, or OpenAI agent.

Background

pytest-llm-eval already exposes adapters for three frameworks (PydanticAIAdapter, LangChainAdapter, OpenAIAdapter). Each adapter wraps a framework-native agent so it conforms to the plugin's callable contract:

async def agent(history: list[dict[str, Any]]) -> tuple[str, list[str]]

history is OpenAI-style messages. The return tuple is (reply, tool_call_names). The runner calls this once per turn, with the accumulated conversation history.

Smolagents differs from the existing frameworks in three ways:

  1. It owns its memory. Conversations continue via agent.run(msg, reset=False); passing reset=True (default) clears memory before running.
  2. agent.run() is synchronous. Our adapter contract is async.
  3. Tool calls are recorded in agent.memory.steps. ToolCallingAgent records structured tool_calls; CodeAgent invokes tools via Python code so its recorded calls are dominated by the smolagents-internal python_interpreter. Both agent types also record a final final_answer "tool call" to terminate.

Architecture

A single new module src/pytest_llm_eval/adapters/smolagents.py exporting one class:

class SmolagentsAdapter:
    def __init__(self, agent: Any, *, include_internal_tools: bool = False) -> None: ...
    async def __call__(self, history: list[dict[str, Any]]) -> tuple[str, list[str]]: ...

The adapter is duck-typed. It depends on:

  • agent.run(task: str, reset: bool) -> Any
  • agent.memory.steps — an iterable; each item may have a tool_calls attribute that is itself an iterable of objects with a .name string.

It does not import smolagents at runtime. This keeps the adapter usable with mocks/fakes in tests and forward-compatible with new smolagents agent classes.

Per-call flow

  1. Extract the latest user message: user_msg = history[-1]["content"].
  2. Decide reset behaviour: reset = (len(history) == 1).
  3. First turn of a transcript (only one user message in history) ⇒ fresh memory.
  4. Subsequent turns ⇒ continue the existing conversation.
  5. Snapshot the existing step count: prev = len(agent.memory.steps).
  6. Run the agent off the event loop: result = await asyncio.to_thread(agent.run, user_msg, reset=reset).
  7. Walk the new step slice (agent.memory.steps[prev:]) and collect step.tool_calls[*].name from steps that have a tool_calls attribute.
  8. If include_internal_tools is False, drop names equal to "python_interpreter" or "final_answer".
  9. Return (str(result), tool_call_names).

Why asyncio.to_thread

Smolagents agents are synchronous. Other adapters in this project wrap async-native frameworks. To keep the contract consistent and avoid blocking the event loop (which matters under pytest-xdist and concurrent runs via asyncio.gather), the adapter runs the agent in a worker thread.

Why len(history) == 1 for reset detection

Our runner builds up history turn-by-turn within a single transcript run. Across runs of the same transcript (when runs > 1), it restarts at turn 1 with a single-message history. So len(history) == 1 is the precise marker for "first turn of a fresh conversation". Subsequent runs are isolated because each first turn resets memory.

Configuration

A new optional dependency in pyproject.toml:

[project.optional-dependencies]
smolagents = ["smolagents>=1.0"]

Users install via pip install "pytest-llm-eval[smolagents]" or uv add "pytest-llm-eval[smolagents]".

Testing

All tests are unit tests against a hand-rolled fake agent. Smolagents itself is not added to dev dependencies — this matches the project's pattern (langchain/openai/pydantic-ai aren't in the dev group either).

Tests live in a new tests/adapters/test_smolagents.py:

Test What it asserts
test_first_turn_passes_reset_true history of length 1 ⇒ fake records reset=True
test_subsequent_turn_passes_reset_false history of length > 1 ⇒ fake records reset=False
test_returns_reply_string The agent's return value is stringified into the first tuple element
test_extracts_new_tool_calls_only Pre-existing steps in memory.steps are ignored; only new steps' tool calls are returned
test_filters_python_interpreter_and_final_answer_by_default Internal pseudo-tools are filtered
test_include_internal_tools_returns_them With include_internal_tools=True, internals appear in the result
test_handles_steps_without_tool_calls Steps lacking a tool_calls attribute are skipped without raising

The fake agent uses types.SimpleNamespace to mirror the duck-typed surface (memory.steps, tool_calls[*].name).

Documentation

  • docs/adapters.md — add a SmolagentsAdapter section with the install snippet (pip + uv tabs), a fixture example, and the constructor parameter table including include_internal_tools.
  • docs/index.md — add a Smolagents entry to the "Supported frameworks" tabbed block.
  • README.md — add a row to the framework table and an [smolagents] install line.

Error handling

No defensive guards in the adapter:

  • If agent lacks .run or .memory.steps, an AttributeError propagates. That is the correct signal — the user passed in a wrong object.
  • If agent.run raises, the exception propagates to the runner, where the existing retry layer (configured via [tool.llm_eval] retries) handles transient failures.

This matches the posture of the existing adapters.

Out of scope

  • Parsing CodeAgent's executed Python to extract per-tool calls. Smolagents records only the python_interpreter step for CodeAgent; getting at the inner tools would require AST-walking the executed code. Document this limitation; users wanting fine-grained tool-call assertions should use ToolCallingAgent.
  • An async agent factory pattern. Sharing a single agent across parallel tests is a known limitation that applies to every adapter in the project — out of scope for this design.
  • Streaming or step-by-step inspection. The adapter consumes only the final result of agent.run.