Tool calling lets your voice agent execute Python functions during a conversation — querying data, controlling hardware, or triggering external services — and optionally speak the results back to the user. Speech-to-Speech supports two distinct tool-calling paths that share the same wire protocol for clients but differ internally depending on the LLM backend.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt
Use this file to discover all available pages before exploring further.
Two tool-calling paths
- Local LLM (transformers / mlx-lm)
- OpenAI API (responses-api)
When Model output:
--llm_backend transformers or --llm_backend mlx-lm is active, there is no native function-calling protocol. Instead, the pipeline uses prompt engineering: tools are rendered as Python-style function stubs and injected into the system prompt. The model is instructed to wrap any tool invocations inside <code>...</code> delimiters.How it works:- Tools defined via
session.updateare converted toFunctionToolobjects (which extendRealtimeFunctionToolfrom theopenailibrary). FunctionTool.to_code_prompt()renders each tool as adef name(arg: type) -> ...: """docstring"""stub usingsignature_from_schema()to convert JSON Schema types to Python type annotations.- The stubs are injected into the system prompt via a Jinja2 template (
tool_prompt.py), which tells the model to output tool calls as<code>function_name(arg='value')</code>. - After generation, the pipeline extracts
<code>blocks with a regex, parses eachname(kwargs)expression using Python’sastandtokenizemodules, and validates arguments against the registered tool schema. - Valid parsed calls become
ResponseFunctionToolCallobjects with auto-generatedcall_ids.
build_tool_system_prompt):Defining tools via session.update
Tools are registered through the session.update event, which accepts a JSON Schema tools array in the same format as the OpenAI Realtime API. Both local-LLM and API paths read from the same session config.
The tool call cycle
When the model decides to call a tool, the server emits aresponse.function_call_arguments.done event containing the call_id, name, and JSON-encoded arguments. Your client executes the function, then sends the result back.
for event in conn:
if event.type == "response.function_call_arguments.done":
call_id = event.call_id
name = event.name
arguments = json.loads(event.arguments)
print(f"Tool called: {name}({arguments})")
Send the tool output back to the server. This injects the result into the LLM context but does not trigger a new generation automatically:
conn.conversation.item.create(
item={
"type": "function_call_output",
"call_id": call_id,
"output": result_str,
}
)
If the tool result should be spoken to the user, send
response.create to kick off a new generation pass. The LLM will see the function output in context and synthesize a natural spoken answer:Complete client example
Fire-and-forget vs. spoken results
Not all tool calls need a spoken follow-up. The pattern depends on whether the action is purely mechanical or returns information the user should hear.- Fire-and-forget (robot / UI actions)
- Data / search tools (spoken result)
For actions like triggering an LED, moving a robot joint, or playing a sound effect, the model already speaks a natural lead-in before invoking the tool (e.g.
"Sure, here's my best happy expression."). After conversation.item.create, you do not need to call response.create.FunctionTool and to_code_prompt()
FunctionTool extends RealtimeFunctionTool from the openai library and adds a single extra method for the local-LLM prompt-engineering path:
include_args_doc and token impact
The include_args_doc parameter controls whether per-argument descriptions are included in the rendered docstring. This has a large effect on prompt size:
include_args_doc | Approx. tokens (Reachy Mini tool profile) |
|---|---|
False | ~906 tokens |
True | ~3,434 tokens |
include_args_doc=True when working with smaller models that benefit from the extra context, or when argument names alone are not self-explanatory. Disable it (False) to reduce token usage and latency, especially for models with a limited context window or in latency-sensitive deployments.