The Language Model stage is the most compute-intensive component in the pipeline. A single forward pass through a large model can dominate end-to-end latency, so choosing the right backend for your hardware and latency budget matters. Select a backend withDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt
Use this file to discover all available pages before exploring further.
--llm_backend (default: responses-api) and pair it with --model_name.
Backend selection
--llm_backend value | Handler class | Best for |
|---|---|---|
transformers | LanguageModelHandler | Local CUDA/CPU inference via Hugging Face Transformers |
mlx-lm | LanguageModelHandler | Local Apple Silicon inference via mlx-lm |
responses-api | ResponsesApiModelHandler | Any OpenAI-compatible /v1/responses provider (default) |
chat-completions | ChatCompletionsApiModelHandler | OpenAI-compatible /v1/chat/completions — prefer when Responses API streaming tool-calls are unreliable |
Shared arguments
These flags apply across all four backends:The model to load or the model ID to send to the API. For local backends this is a Hugging Face Hub model ID or a local path; for API backends it is the model string the provider expects (e.g.
gpt-4o-mini, gpt-5.4-mini). The responses-api and chat-completions backends override the default to gpt-5.4-mini.Number of assistant–user turn pairs to keep in the rolling context window. When the chat exceeds this size, older turns are compacted (summarised) in the background by
compact_history.The role assigned to the initial chat message (system prompt). Default is
system.The system prompt injected at the start of every conversation. Override this to change the assistant’s persona or behaviour.
The role label assigned to user turns in the chat history. Default is
user.When
True, appends a "Please reply to my message in <language>" instruction after each user turn when the detected language is known. Helps smaller models stay in the correct language when --language auto is used. Large models typically infer the language from context without this flag.Number of complete sentences to accumulate before yielding a batch to the TTS stage. Set to
1 for sentence-by-sentence streaming; higher values reduce TTS cold-start overhead at the cost of slightly higher latency to first audio.When
True, older turns are summarised in the background by an extra LLM call once chat_size is exceeded, instead of being evicted synchronously. Keeps the context coherent over long conversations.transformers — local CUDA/CPU inference
LanguageModelHandler loads the model with AutoModelForCausalLM and streams tokens via a TextIteratorStreamer running in a background thread.
Backend-specific arguments (prefix --llm_):
| Argument | Default | Description |
|---|---|---|
--llm_device | cuda | Device: cuda, cpu, mps |
--llm_torch_dtype | float16 | Precision: float16, bfloat16, float32 |
--llm_gen_max_new_tokens | 1024 | Maximum tokens per response |
--llm_gen_min_new_tokens | 0 | Minimum tokens per response |
--llm_gen_temperature | 0.0 | Sampling temperature; 0 = deterministic |
--llm_gen_do_sample | False | Enable sampling; False = greedy |
--llm_is_vlm | False | Set True for vision-language models (loads AutoModelForImageTextToText) |
- CUDA
- CPU
mlx-lm — Apple Silicon
Same LanguageModelHandler, but the backend is set to mlx internally. The model is loaded with mlx_lm.load() and generation uses mlx_lm.stream_generate(). Requires pip install "speech-to-speech[mlx-lm]".
Uses the same --llm_* argument prefix as transformers.
- Apple Silicon (optimal settings)
- Manual
responses-api — OpenAI-compatible Responses API
ResponsesApiModelHandler calls the /v1/responses endpoint of any provider that implements the OpenAI Responses API protocol. This is the default backend and requires no local GPU.
Supported providers
| Provider | --responses_api_base_url | --responses_api_api_key |
|---|---|---|
| OpenAI | (omit — uses OpenAI default) | $OPENAI_API_KEY |
| HF Inference Providers | https://router.huggingface.co/v1 | $HF_TOKEN |
| OpenRouter | https://openrouter.ai/api/v1 | $OPENROUTER_API_KEY |
| vLLM (local) | http://localhost:8000/v1 | (omit or any string) |
| llama.cpp (local) | http://localhost:8080/v1 | (omit or any string) |
responses-api arguments
API key for the provider. Falls back to the
OPENAI_API_KEY environment variable when unset. For local servers (vLLM, llama.cpp), pass any non-empty string.Base URL of the OpenAI-compatible endpoint. Omit to use the OpenAI default (
https://api.openai.com/v1).Stream tokens as they are generated. Strongly recommended for low-latency voice; disabling it blocks until the full response is ready.
Sends
chat_template_kwargs.enable_thinking=false on the Responses API request to suppress chain-of-thought reasoning tokens for providers that support it (e.g. Together + Qwen3.5 models). Disabling thinking reduces latency significantly for voice use cases.- OpenAI
- HF Inference Providers
- vLLM (local)
- DeepSeek
chat-completions — OpenAI Chat Completions API
ChatCompletionsApiModelHandler targets /v1/chat/completions instead of /v1/responses. It reuses all --responses_api_* connection flags (base URL, API key, stream, disable_thinking) and adds one extra argument.
Prefer chat-completions over responses-api when:
- The provider ignores
chat_template_kwargs.enable_thinkingon the Responses path and needs areasoning_effortknob to suppress reasoning. - The server’s Responses-API streaming tool-call path is unreliable (e.g. some vLLM builds), while its Chat Completions tool-call streaming works correctly.
Provider-specific reasoning level sent as
extra_body={"reasoning_effort": <value>} on the Chat Completions request. Use values like "none" or "low" to disable reasoning on providers where chat_template_kwargs.enable_thinking has no effect. When unset, the --responses_api_disable_thinking behaviour applies.- vLLM + Qwen tool calling
- Gemma via HF router
- llama.cpp (local)
Generation parameter overrides
Any generation parameter can be set with the--llm_gen_<param> prefix for local backends: