Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt

Use this file to discover all available pages before exploring further.

The LLM stage is the most compute-intensive and highest-latency component of the pipeline. Four backends are supported, selected via --llm_backend. All backends share a set of base arguments for model identity and conversation state; backend-specific tuning flags are namespaced by prefix.
# Local Transformers backend
speech-to-speech --llm_backend transformers --model_name Qwen/Qwen3-4B-Instruct-2507

# MLX (Apple Silicon)
speech-to-speech --llm_backend mlx-lm --model_name mlx-community/Qwen3-4B-Instruct-2507-bf16

# OpenAI Responses API
speech-to-speech --llm_backend responses-api --model_name gpt-4o-mini

# Chat Completions
speech-to-speech --llm_backend chat-completions --model_name Qwen/Qwen3-4B-Instruct-2507 \
    --responses_api_base_url http://localhost:8000/v1
The following fields come from LanguageModelBaseArguments and are available regardless of which --llm_backend is selected. They carry no prefix.
model_name
string
default:"Qwen/Qwen3-4B-Instruct-2507"
The model to load or call. For local backends this is a Hugging Face Hub model ID or local path. For API backends it is the model string sent in the request body (e.g. gpt-4o-mini, deepseek-chat).
speech-to-speech --model_name Qwen/Qwen3-4B-Instruct-2507
user_role
string
default:"user"
The role label assigned to user turns in the chat template. Change this only if the model’s chat template uses a non-standard role name.
speech-to-speech --user_role user
init_chat_role
string
default:"system"
Role label used for the initial system/context message injected at the start of every conversation.
speech-to-speech --init_chat_role system
init_chat_prompt
string
System prompt injected as the first message in every new conversation. Tune this to change the assistant’s persona, response style, or domain focus.
speech-to-speech --init_chat_prompt "You are a concise voice assistant. Keep answers under 15 words."
chat_size
integer
default:"30"
Number of assistant–user turn pairs to retain in the rolling conversation window. Older turns are either evicted or summarized depending on --compact_history.
speech-to-speech --chat_size 20
stream_batch_sentences
integer
default:"3"
Number of sentences to accumulate before yielding a batch to the TTS handler during streaming. Set to 1 for sentence-by-sentence streaming, which reduces first-audio latency at the cost of potentially more TTS artifacts from very short inputs.
speech-to-speech --stream_batch_sentences 1
enable_lang_prompt
boolean
default:"false"
When true, appends an explicit instruction to reply in the detected or specified language (e.g. Please reply to my message in French.) after each user turn. Useful for smaller models that do not reliably pick up the target language from context alone.
speech-to-speech --enable_lang_prompt
compact_history
boolean
default:"true"
When true, the pipeline summarizes older turns in the background once the conversation exceeds --chat_size, instead of synchronously evicting them. This preserves long-range context at the cost of one extra LLM call per compaction cycle.
speech-to-speech --compact_history false

The gen_kwargs pattern for LLM

Any generation parameter not listed above can be passed at runtime with the llm_gen_ prefix:
speech-to-speech --llm_gen_temperature 0.7 --llm_gen_top_p 0.9
These are forwarded as kwargs to the underlying generate() call and accept any parameter supported by the model’s generation config.

Build docs developers (and LLMs) love