Skip to main content

Prerequisites

Start h2oGPT with audio support enabled and models pre-loaded:
python generate.py \
  --enable_stt=True \
  --enable_tts=True \
  --pre_load_image_audio_models=True

Speech to text

POST /v1/audio/transcriptions Transcribes an audio file to text using Whisper. The request is a multipart/form-data upload.

Request parameters

file
file
required
Audio file to transcribe. Accepted formats include WAV, MP3, and other formats supported by the underlying Whisper installation.
model
string
required
Pass "whisper-1" for compatibility. The server uses its loaded Whisper model regardless of this value.
response_format
string
default:"text"
Output format. Use "text" to receive a plain string.
stream
boolean
default:"true"
When true, partial transcription results are returned as server-sent events. The OpenAI Python client does not expose streaming for transcriptions natively; use httpx directly to receive a streamed response.
chunk
string
default:"interval"
Controls how audio is segmented for streaming. Options: "silence" or "interval". Has no effect when stream=false.

Response

{
  "text": "Good morning. The sun is shining today."
}

Examples

from openai import OpenAI

client = OpenAI(base_url="http://localhost:5000/v1", api_key="EMPTY")

with open("speech.wav", "rb") as f:
    audio_bytes = f.read()

transcription = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_bytes,
)
print(transcription.text)

Text to speech

POST /v1/audio/speech Converts input text to audio using Coqui TTS or Microsoft TTS.

Request parameters

model
string
default:""
Pass "tts-1" for compatibility. The server uses its loaded TTS model.
input
string
required
The text to synthesize.
voice
string
default:""
If set, overrides both chatbot_role and speaker. For native OpenAI voices, h2oGPT translates them into defaults. Leave empty to rely on chatbot_role and speaker.
response_format
string
default:"wav"
Audio format of the response. Options: "wav", "mp3", "opus", "aac", "flac", "pcm".
stream
boolean
default:"true"
When true, audio is returned as a stream, one chunk (sentence) at a time. When false, the entire file is generated before returning.
stream_strip
boolean
default:"true"
When true and stream=true, WAV headers are stripped from all chunks after the first so the stream is a contiguous audio byte sequence. When false, each chunk is a valid standalone WAV file.
chatbot_role
string
default:"Female AI Assistant"
TTS role for Coqui TTS.
speaker
string
default:"SLT (female)"
Speaker for Microsoft TTS.

Response

Binary audio data in the requested format. The Content-Type header is set to audio/<response_format>. For streaming WAV, the server artificially inflates the header’s reported duration so players can stream through to end of audio.

Examples

from openai import OpenAI

client = OpenAI(base_url="http://localhost:5000/v1", api_key="EMPTY")

with client.audio.speech.with_streaming_response.create(
    model="tts-1",
    voice="",
    extra_body=dict(
        stream=True,
        chatbot_role="Female AI Assistant",
        speaker="SLT (female)",
        stream_strip=True,
    ),
    response_format="wav",
    input="Good morning! The sun is shining brilliantly today.",
) as response:
    response.stream_to_file("speech.wav")

Real-time playback with httpx and pygame

For applications that need to play audio while it is still being generated:
import openai
import httpx
import pygame
import pygame.mixer
import io
from pydub import AudioSegment

pygame.mixer.init(frequency=16000, size=-16, channels=1)
sound_queue = []

def play_audio(audio_bytes: bytes):
    s = io.BytesIO(audio_bytes)
    segment = AudioSegment.from_raw(s, sample_width=2, frame_rate=16000, channels=1)
    sound = pygame.mixer.Sound(io.BytesIO(segment.raw_data))
    sound_queue.append(sound)
    sound.play()
    pygame.time.wait(int(sound.get_length() * 1000))

client = openai.OpenAI(api_key="EMPTY")
headers = {
    "Authorization": f"Bearer {client.api_key}",
    "Content-Type": "application/json",
}
data = {
    "model": "tts-1",
    "voice": "SLT (female)",
    "input": "Good morning! The sun is shining brilliantly today.",
    "stream": "true",
    "stream_strip": "false",
}

with httpx.Client(timeout=None) as http_client:
    with http_client.stream(
        "POST",
        "http://localhost:5000/v1/audio/speech",
        headers=headers,
        json=data,
    ) as response:
        chunk_riff = b""
        for chunk in response.iter_bytes():
            if chunk.startswith(b"RIFF"):
                if chunk_riff:
                    play_audio(chunk_riff)
                chunk_riff = chunk
            else:
                chunk_riff += chunk
        if chunk_riff:
            play_audio(chunk_riff)

for sound in sound_queue:
    sound.stop()
pygame.quit()

Build docs developers (and LLMs) love