Overview
Streaming allows you to receive chat completion responses incrementally as they are generated, rather than waiting for the complete response. This is useful for providing real-time feedback to users and reducing perceived latency.Enable streaming
To enable streaming, setstream=True when calling client.chat.completions.create():
Response structure
When streaming is enabled, the API returns aStream[ChatCompletionChunk] instead of a ChatCompletion object.
ChatCompletionChunk fields
A unique identifier for the chat completion. Each chunk has the same ID.
The object type, which is always
chat.completion.chunk.The Unix timestamp (in seconds) of when the chat completion was created. Each chunk has the same timestamp.
The model used to generate the completion.
A list of chat completion choices. Can contain more than one element if
n is greater than 1. Can also be empty for the last chunk if you set stream_options: {"include_usage": true}.Each choice contains:index: The index of this choicedelta: The incremental message content withroleand/orcontentfinish_reason: “stop” | “length” | “tool_calls” | “content_filter” | nulllogprobs: Log probability information (if requested)
Usage statistics for the completion request. Only present in the final chunk when
stream_options: {"include_usage": true} is set.Fields:prompt_tokens: Number of tokens in the promptcompletion_tokens: Number of tokens in the completiontotal_tokens: Total tokens usedcompletion_tokens_details: Breakdown of completion tokensprompt_tokens_details: Breakdown of prompt tokens
This fingerprint represents the backend configuration that the model runs with. Can be used in conjunction with the
seed request parameter to understand when backend changes have been made that might impact determinism.The processing type used for serving the request.
Stream options
You can configure streaming behavior using thestream_options parameter:
When set to
true, the final chunk will include usage statistics in the usage field.Async streaming
The async client provides the same streaming interface usingasync for:
Handling deltas
Each chunk contains adelta object instead of a complete message. The delta represents the incremental changes from the previous chunk:
Complete example with accumulation
Here’s how to accumulate the streamed chunks into a complete message:Streaming with function calls
When using tools/functions, the tool calls are also streamed incrementally:Error handling
Handle errors during streaming using try-except:Parameters
All parameters from the completions endpoint are supported when streaming. The only difference is settingstream=True.
Set to
true to enable streaming responses via Server-Sent Events.Best practices
- Flush output: Use
flush=Truewith print() or flush output streams regularly to ensure content appears immediately - Handle partial JSON: When streaming tool calls, arguments may arrive in multiple chunks - accumulate them before parsing
- Check finish_reason: Always check the
finish_reasonto understand why the stream ended - Include usage stats: Set
stream_options: {"include_usage": true}if you need token usage information - Error handling: Wrap streaming code in try-except blocks to handle network errors gracefully