Documentation Index Fetch the complete documentation index at: https://mintlify.com/BerriAI/litellm/llms.txt
Use this file to discover all available pages before exploring further.
Overview
LiteLLM provides full support for OpenAI’s models including GPT-4o, O1, O3-mini, and more. You can use all OpenAI features including streaming, function calling, vision, audio, and batch processing.
Quick Start
Set API Key
export OPENAI_API_KEY = "sk-..."
Make Your First Call
from litellm import completion
response = completion(
model = "openai/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "Hello, how are you?" }]
)
print (response.choices[ 0 ].message.content)
Supported Models
GPT-4o
O-Series (Reasoning)
GPT-4 Turbo
GPT-3.5
Latest and most capable GPT-4 models with optimized performance. # GPT-4o - Best overall model
response = completion(
model = "openai/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "Explain quantum computing" }]
)
# GPT-4o-mini - Fast and cost-effective
response = completion(
model = "openai/gpt-4o-mini" ,
messages = [{ "role" : "user" , "content" : "Summarize this text" }]
)
# GPT-4o with vision
response = completion(
model = "openai/gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "What's in this image?" },
{ "type" : "image_url" , "image_url" : { "url" : "https://..." }}
]
}]
)
Advanced reasoning models for complex problem-solving. # O1 - Advanced reasoning
response = completion(
model = "openai/o1" ,
messages = [{ "role" : "user" , "content" : "Solve this complex math problem..." }]
)
# O1-mini - Efficient reasoning
response = completion(
model = "openai/o1-mini" ,
messages = [{ "role" : "user" , "content" : "Analyze this code..." }]
)
# O3-mini - Latest reasoning model
response = completion(
model = "openai/o3-mini" ,
messages = [{ "role" : "user" , "content" : "Debug this algorithm..." }]
)
# Control reasoning effort
response = completion(
model = "openai/o1" ,
messages = [{ "role" : "user" , "content" : "Complex problem..." }],
reasoning_effort = "high" # low, medium, high
)
Previous generation GPT-4 models. # GPT-4 Turbo
response = completion(
model = "openai/gpt-4-turbo" ,
messages = [{ "role" : "user" , "content" : "Write an essay" }]
)
# GPT-4 Turbo with vision
response = completion(
model = "openai/gpt-4-turbo-2024-04-09" ,
messages = [{ "role" : "user" , "content" : "Analyze this chart" }]
)
Fast and cost-effective for simpler tasks. response = completion(
model = "openai/gpt-3.5-turbo" ,
messages = [{ "role" : "user" , "content" : "Quick question..." }]
)
Authentication
Environment Variable
Direct Parameter
Custom Base URL
Organization ID
Set your OpenAI API key as an environment variable: export OPENAI_API_KEY = "sk-..."
from litellm import completion
response = completion(
model = "openai/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
)
Pass the API key directly: from litellm import completion
response = completion(
model = "openai/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "Hello!" }],
api_key = "sk-..."
)
Use a custom OpenAI-compatible endpoint: from litellm import completion
response = completion(
model = "openai/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "Hello!" }],
api_base = "https://custom-openai-endpoint.com/v1" ,
api_key = "sk-..."
)
Specify an organization for billing: from litellm import completion
import os
os.environ[ "OPENAI_ORGANIZATION" ] = "org-..."
response = completion(
model = "openai/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
)
Streaming
Get real-time responses as they’re generated:
from litellm import completion
response = completion(
model = "openai/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "Write a long story" }],
stream = True
)
for chunk in response:
if chunk.choices[ 0 ].delta.content:
print (chunk.choices[ 0 ].delta.content, end = "" , flush = True )
Async Streaming
from litellm import acompletion
import asyncio
async def stream_response ():
response = await acompletion(
model = "openai/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "Write a story" }],
stream = True
)
async for chunk in response:
if chunk.choices[ 0 ].delta.content:
print (chunk.choices[ 0 ].delta.content, end = "" , flush = True )
asyncio.run(stream_response())
Function Calling
OpenAI models support sophisticated function/tool calling:
Basic Function Call
Parallel Function Calls
Force Tool Usage
from litellm import completion
tools = [{
"type" : "function" ,
"function" : {
"name" : "get_current_weather" ,
"description" : "Get the current weather in a location" ,
"parameters" : {
"type" : "object" ,
"properties" : {
"location" : {
"type" : "string" ,
"description" : "City and state, e.g. San Francisco, CA"
},
"unit" : {
"type" : "string" ,
"enum" : [ "celsius" , "fahrenheit" ]
}
},
"required" : [ "location" ]
}
}
}]
response = completion(
model = "openai/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "What's the weather in Boston?" }],
tools = tools
)
if response.choices[ 0 ].message.tool_calls:
tool_call = response.choices[ 0 ].message.tool_calls[ 0 ]
print ( f "Function: { tool_call.function.name } " )
print ( f "Arguments: { tool_call.function.arguments } " )
Vision (Multimodal)
GPT-4o and GPT-4 Turbo support image inputs:
Image URL
Base64 Image
Multiple Images
Image Detail Level
response = completion(
model = "openai/gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "What's in this image?" },
{
"type" : "image_url" ,
"image_url" : {
"url" : "https://example.com/image.jpg"
}
}
]
}]
)
import base64
with open ( "image.jpg" , "rb" ) as image_file:
base64_image = base64.b64encode(image_file.read()).decode( 'utf-8' )
response = completion(
model = "openai/gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Describe this image" },
{
"type" : "image_url" ,
"image_url" : {
"url" : f "data:image/jpeg;base64, { base64_image } "
}
}
]
}]
)
response = completion(
model = "openai/gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Compare these images" },
{ "type" : "image_url" , "image_url" : { "url" : "https://..." }},
{ "type" : "image_url" , "image_url" : { "url" : "https://..." }}
]
}]
)
response = completion(
model = "openai/gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Analyze this in detail" },
{
"type" : "image_url" ,
"image_url" : {
"url" : "https://..." ,
"detail" : "high" # low, high, or auto
}
}
]
}]
)
JSON Mode
Force models to return valid JSON:
JSON Mode
Structured Output
response = completion(
model = "openai/gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : "Extract info: John is 30 years old and lives in NYC"
}],
response_format = { "type" : "json_object" }
)
import json
data = json.loads(response.choices[ 0 ].message.content)
Advanced Features
Seed for Reproducibility
response = completion(
model = "openai/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "Tell me a joke" }],
seed = 123 , # Same seed + inputs = similar outputs
temperature = 0.7
)
Logprobs
response = completion(
model = "openai/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "Say 'hello'" }],
logprobs = True ,
top_logprobs = 3 # Return top 3 token probabilities
)
for token in response.choices[ 0 ].logprobs.content:
print ( f "Token: { token.token } , Logprob: { token.logprob } " )
Max Tokens and Stop Sequences
response = completion(
model = "openai/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "Write a story" }],
max_tokens = 500 , # Limit output length
stop = [ " \n\n " , "The End" ] # Stop at these sequences
)
Temperature and Top P
# More creative (temperature)
response = completion(
model = "openai/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "Write a poem" }],
temperature = 1.5 # 0 = deterministic, 2 = very random
)
# Nucleus sampling (top_p)
response = completion(
model = "openai/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "Generate text" }],
top_p = 0.9 # Consider tokens in top 90% probability mass
)
Embeddings
Generate text embeddings for semantic search and clustering:
from litellm import embedding
# Single text
response = embedding(
model = "openai/text-embedding-3-large" ,
input = "Hello world"
)
print (response.data[ 0 ].embedding) # List of floats
# Multiple texts
response = embedding(
model = "openai/text-embedding-3-small" ,
input = [ "Text 1" , "Text 2" , "Text 3" ]
)
for item in response.data:
print ( f "Index { item.index } : { len (item.embedding) } dimensions" )
# Specify dimensions (3-large and 3-small support this)
response = embedding(
model = "openai/text-embedding-3-large" ,
input = "Hello world" ,
dimensions = 256 # Reduce from default 3072
)
Available Embedding Models
Model Dimensions Use Case text-embedding-3-large3072 (default) Best performance text-embedding-3-small1536 (default) Good balance text-embedding-ada-0021536 Legacy model
Batch Processing
Process large volumes of requests asynchronously:
from litellm import create_batch, retrieve_batch
# Create a batch job
batch = create_batch(
custom_llm_provider = "openai" ,
input_file_id = "file-abc123" , # Upload file first
endpoint = "/v1/chat/completions" ,
completion_window = "24h"
)
print ( f "Batch ID: { batch.id } " )
print ( f "Status: { batch.status } " )
# Check batch status
batch_status = retrieve_batch(
custom_llm_provider = "openai" ,
batch_id = batch.id
)
print ( f "Completed: { batch_status.request_counts.completed } " )
print ( f "Failed: { batch_status.request_counts.failed } " )
Error Handling
from litellm import completion
from litellm.exceptions import (
AuthenticationError,
RateLimitError,
ContextWindowExceededError,
APIError
)
try :
response = completion(
model = "openai/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
)
except AuthenticationError:
print ( "Invalid API key" )
except RateLimitError:
print ( "Rate limit exceeded - retry later" )
except ContextWindowExceededError:
print ( "Message too long - reduce input size" )
except APIError as e:
print ( f "API error: { e } " )
Cost Tracking
from litellm import completion, completion_cost
response = completion(
model = "openai/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
)
# Calculate cost
cost = completion_cost( completion_response = response)
print ( f "Cost: $ { cost :.6f} " )
# Response includes token usage
print ( f "Prompt tokens: { response.usage.prompt_tokens } " )
print ( f "Completion tokens: { response.usage.completion_tokens } " )
print ( f "Total tokens: { response.usage.total_tokens } " )
Best Practices
Use GPT-4o-mini First Start with gpt-4o-mini for testing - it’s fast and cost-effective. Upgrade to gpt-4o when you need maximum quality.
Set Max Tokens Always set max_tokens to prevent unexpectedly long (and expensive) responses.
Use Streaming Enable streaming for better user experience in interactive applications.
Handle Rate Limits Implement exponential backoff when handling RateLimitError exceptions.
Streaming Learn more about streaming responses
Function Calling Deep dive into function calling
Vision Working with images and vision models
Embeddings Guide to embeddings and semantic search