OpenAI

Overview

LiteLLM provides full support for OpenAI’s models including GPT-4o, O1, O3-mini, and more. You can use all OpenAI features including streaming, function calling, vision, audio, and batch processing.

Quick Start

Install LiteLLM

pip install litellm

Set API Key

export OPENAI_API_KEY="sk-..."

Make Your First Call

from litellm import completion

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Hello, how are you?"}]
)
print(response.choices[0].message.content)

Supported Models

GPT-4o
O-Series (Reasoning)
GPT-4 Turbo
GPT-3.5

Latest and most capable GPT-4 models with optimized performance.

# GPT-4o - Best overall model
response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

# GPT-4o-mini - Fast and cost-effective
response = completion(
    model="openai/gpt-4o-mini",
    messages=[{"role": "user", "content": "Summarize this text"}]
)

# GPT-4o with vision
response = completion(
    model="openai/gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://..."}}
        ]
    }]
)

Advanced reasoning models for complex problem-solving.

# O1 - Advanced reasoning
response = completion(
    model="openai/o1",
    messages=[{"role": "user", "content": "Solve this complex math problem..."}]
)

# O1-mini - Efficient reasoning
response = completion(
    model="openai/o1-mini",
    messages=[{"role": "user", "content": "Analyze this code..."}]
)

# O3-mini - Latest reasoning model
response = completion(
    model="openai/o3-mini",
    messages=[{"role": "user", "content": "Debug this algorithm..."}]
)

# Control reasoning effort
response = completion(
    model="openai/o1",
    messages=[{"role": "user", "content": "Complex problem..."}],
    reasoning_effort="high"  # low, medium, high
)

Previous generation GPT-4 models.

# GPT-4 Turbo
response = completion(
    model="openai/gpt-4-turbo",
    messages=[{"role": "user", "content": "Write an essay"}]
)

# GPT-4 Turbo with vision
response = completion(
    model="openai/gpt-4-turbo-2024-04-09",
    messages=[{"role": "user", "content": "Analyze this chart"}]
)

Fast and cost-effective for simpler tasks.

response = completion(
    model="openai/gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Quick question..."}]
)

Authentication

Environment Variable
Direct Parameter
Custom Base URL
Organization ID

Set your OpenAI API key as an environment variable:

export OPENAI_API_KEY="sk-..."

from litellm import completion

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

Pass the API key directly:

from litellm import completion

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
    api_key="sk-..."
)

Use a custom OpenAI-compatible endpoint:

from litellm import completion

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
    api_base="https://custom-openai-endpoint.com/v1",
    api_key="sk-..."
)

Specify an organization for billing:

from litellm import completion
import os

os.environ["OPENAI_ORGANIZATION"] = "org-..."

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

Streaming

Get real-time responses as they’re generated:

from litellm import completion

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Write a long story"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Async Streaming

from litellm import acompletion
import asyncio

async def stream_response():
    response = await acompletion(
        model="openai/gpt-4o",
        messages=[{"role": "user", "content": "Write a story"}],
        stream=True
    )
    
    async for chunk in response:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)

asyncio.run(stream_response())

Function Calling

OpenAI models support sophisticated function/tool calling:

from litellm import completion

tools = [{
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "City and state, e.g. San Francisco, CA"
                },
                "unit": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"]
                }
            },
            "required": ["location"]
        }
    }
}]

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "What's the weather in Boston?"}],
    tools=tools
)

if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

Vision (Multimodal)

GPT-4o and GPT-4 Turbo support image inputs:

Image URL
Base64 Image
Multiple Images
Image Detail Level

response = completion(
    model="openai/gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://example.com/image.jpg"
                }
            }
        ]
    }]
)

import base64

with open("image.jpg", "rb") as image_file:
    base64_image = base64.b64encode(image_file.read()).decode('utf-8')

response = completion(
    model="openai/gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{base64_image}"
                }
            }
        ]
    }]
)

response = completion(
    model="openai/gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Compare these images"},
            {"type": "image_url", "image_url": {"url": "https://..."}},
            {"type": "image_url", "image_url": {"url": "https://..."}}
        ]
    }]
)

response = completion(
    model="openai/gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Analyze this in detail"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://...",
                    "detail": "high"  # low, high, or auto
                }
            }
        ]
    }]
)

JSON Mode

Force models to return valid JSON:

response = completion(
    model="openai/gpt-4o",
    messages=[{
        "role": "user",
        "content": "Extract info: John is 30 years old and lives in NYC"
    }],
    response_format={"type": "json_object"}
)

import json
data = json.loads(response.choices[0].message.content)

Advanced Features

Seed for Reproducibility

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Tell me a joke"}],
    seed=123,  # Same seed + inputs = similar outputs
    temperature=0.7
)

Logprobs

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Say 'hello'"}],
    logprobs=True,
    top_logprobs=3  # Return top 3 token probabilities
)

for token in response.choices[0].logprobs.content:
    print(f"Token: {token.token}, Logprob: {token.logprob}")

Max Tokens and Stop Sequences

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Write a story"}],
    max_tokens=500,  # Limit output length
    stop=["\n\n", "The End"]  # Stop at these sequences
)

Temperature and Top P

# More creative (temperature)
response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Write a poem"}],
    temperature=1.5  # 0 = deterministic, 2 = very random
)

# Nucleus sampling (top_p)
response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Generate text"}],
    top_p=0.9  # Consider tokens in top 90% probability mass
)

Embeddings

Generate text embeddings for semantic search and clustering:

from litellm import embedding

# Single text
response = embedding(
    model="openai/text-embedding-3-large",
    input="Hello world"
)
print(response.data[0].embedding)  # List of floats

# Multiple texts
response = embedding(
    model="openai/text-embedding-3-small",
    input=["Text 1", "Text 2", "Text 3"]
)

for item in response.data:
    print(f"Index {item.index}: {len(item.embedding)} dimensions")

# Specify dimensions (3-large and 3-small support this)
response = embedding(
    model="openai/text-embedding-3-large",
    input="Hello world",
    dimensions=256  # Reduce from default 3072
)

Available Embedding Models

Model	Dimensions	Use Case
`text-embedding-3-large`	3072 (default)	Best performance
`text-embedding-3-small`	1536 (default)	Good balance
`text-embedding-ada-002`	1536	Legacy model

Batch Processing

Process large volumes of requests asynchronously:

from litellm import create_batch, retrieve_batch

# Create a batch job
batch = create_batch(
    custom_llm_provider="openai",
    input_file_id="file-abc123",  # Upload file first
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

print(f"Batch ID: {batch.id}")
print(f"Status: {batch.status}")

# Check batch status
batch_status = retrieve_batch(
    custom_llm_provider="openai",
    batch_id=batch.id
)

print(f"Completed: {batch_status.request_counts.completed}")
print(f"Failed: {batch_status.request_counts.failed}")

Error Handling

from litellm import completion
from litellm.exceptions import (
    AuthenticationError,
    RateLimitError,
    ContextWindowExceededError,
    APIError
)

try:
    response = completion(
        model="openai/gpt-4o",
        messages=[{"role": "user", "content": "Hello!"}]
    )
except AuthenticationError:
    print("Invalid API key")
except RateLimitError:
    print("Rate limit exceeded - retry later")
except ContextWindowExceededError:
    print("Message too long - reduce input size")
except APIError as e:
    print(f"API error: {e}")

Cost Tracking

from litellm import completion, completion_cost

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Calculate cost
cost = completion_cost(completion_response=response)
print(f"Cost: ${cost:.6f}")

# Response includes token usage
print(f"Prompt tokens: {response.usage.prompt_tokens}")
print(f"Completion tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")

Best Practices

Use GPT-4o-mini First

Start with gpt-4o-mini for testing - it’s fast and cost-effective. Upgrade to gpt-4o when you need maximum quality.

Set Max Tokens

Always set max_tokens to prevent unexpectedly long (and expensive) responses.

Use Streaming

Enable streaming for better user experience in interactive applications.

Handle Rate Limits

Implement exponential backoff when handling RateLimitError exceptions.

Streaming

Learn more about streaming responses

Function Calling

Deep dive into function calling

Vision

Working with images and vision models

Embeddings

Guide to embeddings and semantic search

Providers

Provider Features

Overview

Quick Start

Supported Models

Authentication

Streaming

Async Streaming

Function Calling

Vision (Multimodal)

JSON Mode

Advanced Features

Seed for Reproducibility

Logprobs

Max Tokens and Stop Sequences

Temperature and Top P

Embeddings

Available Embedding Models

Batch Processing

Error Handling

Cost Tracking

Best Practices

Use GPT-4o-mini First

Set Max Tokens

Use Streaming

Handle Rate Limits

Streaming

Function Calling

Vision

Embeddings

Build docs developers (and LLMs) love

Providers

Provider Features

Documentation Index

​Overview

​Quick Start

​Supported Models

​Authentication

​Streaming

​Async Streaming

​Function Calling

​Vision (Multimodal)

​JSON Mode

​Advanced Features

​Seed for Reproducibility

​Logprobs

​Max Tokens and Stop Sequences

​Temperature and Top P

​Embeddings

​Available Embedding Models

​Batch Processing

​Error Handling

​Cost Tracking

​Best Practices

Use GPT-4o-mini First

Set Max Tokens

Use Streaming

Handle Rate Limits

​Related Documentation

Streaming

Function Calling

Vision

Embeddings

Build docs developers (and LLMs) love

Overview

Quick Start

Supported Models

Authentication

Streaming

Async Streaming

Function Calling

Vision (Multimodal)

JSON Mode

Advanced Features

Seed for Reproducibility

Logprobs

Max Tokens and Stop Sequences

Temperature and Top P

Embeddings

Available Embedding Models

Batch Processing

Error Handling

Cost Tracking

Best Practices

Related Documentation