Documentation Index
Fetch the complete documentation index at: https://mintlify.com/BerriAI/litellm/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The Router provides intelligent load balancing, fallbacks, and retries across multiple LLM deployments. This guide covers router-specific configuration.
Basic Router Setup
Python Configuration
from litellm import Router
router = Router(
model_list=[
{
"model_name": "gpt-4",
"litellm_params": {
"model": "azure/gpt-4",
"api_key": "your-key",
"api_base": "https://your-endpoint.openai.azure.com/"
},
"tpm": 100000,
"rpm": 1000
},
{
"model_name": "gpt-4",
"litellm_params": {
"model": "gpt-4",
"api_key": "your-openai-key"
},
"tpm": 90000,
"rpm": 900
}
],
routing_strategy="usage-based-routing"
)
# Use the router
response = router.completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}]
)
YAML Configuration (for Proxy)
model_list:
- model_name: gpt-4
litellm_params:
model: azure/gpt-4
api_key: os.environ/AZURE_API_KEY
api_base: https://your-endpoint.openai.azure.com/
tpm: 100000
rpm: 1000
- model_name: gpt-4
litellm_params:
model: gpt-4
api_key: os.environ/OPENAI_API_KEY
tpm: 90000
rpm: 900
router_settings:
routing_strategy: usage-based-routing
num_retries: 3
timeout: 30
fallbacks:
- gpt-4: ["gpt-3.5-turbo"]
Routing Strategies
simple-shuffle (Default)
Randomly selects from available deployments.
router = Router(
model_list=[...],
routing_strategy="simple-shuffle"
)
Best for: Simple load distribution without specific requirements.
usage-based-routing
Respects TPM (tokens per minute) and RPM (requests per minute) limits.
router = Router(
model_list=[
{
"model_name": "gpt-4",
"litellm_params": {"model": "gpt-4", "api_key": "key1"},
"tpm": 100000, # 100K tokens per minute
"rpm": 1000 # 1K requests per minute
},
{
"model_name": "gpt-4",
"litellm_params": {"model": "azure/gpt-4", "api_key": "key2"},
"tpm": 200000,
"rpm": 2000
}
],
routing_strategy="usage-based-routing"
)
Best for: Respecting provider rate limits and quotas.
latency-based-routing
Routes to the deployment with lowest latency.
router = Router(
model_list=[...],
routing_strategy="latency-based-routing",
routing_strategy_args={
"ttl": 60 # Cache latency measurements for 60 seconds
}
)
Best for: Optimizing response time across geographic regions.
least-busy
Routes to deployment with fewest ongoing requests.
router = Router(
model_list=[...],
routing_strategy="least-busy"
)
Best for: Even load distribution in high-concurrency scenarios.
cost-based-routing
Routes to the cheapest deployment.
router = Router(
model_list=[
{
"model_name": "gpt-4",
"litellm_params": {"model": "gpt-4", "api_key": "key"},
"model_info": {
"input_cost_per_token": 0.00003,
"output_cost_per_token": 0.00006
}
},
{
"model_name": "gpt-4",
"litellm_params": {"model": "azure/gpt-4", "api_key": "key"},
"model_info": {
"input_cost_per_token": 0.000025,
"output_cost_per_token": 0.00005
}
}
],
routing_strategy="cost-based-routing"
)
Best for: Cost optimization.
Fallback Configuration
Basic Fallbacks
router = Router(
model_list=[
{"model_name": "gpt-4", "litellm_params": {"model": "gpt-4"}},
{"model_name": "gpt-3.5-turbo", "litellm_params": {"model": "gpt-3.5-turbo"}},
{"model_name": "claude-2", "litellm_params": {"model": "claude-2"}}
],
fallbacks=[
{"gpt-4": ["gpt-3.5-turbo", "claude-2"]}
]
)
YAML:
model_list:
- model_name: gpt-4
litellm_params:
model: gpt-4
api_key: os.environ/OPENAI_API_KEY
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
api_key: os.environ/OPENAI_API_KEY
- model_name: claude-2
litellm_params:
model: claude-2
api_key: os.environ/ANTHROPIC_API_KEY
litellm_settings:
fallbacks:
- gpt-4: ["gpt-3.5-turbo", "claude-2"]
Context Window Fallbacks
router = Router(
model_list=[
{"model_name": "gpt-3.5-turbo", "litellm_params": {"model": "gpt-3.5-turbo"}},
{"model_name": "gpt-3.5-turbo-16k", "litellm_params": {"model": "gpt-3.5-turbo-16k"}},
{"model_name": "gpt-4-32k", "litellm_params": {"model": "gpt-4-32k"}}
],
context_window_fallbacks=[
{"gpt-3.5-turbo": ["gpt-3.5-turbo-16k"]},
{"gpt-4": ["gpt-4-32k"]}
]
)
YAML:
litellm_settings:
context_window_fallbacks:
- gpt-3.5-turbo: ["gpt-3.5-turbo-16k"]
- gpt-4: ["gpt-4-32k"]
Retry Configuration
Basic Retries
router = Router(
model_list=[...],
num_retries=3,
timeout=30,
retry_after=5 # Wait 5s before retry
)
Per-Error Retry Policy
router = Router(
model_list=[...],
retry_policy={
"RateLimitError": {"max_retries": 5},
"Timeout": {"max_retries": 2},
"InternalServerError": {"max_retries": 3}
}
)
YAML:
router_settings:
retry_policy:
RateLimitError:
max_retries: 5
Timeout:
max_retries: 2
InternalServerError:
max_retries: 3
Per-Model-Group Retry Policy
router = Router(
model_list=[...],
model_group_retry_policy={
"gpt-4": {
"RateLimitError": {"max_retries": 10}
},
"claude-2": {
"RateLimitError": {"max_retries": 3}
}
}
)
Cooldown Configuration
router = Router(
model_list=[...],
allowed_fails=3, # Allow 3 failures
cooldown_time=120, # 2 minute cooldown
disable_cooldowns=False
)
YAML:
router_settings:
allowed_fails: 3
cooldown_time: 120
Caching Configuration
Redis Caching
router = Router(
model_list=[...],
cache_responses=True,
redis_host="localhost",
redis_port=6379,
redis_password="your-password",
default_cache_time_seconds=3600 # 1 hour
)
YAML:
router_settings:
redis_host: localhost
redis_port: 6379
redis_password: os.environ/REDIS_PASSWORD
cache_responses: true
litellm_settings:
cache: true
cache_params:
type: redis
ttl: 3600
In-Memory Caching
router = Router(
model_list=[...],
cache_responses=True,
cache_kwargs={
"type": "local"
}
)
Model Aliases
router = Router(
model_list=[
{
"model_name": "prod-gpt-4",
"litellm_params": {"model": "gpt-4", "api_key": "key"}
}
],
model_group_alias={
"gpt-4": "prod-gpt-4",
"gpt4": "prod-gpt-4"
}
)
# All of these work:
response = router.completion(model="gpt-4", messages=[...])
response = router.completion(model="gpt4", messages=[...])
response = router.completion(model="prod-gpt-4", messages=[...])
YAML:
router_settings:
model_group_alias:
gpt-4: prod-gpt-4
gpt4: prod-gpt-4
Complete Production Example
model_list:
# GPT-4 with multiple deployments
- model_name: gpt-4
litellm_params:
model: azure/gpt-4
api_base: https://eastus.openai.azure.com/
api_key: os.environ/AZURE_KEY_EAST
api_version: "2024-02-01"
tpm: 100000
rpm: 1000
model_info:
input_cost_per_token: 0.00003
output_cost_per_token: 0.00006
- model_name: gpt-4
litellm_params:
model: azure/gpt-4
api_base: https://westus.openai.azure.com/
api_key: os.environ/AZURE_KEY_WEST
api_version: "2024-02-01"
tpm: 150000
rpm: 1500
model_info:
input_cost_per_token: 0.00003
output_cost_per_token: 0.00006
- model_name: gpt-4
litellm_params:
model: gpt-4
api_key: os.environ/OPENAI_API_KEY
tpm: 90000
rpm: 900
model_info:
input_cost_per_token: 0.00003
output_cost_per_token: 0.00006
# Fallback models
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
api_key: os.environ/OPENAI_API_KEY
tpm: 1000000
rpm: 10000
- model_name: claude-2
litellm_params:
model: claude-2
api_key: os.environ/ANTHROPIC_API_KEY
tpm: 100000
rpm: 1000
router_settings:
# Routing
routing_strategy: latency-based-routing
routing_strategy_args:
ttl: 60
# Model aliases
model_group_alias:
gpt-4: prod-gpt-4
gpt4: prod-gpt-4
# Retries
num_retries: 3
timeout: 30
retry_policy:
RateLimitError:
max_retries: 5
Timeout:
max_retries: 2
# Cooldowns
allowed_fails: 3
cooldown_time: 120
# Caching
redis_host: localhost
redis_port: 6379
redis_password: os.environ/REDIS_PASSWORD
cache_responses: true
litellm_settings:
# Fallbacks
fallbacks:
- gpt-4: ["gpt-3.5-turbo", "claude-2"]
context_window_fallbacks:
- gpt-3.5-turbo: ["gpt-3.5-turbo-16k"]
# Callbacks
success_callback: ["langfuse", "prometheus"]
failure_callback: ["sentry"]
# Settings
set_verbose: false
drop_params: true
request_timeout: 300
Best Practices
- Use TPM/RPM limits: Always set limits to respect provider quotas
- Configure fallbacks: Have backup models for reliability
- Enable caching: Reduce costs and latency
- Monitor latency: Use latency-based routing in production
- Set appropriate timeouts: Balance responsiveness and success rate
- Use cooldowns: Prevent cascading failures
- Test retry policies: Ensure they match your use case
- Use model aliases: Abstract model names for easier updates