Documentation Index Fetch the complete documentation index at: https://mintlify.com/BerriAI/litellm/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Ollama lets you run large language models locally. LiteLLM provides seamless integration with Ollama, supporting chat, embeddings, function calling, and reasoning models.
Quick Start
Install Ollama
Download and install Ollama from ollama.ai # Pull a model
ollama pull llama3.3
Make Your First Call
from litellm import completion
response = completion(
model = "ollama/llama3.3" ,
messages = [{ "role" : "user" , "content" : "Hello!" }],
api_base = "http://localhost:11434"
)
print (response.choices[ 0 ].message.content)
Popular Models
Llama
Mistral
Phi
Code Models
Meta’s Llama models. # Pull models
ollama pull llama3.3
ollama pull llama3.1
from litellm import completion
response = completion(
model = "ollama/llama3.3" ,
messages = [{ "role" : "user" , "content" : "Explain AI" }],
api_base = "http://localhost:11434"
)
Mistral AI models. ollama pull mistral
ollama pull mixtral
response = completion(
model = "ollama/mistral" ,
messages = [{ "role" : "user" , "content" : "Hello!" }],
api_base = "http://localhost:11434"
)
Microsoft’s Phi models. response = completion(
model = "ollama/phi3" ,
messages = [{ "role" : "user" , "content" : "Quick task" }],
api_base = "http://localhost:11434"
)
Code-specialized models. ollama pull codellama
ollama pull deepseek-coder
response = completion(
model = "ollama/deepseek-coder" ,
messages = [{ "role" : "user" , "content" : "Write a Python function" }],
api_base = "http://localhost:11434"
)
Configuration
Default Localhost
Custom Host
Environment Variable
from litellm import completion
response = completion(
model = "ollama/llama3.3" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
# Defaults to http://localhost:11434
)
from litellm import completion
response = completion(
model = "ollama/llama3.3" ,
messages = [{ "role" : "user" , "content" : "Hello!" }],
api_base = "http://192.168.1.100:11434"
)
export OLLAMA_API_BASE = "http://localhost:11434"
from litellm import completion
response = completion(
model = "ollama/llama3.3" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
)
Streaming
from litellm import completion
response = completion(
model = "ollama/llama3.3" ,
messages = [{ "role" : "user" , "content" : "Write a story" }],
api_base = "http://localhost:11434" ,
stream = True
)
for chunk in response:
if chunk.choices[ 0 ].delta.content:
print (chunk.choices[ 0 ].delta.content, end = "" )
Function Calling
Ollama 0.4+ supports native function calling.
from litellm import completion
tools = [
{
"type" : "function" ,
"function" : {
"name" : "get_weather" ,
"description" : "Get current weather" ,
"parameters" : {
"type" : "object" ,
"properties" : {
"location" : {
"type" : "string" ,
"description" : "City name"
}
},
"required" : [ "location" ]
}
}
}
]
response = completion(
model = "ollama/llama3.3" ,
messages = [{ "role" : "user" , "content" : "What's the weather in SF?" }],
tools = tools,
api_base = "http://localhost:11434"
)
if response.choices[ 0 ].message.tool_calls:
print ( "Tool calls:" , response.choices[ 0 ].message.tool_calls)
Reasoning Models
Use reasoning capabilities with compatible models.
GPT-OSS (DeepSeek)
Other Models
from litellm import completion
response = completion(
model = "ollama/gpt-oss-120b" ,
messages = [{ "role" : "user" , "content" : "Solve this problem..." }],
reasoning_effort = "medium" , # low, medium, high
api_base = "http://localhost:11434"
)
if response.choices[ 0 ].message.reasoning_content:
print ( "Reasoning:" , response.choices[ 0 ].message.reasoning_content)
print ( "Answer:" , response.choices[ 0 ].message.content)
from litellm import completion
# Enable thinking for other models
response = completion(
model = "ollama/llama3.3" ,
messages = [{ "role" : "user" , "content" : "Complex problem..." }],
reasoning_effort = "high" , # Enables thinking mode
api_base = "http://localhost:11434"
)
JSON Mode
from litellm import completion
response = completion(
model = "ollama/llama3.3" ,
messages = [{ "role" : "user" , "content" : "List 3 colors in JSON" }],
response_format = { "type" : "json_object" },
api_base = "http://localhost:11434"
)
import json
data = json.loads(response.choices[ 0 ].message.content)
from litellm import completion
schema = {
"type" : "object" ,
"properties" : {
"colors" : {
"type" : "array" ,
"items" : { "type" : "string" }
}
},
"required" : [ "colors" ]
}
response = completion(
model = "ollama/llama3.3" ,
messages = [{ "role" : "user" , "content" : "List 3 colors" }],
response_format = {
"type" : "json_schema" ,
"json_schema" : { "schema" : schema}
},
api_base = "http://localhost:11434"
)
Vision Models
Use vision-capable models with images.
from litellm import completion
response = completion(
model = "ollama/llava" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "What's in this image?" },
{ "type" : "image_url" , "image_url" : { "url" : "https://..." }}
]
}],
api_base = "http://localhost:11434"
)
Embeddings
from litellm import embedding
response = embedding(
model = "ollama/nomic-embed-text" ,
input = [ "Text to embed" , "Another text" ],
api_base = "http://localhost:11434"
)
embeddings = [data.embedding for data in response.data]
Advanced Configuration
from litellm import completion
response = completion(
model = "ollama/llama3.3" ,
messages = [{ "role" : "user" , "content" : "Hello!" }],
api_base = "http://localhost:11434" ,
# OpenAI params
temperature = 0.8 ,
max_tokens = 500 ,
top_p = 0.9 ,
frequency_penalty = 0.5 ,
seed = 42 ,
# Ollama-specific params
num_ctx = 4096 , # Context window size
num_predict = 200 , # Max tokens to generate
repeat_penalty = 1.1 , # Penalize repetition
top_k = 40 , # Top-k sampling
mirostat = 0 , # Mirostat sampling (0=off, 1=v1, 2=v2)
keep_alive = "5m" # Keep model loaded
)
Supported Parameters
Parameter Type Description temperaturefloat Randomness (0-1) max_tokensint Max output tokens max_completion_tokensint Alternative to max_tokens top_pfloat Nucleus sampling frequency_penaltyfloat Maps to repeat_penalty stoplist Stop sequences seedint Reproducibility num_ctxint Context window size num_predictint Max tokens to generate repeat_penaltyfloat Penalize repetition top_kint Top-k sampling mirostatint Mirostat mode (0/1/2) keep_alivestr Keep model loaded duration
Error Handling
from litellm import completion
from litellm.exceptions import APIError
try :
response = completion(
model = "ollama/llama3.3" ,
messages = [{ "role" : "user" , "content" : "Hello!" }],
api_base = "http://localhost:11434"
)
except APIError as e:
print ( f "Error: { e.status_code } - { e.message } " )
# Check if Ollama is running
# Check if model is pulled
LiteLLM Proxy
model_list :
- model_name : llama3.3
litellm_params :
model : ollama/llama3.3
api_base : http://localhost:11434
- model_name : codellama
litellm_params :
model : ollama/codellama
api_base : http://192.168.1.100:11434
import openai
client = openai.OpenAI(
api_key = "sk-1234" ,
base_url = "http://0.0.0.0:4000"
)
response = client.chat.completions.create(
model = "llama3.3" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
)
Best Practices
Pull models before use: ollama pull model-name
Use keep_alive to keep frequently-used models loaded
Monitor system resources (RAM, GPU memory)
Requires Ollama 0.4+
Not all models support function calling equally
Test with your specific model before production
Troubleshooting
# Check Ollama is running
ollama list
# Start Ollama if needed
ollama serve
# Pull the model first
ollama pull llama3.3
# List available models
ollama list
Use smaller models or quantized versions
Reduce num_ctx to lower memory usage
Close other applications