Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/BerriAI/litellm/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Ollama lets you run large language models locally. LiteLLM provides seamless integration with Ollama, supporting chat, embeddings, function calling, and reasoning models.

Quick Start

1

Install Ollama

Download and install Ollama from ollama.ai
# Pull a model
ollama pull llama3.3
2

Install LiteLLM

pip install litellm
3

Make Your First Call

from litellm import completion

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "Hello!"}],
    api_base="http://localhost:11434"
)
print(response.choices[0].message.content)
Meta’s Llama models.
# Pull models
ollama pull llama3.3
ollama pull llama3.1
from litellm import completion

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "Explain AI"}],
    api_base="http://localhost:11434"
)

Configuration

from litellm import completion

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "Hello!"}]
    # Defaults to http://localhost:11434
)

Streaming

from litellm import completion

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "Write a story"}],
    api_base="http://localhost:11434",
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Function Calling

Ollama 0.4+ supports native function calling.
from litellm import completion

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "What's the weather in SF?"}],
    tools=tools,
    api_base="http://localhost:11434"
)

if response.choices[0].message.tool_calls:
    print("Tool calls:", response.choices[0].message.tool_calls)

Reasoning Models

Use reasoning capabilities with compatible models.
from litellm import completion

response = completion(
    model="ollama/gpt-oss-120b",
    messages=[{"role": "user", "content": "Solve this problem..."}],
    reasoning_effort="medium",  # low, medium, high
    api_base="http://localhost:11434"
)

if response.choices[0].message.reasoning_content:
    print("Reasoning:", response.choices[0].message.reasoning_content)
print("Answer:", response.choices[0].message.content)

JSON Mode

from litellm import completion

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "List 3 colors in JSON"}],
    response_format={"type": "json_object"},
    api_base="http://localhost:11434"
)

import json
data = json.loads(response.choices[0].message.content)

Vision Models

Use vision-capable models with images.
from litellm import completion

response = completion(
    model="ollama/llava",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://..."}}
        ]
    }],
    api_base="http://localhost:11434"
)

Embeddings

from litellm import embedding

response = embedding(
    model="ollama/nomic-embed-text",
    input=["Text to embed", "Another text"],
    api_base="http://localhost:11434"
)

embeddings = [data.embedding for data in response.data]

Advanced Configuration

from litellm import completion

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "Hello!"}],
    api_base="http://localhost:11434",
    # OpenAI params
    temperature=0.8,
    max_tokens=500,
    top_p=0.9,
    frequency_penalty=0.5,
    seed=42,
    # Ollama-specific params
    num_ctx=4096,  # Context window size
    num_predict=200,  # Max tokens to generate
    repeat_penalty=1.1,  # Penalize repetition
    top_k=40,  # Top-k sampling
    mirostat=0,  # Mirostat sampling (0=off, 1=v1, 2=v2)
    keep_alive="5m"  # Keep model loaded
)

Supported Parameters

ParameterTypeDescription
temperaturefloatRandomness (0-1)
max_tokensintMax output tokens
max_completion_tokensintAlternative to max_tokens
top_pfloatNucleus sampling
frequency_penaltyfloatMaps to repeat_penalty
stoplistStop sequences
seedintReproducibility
num_ctxintContext window size
num_predictintMax tokens to generate
repeat_penaltyfloatPenalize repetition
top_kintTop-k sampling
mirostatintMirostat mode (0/1/2)
keep_alivestrKeep model loaded duration

Error Handling

from litellm import completion
from litellm.exceptions import APIError

try:
    response = completion(
        model="ollama/llama3.3",
        messages=[{"role": "user", "content": "Hello!"}],
        api_base="http://localhost:11434"
    )
except APIError as e:
    print(f"Error: {e.status_code} - {e.message}")
    # Check if Ollama is running
    # Check if model is pulled

LiteLLM Proxy

model_list:
  - model_name: llama3.3
    litellm_params:
      model: ollama/llama3.3
      api_base: http://localhost:11434
  
  - model_name: codellama
    litellm_params:
      model: ollama/codellama
      api_base: http://192.168.1.100:11434
import openai

client = openai.OpenAI(
    api_key="sk-1234",
    base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(
    model="llama3.3",
    messages=[{"role": "user", "content": "Hello!"}]
)

Best Practices

  • Pull models before use: ollama pull model-name
  • Use keep_alive to keep frequently-used models loaded
  • Monitor system resources (RAM, GPU memory)
  • Use GPU acceleration when available
  • Adjust num_ctx based on your needs
  • Smaller models (7B/8B) for speed, larger (70B+) for quality
  • Requires Ollama 0.4+
  • Not all models support function calling equally
  • Test with your specific model before production

Troubleshooting

# Check Ollama is running
ollama list

# Start Ollama if needed
ollama serve
# Pull the model first
ollama pull llama3.3

# List available models
ollama list
  • Use smaller models or quantized versions
  • Reduce num_ctx to lower memory usage
  • Close other applications

Build docs developers (and LLMs) love