Ollama Integration

Ollama enables running open-source LLMs locally for privacy-focused applications, offline deployment, and cost-free inference.

Installation

Ollama support is included in the base installation:

pip install graphiti-core

Prerequisites

Install Ollama

Download and install Ollama:

macOS/Linux: ollama.ai/download
Windows: Follow Ollama Windows instructions

Start Ollama:

ollama serve

Pull Models

Download the models you’ll use:

# Pull LLM model
ollama pull deepseek-r1:7b

# Pull embedding model
ollama pull nomic-embed-text

Configuration

Environment Variables

.env

OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_LLM_MODEL=deepseek-r1:7b
OLLAMA_EMBEDDING_MODEL=nomic-embed-text

Basic Setup

Initialize Graphiti with Ollama:

import os
from graphiti_core import Graphiti
from graphiti_core.llm_client.config import LLMConfig
from graphiti_core.llm_client.openai_generic_client import OpenAIGenericClient
from graphiti_core.embedder.openai import OpenAIEmbedder, OpenAIEmbedderConfig
from graphiti_core.cross_encoder.openai_reranker_client import OpenAIRerankerClient

# Configure Ollama LLM client
llm_config = LLMConfig(
    api_key="ollama",  # Placeholder (required but not used)
    model=os.getenv("OLLAMA_LLM_MODEL", "deepseek-r1:7b"),
    small_model=os.getenv("OLLAMA_LLM_MODEL", "deepseek-r1:7b"),
    base_url=os.getenv("OLLAMA_BASE_URL", "http://localhost:11434/v1")
)

llm_client = OpenAIGenericClient(config=llm_config)

# Configure Ollama embedder
embedder = OpenAIEmbedder(
    config=OpenAIEmbedderConfig(
        api_key="ollama",  # Placeholder
        embedding_model=os.getenv("OLLAMA_EMBEDDING_MODEL", "nomic-embed-text"),
        embedding_dim=768,  # nomic-embed-text dimension
        base_url=os.getenv("OLLAMA_BASE_URL", "http://localhost:11434/v1")
    )
)

# Configure cross-encoder (reranker)
cross_encoder = OpenAIRerankerClient(
    client=llm_client,
    config=llm_config
)

# Initialize Graphiti
graphiti = Graphiti(
    "bolt://localhost:7687",
    "neo4j",
    "password",
    llm_client=llm_client,
    embedder=embedder,
    cross_encoder=cross_encoder
)

Important Notes

Use OpenAIGenericClient

Always use OpenAIGenericClient for Ollama, not OpenAIClient:

# ✓ Correct
from graphiti_core.llm_client.openai_generic_client import OpenAIGenericClient
llm_client = OpenAIGenericClient(config=llm_config)

# ✗ Wrong
from graphiti_core.llm_client.openai_client import OpenAIClient
llm_client = OpenAIClient(config=llm_config)  # May have issues with local models

Why OpenAIGenericClient?

Higher default max tokens (16K vs 8K)
Better compatibility with local models
Full structured output support
Optimized for OpenAI-compatible APIs

Ollama API Endpoint

Ollama provides an OpenAI-compatible API at:

http://localhost:11434/v1

This endpoint implements the OpenAI API format, enabling compatibility with OpenAI client libraries.

Recommended Models

Language Models

deepseek-r1:7b (recommended): Fast reasoning model, 7B parameters
qwen2.5:7b: Strong general-purpose model
llama3.3:70b: High quality, requires more resources
gemma2:9b: Efficient Google model
mistral:7b: Fast and capable

Embedding Models

nomic-embed-text (recommended): 768 dimensions, excellent quality
mxbai-embed-large: 1024 dimensions, high quality
all-minilm: 384 dimensions, lightweight

Model Selection Guide

Model	Size	RAM Needed	Speed	Quality
deepseek-r1:7b	4.7GB	8GB	Fast	Good
qwen2.5:7b	4.7GB	8GB	Fast	Good
llama3.3:70b	40GB	64GB	Slow	Excellent
gemma2:9b	5.5GB	10GB	Medium	Good

Configuration Options

LLM Client

Parameter	Type	Default	Description
`api_key`	str	`"ollama"`	Placeholder (required but unused)
`model`	str	Required	Ollama model name
`small_model`	str	Same as model	Model for simpler tasks
`base_url`	str	`"http://localhost:11434/v1"`	Ollama API endpoint
`temperature`	float	`0.7`	Sampling temperature
`max_tokens`	int	`16384`	Maximum output tokens

Embedder

Parameter	Type	Default	Description
`api_key`	str	`"ollama"`	Placeholder (required but unused)
`embedding_model`	str	Required	Ollama embedding model
`embedding_dim`	int	Model-specific	Output dimensions
`base_url`	str	`"http://localhost:11434/v1"`	Ollama API endpoint

Complete Example

import asyncio
import os
from datetime import datetime, timezone
from graphiti_core import Graphiti
from graphiti_core.llm_client.config import LLMConfig
from graphiti_core.llm_client.openai_generic_client import OpenAIGenericClient
from graphiti_core.embedder.openai import OpenAIEmbedder, OpenAIEmbedderConfig
from graphiti_core.cross_encoder.openai_reranker_client import OpenAIRerankerClient
from graphiti_core.nodes import EpisodeType

async def main():
    # Configure Ollama LLM
    llm_config = LLMConfig(
        api_key="ollama",
        model="deepseek-r1:7b",
        small_model="deepseek-r1:7b",
        base_url="http://localhost:11434/v1"
    )
    
    llm_client = OpenAIGenericClient(config=llm_config)
    
    # Configure Ollama embedder
    embedder = OpenAIEmbedder(
        config=OpenAIEmbedderConfig(
            api_key="ollama",
            embedding_model="nomic-embed-text",
            embedding_dim=768,
            base_url="http://localhost:11434/v1"
        )
    )
    
    # Configure cross-encoder
    cross_encoder = OpenAIRerankerClient(
        client=llm_client,
        config=llm_config
    )
    
    # Initialize Graphiti
    graphiti = Graphiti(
        "bolt://localhost:7687",
        "neo4j",
        "password",
        llm_client=llm_client,
        embedder=embedder,
        cross_encoder=cross_encoder
    )
    
    try:
        # Add an episode
        await graphiti.add_episode(
            name="Local AI Test",
            episode_body="Ollama enables running LLMs locally for privacy and offline use.",
            source=EpisodeType.text,
            reference_time=datetime.now(timezone.utc)
        )
        print("Added episode using local Ollama model")
        
        # Search the graph
        results = await graphiti.search("What are the benefits of local LLMs?")
        for result in results:
            print(f"Fact: {result.fact}")
    
    finally:
        await graphiti.close()

if __name__ == "__main__":
    asyncio.run(main())

Structured Output Limitations

Local models may have challenges with structured outputs: Best Practices:

Use larger models (7B+) for better structured output adherence
Enable JSON mode in Ollama modelfile if available
Monitor extraction quality and adjust prompts if needed
Consider using quantized versions for faster inference

Performance Optimization

Hardware Acceleration

GPU Support:

# Ollama automatically detects and uses GPU if available
# NVIDIA GPU: CUDA support
# Apple Silicon: Metal support
# AMD GPU: ROCm support

Concurrency Control

Local models are slower than cloud APIs. Reduce concurrency:

.env

SEMAPHORE_LIMIT=2  # Lower concurrency for local models

Model Optimization

Use Quantized Models: Faster inference, lower memory
Tune Context Length: Balance quality vs speed
Batch Requests: Process multiple items together

When to Use Ollama

Choose Ollama if you:

Need complete data privacy (no external API calls)
Want offline operation
Prefer zero API costs
Have capable local hardware (GPU recommended)
Need air-gapped deployment

Choose Cloud APIs if you:

Need the highest quality outputs
Want faster response times
Don’t have powerful local hardware
Need enterprise support and SLAs

Troubleshooting

Ollama Not Running

# Start Ollama server
ollama serve

# Check if models are available
ollama list

Model Not Found

# Pull the model
ollama pull deepseek-r1:7b

Slow Performance

Use GPU: Ensure GPU acceleration is enabled
Reduce Concurrency: Set SEMAPHORE_LIMIT=1
Use Smaller Models: Try 7B instead of 70B
Quantization: Use quantized model variants

Out of Memory

Use Smaller Models: Switch to 7B or smaller
Increase Swap: Configure system swap space
Reduce Context: Lower max_tokens parameter

Get Started

Core Concepts

Guides

Integrations

Advanced

Ollama Integration

Installation

Prerequisites

Install Ollama

Pull Models

Configuration

Environment Variables

Basic Setup

Important Notes

Use OpenAIGenericClient

Ollama API Endpoint

Recommended Models

Language Models

Embedding Models

Model Selection Guide

Configuration Options

LLM Client

Embedder

Complete Example

Structured Output Limitations

Performance Optimization

Hardware Acceleration

Concurrency Control

Model Optimization

When to Use Ollama

Troubleshooting

Ollama Not Running

Model Not Found

Slow Performance

Out of Memory

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Integrations

Advanced

Documentation Index

​Installation

​Prerequisites

​Install Ollama

​Pull Models

​Configuration

​Environment Variables

​Basic Setup

​Important Notes

​Use OpenAIGenericClient

​Ollama API Endpoint

​Recommended Models

​Language Models

​Embedding Models

​Model Selection Guide

​Configuration Options

​LLM Client

​Embedder

​Complete Example

​Structured Output Limitations

​Performance Optimization

​Hardware Acceleration

​Concurrency Control

​Model Optimization

​When to Use Ollama

​Troubleshooting

​Ollama Not Running

​Model Not Found

​Slow Performance

​Out of Memory

​Related Resources

Build docs developers (and LLMs) love

Installation

Prerequisites

Install Ollama

Pull Models

Configuration

Environment Variables

Basic Setup

Important Notes

Use OpenAIGenericClient

Ollama API Endpoint

Recommended Models

Language Models

Embedding Models

Model Selection Guide

Configuration Options

LLM Client

Embedder

Complete Example

Structured Output Limitations

Performance Optimization

Hardware Acceleration

Concurrency Control

Model Optimization

When to Use Ollama

Troubleshooting

Ollama Not Running

Model Not Found

Slow Performance

Out of Memory

Related Resources