Documentation Index
Fetch the complete documentation index at: https://mintlify.com/getzep/graphiti/llms.txt
Use this file to discover all available pages before exploring further.
Ollama enables running open-source LLMs locally for privacy-focused applications, offline deployment, and cost-free inference.
Installation
Ollama support is included in the base installation:
pip install graphiti-core
Prerequisites
Install Ollama
Download and install Ollama:
Start Ollama:
Pull Models
Download the models you’ll use:
# Pull LLM model
ollama pull deepseek-r1:7b
# Pull embedding model
ollama pull nomic-embed-text
Configuration
Environment Variables
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_LLM_MODEL=deepseek-r1:7b
OLLAMA_EMBEDDING_MODEL=nomic-embed-text
Basic Setup
Initialize Graphiti with Ollama:
import os
from graphiti_core import Graphiti
from graphiti_core.llm_client.config import LLMConfig
from graphiti_core.llm_client.openai_generic_client import OpenAIGenericClient
from graphiti_core.embedder.openai import OpenAIEmbedder, OpenAIEmbedderConfig
from graphiti_core.cross_encoder.openai_reranker_client import OpenAIRerankerClient
# Configure Ollama LLM client
llm_config = LLMConfig(
api_key="ollama", # Placeholder (required but not used)
model=os.getenv("OLLAMA_LLM_MODEL", "deepseek-r1:7b"),
small_model=os.getenv("OLLAMA_LLM_MODEL", "deepseek-r1:7b"),
base_url=os.getenv("OLLAMA_BASE_URL", "http://localhost:11434/v1")
)
llm_client = OpenAIGenericClient(config=llm_config)
# Configure Ollama embedder
embedder = OpenAIEmbedder(
config=OpenAIEmbedderConfig(
api_key="ollama", # Placeholder
embedding_model=os.getenv("OLLAMA_EMBEDDING_MODEL", "nomic-embed-text"),
embedding_dim=768, # nomic-embed-text dimension
base_url=os.getenv("OLLAMA_BASE_URL", "http://localhost:11434/v1")
)
)
# Configure cross-encoder (reranker)
cross_encoder = OpenAIRerankerClient(
client=llm_client,
config=llm_config
)
# Initialize Graphiti
graphiti = Graphiti(
"bolt://localhost:7687",
"neo4j",
"password",
llm_client=llm_client,
embedder=embedder,
cross_encoder=cross_encoder
)
Important Notes
Use OpenAIGenericClient
Always use OpenAIGenericClient for Ollama, not OpenAIClient:
# ✓ Correct
from graphiti_core.llm_client.openai_generic_client import OpenAIGenericClient
llm_client = OpenAIGenericClient(config=llm_config)
# ✗ Wrong
from graphiti_core.llm_client.openai_client import OpenAIClient
llm_client = OpenAIClient(config=llm_config) # May have issues with local models
Why OpenAIGenericClient?
- Higher default max tokens (16K vs 8K)
- Better compatibility with local models
- Full structured output support
- Optimized for OpenAI-compatible APIs
Ollama API Endpoint
Ollama provides an OpenAI-compatible API at:
http://localhost:11434/v1
This endpoint implements the OpenAI API format, enabling compatibility with OpenAI client libraries.
Recommended Models
Language Models
- deepseek-r1:7b (recommended): Fast reasoning model, 7B parameters
- qwen2.5:7b: Strong general-purpose model
- llama3.3:70b: High quality, requires more resources
- gemma2:9b: Efficient Google model
- mistral:7b: Fast and capable
Embedding Models
- nomic-embed-text (recommended): 768 dimensions, excellent quality
- mxbai-embed-large: 1024 dimensions, high quality
- all-minilm: 384 dimensions, lightweight
Model Selection Guide
| Model | Size | RAM Needed | Speed | Quality |
|---|
| deepseek-r1:7b | 4.7GB | 8GB | Fast | Good |
| qwen2.5:7b | 4.7GB | 8GB | Fast | Good |
| llama3.3:70b | 40GB | 64GB | Slow | Excellent |
| gemma2:9b | 5.5GB | 10GB | Medium | Good |
Configuration Options
LLM Client
| Parameter | Type | Default | Description |
|---|
api_key | str | "ollama" | Placeholder (required but unused) |
model | str | Required | Ollama model name |
small_model | str | Same as model | Model for simpler tasks |
base_url | str | "http://localhost:11434/v1" | Ollama API endpoint |
temperature | float | 0.7 | Sampling temperature |
max_tokens | int | 16384 | Maximum output tokens |
Embedder
| Parameter | Type | Default | Description |
|---|
api_key | str | "ollama" | Placeholder (required but unused) |
embedding_model | str | Required | Ollama embedding model |
embedding_dim | int | Model-specific | Output dimensions |
base_url | str | "http://localhost:11434/v1" | Ollama API endpoint |
Complete Example
import asyncio
import os
from datetime import datetime, timezone
from graphiti_core import Graphiti
from graphiti_core.llm_client.config import LLMConfig
from graphiti_core.llm_client.openai_generic_client import OpenAIGenericClient
from graphiti_core.embedder.openai import OpenAIEmbedder, OpenAIEmbedderConfig
from graphiti_core.cross_encoder.openai_reranker_client import OpenAIRerankerClient
from graphiti_core.nodes import EpisodeType
async def main():
# Configure Ollama LLM
llm_config = LLMConfig(
api_key="ollama",
model="deepseek-r1:7b",
small_model="deepseek-r1:7b",
base_url="http://localhost:11434/v1"
)
llm_client = OpenAIGenericClient(config=llm_config)
# Configure Ollama embedder
embedder = OpenAIEmbedder(
config=OpenAIEmbedderConfig(
api_key="ollama",
embedding_model="nomic-embed-text",
embedding_dim=768,
base_url="http://localhost:11434/v1"
)
)
# Configure cross-encoder
cross_encoder = OpenAIRerankerClient(
client=llm_client,
config=llm_config
)
# Initialize Graphiti
graphiti = Graphiti(
"bolt://localhost:7687",
"neo4j",
"password",
llm_client=llm_client,
embedder=embedder,
cross_encoder=cross_encoder
)
try:
# Add an episode
await graphiti.add_episode(
name="Local AI Test",
episode_body="Ollama enables running LLMs locally for privacy and offline use.",
source=EpisodeType.text,
reference_time=datetime.now(timezone.utc)
)
print("Added episode using local Ollama model")
# Search the graph
results = await graphiti.search("What are the benefits of local LLMs?")
for result in results:
print(f"Fact: {result.fact}")
finally:
await graphiti.close()
if __name__ == "__main__":
asyncio.run(main())
Structured Output Limitations
Local models may have challenges with structured outputs:
Best Practices:
- Use larger models (7B+) for better structured output adherence
- Enable JSON mode in Ollama modelfile if available
- Monitor extraction quality and adjust prompts if needed
- Consider using quantized versions for faster inference
Hardware Acceleration
GPU Support:
# Ollama automatically detects and uses GPU if available
# NVIDIA GPU: CUDA support
# Apple Silicon: Metal support
# AMD GPU: ROCm support
Concurrency Control
Local models are slower than cloud APIs. Reduce concurrency:
SEMAPHORE_LIMIT=2 # Lower concurrency for local models
Model Optimization
- Use Quantized Models: Faster inference, lower memory
- Tune Context Length: Balance quality vs speed
- Batch Requests: Process multiple items together
When to Use Ollama
Choose Ollama if you:
- Need complete data privacy (no external API calls)
- Want offline operation
- Prefer zero API costs
- Have capable local hardware (GPU recommended)
- Need air-gapped deployment
Choose Cloud APIs if you:
- Need the highest quality outputs
- Want faster response times
- Don’t have powerful local hardware
- Need enterprise support and SLAs
Troubleshooting
Ollama Not Running
# Start Ollama server
ollama serve
# Check if models are available
ollama list
Model Not Found
# Pull the model
ollama pull deepseek-r1:7b
- Use GPU: Ensure GPU acceleration is enabled
- Reduce Concurrency: Set
SEMAPHORE_LIMIT=1
- Use Smaller Models: Try 7B instead of 70B
- Quantization: Use quantized model variants
Out of Memory
- Use Smaller Models: Switch to 7B or smaller
- Increase Swap: Configure system swap space
- Reduce Context: Lower max_tokens parameter