Ollama Overview
EduMate uses Ollama to run local language models for:
Text Embeddings : Converting document chunks to vectors using qwen3-embedding:0.6b
Chat/Inference : Generating questions (alternative to Gemini API) using llama3.2:1b
Ollama provides a simple API on port 11434 that’s compatible with OpenAI’s API format.
Installation
Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
Verify Installation
ollama --version
# Expected output: ollama version 0.x.x
Start Ollama Service
Start Ollama Server
Ollama runs as a background service on port 11434: Linux (systemd)
macOS
Manual Start
# Ollama should start automatically after installation
# Check status
sudo systemctl status ollama
# Start if not running
sudo systemctl start ollama
# Enable auto-start on boot
sudo systemctl enable ollama
Verify Service
Check if Ollama is accessible: curl http://localhost:11434
Expected response:
Ollama binds to localhost:11434 by default. The EduMate backend connects to http://localhost:11434.
Download Required Models
EduMate requires specific models for embeddings and text generation.
Embedding Model: qwen3-embedding:0.6b
Pull qwen3-embedding Model
This model generates 384-dimensional embeddings for document chunks: ollama pull qwen3-embedding:0.6b
Model size: ~600MB
Download time: 2-5 minutes (depending on connection)
Test Embedding Model
Verify the model works: ollama run qwen3-embedding:0.6b "Test embedding"
Or test with API: curl http://localhost:11434/api/embeddings -d '{
"model": "qwen3-embedding:0.6b",
"prompt": "The quick brown fox jumps over the lazy dog"
}'
Chat Model: llama3.2:1b
This model can be used for question generation (alternative to Gemini):
Model size: ~1.3GB
RAM required: ~2GB during inference
Optional: Gemini Integration
EduMate primarily uses Gemini API (gemini-2.5-flash-lite) for question generation because it produces better structured outputs. However, the code supports Ollama models as an alternative:
# Current configuration (Gemini API)
open_ai_client = OpenAI(
api_key = GEMINI_API_KEY ,
base_url = "https://generativelanguage.googleapis.com/v1beta/openai" ,
)
# Alternative: Use Ollama (commented out)
# open_ai_client = OpenAI(
# base_url="http://localhost:11434/v1",
# api_key="ollama"
# )
To use Ollama for question generation instead of Gemini, uncomment the Ollama client configuration and comment out the Gemini client in backend/queue/chat.py.
Model Configuration in EduMate
Embedding Configuration
The embedding model is used in both document chunking and retrieval:
backend/queue/doc_chunking.py
from langchain_ollama import OllamaEmbeddings
embedding_model = OllamaEmbeddings(
model = 'qwen3-embedding:0.6b' ,
base_url = 'http://localhost:11434'
)
def _embedding_model ():
return OllamaEmbeddings(
model = 'qwen3-embedding:0.6b' ,
base_url = 'http://localhost:11434' ,
)
Critical : The same embedding model (qwen3-embedding:0.6b) must be used for both indexing and retrieval. Changing models will break semantic search!
Chat Model Configuration
The chat model is used for question generation (when not using Gemini):
from ollama import Client
ollama_client = Client(
host = 'http://localhost:11434'
)
# Example usage (currently commented out)
# response = ollama_client.chat(
# model='llama3.2:1b',
# messages=[
# {'role': 'system', 'content': SYSTEM_PROMPT},
# {'role': 'user', 'content': user_query}
# ]
# )
List Installed Models
View all downloaded models:
Expected output:
NAME ID SIZE MODIFIED
qwen3-embedding:0.6b abc123def456 600 MB 2 hours ago
llama3.2:1b def789ghi012 1.3 GB 2 hours ago
Model Management
Delete a Model
Update a Model
ollama pull qwen3-embedding:0.6b
Show Model Details
ollama show qwen3-embedding:0.6b
GPU Acceleration
If you have an NVIDIA GPU, Ollama automatically uses it for faster inference:
# Check if GPU is detected
nvidia-smi
# Ollama will automatically use CUDA if available
# You'll see "Using GPU" in the logs
CPU Optimization
For CPU-only systems:
# Set thread count (adjust based on your CPU cores)
export OLLAMA_NUM_THREADS = 8
# Restart Ollama
sudo systemctl restart ollama
Memory Configuration
# Set context size (default: 2048)
export OLLAMA_NUM_CTX = 4096
# Set batch size (default: 512)
export OLLAMA_NUM_BATCH = 256
For the embedding model qwen3-embedding:0.6b, default settings are sufficient. It’s lightweight and fast even on CPU.
Testing Ollama Integration
Test Embedding API
curl http://localhost:11434/api/embeddings -d '{
"model": "qwen3-embedding:0.6b",
"prompt": "Test document chunk for embedding"
}'
Expected: JSON response with 384-dimensional vector
Test with Python
Create a test script: from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(
model = 'qwen3-embedding:0.6b' ,
base_url = 'http://localhost:11434'
)
vector = embeddings.embed_query( "Test query" )
print ( f "Vector dimension: { len (vector) } " )
print ( f "First 5 values: { vector[: 5 ] } " )
Run: Expected output: Vector dimension: 384
First 5 values: [0.123, -0.456, 0.789, ...]
Test Chat Model (Optional)
ollama run llama3.2:1b "What is the capital of France?"
Expected: Coherent response from the model
Troubleshooting
Port 11434 Already in Use
# Check what's using the port
sudo lsof -i :11434
# Stop existing Ollama
sudo systemctl stop ollama
# Or kill process
sudo pkill ollama
Model Not Found
# List installed models
ollama list
# Pull missing model
ollama pull qwen3-embedding:0.6b
Connection Refused
# Check if Ollama is running
sudo systemctl status ollama
# View logs
journalctl -u ollama -n 50 -f
# Restart Ollama
sudo systemctl restart ollama
Out of Memory
If you get OOM errors:
# Use smaller context window
export OLLAMA_NUM_CTX = 2048
# Reduce batch size
export OLLAMA_NUM_BATCH = 128
# Restart Ollama
sudo systemctl restart ollama
Slow Inference
# Check system resources
htop
# Monitor Ollama process
top -p $( pgrep ollama )
# For CPU systems, increase threads
export OLLAMA_NUM_THREADS = $( nproc )
Environment Variables
Common Ollama environment variables:
# Server configuration
export OLLAMA_HOST = 0 . 0 . 0 . 0 : 11434 # Listen on all interfaces
export OLLAMA_ORIGINS =* # Allow CORS from all origins
# Performance tuning
export OLLAMA_NUM_THREADS = 8 # CPU threads
export OLLAMA_NUM_CTX = 4096 # Context window
export OLLAMA_NUM_BATCH = 512 # Batch size
# GPU settings (if available)
export OLLAMA_GPU_LAYERS = 33 # Layers to offload to GPU
Next Steps
With Ollama configured and models downloaded, proceed to deploy the backend:
Remember to obtain a Gemini API key from Google AI Studio for question generation, or modify the code to use Ollama’s llama3.2:1b instead.