Ollama Setup

Ollama Overview

EduMate uses Ollama to run local language models for:

Text Embeddings: Converting document chunks to vectors using qwen3-embedding:0.6b
Chat/Inference: Generating questions (alternative to Gemini API) using llama3.2:1b

Ollama provides a simple API on port 11434 that’s compatible with OpenAI’s API format.

Installation

Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

Verify Installation

ollama --version
# Expected output: ollama version 0.x.x

Start Ollama Service

Start Ollama Server

Ollama runs as a background service on port 11434:

# Ollama should start automatically after installation
# Check status
sudo systemctl status ollama

# Start if not running
sudo systemctl start ollama

# Enable auto-start on boot
sudo systemctl enable ollama

Verify Service

Check if Ollama is accessible:

curl http://localhost:11434

Expected response:

Ollama is running

Ollama binds to localhost:11434 by default. The EduMate backend connects to http://localhost:11434.

Download Required Models

EduMate requires specific models for embeddings and text generation.

Embedding Model: qwen3-embedding:0.6b

Pull qwen3-embedding Model

This model generates 384-dimensional embeddings for document chunks:

ollama pull qwen3-embedding:0.6b

Model size: ~600MB Download time: 2-5 minutes (depending on connection)

Test Embedding Model

Verify the model works:

ollama run qwen3-embedding:0.6b "Test embedding"

Or test with API:

curl http://localhost:11434/api/embeddings -d '{
  "model": "qwen3-embedding:0.6b",
  "prompt": "The quick brown fox jumps over the lazy dog"
}'

Chat Model: llama3.2:1b

This model can be used for question generation (alternative to Gemini):

ollama pull llama3.2:1b

Model size: ~1.3GB RAM required: ~2GB during inference

Optional: Gemini Integration

EduMate primarily uses Gemini API (gemini-2.5-flash-lite) for question generation because it produces better structured outputs. However, the code supports Ollama models as an alternative:

backend/queue/chat.py

# Current configuration (Gemini API)
open_ai_client = OpenAI(
    api_key=GEMINI_API_KEY,
    base_url="https://generativelanguage.googleapis.com/v1beta/openai",
)

# Alternative: Use Ollama (commented out)
# open_ai_client = OpenAI(
#     base_url="http://localhost:11434/v1",
#     api_key="ollama"
# )

To use Ollama for question generation instead of Gemini, uncomment the Ollama client configuration and comment out the Gemini client in backend/queue/chat.py.

Model Configuration in EduMate

Embedding Configuration

The embedding model is used in both document chunking and retrieval:

backend/queue/doc_chunking.py

from langchain_ollama import OllamaEmbeddings

embedding_model = OllamaEmbeddings(
    model='qwen3-embedding:0.6b',
    base_url='http://localhost:11434'
)

backend/queue/chat.py

def _embedding_model():
    return OllamaEmbeddings(
        model='qwen3-embedding:0.6b',
        base_url='http://localhost:11434',
    )

Critical: The same embedding model (qwen3-embedding:0.6b) must be used for both indexing and retrieval. Changing models will break semantic search!

Chat Model Configuration

The chat model is used for question generation (when not using Gemini):

backend/queue/chat.py

from ollama import Client

ollama_client = Client(
    host='http://localhost:11434'
)

# Example usage (currently commented out)
# response = ollama_client.chat(
#     model='llama3.2:1b',
#     messages=[
#         {'role': 'system', 'content': SYSTEM_PROMPT},
#         {'role': 'user', 'content': user_query}
#     ]
# )

List Installed Models

View all downloaded models:

ollama list

Expected output:

NAME                      ID              SIZE      MODIFIED
qwen3-embedding:0.6b      abc123def456    600 MB    2 hours ago
llama3.2:1b              def789ghi012    1.3 GB    2 hours ago

Model Management

Delete a Model

ollama rm llama3.2:1b

Update a Model

ollama pull qwen3-embedding:0.6b

Show Model Details

ollama show qwen3-embedding:0.6b

Performance Optimization

GPU Acceleration

If you have an NVIDIA GPU, Ollama automatically uses it for faster inference:

# Check if GPU is detected
nvidia-smi

# Ollama will automatically use CUDA if available
# You'll see "Using GPU" in the logs

CPU Optimization

For CPU-only systems:

# Set thread count (adjust based on your CPU cores)
export OLLAMA_NUM_THREADS=8

# Restart Ollama
sudo systemctl restart ollama

Memory Configuration

# Set context size (default: 2048)
export OLLAMA_NUM_CTX=4096

# Set batch size (default: 512)
export OLLAMA_NUM_BATCH=256

For the embedding model qwen3-embedding:0.6b, default settings are sufficient. It’s lightweight and fast even on CPU.

Testing Ollama Integration

Test Embedding API

curl http://localhost:11434/api/embeddings -d '{
  "model": "qwen3-embedding:0.6b",
  "prompt": "Test document chunk for embedding"
}'

Expected: JSON response with 384-dimensional vector

Test with Python

Create a test script:

test_ollama.py

from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(
    model='qwen3-embedding:0.6b',
    base_url='http://localhost:11434'
)

vector = embeddings.embed_query("Test query")
print(f"Vector dimension: {len(vector)}")
print(f"First 5 values: {vector[:5]}")

Run:

python test_ollama.py

Expected output:

Vector dimension: 384
First 5 values: [0.123, -0.456, 0.789, ...]

Test Chat Model (Optional)

ollama run llama3.2:1b "What is the capital of France?"

Expected: Coherent response from the model

Troubleshooting

Port 11434 Already in Use

# Check what's using the port
sudo lsof -i :11434

# Stop existing Ollama
sudo systemctl stop ollama

# Or kill process
sudo pkill ollama

Model Not Found

# List installed models
ollama list

# Pull missing model
ollama pull qwen3-embedding:0.6b

Connection Refused

# Check if Ollama is running
sudo systemctl status ollama

# View logs
journalctl -u ollama -n 50 -f

# Restart Ollama
sudo systemctl restart ollama

Out of Memory

If you get OOM errors:

# Use smaller context window
export OLLAMA_NUM_CTX=2048

# Reduce batch size
export OLLAMA_NUM_BATCH=128

# Restart Ollama
sudo systemctl restart ollama

Slow Inference

# Check system resources
htop

# Monitor Ollama process
top -p $(pgrep ollama)

# For CPU systems, increase threads
export OLLAMA_NUM_THREADS=$(nproc)

Environment Variables

Common Ollama environment variables:

~/.bashrc or ~/.zshrc

# Server configuration
export OLLAMA_HOST=0.0.0.0:11434  # Listen on all interfaces
export OLLAMA_ORIGINS=*            # Allow CORS from all origins

# Performance tuning
export OLLAMA_NUM_THREADS=8        # CPU threads
export OLLAMA_NUM_CTX=4096         # Context window
export OLLAMA_NUM_BATCH=512        # Batch size

# GPU settings (if available)
export OLLAMA_GPU_LAYERS=33        # Layers to offload to GPU

Next Steps

With Ollama configured and models downloaded, proceed to deploy the backend:

Backend Deployment - Set up the FastAPI server and RQ workers
Frontend Deployment - Build and serve the React application

Remember to obtain a Gemini API key from Google AI Studio for question generation, or modify the code to use Ollama’s llama3.2:1b instead.

Infrastructure

Configuration

Ollama Overview

Installation

Install Ollama

Verify Installation

Start Ollama Service

Download Required Models

Embedding Model: qwen3-embedding:0.6b

Chat Model: llama3.2:1b

Optional: Gemini Integration

Model Configuration in EduMate

Embedding Configuration

Chat Model Configuration

List Installed Models

Model Management

Delete a Model

Update a Model

Show Model Details

Performance Optimization

GPU Acceleration

CPU Optimization

Memory Configuration

Testing Ollama Integration

Troubleshooting

Port 11434 Already in Use

Model Not Found

Connection Refused

Out of Memory

Slow Inference

Environment Variables

Next Steps

Build docs developers (and LLMs) love

Infrastructure

Configuration

​Ollama Overview

​Installation

​Install Ollama

​Verify Installation

​Start Ollama Service

​Download Required Models

​Embedding Model: qwen3-embedding:0.6b

​Chat Model: llama3.2:1b

​Optional: Gemini Integration

​Model Configuration in EduMate

​Embedding Configuration

​Chat Model Configuration

​List Installed Models

​Model Management

​Delete a Model

​Update a Model

​Show Model Details

​Performance Optimization

​GPU Acceleration

​CPU Optimization

​Memory Configuration

​Testing Ollama Integration

​Troubleshooting

​Port 11434 Already in Use

​Model Not Found

​Connection Refused

​Out of Memory

​Slow Inference

​Environment Variables

​Next Steps

Build docs developers (and LLMs) love

Ollama Overview

Installation

Install Ollama

Verify Installation

Start Ollama Service

Download Required Models

Embedding Model: qwen3-embedding:0.6b

Chat Model: llama3.2:1b

Optional: Gemini Integration

Model Configuration in EduMate

Embedding Configuration

Chat Model Configuration

List Installed Models

Model Management

Delete a Model

Update a Model

Show Model Details

Performance Optimization

GPU Acceleration

CPU Optimization

Memory Configuration

Testing Ollama Integration

Troubleshooting

Port 11434 Already in Use

Model Not Found

Connection Refused

Out of Memory

Slow Inference

Environment Variables

Next Steps