Ollama Setup - SmartEat AI

Overview

SmartEat AI uses Ollama to run large language models (LLMs) locally, providing private, fast AI inference without external API dependencies. The system is optimized for environments with limited GPU memory (8GB VRAM).

Why Ollama?

Privacy: All data stays on your infrastructure
Cost: No per-token API fees
Speed: Low-latency inference for real-time chat
Flexibility: Easy model switching and customization

Docker Configuration

Ollama runs as a containerized service with GPU support.

docker-compose.yml

ollama:
  image: ollama/ollama:latest
  container_name: smarteatai_ollama
  ports:
    - "11434:11434"
  volumes:
    - ollama_data:/root/.ollama
  environment:
    - OLLAMA_CONTEXT_LENGTH=32768  # Extended context window
    - OLLAMA_NUM_PARALLEL=1         # Single request at a time
    - OLLAMA_MAX_LOADED_MODELS=1    # Keep one model in memory
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: all
            capabilities: [gpu]
  entrypoint: /bin/bash
  command: -c "ollama serve"

Configuration Explanation

Port 11434: Ollama API endpoint
Volume: Persists downloaded models across container restarts
OLLAMA_CONTEXT_LENGTH: Maximum tokens for context (32K)
GPU Reservation: Enables NVIDIA GPU acceleration
NUM_PARALLEL=1: Prevents OOM with limited VRAM

Model Setup

Step 1: Start the Ollama Service

# Start all services including Ollama
docker-compose up -d

# Verify Ollama is running
docker ps | grep ollama

Step 2: Download the Model

Access the Ollama container and pull the model:

# Enter the Ollama container
docker exec -it smarteatai_ollama bash

# Download Llama 3.1 (recommended)
ollama pull llama3.1

# Alternative models:
# ollama pull llama3:latest     # Llama 3 (smaller, faster)
# ollama pull mistral:latest    # Mistral 7B
# ollama pull phi3:latest       # Phi-3 (very small, 3.8B)

Llama 3.1 (Recommended)
Llama 3
Phi-3

Size: ~4.7GB
Parameters: 8B
Context: Up to 128K tokens
Best for: Balanced performance and accuracy

Step 3: Verify Installation

# List downloaded models
ollama list

# Expected output:
NAME              ID              SIZE    MODIFIED
llama3.1:latest   42182419e950    4.7 GB  2 minutes ago

# Test the model
ollama run llama3.1 "Hello, what is your name?"

Backend Integration

Environment Variables

Add to your .env file (in project root):

# Ollama Configuration
OLLAMA_MODEL=llama3.1:latest
OLLAMA_BASE_URL=http://ollama:11434

# Embeddings (for future vector search)
CHROMA_EMBEDDING_MODEL=llama3.1:latest
CHROMA_DB=/app/data/chroma_db_recipes

LangChain Configuration

File: backend/app/core/config_ollama.py

from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_chroma import Chroma
from app.config import settings

# Optimized for 8GB VRAM
OLLAMA_CONFIG = {
    "model": settings.OLLAMA_MODEL,              # llama3.1:latest
    "base_url": settings.OLLAMA_BASE_URL,        # http://ollama:11434
    "temperature": 0,                             # Deterministic output
    "num_ctx": 16384,                             # Context window
    "num_predict": 4096,                          # Max tokens to generate
}

llm = ChatOllama(**OLLAMA_CONFIG)

# Embeddings for vector search (optional)
embeddings = OllamaEmbeddings(
    model=settings.CHROMA_EMBEDDING_MODEL,
    base_url=settings.OLLAMA_BASE_URL
)

# Vector database (not currently active)
vector_db = Chroma(
    persist_directory=settings.CHROMA_DB,
    embedding_function=embeddings
)

Vector Database: ChromaDB integration exists but is not active. Direct PostgreSQL queries are used instead due to filtering limitations.

GPU Support

NVIDIA GPU Setup

Install NVIDIA Drivers

Ensure you have NVIDIA drivers installed on your host:

nvidia-smi

Should display GPU information.

Install NVIDIA Container Toolkit

# Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Verify GPU in Container

docker exec -it smarteatai_ollama nvidia-smi

Should show GPU details from inside the container.

CPU-Only Mode

If you don’t have a GPU, modify docker-compose.yml:

ollama:
  image: ollama/ollama:latest
  # Remove the deploy section:
  # deploy:
  #   resources:
  #     reservations:
  #       devices:
  #         - driver: nvidia
  #           count: all
  #           capabilities: [gpu]

Performance Impact: CPU inference is 10-50x slower than GPU. Expect 2-10 seconds per response instead of 200-500ms.

Performance Tuning

Context Window Optimization

The system uses a multi-layered approach to manage context:

Ollama Level

OLLAMA_CONTEXT_LENGTH=32768Maximum supported by the service

Model Level

num_ctx=16384Actual context used per request

Agent Level

MAX_CONTEXT_TOKENS=10000Conversation history limit

Memory Management

8GB VRAM Configuration (recommended):

# config_ollama.py
OLLAMA_CONFIG = {
    "num_ctx": 16384,        # 16K context window
    "num_predict": 4096,     # Max output tokens
    "num_thread": 8,         # CPU threads for computation
    "num_gpu": 1,            # Number of GPUs to use
}

4GB VRAM Configuration (smaller model):

# Use Phi-3 instead of Llama 3.1
OLLAMA_MODEL=phi3:latest

OLLAMA_CONFIG = {
    "num_ctx": 4096,         # Reduced context
    "num_predict": 1024,     # Reduced output
}

Troubleshooting

Error: Connection refused (port 11434)

Cause: Ollama service is not running.Solution:

docker-compose up -d ollama
docker logs smarteatai_ollama

Error: Out of memory (OOM)

Cause: Model is too large for available VRAM.Solutions:

Use a smaller model (Phi-3)
Reduce num_ctx in config
Set OLLAMA_NUM_PARALLEL=1
Close other GPU applications

# Check GPU memory usage
nvidia-smi

Error: Model not found

Cause: Model not downloaded or wrong name in .env.Solution:

docker exec -it smarteatai_ollama ollama list
docker exec -it smarteatai_ollama ollama pull llama3.1

Slow inference (>5 seconds per response)

Possible causes:

Running on CPU instead of GPU
Context window too large
Model loading overhead

Solutions:

Verify GPU is being used:

docker exec -it smarteatai_ollama nvidia-smi

Reduce num_ctx to 8192 or 4096
Keep OLLAMA_MAX_LOADED_MODELS=1 to avoid reloading

Usage in Agent

The agent uses Ollama for two purposes:

1. Conversational AI (Primary)

from app.core.config_ollama import llm

# The agent binds tools and invokes the model
llm_with_tools = llm.bind_tools(nutrition_tools)
response = llm_with_tools.invoke([
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Generate a meal plan"}
])

2. Dietary Validation (Secondary)

Used in generate_weekly_plan.py to validate recipes against restrictions:

from app.core.config_ollama import llm

message = (
    f"answer ONLY with YES or NO. "
    f"I have dietary restrictions: [{restrictions_text}] "
    f"Does this recipe comply: {recipe.name} "
    f"Ingredients: [{recipe.ingredients}]"
)

response = llm.invoke(message)
answer = response.content.strip().upper()  # "YES" or "NO"

Monitoring

Real-time Metrics

# Watch GPU usage
watch -n 1 nvidia-smi

# Monitor Ollama logs
docker logs -f smarteatai_ollama

# Check model memory usage
curl http://localhost:11434/api/ps

Performance Benchmarks

Typical performance on an 8GB GPU:

Operation	Latency
Simple query	200-500ms
Tool call	500-1000ms
Complex reasoning	1-2s
Model loading	3-5s (first request)

Advanced Configuration

Custom Model Parameters

Create a Modelfile for fine-tuned behavior:

# Modelfile
FROM llama3.1:latest

# System message
SYSTEM You are Smarty, a nutritionist assistant.

# Parameters
PARAMETER temperature 0
PARAMETER num_ctx 16384
PARAMETER num_predict 4096
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1

# Create custom model
docker exec -it smarteatai_ollama ollama create smarty -f Modelfile

# Update .env
OLLAMA_MODEL=smarty:latest

Multiple Model Support

Run different models for different tasks:

# Conversational AI: Llama 3.1
conversational_llm = ChatOllama(
    model="llama3.1:latest",
    base_url=settings.OLLAMA_BASE_URL
)

# Quick validation: Phi-3
validation_llm = ChatOllama(
    model="phi3:latest",
    base_url=settings.OLLAMA_BASE_URL,
    num_ctx=2048
)

Production Considerations

Load Balancing

Run multiple Ollama instances behind a load balancer for high traffic

Model Caching

Keep models loaded in memory with OLLAMA_MAX_LOADED_MODELS=1

Monitoring

Track inference latency, GPU utilization, and error rates

Fallback

Implement graceful degradation if Ollama is unavailable

Rate Limiting: With OLLAMA_NUM_PARALLEL=1, only one request is processed at a time. Consider horizontal scaling for production.

Resources

Ollama Documentation

Official Ollama setup and API reference

LangChain Ollama

LangChain integration guide

Model Library

Browse available models

NVIDIA Toolkit

GPU container setup

Overview

Setup

AI & ML

Database

Documentation Index

​Overview

​Docker Configuration

​docker-compose.yml

​Model Setup

​Step 1: Start the Ollama Service

​Step 2: Download the Model

​Step 3: Verify Installation

​Backend Integration

​Environment Variables

​LangChain Configuration

​GPU Support

​NVIDIA GPU Setup

​CPU-Only Mode

​Performance Tuning

​Context Window Optimization

Ollama Level

Model Level

Agent Level

​Memory Management

​Troubleshooting

​Usage in Agent

​1. Conversational AI (Primary)

​2. Dietary Validation (Secondary)

​Monitoring

​Real-time Metrics

​Performance Benchmarks

​Advanced Configuration

​Custom Model Parameters

​Multiple Model Support

​Production Considerations

Load Balancing

Model Caching

Monitoring

Fallback

​Resources

Ollama Documentation

LangChain Ollama

Model Library

NVIDIA Toolkit

Related Documentation

Build docs developers (and LLMs) love

Overview

Docker Configuration

docker-compose.yml

Model Setup

Step 1: Start the Ollama Service

Step 2: Download the Model

Step 3: Verify Installation

Backend Integration

Environment Variables

LangChain Configuration

GPU Support

NVIDIA GPU Setup

CPU-Only Mode

Performance Tuning

Context Window Optimization

Memory Management

Troubleshooting

Usage in Agent

1. Conversational AI (Primary)

2. Dietary Validation (Secondary)

Monitoring

Real-time Metrics

Performance Benchmarks

Advanced Configuration

Custom Model Parameters

Multiple Model Support

Production Considerations

Resources