Overview
SmartEat AI uses Ollama to run large language models (LLMs) locally, providing private, fast AI inference without external API dependencies. The system is optimized for environments with limited GPU memory (8GB VRAM).Why Ollama?
- Privacy: All data stays on your infrastructure
- Cost: No per-token API fees
- Speed: Low-latency inference for real-time chat
- Flexibility: Easy model switching and customization
Docker Configuration
Ollama runs as a containerized service with GPU support.docker-compose.yml
Configuration Explanation
Configuration Explanation
- Port 11434: Ollama API endpoint
- Volume: Persists downloaded models across container restarts
- OLLAMA_CONTEXT_LENGTH: Maximum tokens for context (32K)
- GPU Reservation: Enables NVIDIA GPU acceleration
- NUM_PARALLEL=1: Prevents OOM with limited VRAM
Model Setup
Step 1: Start the Ollama Service
Step 2: Download the Model
Access the Ollama container and pull the model:- Llama 3.1 (Recommended)
- Llama 3
- Phi-3
- Size: ~4.7GB
- Parameters: 8B
- Context: Up to 128K tokens
- Best for: Balanced performance and accuracy
Step 3: Verify Installation
Backend Integration
Environment Variables
Add to your.env file (in project root):
LangChain Configuration
File:backend/app/core/config_ollama.py
Vector Database: ChromaDB integration exists but is not active. Direct PostgreSQL queries are used instead due to filtering limitations.
GPU Support
NVIDIA GPU Setup
Install NVIDIA Drivers
Ensure you have NVIDIA drivers installed on your host:Should display GPU information.
CPU-Only Mode
If you don’t have a GPU, modifydocker-compose.yml:
Performance Tuning
Context Window Optimization
The system uses a multi-layered approach to manage context:Ollama Level
OLLAMA_CONTEXT_LENGTH=32768Maximum supported by the serviceModel Level
num_ctx=16384Actual context used per requestAgent Level
MAX_CONTEXT_TOKENS=10000Conversation history limitMemory Management
8GB VRAM Configuration (recommended):Troubleshooting
Error: Connection refused (port 11434)
Error: Connection refused (port 11434)
Cause: Ollama service is not running.Solution:
Error: Out of memory (OOM)
Error: Out of memory (OOM)
Cause: Model is too large for available VRAM.Solutions:
- Use a smaller model (Phi-3)
- Reduce
num_ctxin config - Set
OLLAMA_NUM_PARALLEL=1 - Close other GPU applications
Error: Model not found
Error: Model not found
Cause: Model not downloaded or wrong name in
.env.Solution:Slow inference (>5 seconds per response)
Slow inference (>5 seconds per response)
Possible causes:
- Running on CPU instead of GPU
- Context window too large
- Model loading overhead
- Verify GPU is being used:
- Reduce
num_ctxto 8192 or 4096 - Keep
OLLAMA_MAX_LOADED_MODELS=1to avoid reloading
Usage in Agent
The agent uses Ollama for two purposes:1. Conversational AI (Primary)
2. Dietary Validation (Secondary)
Used ingenerate_weekly_plan.py to validate recipes against restrictions:
Monitoring
Real-time Metrics
Performance Benchmarks
Typical performance on an 8GB GPU:| Operation | Latency |
|---|---|
| Simple query | 200-500ms |
| Tool call | 500-1000ms |
| Complex reasoning | 1-2s |
| Model loading | 3-5s (first request) |
Advanced Configuration
Custom Model Parameters
Create aModelfile for fine-tuned behavior:
Multiple Model Support
Run different models for different tasks:Production Considerations
Load Balancing
Run multiple Ollama instances behind a load balancer for high traffic
Model Caching
Keep models loaded in memory with
OLLAMA_MAX_LOADED_MODELS=1Monitoring
Track inference latency, GPU utilization, and error rates
Fallback
Implement graceful degradation if Ollama is unavailable
Resources
Ollama Documentation
Official Ollama setup and API reference
LangChain Ollama
LangChain integration guide
Model Library
Browse available models
NVIDIA Toolkit
GPU container setup
Related Documentation
- AI Agent - How Smarty uses Ollama
- Environment Variables - Backend configuration options
