sa_ollama container runs inference on the CPU. Enabling GPU acceleration dramatically reduces response times — typically from 30–120 seconds per response (CPU) to 2–8 seconds (GPU).
NVIDIA GPU (CUDA)
Thedocker-compose.yml file includes an NVIDIA GPU reservation block that is active by default:
Install the NVIDIA Container Toolkit
Follow the official installation guide for your operating system.
Verify GPU is visible inside the container
After starting the stack, confirm the GPU is accessible to Ollama:You should see your GPU listed with its memory and utilisation stats.
Apple Silicon (Metal)
Ollama has native Metal support for Apple M-series chips. When thesa_ollama container is run on macOS with an Apple Silicon host, Ollama automatically uses the GPU via Metal — no extra configuration is needed in docker-compose.yml.
Docker Desktop on macOS does not expose the GPU to Linux containers the same way NVIDIA does on Linux. For the best Apple Silicon performance, consider running Ollama natively on the host and pointing
OLLAMA_BASE_URL in your .env to http://host.docker.internal:11434.CPU-only fallback
If you have no GPU or do not want to use one, comment out thedeploy block in infrastructure/docker-compose.yml:
Memory and CPU configuration
The following.env variables control resource allocation for the Ollama container:
| Variable | Default | Description |
|---|---|---|
OLLAMA_MEMORY_LIMIT | 2GB | Maximum RAM the container may use. Increase for larger models. |
OLLAMA_CPU_SHARES | 1024 | Relative CPU weight. Higher values give Ollama more CPU time. |
CHROMADB_MEMORY_LIMIT | 512MB | Memory limit for the ChromaDB container. |
API_MEMORY_LIMIT | 512MB | Memory limit for the FastAPI container. |
Memory limit recommendations by hardware class
| Hardware class | OLLAMA_MEMORY_LIMIT | Suitable models |
|---|---|---|
| 8 GB RAM (laptop) | 2GB | llama3.2, phi4-mini |
| 16 GB RAM (workstation) | 4GB | llama3.2, qwen2.5-coder:7b |
| 32 GB RAM | 8GB | Any 7B model, some 13B models |
| 64 GB RAM+ | 16GB | 13B–34B models |
Performance comparison
| Inference mode | Response time (7B model) | Memory usage | Setup complexity |
|---|---|---|---|
| NVIDIA GPU (CUDA) | 2–8 s | GPU VRAM | Medium (requires toolkit) |
| Apple Silicon (Metal, native) | 3–10 s | Unified memory | Low (automatic) |
| CPU only (modern desktop) | 30–90 s | RAM | None |
| CPU only (low-end laptop) | 90–180 s | RAM | None |