Setting Environment Variables
Restart the Ollama server after changing environment variables for the changes to take effect.
Server Configuration
OLLAMA_HOST
The IP address and port the Ollama server listens on.Server bind address and port.
OLLAMA_ORIGINS
Comma-separated list of allowed origins for CORS.Allowed CORS origins.
OLLAMA_MODELS
Directory where models are stored.Path to models directory.
OLLAMA_KEEP_ALIVE
Duration models stay loaded in memory after the last request.Keep-alive duration (e.g., “5m”, “1h”, “300s”). Use “0” to unload immediately or “-1” for infinite.
OLLAMA_NUM_PARALLEL
Maximum number of parallel requests processed simultaneously.Number of parallel requests.
OLLAMA_MAX_LOADED_MODELS
Maximum number of models loaded in memory simultaneously.Maximum loaded models per GPU (0 = unlimited).
OLLAMA_MAX_QUEUE
Maximum number of requests queued when the server is busy.Maximum queued requests.
OLLAMA_LOAD_TIMEOUT
Timeout for model loading operations.Load timeout duration (e.g., “5m”, “300s”).
GPU Configuration
OLLAMA_GPU_OVERHEAD
Reserve a portion of VRAM per GPU to prevent memory exhaustion.Reserved VRAM per GPU in bytes.
OLLAMA_SCHED_SPREAD
Schedule model layers across all available GPUs.Enable multi-GPU scheduling.
CUDA_VISIBLE_DEVICES
Select specific NVIDIA GPUs (comma-separated IDs or UUIDs).Visible NVIDIA GPUs (Linux/Windows only).
ROCR_VISIBLE_DEVICES
Select specific AMD GPUs.Visible AMD GPUs (Linux/Windows only).
HSA_OVERRIDE_GFX_VERSION
Override AMD GPU architecture version for unsupported GPUs.Force AMD GPU to use compatible LLVM target (Linux only).
GGML_VK_VISIBLE_DEVICES
Select specific Vulkan GPUs.Visible Vulkan GPUs (requires OLLAMA_VULKAN=1).
OLLAMA_VULKAN
Enable experimental Vulkan GPU support.Enable Vulkan backend (Linux/Windows only).
Model Behavior
OLLAMA_CONTEXT_LENGTH
Default context length for models.Context length (default: 4k/32k/256k based on VRAM).
OLLAMA_FLASH_ATTENTION
Enable flash attention optimization.Enable flash attention (experimental).
OLLAMA_KV_CACHE_TYPE
Quantization type for the key-value cache.KV cache quantization (e.g., “f16”, “q8_0”, “q4_0”).
OLLAMA_MULTIUSER_CACHE
Optimize prompt caching for multi-user scenarios.Enable multi-user prompt cache optimization.
Advanced Configuration
OLLAMA_DEBUG
Enable debug logging.Enable debug output.
OLLAMA_LLM_LIBRARY
Override LLM library path (bypasses auto-detection).Path to custom LLM library.
OLLAMA_NOPRUNE
Disable automatic pruning of unused model blobs on startup.Disable model blob pruning.
OLLAMA_NOHISTORY
Disable readline history in the CLI.Disable command history.
OLLAMA_EDITOR
Set the editor for interactive prompt editing (Ctrl+G in CLI).Path to editor executable.
OLLAMA_REMOTES
Allowed hosts for remote model pulling.Comma-separated list of allowed remote hosts.
OLLAMA_NO_CLOUD
Disable Ollama cloud features (remote inference and web search).Disable cloud features.
OLLAMA_NEW_ENGINE
Enable the new experimental Ollama engine.Enable new engine (experimental).
Proxy Configuration
HTTP_PROXY / HTTPS_PROXY
Configure HTTP/HTTPS proxy for model downloads.NO_PROXY
Hosts to exclude from proxy.Examples
Production Server Configuration
Development Configuration
Multi-GPU Setup
CPU-Only Mode
Related
GPU Configuration
Detailed GPU setup and troubleshooting
Model Quantization
Optimize memory usage with quantization