Skip to main content
Ollama’s behavior can be customized using environment variables. Set these before starting the Ollama server.

Setting Environment Variables

# Set for current session
export OLLAMA_HOST=0.0.0.0:11434

# Set permanently (add to ~/.bashrc or ~/.zshrc)
echo 'export OLLAMA_HOST=0.0.0.0:11434' >> ~/.bashrc

# Set for systemd service
sudo systemctl edit ollama
# Add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Restart the Ollama server after changing environment variables for the changes to take effect.

Server Configuration

OLLAMA_HOST

The IP address and port the Ollama server listens on.
OLLAMA_HOST
string
default:"127.0.0.1:11434"
Server bind address and port.
# Listen on all interfaces
export OLLAMA_HOST=0.0.0.0:11434

# Custom port
export OLLAMA_HOST=127.0.0.1:8080

# HTTPS with custom port
export OLLAMA_HOST=https://localhost:443

OLLAMA_ORIGINS

Comma-separated list of allowed origins for CORS.
OLLAMA_ORIGINS
string
default:"localhost,127.0.0.1,0.0.0.0"
Allowed CORS origins.
export OLLAMA_ORIGINS="http://localhost:3000,https://myapp.com"

OLLAMA_MODELS

Directory where models are stored.
OLLAMA_MODELS
string
default:"~/.ollama/models"
Path to models directory.
export OLLAMA_MODELS=/mnt/storage/ollama-models

OLLAMA_KEEP_ALIVE

Duration models stay loaded in memory after the last request.
OLLAMA_KEEP_ALIVE
string
default:"5m"
Keep-alive duration (e.g., “5m”, “1h”, “300s”). Use “0” to unload immediately or “-1” for infinite.
# Keep loaded for 10 minutes
export OLLAMA_KEEP_ALIVE=10m

# Unload immediately after use
export OLLAMA_KEEP_ALIVE=0

# Keep loaded indefinitely
export OLLAMA_KEEP_ALIVE=-1

OLLAMA_NUM_PARALLEL

Maximum number of parallel requests processed simultaneously.
OLLAMA_NUM_PARALLEL
integer
default:"1"
Number of parallel requests.
# Process up to 4 requests in parallel
export OLLAMA_NUM_PARALLEL=4

OLLAMA_MAX_LOADED_MODELS

Maximum number of models loaded in memory simultaneously.
OLLAMA_MAX_LOADED_MODELS
integer
default:"0"
Maximum loaded models per GPU (0 = unlimited).
# Keep at most 3 models loaded
export OLLAMA_MAX_LOADED_MODELS=3

OLLAMA_MAX_QUEUE

Maximum number of requests queued when the server is busy.
OLLAMA_MAX_QUEUE
integer
default:"512"
Maximum queued requests.
export OLLAMA_MAX_QUEUE=1024

OLLAMA_LOAD_TIMEOUT

Timeout for model loading operations.
OLLAMA_LOAD_TIMEOUT
string
default:"5m"
Load timeout duration (e.g., “5m”, “300s”).
# 10-minute timeout for large models
export OLLAMA_LOAD_TIMEOUT=10m

GPU Configuration

OLLAMA_GPU_OVERHEAD

Reserve a portion of VRAM per GPU to prevent memory exhaustion.
OLLAMA_GPU_OVERHEAD
integer
default:"0"
Reserved VRAM per GPU in bytes.
# Reserve 2GB per GPU
export OLLAMA_GPU_OVERHEAD=2147483648

OLLAMA_SCHED_SPREAD

Schedule model layers across all available GPUs.
OLLAMA_SCHED_SPREAD
boolean
default:"false"
Enable multi-GPU scheduling.
export OLLAMA_SCHED_SPREAD=true

CUDA_VISIBLE_DEVICES

Select specific NVIDIA GPUs (comma-separated IDs or UUIDs).
CUDA_VISIBLE_DEVICES
string
Visible NVIDIA GPUs (Linux/Windows only).
# Use GPUs 0 and 1
export CUDA_VISIBLE_DEVICES=0,1

# Use specific UUIDs
export CUDA_VISIBLE_DEVICES=GPU-abc123,GPU-def456

# Force CPU only
export CUDA_VISIBLE_DEVICES=-1

ROCR_VISIBLE_DEVICES

Select specific AMD GPUs.
ROCR_VISIBLE_DEVICES
string
Visible AMD GPUs (Linux/Windows only).
export ROCR_VISIBLE_DEVICES=0,1

HSA_OVERRIDE_GFX_VERSION

Override AMD GPU architecture version for unsupported GPUs.
HSA_OVERRIDE_GFX_VERSION
string
Force AMD GPU to use compatible LLVM target (Linux only).
# Force RX 5400 to use gfx1030 target
export HSA_OVERRIDE_GFX_VERSION="10.3.0"

# Different versions for multiple GPUs
export HSA_OVERRIDE_GFX_VERSION_0=10.3.0
export HSA_OVERRIDE_GFX_VERSION_1=11.0.0

GGML_VK_VISIBLE_DEVICES

Select specific Vulkan GPUs.
GGML_VK_VISIBLE_DEVICES
string
Visible Vulkan GPUs (requires OLLAMA_VULKAN=1).
export GGML_VK_VISIBLE_DEVICES=0,1

# Disable Vulkan
export GGML_VK_VISIBLE_DEVICES=-1

OLLAMA_VULKAN

Enable experimental Vulkan GPU support.
OLLAMA_VULKAN
boolean
default:"false"
Enable Vulkan backend (Linux/Windows only).
export OLLAMA_VULKAN=1

Model Behavior

OLLAMA_CONTEXT_LENGTH

Default context length for models.
OLLAMA_CONTEXT_LENGTH
integer
default:"auto"
Context length (default: 4k/32k/256k based on VRAM).
# Set 8k context window
export OLLAMA_CONTEXT_LENGTH=8192

OLLAMA_FLASH_ATTENTION

Enable flash attention optimization.
OLLAMA_FLASH_ATTENTION
boolean
default:"false"
Enable flash attention (experimental).
export OLLAMA_FLASH_ATTENTION=1

OLLAMA_KV_CACHE_TYPE

Quantization type for the key-value cache.
OLLAMA_KV_CACHE_TYPE
string
default:"f16"
KV cache quantization (e.g., “f16”, “q8_0”, “q4_0”).
# Use 8-bit quantization for KV cache
export OLLAMA_KV_CACHE_TYPE=q8_0

OLLAMA_MULTIUSER_CACHE

Optimize prompt caching for multi-user scenarios.
OLLAMA_MULTIUSER_CACHE
boolean
default:"false"
Enable multi-user prompt cache optimization.
export OLLAMA_MULTIUSER_CACHE=1

Advanced Configuration

OLLAMA_DEBUG

Enable debug logging.
OLLAMA_DEBUG
boolean
default:"false"
Enable debug output.
# Enable debug logging
export OLLAMA_DEBUG=1

# Enable trace logging (more verbose)
export OLLAMA_DEBUG=2

OLLAMA_LLM_LIBRARY

Override LLM library path (bypasses auto-detection).
OLLAMA_LLM_LIBRARY
string
Path to custom LLM library.
export OLLAMA_LLM_LIBRARY=/path/to/custom/llm/library.so

OLLAMA_NOPRUNE

Disable automatic pruning of unused model blobs on startup.
OLLAMA_NOPRUNE
boolean
default:"false"
Disable model blob pruning.
export OLLAMA_NOPRUNE=1

OLLAMA_NOHISTORY

Disable readline history in the CLI.
OLLAMA_NOHISTORY
boolean
default:"false"
Disable command history.
export OLLAMA_NOHISTORY=1

OLLAMA_EDITOR

Set the editor for interactive prompt editing (Ctrl+G in CLI).
OLLAMA_EDITOR
string
Path to editor executable.
export OLLAMA_EDITOR=vim

OLLAMA_REMOTES

Allowed hosts for remote model pulling.
OLLAMA_REMOTES
string
default:"ollama.com"
Comma-separated list of allowed remote hosts.
export OLLAMA_REMOTES="ollama.com,myserver.com"

OLLAMA_NO_CLOUD

Disable Ollama cloud features (remote inference and web search).
OLLAMA_NO_CLOUD
boolean
default:"false"
Disable cloud features.
export OLLAMA_NO_CLOUD=1

OLLAMA_NEW_ENGINE

Enable the new experimental Ollama engine.
OLLAMA_NEW_ENGINE
boolean
default:"false"
Enable new engine (experimental).
export OLLAMA_NEW_ENGINE=1

Proxy Configuration

HTTP_PROXY / HTTPS_PROXY

Configure HTTP/HTTPS proxy for model downloads.
export HTTP_PROXY=http://proxy.company.com:8080
export HTTPS_PROXY=https://proxy.company.com:8443

NO_PROXY

Hosts to exclude from proxy.
export NO_PROXY=localhost,127.0.0.1,.internal.com

Examples

Production Server Configuration

# Production server with GPU optimization
export OLLAMA_HOST=0.0.0.0:11434
export OLLAMA_ORIGINS="https://myapp.com"
export OLLAMA_MODELS=/data/ollama-models
export OLLAMA_NUM_PARALLEL=8
export OLLAMA_MAX_LOADED_MODELS=3
export OLLAMA_KEEP_ALIVE=30m
export OLLAMA_GPU_OVERHEAD=2147483648
export OLLAMA_SCHED_SPREAD=true

Development Configuration

# Development with debugging
export OLLAMA_HOST=127.0.0.1:11434
export OLLAMA_DEBUG=1
export OLLAMA_KEEP_ALIVE=0
export OLLAMA_NUM_PARALLEL=2

Multi-GPU Setup

# Use all NVIDIA GPUs with load spreading
export OLLAMA_SCHED_SPREAD=true
export OLLAMA_GPU_OVERHEAD=2147483648
export OLLAMA_NUM_PARALLEL=4

CPU-Only Mode

# Force CPU-only inference
export CUDA_VISIBLE_DEVICES=-1
export ROCR_VISIBLE_DEVICES=-1
export GGML_VK_VISIBLE_DEVICES=-1

GPU Configuration

Detailed GPU setup and troubleshooting

Model Quantization

Optimize memory usage with quantization

Build docs developers (and LLMs) love