Environment Variables

Ollama’s behavior can be customized using environment variables. Set these before starting the Ollama server.

Setting Environment Variables

# Set for current session
export OLLAMA_HOST=0.0.0.0:11434

# Set permanently (add to ~/.bashrc or ~/.zshrc)
echo 'export OLLAMA_HOST=0.0.0.0:11434' >> ~/.bashrc

# Set for systemd service
sudo systemctl edit ollama
# Add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

Restart the Ollama server after changing environment variables for the changes to take effect.

Server Configuration

OLLAMA_HOST

The IP address and port the Ollama server listens on.

OLLAMA_HOST

string

default:"127.0.0.1:11434"

Server bind address and port.

# Listen on all interfaces
export OLLAMA_HOST=0.0.0.0:11434

# Custom port
export OLLAMA_HOST=127.0.0.1:8080

# HTTPS with custom port
export OLLAMA_HOST=https://localhost:443

OLLAMA_ORIGINS

Comma-separated list of allowed origins for CORS.

OLLAMA_ORIGINS

string

default:"localhost,127.0.0.1,0.0.0.0"

Allowed CORS origins.

export OLLAMA_ORIGINS="http://localhost:3000,https://myapp.com"

OLLAMA_MODELS

Directory where models are stored.

OLLAMA_MODELS

string

default:"~/.ollama/models"

Path to models directory.

export OLLAMA_MODELS=/mnt/storage/ollama-models

OLLAMA_KEEP_ALIVE

Duration models stay loaded in memory after the last request.

OLLAMA_KEEP_ALIVE

string

default:"5m"

Keep-alive duration (e.g., “5m”, “1h”, “300s”). Use “0” to unload immediately or “-1” for infinite.

# Keep loaded for 10 minutes
export OLLAMA_KEEP_ALIVE=10m

# Unload immediately after use
export OLLAMA_KEEP_ALIVE=0

# Keep loaded indefinitely
export OLLAMA_KEEP_ALIVE=-1

OLLAMA_NUM_PARALLEL

Maximum number of parallel requests processed simultaneously.

OLLAMA_NUM_PARALLEL

integer

default:"1"

Number of parallel requests.

# Process up to 4 requests in parallel
export OLLAMA_NUM_PARALLEL=4

OLLAMA_MAX_LOADED_MODELS

Maximum number of models loaded in memory simultaneously.

OLLAMA_MAX_LOADED_MODELS

integer

default:"0"

Maximum loaded models per GPU (0 = unlimited).

# Keep at most 3 models loaded
export OLLAMA_MAX_LOADED_MODELS=3

OLLAMA_MAX_QUEUE

Maximum number of requests queued when the server is busy.

OLLAMA_MAX_QUEUE

integer

default:"512"

Maximum queued requests.

export OLLAMA_MAX_QUEUE=1024

OLLAMA_LOAD_TIMEOUT

Timeout for model loading operations.

OLLAMA_LOAD_TIMEOUT

string

default:"5m"

Load timeout duration (e.g., “5m”, “300s”).

# 10-minute timeout for large models
export OLLAMA_LOAD_TIMEOUT=10m

GPU Configuration

OLLAMA_GPU_OVERHEAD

Reserve a portion of VRAM per GPU to prevent memory exhaustion.

OLLAMA_GPU_OVERHEAD

integer

default:"0"

Reserved VRAM per GPU in bytes.

# Reserve 2GB per GPU
export OLLAMA_GPU_OVERHEAD=2147483648

OLLAMA_SCHED_SPREAD

Schedule model layers across all available GPUs.

OLLAMA_SCHED_SPREAD

boolean

default:"false"

Enable multi-GPU scheduling.

export OLLAMA_SCHED_SPREAD=true

CUDA_VISIBLE_DEVICES

Select specific NVIDIA GPUs (comma-separated IDs or UUIDs).

CUDA_VISIBLE_DEVICES

string

Visible NVIDIA GPUs (Linux/Windows only).

# Use GPUs 0 and 1
export CUDA_VISIBLE_DEVICES=0,1

# Use specific UUIDs
export CUDA_VISIBLE_DEVICES=GPU-abc123,GPU-def456

# Force CPU only
export CUDA_VISIBLE_DEVICES=-1

ROCR_VISIBLE_DEVICES

Select specific AMD GPUs.

ROCR_VISIBLE_DEVICES

string

Visible AMD GPUs (Linux/Windows only).

export ROCR_VISIBLE_DEVICES=0,1

HSA_OVERRIDE_GFX_VERSION

Override AMD GPU architecture version for unsupported GPUs.

HSA_OVERRIDE_GFX_VERSION

string

Force AMD GPU to use compatible LLVM target (Linux only).

# Force RX 5400 to use gfx1030 target
export HSA_OVERRIDE_GFX_VERSION="10.3.0"

# Different versions for multiple GPUs
export HSA_OVERRIDE_GFX_VERSION_0=10.3.0
export HSA_OVERRIDE_GFX_VERSION_1=11.0.0

GGML_VK_VISIBLE_DEVICES

Select specific Vulkan GPUs.

GGML_VK_VISIBLE_DEVICES

string

Visible Vulkan GPUs (requires OLLAMA_VULKAN=1).

export GGML_VK_VISIBLE_DEVICES=0,1

# Disable Vulkan
export GGML_VK_VISIBLE_DEVICES=-1

OLLAMA_VULKAN

Enable experimental Vulkan GPU support.

OLLAMA_VULKAN

boolean

default:"false"

Enable Vulkan backend (Linux/Windows only).

export OLLAMA_VULKAN=1

Model Behavior

OLLAMA_CONTEXT_LENGTH

Default context length for models.

OLLAMA_CONTEXT_LENGTH

integer

default:"auto"

Context length (default: 4k/32k/256k based on VRAM).

# Set 8k context window
export OLLAMA_CONTEXT_LENGTH=8192

OLLAMA_FLASH_ATTENTION

Enable flash attention optimization.

OLLAMA_FLASH_ATTENTION

boolean

default:"false"

Enable flash attention (experimental).

export OLLAMA_FLASH_ATTENTION=1

OLLAMA_KV_CACHE_TYPE

Quantization type for the key-value cache.

OLLAMA_KV_CACHE_TYPE

string

default:"f16"

KV cache quantization (e.g., “f16”, “q8_0”, “q4_0”).

# Use 8-bit quantization for KV cache
export OLLAMA_KV_CACHE_TYPE=q8_0

OLLAMA_MULTIUSER_CACHE

Optimize prompt caching for multi-user scenarios.

OLLAMA_MULTIUSER_CACHE

boolean

default:"false"

Enable multi-user prompt cache optimization.

export OLLAMA_MULTIUSER_CACHE=1

Advanced Configuration

OLLAMA_DEBUG

Enable debug logging.

OLLAMA_DEBUG

boolean

default:"false"

Enable debug output.

# Enable debug logging
export OLLAMA_DEBUG=1

# Enable trace logging (more verbose)
export OLLAMA_DEBUG=2

OLLAMA_LLM_LIBRARY

Override LLM library path (bypasses auto-detection).

OLLAMA_LLM_LIBRARY

string

Path to custom LLM library.

export OLLAMA_LLM_LIBRARY=/path/to/custom/llm/library.so

OLLAMA_NOPRUNE

Disable automatic pruning of unused model blobs on startup.

OLLAMA_NOPRUNE

boolean

default:"false"

Disable model blob pruning.

export OLLAMA_NOPRUNE=1

OLLAMA_NOHISTORY

Disable readline history in the CLI.

OLLAMA_NOHISTORY

boolean

default:"false"

Disable command history.

export OLLAMA_NOHISTORY=1

OLLAMA_EDITOR

Set the editor for interactive prompt editing (Ctrl+G in CLI).

OLLAMA_EDITOR

string

Path to editor executable.

export OLLAMA_EDITOR=vim

OLLAMA_REMOTES

Allowed hosts for remote model pulling.

OLLAMA_REMOTES

string

default:"ollama.com"

Comma-separated list of allowed remote hosts.

export OLLAMA_REMOTES="ollama.com,myserver.com"

OLLAMA_NO_CLOUD

Disable Ollama cloud features (remote inference and web search).

OLLAMA_NO_CLOUD

boolean

default:"false"

Disable cloud features.

export OLLAMA_NO_CLOUD=1

OLLAMA_NEW_ENGINE

Enable the new experimental Ollama engine.

OLLAMA_NEW_ENGINE

boolean

default:"false"

Enable new engine (experimental).

export OLLAMA_NEW_ENGINE=1

Proxy Configuration

HTTP_PROXY / HTTPS_PROXY

Configure HTTP/HTTPS proxy for model downloads.

export HTTP_PROXY=http://proxy.company.com:8080
export HTTPS_PROXY=https://proxy.company.com:8443

NO_PROXY

Hosts to exclude from proxy.

export NO_PROXY=localhost,127.0.0.1,.internal.com

Examples

Production Server Configuration

# Production server with GPU optimization
export OLLAMA_HOST=0.0.0.0:11434
export OLLAMA_ORIGINS="https://myapp.com"
export OLLAMA_MODELS=/data/ollama-models
export OLLAMA_NUM_PARALLEL=8
export OLLAMA_MAX_LOADED_MODELS=3
export OLLAMA_KEEP_ALIVE=30m
export OLLAMA_GPU_OVERHEAD=2147483648
export OLLAMA_SCHED_SPREAD=true

Development Configuration

# Development with debugging
export OLLAMA_HOST=127.0.0.1:11434
export OLLAMA_DEBUG=1
export OLLAMA_KEEP_ALIVE=0
export OLLAMA_NUM_PARALLEL=2

Multi-GPU Setup

# Use all NVIDIA GPUs with load spreading
export OLLAMA_SCHED_SPREAD=true
export OLLAMA_GPU_OVERHEAD=2147483648
export OLLAMA_NUM_PARALLEL=4

CPU-Only Mode

# Force CPU-only inference
export CUDA_VISIBLE_DEVICES=-1
export ROCR_VISIBLE_DEVICES=-1
export GGML_VK_VISIBLE_DEVICES=-1

GPU Configuration

Detailed GPU setup and troubleshooting

Model Quantization

Optimize memory usage with quantization

Get Started

Core Concepts

Features

Integrations

Platform Guides

Advanced

Resources

​Setting Environment Variables

​Server Configuration

​OLLAMA_HOST

​OLLAMA_ORIGINS

​OLLAMA_MODELS

​OLLAMA_KEEP_ALIVE

​OLLAMA_NUM_PARALLEL

​OLLAMA_MAX_LOADED_MODELS

​OLLAMA_MAX_QUEUE

​OLLAMA_LOAD_TIMEOUT

​GPU Configuration

​OLLAMA_GPU_OVERHEAD

​OLLAMA_SCHED_SPREAD

​CUDA_VISIBLE_DEVICES

​ROCR_VISIBLE_DEVICES

​HSA_OVERRIDE_GFX_VERSION

​GGML_VK_VISIBLE_DEVICES

​OLLAMA_VULKAN

​Model Behavior

​OLLAMA_CONTEXT_LENGTH

​OLLAMA_FLASH_ATTENTION

​OLLAMA_KV_CACHE_TYPE

​OLLAMA_MULTIUSER_CACHE

​Advanced Configuration

​OLLAMA_DEBUG

​OLLAMA_LLM_LIBRARY

​OLLAMA_NOPRUNE

​OLLAMA_NOHISTORY

​OLLAMA_EDITOR

​OLLAMA_REMOTES

​OLLAMA_NO_CLOUD

​OLLAMA_NEW_ENGINE

​Proxy Configuration

​HTTP_PROXY / HTTPS_PROXY

​NO_PROXY

​Examples

​Production Server Configuration

​Development Configuration

​Multi-GPU Setup

​CPU-Only Mode

​Related

GPU Configuration

Model Quantization

Build docs developers (and LLMs) love

Setting Environment Variables

Server Configuration

OLLAMA_HOST

OLLAMA_ORIGINS

OLLAMA_MODELS

OLLAMA_KEEP_ALIVE

OLLAMA_NUM_PARALLEL

OLLAMA_MAX_LOADED_MODELS

OLLAMA_MAX_QUEUE

OLLAMA_LOAD_TIMEOUT

GPU Configuration

OLLAMA_GPU_OVERHEAD

OLLAMA_SCHED_SPREAD

CUDA_VISIBLE_DEVICES

ROCR_VISIBLE_DEVICES

HSA_OVERRIDE_GFX_VERSION

GGML_VK_VISIBLE_DEVICES

OLLAMA_VULKAN

Model Behavior

OLLAMA_CONTEXT_LENGTH

OLLAMA_FLASH_ATTENTION

OLLAMA_KV_CACHE_TYPE

OLLAMA_MULTIUSER_CACHE

Advanced Configuration

OLLAMA_DEBUG

OLLAMA_LLM_LIBRARY

OLLAMA_NOPRUNE

OLLAMA_NOHISTORY

OLLAMA_EDITOR

OLLAMA_REMOTES

OLLAMA_NO_CLOUD

OLLAMA_NEW_ENGINE

Proxy Configuration

HTTP_PROXY / HTTPS_PROXY

NO_PROXY

Examples

Production Server Configuration

Development Configuration

Multi-GPU Setup

CPU-Only Mode

Related