Skip to main content

Performance Tuning

Optimize llama.cpp inference performance across CPU, GPU, and hybrid configurations.

Quick Wins

Use GPU

Offload layers to GPU with --n-gpu-layers

Optimize Threads

Set --threads to physical CPU cores

Choose Quantization

Use Q4_K_M or Q5_K_M for best speed/quality

Adjust Context

Reduce --ctx-size to minimum needed

GPU Acceleration

CUDA (NVIDIA)

Offload layers to GPU:
llama-cli -m model.gguf --n-gpu-layers 32 -p "Hello"
Set --n-gpu-layers to a large number (e.g., 200000) to offload all possible layers automatically.
Verify GPU usage in the startup logs:
llama_model_load_internal: [cublas] offloading 60 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 17223 MB

Metal (Apple Silicon)

Metal is enabled by default on macOS:
llama-cli -m model.gguf --n-gpu-layers 999
Monitor GPU utilization:
sudo powermetrics --samplers gpu_power -i 1000

ROCm (AMD)

llama-cli -m model.gguf --n-gpu-layers 32
Check GPU usage:
rocm-smi

Thread Configuration

Incorrect thread settings are the #1 cause of slow inference!

Finding Optimal Thread Count

Start conservative:
# Start with 1 thread
llama-cli -m model.gguf --threads 1 -p "Test"

# Double until performance stops improving
llama-cli -m model.gguf --threads 2 -p "Test"
llama-cli -m model.gguf --threads 4 -p "Test"
llama-cli -m model.gguf --threads 8 -p "Test"
Recommended values:
  • CPU-only: Physical CPU cores (not logical/hyperthreaded)
  • With GPU: 4-8 threads regardless of core count
  • Server (parallel requests): 2-4 threads per request
lscpu | grep "Core(s) per socket"

Batch Thread Configuration

Separate threads for prompt processing:
llama-cli -m model.gguf \
  --threads 4 \
  --threads-batch 8

Context Size Optimization

Context size directly impacts:
  • Memory usage (RAM/VRAM)
  • Inference speed
  • Maximum conversation length
llama-cli -m model.gguf --ctx-size 512
Only use large context (>4096) when absolutely necessary. Most tasks work well with 2048.

Batch Size Tuning

Logical batch size (prompt processing parallelism):
llama-cli -m model.gguf --batch-size 512
Physical batch size (hardware limit):
llama-cli -m model.gguf --ubatch-size 256
Guidelines:
  • Larger batch = faster prompt processing, more memory
  • CPU: 512-2048
  • GPU: 512-2048 (depends on VRAM)
  • Server: 2048+ for parallel requests

Flash Attention

Enables more efficient attention computation:
llama-cli -m model.gguf --flash-attn
Flash Attention is enabled by default (auto) when beneficial. Explicitly enable with --flash-attn on.

Quantization Selection

QuantizationSpeedQualityUse Case
Q2_KFastestLowestExperimentation
Q3_K_MVery FastLowResource-constrained
Q4_K_MFastGoodRecommended default
Q5_K_MModerateVery GoodQuality-focused
Q6_KSlowerExcellentNear-original quality
Q8_0SlowestHighestReference/evaluation

Benchmark Example

Real-world benchmark on NVIDIA A6000 (48GB VRAM), 7-core CPU, 30B Q4_0 model:
ConfigurationTokens/sec
GPU only, wrong threads<0.1
CPU only (-t 7)1.7
GPU + 1 thread5.5
GPU + 7 threads8.7
GPU + 4 threads9.1
Note how too many threads (7) actually decreased performance compared to 4 threads!

Hybrid CPU+GPU Inference

For models larger than VRAM:
# Model requires 32GB, GPU has 24GB
llama-cli -m model.gguf \
  --n-gpu-layers 40 \
  --threads 4
llama.cpp automatically splits:
  • 40 layers on GPU
  • Remaining layers on CPU

Memory Optimization

Memory Mapping

Enable mmap (default, recommended):
llama-cli -m model.gguf --mmap
Disable mmap (faster startup, more RAM):
llama-cli -m model.gguf --no-mmap

Memory Locking

Prevent swapping (requires sufficient RAM):
llama-cli -m model.gguf --mlock

Server Performance

Parallel Request Handling

llama-server -m model.gguf \
  --ctx-size 4096 \
  --n-parallel 4 \
  --threads 4 \
  --batch-size 2048
Configuration guide:
  • --n-parallel: Number of simultaneous requests (2-8)
  • --threads: Threads per request (2-4 recommended)
  • --batch-size: Must be ≥ ctx-size × n-parallel

Continuous Batching

Enabled by default, improves throughput:
llama-server -m model.gguf \
  --cont-batching \
  --n-parallel 8

Platform-Specific Tips

Optimal configuration:
llama-cli -m model.gguf \
  --n-gpu-layers 999 \
  --threads 4 \
  --batch-size 512 \
  --ubatch-size 256 \
  --flash-attn
Multi-GPU:
# Split evenly across 2 GPUs
llama-cli -m model.gguf \
  --tensor-split 1,1 \
  --n-gpu-layers 999

Profiling and Monitoring

Built-in Performance Stats

Enable timing information:
llama-cli -m model.gguf --perf -p "Test prompt"
Outputs:
  • Prompt evaluation time
  • Token generation time
  • Tokens per second

Server Metrics

Query server metrics endpoint:
curl http://localhost:8080/metrics
Returns:
  • Request counts
  • Processing times
  • KV cache usage
  • Queue statistics

Benchmark Tool

Systematic performance testing:
llama-bench -m model.gguf \
  --n-prompt 512 \
  --n-gen 128 \
  -ngl 32 \
  -t 4,8,16
Learn more about benchmarking →

Common Performance Issues

Likely causes:
  • Too many threads (oversaturation)
  • No GPU acceleration
  • Context size too large
Solutions:
  • Set --threads 1 and gradually increase
  • Enable GPU layers: --n-gpu-layers 32
  • Reduce context: --ctx-size 2048
Solutions:
  • Use smaller quantization (Q4_K_M instead of Q8_0)
  • Reduce context size: --ctx-size 1024
  • Reduce batch size: --batch-size 256
  • Offload fewer layers: --n-gpu-layers 20
  • Enable mmap: --mmap
Check:
  • Are layers offloaded? (check startup logs)
  • Is batch size large enough? Try 512 or 1024
  • Are you using optimal quantization? (Q4_K_M recommended)
Optimize:
llama-cli -m model.gguf \
  --n-gpu-layers 999 \
  --batch-size 1024 \
  --ubatch-size 512
Solutions:
  • Increase --n-parallel 8
  • Ensure batch size ≥ ctx-size × n-parallel
  • Reduce per-request threads: --threads 2
  • Enable continuous batching: --cont-batching

Advanced Optimizations

CPU Affinity

Bind threads to specific cores:
llama-cli -m model.gguf \
  --cpu-mask 0xFF \
  --cpu-strict 1

Process Priority

Increase process priority:
llama-cli -m model.gguf --prio 2
Levels: -1 (low), 0 (normal), 1 (medium), 2 (high), 3 (realtime)

Polling Level

Reduce latency with busy-waiting:
llama-cli -m model.gguf --poll 100
Range: 0-100 (0=no polling, 100=full busy-wait)

Next Steps

Quantization Guide

Learn about quantization types and tradeoffs

Backend Configuration

Configure GPU backends for your hardware

Benchmarking

Measure and compare performance

Server Tuning

Optimize server for production

Build docs developers (and LLMs) love