Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/hacksider/Deep-Live-Cam/llms.txt

Use this file to discover all available pages before exploring further.

Deep-Live-Cam ships with sensible defaults that work on most machines, but getting the best throughput requires matching the execution provider, thread count, and memory limit to your actual hardware. This guide explains exactly how those defaults are derived, when to override them, and what platform-specific constraints to watch for.

Execution providers

The execution provider controls which hardware ONNX Runtime uses to run the face models. Deep-Live-Cam picks a provider automatically at startup using suggest_default_execution_provider in modules/core.py, which probes the list returned by onnxruntime.get_available_providers() and selects the first match in priority order. Auto-detection order: cudarocmcoremldmlcpu
Requires onnxruntime-gpu and CUDA Toolkit 12.8.0 with cuDNN v8.9.7.
python run.py -s source.jpg -t video.mp4 \
  --execution-provider cuda \
  --execution-threads 2
On CUDA, the in-memory FFmpeg pipeline also promotes libx264 to h264_nvenc and libx265 to hevc_nvenc automatically, offloading encoding to the NVENC hardware unit.
When any --execution-provider value is given on the command line, OMP_NUM_THREADS is set to 6 before PyTorch is imported. This is done at module-load time in modules/core.py because OpenMP threads must be pinned before the first import torch.

Thread count

The --execution-threads flag controls the ThreadPoolExecutor worker count in multi_process_frame. The default is chosen by suggest_execution_threads based on the active provider:
ProviderDefault threadsRationale
DML1DirectML serializes ONNX sessions; extra threads stall on the DML lock.
ROCm1Same serialization behavior as DML.
CUDA2Two threads keep the GPU fed while I/O occurs on the other thread.
CPUmax(4, min(cpu_count − 2, 16))Uses most available cores, reserving 2 for the OS and FFmpeg.
# From modules/core.py — suggest_execution_threads()
if 'DmlExecutionProvider' in modules.globals.execution_providers:
    return 1
if 'ROCMExecutionProvider' in modules.globals.execution_providers:
    return 1
if 'CUDAExecutionProvider' in modules.globals.execution_providers:
    return 2
return max(4, min(cpu_count - 2, 16))
For CUDA users, 2 threads is the sweet spot found empirically. Going higher causes CUDA context contention that actually reduces throughput. For CPU users on a 12-core machine, the formula yields max(4, min(10, 16)) = 10.

Memory limits

--max-memory sets a hard RAM ceiling enforced by limit_resources in modules/core.py after argument parsing completes. Platform defaults:
  • macOS: 4 GB — conservative default to avoid pressure on unified memory shared with the GPU.
  • All other platforms: 16 GB.
The limit is applied using the OS memory API:
# From modules/core.py — limit_resources()
if platform.system().lower() == 'windows':
    kernel32.SetProcessWorkingSetSize(-1, memory, memory)
else:
    resource.setrlimit(resource.RLIMIT_DATA, (memory, memory))
Setting --max-memory too low will cause the process to be terminated by the OS mid-run. The inswapper_128_fp16.onnx model alone occupies roughly 500 MB; budget at least 2 GB for swapper-only mode and 4 GB when chaining enhancers.

TensorFlow GPU memory growth

If TensorFlow is installed, limit_resources also enables memory growth for every visible GPU device:
# Prevents TensorFlow from pre-allocating all available VRAM
for gpu in tensorflow.config.experimental.list_physical_devices('GPU'):
    tensorflow.config.experimental.set_memory_growth(gpu, True)
This prevents TensorFlow from pre-allocating all available VRAM at startup, which would otherwise starve ONNX Runtime and the encoder.

CUDA cache management

After each frame processor finishes its pass, release_resources in modules/core.py flears the PyTorch CUDA allocator cache:
# From modules/core.py — release_resources()
if 'CUDAExecutionProvider' in modules.globals.execution_providers and HAS_TORCH:
    torch.cuda.empty_cache()
This runs between every processor in the pipeline (e.g., between face_swapper and face_enhancer), keeping peak VRAM usage predictable across a long video.

In-memory vs. disk-based pipeline

By default, Deep-Live-Cam avoids writing per-frame PNG files to disk. Instead it pipes raw BGR24 frames directly from an FFmpeg decoder into Python, processes them, and pushes them into a second FFmpeg encoder process. This eliminates the largest I/O bottleneck in earlier versions. The disk-based fallback is used automatically when:
  • --map-faces is active (multi-face mapping requires random frame access).
  • The FFmpeg pipe pipeline fails (e.g., the hardware encoder is unavailable).

Tuning recommendations by hardware tier

1

High-end NVIDIA GPU (RTX 3080+, 10 GB+ VRAM)

python run.py -s source.jpg -t video.mp4 \
  --frame-processor face_swapper face_enhancer \
  --execution-provider cuda \
  --execution-threads 2 \
  --max-memory 16 \
  --keep-fps
Enable face_enhancer for maximum quality. The NVENC hardware encoder handles output encoding in parallel with ONNX inference.
2

Mid-range NVIDIA GPU (RTX 3060, 6–8 GB VRAM)

python run.py -s source.jpg -t video.mp4 \
  --frame-processor face_swapper face_enhancer_gpen256 \
  --execution-provider cuda \
  --execution-threads 2 \
  --max-memory 8
Use face_enhancer_gpen256 instead of GFPGAN to stay within VRAM budget while still improving output quality.
3

Apple Silicon (M1/M2/M3)

python run.py -s source.jpg -t video.mp4 \
  --frame-processor face_swapper \
  --execution-provider coreml \
  --max-memory 4
CoreML routes inference to the ANE. Avoid adding enhancers for live use; the ANE has limited concurrency for multiple large models.
4

CPU only

python run.py -s source.jpg -t video.mp4 \
  --frame-processor face_swapper \
  --execution-provider cpu \
  --execution-threads 8 \
  --max-memory 16
Skip all enhancers. Maximize --execution-threads up to cpu_count - 2. Do not use for real-time/live mode.

Build docs developers (and LLMs) love