nanoGPT uses PyTorch Distributed Data Parallel (DDP) to scale training across multiple GPUs and nodes. This guide explains how DDP is implemented and how to configure it for your setup.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/karpathy/nanoGPT/llms.txt
Use this file to discover all available pages before exploring further.
How DDP works in nanoGPT
The training script automatically detects and configures DDP based on environment variables set bytorchrun.
DDP detection and initialization
Fromtrain.py:82-100:
Backend configuration
The default backend is NCCL, optimized for NVIDIA GPUs:NCCL is recommended for NVIDIA GPU clusters with high-speed interconnects like Infiniband. For CPU training or mixed CPU/GPU setups, use Gloo backend.
Single-node, multi-GPU training
Launch with torchrun
Train on all available GPUs on a single node:Parameters explained
| Parameter | Description |
|---|---|
--standalone | Single-node training (auto-configures master address) |
--nproc_per_node=8 | Number of processes (GPUs) to use |
train.py | Training script |
config/train_gpt2.py | Configuration file |
Example: 4 GPUs
torchrun sets environment variables
For each process:
RANK: Global rank (0-3)LOCAL_RANK: Local rank on this node (0-3)WORLD_SIZE: Total number of processes (4)
Each process initializes
- Loads the same model
- Sets different CUDA device based on
LOCAL_RANK - Uses different random seed (
1337 + RANK)
Multi-node training
Two-node example
For training across 2 nodes, each with 8 GPUs:- Master node
- Worker node
Multi-node parameters
| Parameter | Description |
|---|---|
--nproc_per_node=8 | GPUs per node |
--nnodes=2 | Total number of nodes |
--node_rank=0 | Rank of this node (0 for master, 1+ for workers) |
--master_addr | IP address of master node |
--master_port | Port for communication (default: 29500) |
Infiniband configuration
With Infiniband:Benchmark your interconnect
Useiperf3 to test network bandwidth between nodes:
- Infiniband: 100+ Gbps
- 10GbE: ~10 Gbps
- 1GbE: ~1 Gbps (will be very slow for multi-node training)
Gradient accumulation with DDP
Automatic scaling
Gradient accumulation steps are automatically divided by world size to maintain the same effective batch size:Example calculation
Withconfig/train_gpt2.py:
The effective batch size remains constant regardless of the number of GPUs. Each GPU processes fewer gradient accumulation steps.
Gradient synchronization
Efficient sync strategy
Fromtrain.py:292-298, gradients are only synchronized on the last micro-step:
Checkpointing and logging
Master process only
Only the master process (rank 0) performs I/O operations:Unwrap DDP for checkpointing
Cleanup
Always destroy the process group when training completes:Advanced DDP configurations
Custom backend
For CPU training or debugging:NCCL environment variables
Optimize NCCL performance:Find available network interfaces
Performance considerations
Scaling efficiency
Due to gradient synchronization overhead, scaling efficiency decreases as you add more GPUs. Expect 80-90% efficiency on 8 GPUs, 60-70% on 64 GPUs.
Batch size tuning
Increasebatch_size or gradient_accumulation_steps to:
- Reduce gradient sync overhead
- Improve GPU utilization
- Maintain stable training
Memory optimization
If you run out of memory:- Decrease
batch_size - Decrease
block_size(context length) - Enable gradient checkpointing (requires code modification)
- Use smaller model (
n_layer,n_head,n_embd)
Troubleshooting
Common issues
NCCL timeout or hang
NCCL timeout or hang
- Check network connectivity between nodes
- Verify firewall allows traffic on master port
- Try
NCCL_DEBUG=INFOto see detailed logs - Increase timeout:
NCCL_TIMEOUT=7200(seconds)
Out of memory on some GPUs
Out of memory on some GPUs
- Ensure all GPUs have the same memory
- Check for memory leaks in data loading
- Reduce
batch_sizeorblock_size
Slow multi-node training
Slow multi-node training
- Benchmark interconnect with
iperf3 - Disable Infiniband if not available:
NCCL_IB_DISABLE=1 - Check for network congestion
Different results on different GPUs
Different results on different GPUs
- Ensure deterministic operations are disabled (default)
- Check if
torch.manual_seedis set correctly - Verify all processes load the same initial checkpoint
Next steps
Reproducing GPT-2
Train a 124M parameter model with DDP
Finetuning
Finetune pretrained models on custom data