Documentation Index
Fetch the complete documentation index at: https://mintlify.com/karpathy/nanoGPT/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The training module provides a complete implementation for training GPT models with support for:- Single GPU and distributed data parallel (DDP) training
- Mixed precision training (float16/bfloat16)
- Gradient accumulation
- Learning rate scheduling with warmup and cosine decay
- Checkpointing and resumption
- WandB integration for experiment tracking
Training modes
You can run the training script in multiple configurations:If your cluster does not have Infiniband, prepend
NCCL_IB_DISABLE=1 to the commands.Key functions
get_batch
Data split to load from:
'train' or 'val'x: Input token sequences of shape (batch_size, block_size)y: Target token sequences of shape (batch_size, block_size), shifted by one position
estimate_loss
'train' and 'val', each containing the mean loss
Number of iterations to average over (configured globally)
get_lr
Current iteration number
Training loop structure
The main training loop performs the following steps:1. Learning rate scheduling
2. Periodic evaluation
3. Forward and backward pass
With gradient accumulation to simulate larger batch sizes:4. Gradient clipping and optimizer step
5. Logging
Configuration parameters
I/O settings
Directory for saving checkpoints
How often to evaluate on val set and save checkpoints
How often to log training metrics
Number of iterations for loss estimation
If True, exit after first evaluation (useful for testing)
If True, save checkpoint after each eval even if val loss didn’t improve
Initialization mode:
'scratch', 'resume', or a GPT-2 variant ('gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl')Data settings
Name of dataset (must have corresponding data// directory)
Accumulate gradients over this many steps to simulate larger batches
Micro-batch size (per GPU if using DDP)
Context length for training sequences
Model architecture
Number of transformer layers
Number of attention heads
Embedding dimension
Dropout rate (0.0 for pretraining, 0.1+ for finetuning)
Use bias in Linear and LayerNorm layers
Optimizer settings
Maximum learning rate
Total number of training iterations
Weight decay coefficient
AdamW beta1 parameter
AdamW beta2 parameter
Gradient clipping threshold (0.0 to disable)
Learning rate decay
Enable learning rate decay
Number of warmup iterations
Iterations for learning rate decay (should be ~= max_iters)
Minimum learning rate (should be ~= learning_rate/10)
System settings
Device to train on:
'cpu', 'cuda', 'cuda:0', 'cuda:1', 'mps', etc.Data type for training:
'float32', 'bfloat16', or 'float16'. Automatically selects bfloat16 if supported.Use PyTorch 2.0 compilation for faster training
DDP settings
DDP backend:
'nccl' (recommended for CUDA) or 'gloo'Checkpointing
Checkpoints are saved to{out_dir}/ckpt.pt and contain:
init_from='resume' and ensure the checkpoint exists in out_dir.
WandB integration
Enable Weights & Biases logging
WandB project name
WandB run name
- Training loss
- Validation loss
- Learning rate
- Model FLOPs Utilization (MFU)