Gradient accumulation is a technique that allows you to train with effectively larger batch sizes than your GPU memory would normally allow. This is crucial for CLIP training, where larger batch sizes typically lead to better performance.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/mlfoundations/open_clip/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Instead of updating model weights after every batch, gradient accumulation:- Computes gradients for multiple small batches
- Accumulates (sums) these gradients
- Updates the model weights once after processing all accumulated batches
batch_size × accum_freq × num_gpus.
Basic Usage
Use the--accum-freq flag to specify how many batches to accumulate:
- Per-GPU batch size: 128
- Accumulation frequency: 4
- Effective batch size per GPU: 128 × 4 = 512
- With 8 GPUs: Total effective batch size = 512 × 8 = 4,096
How It Works
Gradient accumulation modifies the training loop:Without Gradient Accumulation (accum-freq = 1)
With Gradient Accumulation (accum-freq = 4)
Effective Batch Size Calculation
The effective batch size is:| Per-GPU Batch | Accum Freq | GPUs | Effective Batch Size |
|---|---|---|---|
| 128 | 1 | 8 | 1,024 |
| 128 | 2 | 8 | 2,048 |
| 128 | 4 | 8 | 4,096 |
| 64 | 8 | 8 | 4,096 |
| 256 | 1 | 4 | 1,024 |
| 256 | 4 | 4 | 4,096 |
Memory vs Speed Tradeoffs
Memory Considerations
Advantages:- Reduces per-step memory usage for model activations
- Enables training larger models on limited hardware
- Allows simulation of large batch sizes
- Features from all accumulated batches are stored in memory
- Additional memory needed for intermediate loss computations
- Each batch’s features are cached until the update step
Speed Considerations
Impact on Training Speed:- ~2× forward passes per example (one with gradients, one without)
- Samples per second remains approximately constant
- Time per update step increases proportionally with
accum_freq - Overall throughput (samples/second) stays similar
When to Use Gradient Accumulation
Use Gradient Accumulation When:
-
GPU Memory is Limited
- Cannot fit desired batch size in memory
- Training large models (ViT-L, ViT-H, ViT-g)
- Using high-resolution images
-
Constrained GPU Resources
- Limited number of GPUs available
- Need to match batch sizes from papers
- Simulating larger-scale training
-
After Trying Other Techniques
- Already using
--grad-checkpointing - Already using
--local-loss --gather-with-grad - Already optimized per-GPU batch size
- Already using
Avoid When:
- Memory is Sufficient: If you can fit larger batches, do so directly
- Using Distillation: Distillation requires
--accum-freq 1 - Training is Already Slow: Gradient accumulation adds overhead
Recommended Workflow
Follow this sequence to optimize batch size:Examples
Single GPU Training
Simulate a large batch size on a single GPU:Multi-GPU Training
Scale to very large batch sizes:Large Model Training
Train huge models with gradient accumulation:High Resolution Images
Train with larger image sizes:Learning Rate Adjustment
When using gradient accumulation, the effective batch size changes but the number of gradient steps remains the same per epoch. Generally: No learning rate adjustment needed when only changing--accum-freq
However, if you’re matching a specific training recipe that used a different batch size:
Implementation Details
Forward Passes
With gradient accumulation, there are two forward passes per sample:- First pass (with gradients): Computes loss and gradients
- Second pass (with
torch.no_grad()): Computes features for contrastive loss
Loss Computation
The loss is computedaccum_freq times before each weight update:
- Each accumulated batch computes its own loss
- Gradients are accumulated across all batches
- Final gradient is the sum (effectively the mean due to normalization)
Memory Usage
Memory is used for:- Model weights and optimizer states
- Gradients (accumulated across batches)
- Features from all
accum_freqbatches - Current batch activations
Monitoring Training
Key metrics when using gradient accumulation:Compatibility
Works With:
- Mixed precision training (
--precision amp) - Gradient checkpointing (
--grad-checkpointing) - Local loss (
--local-loss) - Gather with gradients (
--gather-with-grad) - Distributed training (multi-GPU)
- All model architectures
Does Not Work With:
- Model distillation (
--distill-model) - requires--accum-freq 1
Best Practices
- Start Small: Test with
--accum-freq 2before using larger values - Power of 2: Use powers of 2 for
accum_freq(2, 4, 8) for better memory alignment - Balance: Find the sweet spot between
batch_sizeandaccum_freq - Memory First: Maximize
batch_sizebefore increasingaccum_freq - Monitor: Watch memory usage and training speed to find optimal settings
- Document: Record your effective batch size for reproducibility
Troubleshooting
Still Running Out of Memory
Training is Too Slow
Unstable Training
References
For more information on gradient accumulation for contrastive learning:- Don’t Use Large Mini-Batches, Use Local SGD - Cui et al.
- Gradient Accumulation for Large-Scale Training - Pham et al.
