OpenCLIP has beta support for int8 training and inference using theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/mlfoundations/open_clip/llms.txt
Use this file to discover all available pages before exploring further.
bitsandbytes library. This enables faster training with lower memory usage while maintaining accuracy, particularly beneficial for large models like ViT-Huge.
Overview
Int8 training replaces standard linear layers with 8-bit quantized versions that:- Reduce memory usage for weights and activations
- Accelerate matrix multiplications
- Maintain numerical stability through specialized quantization schemes
- Preserve accuracy with minimal degradation
Requirements
Install the bitsandbytes library:Basic Usage
Enable int8 training with the--use-bnb-linear flag:
Available Linear Layer Types
OpenCLIP supports two int8 linear layer implementations from bitsandbytes:SwitchBackLinearGlobal
Standard 8-bit linear layer with switchback optimization:- Good balance of speed and memory efficiency
- Recommended for most use cases
- Stable gradient computation
- Works well with all model sizes
SwitchBackLinearGlobalMemEfficient
Memory-optimized 8-bit linear layer:- Further reduces memory usage
- Slightly slower than standard version
- Best for very large models or limited memory
- Useful when training huge models (ViT-H, ViT-g)
Performance Benefits
Training Speed
ViT-Huge Model:- Standard training: baseline
- Int8 training: ~10% faster
- Expected improvement: 1.1x speedup
- Reduced weight storage (8-bit vs 16/32-bit)
- Lower activation memory
- Enables larger batch sizes
- Can train larger models on same hardware
Accuracy
Int8 training maintains accuracy:- No significant accuracy degradation observed
- Contrastive learning is robust to quantization
- Zero-shot performance remains comparable
- Fine-tuning results are preserved
Examples
Training ViT-B-32 with Int8
Training ViT-L-14 with Int8
Training ViT-H-14 with Memory-Efficient Int8
Combining with Other Optimizations
Int8 training works well with other memory and speed optimizations:With Mixed Precision
With Gradient Checkpointing
With Gradient Accumulation
With Distributed Training
Int8 Inference
You can also load and use int8 models for inference:Tutorial Notebook
For a detailed walkthrough of int8 training and inference, see the tutorial notebook:- Setting up int8 training
- Comparing performance with standard training
- Memory usage analysis
- Accuracy evaluation
- Inference optimization
- Best practices
Current Limitations
Attention Layers
Currently, only linear layers are replaced with int8 versions. Attention layers still use standard precision. Future improvements will include:- Int8 attention layers (coming soon)
- Further speedups when attention is refactored
- Full model quantization
Platform Support
- Supported: NVIDIA GPUs with CUDA
- Not Supported: CPU, AMD GPUs, Apple Silicon
- Requires CUDA-compatible bitsandbytes installation
Optimizer State
Optimizer states (Adam, AdamW) still use higher precision:- Int8 only applies to model weights
- Gradients are computed in higher precision
- Optimizer momentum and variance use fp32
When to Use Int8
Recommended For:
-
Large Models
- ViT-Huge and larger
- Models that are close to memory limits
- When you want to increase batch size
-
Limited GPU Memory
- Training on consumer GPUs (RTX 3090, 4090)
- Maximizing model size on available hardware
- Enabling larger experiments
-
Speed-Critical Training
- When 10% speedup matters
- Large-scale training runs
- Cost-sensitive training
Not Necessary For:
-
Small Models (ViT-B-32, ResNet-50)
- Limited benefit for smaller models
- Standard training is already fast enough
-
Abundant Memory
- If memory is not a constraint
- When using small batch sizes
-
Maximum Precision Needed
- Research requiring exact reproducibility
- When numerical precision is critical
Best Practices
-
Start with SwitchBackLinearGlobal
- Good default choice for most use cases
- Balance of speed and memory
-
Use with Mixed Precision
- Combine
--use-bnb-linearwith--precision amp - Maximizes speed benefits
- Combine
-
Monitor Accuracy
- Run regular zero-shot evaluations
- Compare with baseline runs
- Check final model performance
-
Test Before Large Runs
- Validate int8 training on small dataset first
- Ensure stability and convergence
- Measure actual speedup on your hardware
-
Enable for Large Models
- Most beneficial for ViT-L and larger
- Use SwitchBackLinearGlobalMemEfficient for ViT-H/ViT-g
Troubleshooting
Import Error
CUDA Error
Slower Than Expected
- Ensure CUDA is properly installed
- Check GPU utilization (should be high)
- Verify mixed precision is enabled (
--precision amp) - Some models benefit more than others
Numerical Issues
- Increase warmup:
--warmup 5000 - Reduce learning rate:
--lr 5e-4 - Enable gradient clipping:
--grad-clip-norm 1.0 - Try SwitchBackLinearGlobal instead of MemEfficient version
