Enabling Deterministic Training
To enable fully deterministic training, you need to configure both SGLang (for rollout) and Megatron (for training) to use deterministic operations.Prerequisites
Configuration Flags
Add these flags to your training configuration:Environment Variables
Set these environment variables to ensure deterministic operations throughout the stack:| Variable | Purpose | Value |
|---|---|---|
NCCL_ALGO | Force NCCL to use Ring algorithm | Ring |
NVTE_ALLOW_NONDETERMINISTIC_ALGO | Disable non-deterministic ops in TransformerEngine | 0 |
CUBLAS_WORKSPACE_CONFIG | Enable deterministic cuBLAS operations | :4096:8 |
The
CUBLAS_WORKSPACE_CONFIG format is :size:count. The value :4096:8 allocates 8 workspaces of 4096 bytes each for deterministic algorithms. This may need adjustment for very large models.Complete Example: GSM8K with Qwen2.5-0.5B
We provide a fully deterministic training example using Qwen2.5-0.5B on GSM8K. This example demonstrates bitwise reproducible training.Setup
Training Script
Here’s the complete training script with all deterministic settings:run-qwen2.5-0.5B-reproducibility.sh
Verification
The reproducibility example includes WandB logging to verify bitwise determinism. Running the same script multiple times should produce identical:- Loss curves
- Reward curves
- Model outputs
- Checkpoint weights
For true bitwise reproducibility, you must also fix:
- Random seeds (automatically handled by slime)
- Dataset order (use
--rollout-shuffleconsistently) - Number of workers and GPUs
- CUDA version and driver version
Performance Impact
Deterministic mode has some performance trade-offs:Throughput Reduction
Expect 10-30% reduction in throughput compared to non-deterministic mode, primarily due to:
- FlashInfer vs Flash Attention 3
- Ring NCCL algorithm
- Deterministic cuBLAS operations
Memory Overhead
Deterministic operations require additional workspace memory:
- cuBLAS workspaces: ~32KB (configured via
CUBLAS_WORKSPACE_CONFIG) - Minimal impact on large models
What Is Deterministic?
In deterministic mode, the following are guaranteed to be bitwise identical across runs:Guaranteed Deterministic
- Matrix multiplications (GEMM)
- Attention operations
- Activation functions
- Optimizer updates
- Gradient computations
- NCCL collectives
- Random number generation (with fixed seed)
Not Deterministic
Advanced Configuration
Custom cuBLAS Workspace Size
For very large models, you may need to increase the cuBLAS workspace size:Deterministic Data Loading
Ensure data loading is deterministic:Multi-Node Determinism
For multi-node training, ensure:- All nodes use the same CUDA version
- All nodes use the same NCCL version
- Network topology is consistent
- Nodes are synchronized (NTP)
Debugging Non-Determinism
If you encounter non-determinism despite enabling deterministic mode:1. Check Flash Attention
2. Verify Environment Variables
3. Check PyTorch Determinism
True, True, False respectively in deterministic mode.
4. Compare Checksums
After training, compare checkpoint checksums:Best Practices
Use Determinism for Critical Experiments
Use Determinism for Critical Experiments
Enable deterministic mode for:
- Paper results that need exact reproduction
- Debugging training instabilities
- Ablation studies requiring precise comparison
Document Your Environment
Document Your Environment
When publishing results, document:
- CUDA version:
nvcc --version - PyTorch version:
python -c 'import torch; print(torch.__version__)' - Slime commit hash:
git rev-parse HEAD - GPU type:
nvidia-smi --query-gpu=name --format=csv,noheader
Test Reproducibility Early
Test Reproducibility Early
Run a short deterministic training test early in your project:
Balance Performance and Reproducibility
Balance Performance and Reproducibility
Consider a hybrid approach:
- Use deterministic mode for final benchmark runs
- Use non-deterministic mode for development and hyperparameter tuning
- Re-run important experiments in deterministic mode to verify
Future Improvements
Planned enhancements to reproducibility in slime:- Automatic detection of non-deterministic operations with warnings
- Per-component determinism flags (e.g., deterministic rollout but non-deterministic training)
- Better performance optimization for deterministic mode
- Reproducibility verification tests in CI/CD