FP8 Rollout with BF16 Training
You can run FP8 rollout while keeping training in BF16 format by using a blockwise quantized checkpoint for inference.Converting Models to FP8
Convert your BF16 model to FP8 format using the conversion tool:config.json contains the correct quantization_config so that slime can automatically use FP8 quantization during weight updates.
The FP8 checkpoint is only used for rollout inference. Training weights remain in BF16 format.
FP8 Training and Inference
For maximum efficiency and training stability, you can use FP8 for both training and inference. This achieves:- More efficient inference throughput
- Lower training-inference mismatch
- More stable training dynamics
Quick Start
Configure Training Flags
Add these flags to your training script:Enable the required environment variable:
Implementation Details
Here’s how FP8 training works in slime:- Initialization: When FP8 recipe is enabled, layers are built in FP8 context
- Training: Weights and activations are quantized online to nvfp8 format. cuBLAS FP8 GEMM is called for GEMM computations in forward and backward passes
- Weight Updates: During RL weight updates, Megatron dequantizes FP8 weights to BF16, then slime quantizes them back to FP8 for sending to SGLang
-
Checkpoint Saving: Checkpoints are dequantized to BF16 and saved in
torch_distformat
Only
Linear and GroupLinear layers in TransformerEngine use FP8 format. The embedding and lm_head layers remain in their original precision. If --fp8-param-gather is not enabled, weights in TransformerEngine remain in BF16 format and are only cast to FP8 during GEMM operations.Known Limitations
INT4 QAT Training
INT4 quantization-aware training (QAT) uses Straight-Through Estimator (STE) to enable training with INT4 inference. This significantly improves throughput during the rollout generation phase.Quick Start
Convert HuggingFace Weights to INT4
Use the direct conversion script:Ensure
config.json contains the correct quantization_config for automatic INT4 quantization during weight updates.Configure Environment Variables
Set the required environment variables for quantization:
OPEN_TRAINING_INT4_FAKE_QAT_FLAG: Enables fake quantization operations for INT4 trainingOPEN_TRAINING_INT4_GROUP_SIZE: Specifies the block size (group size) for model quantization
- Set to 128 for
moonlight-16B-A3B,qwen3-30B-A3B, andqwen3-235B-A22B-int4 - Set to 32 for
kimi-k2-Thinking-int4
For multi-node environments, start the Ray service according to your cluster configuration before launching training.
INT4 Rollout Only
If you only want INT4 inference during rollout without QAT training, simply set--hf-checkpoint to the converted INT4 checkpoint. No additional environment variables are needed.
Example Configuration
Here’s a complete example configuration for FP8 training with Qwen3-4B:run-qwen3-4b-fp8.sh
run-qwen3-30B-A3B-int4.sh
Best Practices
Choose the Right Precision
- Use FP8 rollout + BF16 training for a simple efficiency boost
- Use FP8 training + FP8 inference for maximum throughput and stability
- Use INT4 QAT for the largest models when memory is constrained
Monitor Training Stability
- Watch for divergence when switching to lower precision
- FP8 typically provides better stability than INT4
- Adjust learning rate if needed when changing precision