This guide will walk you through setting up slime and running your first RL training job with a complete working example.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/THUDM/slime/llms.txt
Use this file to discover all available pages before exploring further.
Environment Setup
Pull the Docker image
We strongly recommend using the official Docker image which comes pre-configured with all dependencies:
The Docker image includes temporary patches for SGLang and Megatron to avoid configuration issues.
Download Model and Data
Convert Model Weights
Load model configuration
Source the configuration file for your target model:
The
scripts/models/ directory contains configurations for commonly used models including GLM4-9B, Qwen3-4B, Qwen3-30B-A3B, and more.Run Your First Training
Launch the training script
Start training with the provided example script:This script will:
- Initialize Ray for distributed training
- Set up SGLang inference servers
- Load the Megatron training backend
- Begin the rollout-training loop
Monitor training progress
The training process follows this loop:
- Rollout Phase: Generate responses using the current policy
- Reward Calculation: Evaluate generated responses
- Training Phase: Update model weights based on rewards
- Weight Sync: Synchronize updated weights to inference engines
With default settings (
--num-rollout 3000), the script will run 3000 iterations of this loop.Understanding Key Parameters
The training script configures several important parameter groups:Model Configuration
Checkpoint Paths
Rollout Configuration
Controls the relationship between data generation and training:Important constraint:
(rollout-batch-size × n-samples-per-prompt) = (global-batch-size × num-steps-per-rollout)In this example: (16 × 8) = (128 × 1) ✓Performance Settings
GRPO Algorithm
Next Steps
Installation Options
Explore conda installation and multi-node setup
Usage Guide
Learn about all available parameters and features
Custom Functions
Write custom generation and reward functions
Multi-Turn Training
Train agents with tool calling and multi-turn interaction