This guide will walk you through setting up slime and running your first RL training job with a complete working example.
Environment Setup
Pull the Docker image
We strongly recommend using the official Docker image which comes pre-configured with all dependencies: docker pull slimerl/slime:latest
The Docker image includes temporary patches for SGLang and Megatron to avoid configuration issues.
Start the container
Launch an interactive container with GPU access: docker run --rm --gpus all --ipc=host --shm-size=16g \
--ulimit memlock= -1 --ulimit stack= 67108864 \
-it slimerl/slime:latest /bin/bash
The container supports both H-series (H100/H200) and B-series (B200) NVIDIA GPUs without additional configuration.
Update slime to latest version
slime is pre-installed in the Docker image. Update to the latest version: cd /root/slime
git pull
pip install -e . --no-deps
Download Model and Data
Download the model weights
Download the GLM-Z1-9B model using huggingface_hub: hf download zai-org/GLM-Z1-9B-0414 --local-dir /root/GLM-Z1-9B-0414
Download training dataset
Download the DAPO math training dataset: hf download --repo-type dataset zhuzilin/dapo-math-17k \
--local-dir /root/dapo-math-17k
Download evaluation dataset (optional)
Download the AIME 2024 evaluation dataset: hf download --repo-type dataset zhuzilin/aime-2024 \
--local-dir /root/aime-2024
Convert Model Weights
Megatron cannot directly read Hugging Face checkpoints. You must convert weights to Megatron’s torch_dist format.
Load model configuration
Source the configuration file for your target model: cd /root/slime
source scripts/models/glm4-9B.sh
The scripts/models/ directory contains configurations for commonly used models including GLM4-9B, Qwen3-4B, Qwen3-30B-A3B, and more.
Run the conversion script
Convert Hugging Face weights to Megatron torch_dist format: PYTHONPATH = /root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
${ MODEL_ARGS [ @ ]} \
--hf-checkpoint /root/GLM-Z1-9B-0414 \
--save /root/GLM-Z1-9B-0414_torch_dist
For larger models, use torchrun to convert with multiple GPUs or nodes for faster conversion.
Run Your First Training
Launch the training script
Start training with the provided example script: cd /root/slime
bash scripts/run-glm4-9B.sh
This script will:
Initialize Ray for distributed training
Set up SGLang inference servers
Load the Megatron training backend
Begin the rollout-training loop
Monitor training progress
The training process follows this loop:
Rollout Phase : Generate responses using the current policy
Reward Calculation : Evaluate generated responses
Training Phase : Update model weights based on rewards
Weight Sync : Synchronize updated weights to inference engines
With default settings (--num-rollout 3000), the script will run 3000 iterations of this loop.
View results
Training checkpoints are saved to the path specified by --save: /root/GLM-Z1-9B-0414_slime/
├── latest_checkpointed_iteration.txt
├── iter_0000020/
├── iter_0000040/
└── ...
Convert a checkpoint back to Hugging Face format: PYTHONPATH = /root/Megatron-LM python tools/convert_torch_dist_to_hf.py \
--input-dir /root/GLM-Z1-9B-0414_slime/iter_0000100/ \
--output-dir /root/GLM-Z1-9B-0414-iter_100 \
--origin-hf-dir /root/GLM-Z1-9B-0414
Understanding Key Parameters
The training script configures several important parameter groups:
Model Configuration
source "${ SCRIPT_DIR }/models/glm4-9B.sh"
Loads model architecture parameters required by Megatron (layers, hidden size, attention heads, etc.).
Always verify that configuration parameters match your model version. Different versions may use different values for parameters like --rotary-base.
Checkpoint Paths
CKPT_ARGS = (
--hf-checkpoint /root/GLM-Z1-9B-0414 # For tokenizer and metadata
--ref-load /root/GLM-Z1-9B-0414_torch_dist # Reference model weights
--load /root/GLM-Z1-9B-0414_slime/ # Actor checkpoint (resume)
--save /root/GLM-Z1-9B-0414_slime/ # Save path
--save-interval 20 # Save every 20 steps
)
Rollout Configuration
Controls the relationship between data generation and training:
ROLLOUT_ARGS = (
--num-rollout 3000 # Total training iterations
--rollout-batch-size 16 # Prompts per rollout
--n-samples-per-prompt 8 # Responses per prompt
--num-steps-per-rollout 1 # Training steps per rollout
--global-batch-size 128 # Samples per optimizer step
)
Important constraint : (rollout-batch-size × n-samples-per-prompt) = (global-batch-size × num-steps-per-rollout)In this example: (16 × 8) = (128 × 1) ✓
PERF_ARGS = (
--tensor-model-parallel-size 2 # Tensor parallelism
--sequence-parallel # Enable with TP
--context-parallel-size 2 # Context/sequence parallelism
--use-dynamic-batch-size # Intelligent batch packing
--max-tokens-per-gpu 4608 # Tokens per GPU in dynamic batching
)
Dynamic batching is strongly recommended. It improves training efficiency without affecting loss calculation.
GRPO Algorithm
GRPO_ARGS = (
--advantage-estimator grpo
--use-kl-loss
--kl-loss-coef 0.00
--eps-clip 0.2
--eps-clip-high 0.28
)
Next Steps
Installation Options Explore conda installation and multi-node setup
Usage Guide Learn about all available parameters and features
Custom Functions Write custom generation and reward functions
Multi-Turn Training Train agents with tool calling and multi-turn interaction