Skip to main content
This guide will walk you through setting up slime and running your first RL training job with a complete working example.

Environment Setup

1

Pull the Docker image

We strongly recommend using the official Docker image which comes pre-configured with all dependencies:
docker pull slimerl/slime:latest
The Docker image includes temporary patches for SGLang and Megatron to avoid configuration issues.
2

Start the container

Launch an interactive container with GPU access:
docker run --rm --gpus all --ipc=host --shm-size=16g \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -it slimerl/slime:latest /bin/bash
The container supports both H-series (H100/H200) and B-series (B200) NVIDIA GPUs without additional configuration.
3

Update slime to latest version

slime is pre-installed in the Docker image. Update to the latest version:
cd /root/slime
git pull
pip install -e . --no-deps

Download Model and Data

1

Download the model weights

Download the GLM-Z1-9B model using huggingface_hub:
hf download zai-org/GLM-Z1-9B-0414 --local-dir /root/GLM-Z1-9B-0414
2

Download training dataset

Download the DAPO math training dataset:
hf download --repo-type dataset zhuzilin/dapo-math-17k \
  --local-dir /root/dapo-math-17k
3

Download evaluation dataset (optional)

Download the AIME 2024 evaluation dataset:
hf download --repo-type dataset zhuzilin/aime-2024 \
  --local-dir /root/aime-2024

Convert Model Weights

Megatron cannot directly read Hugging Face checkpoints. You must convert weights to Megatron’s torch_dist format.
1

Load model configuration

Source the configuration file for your target model:
cd /root/slime
source scripts/models/glm4-9B.sh
The scripts/models/ directory contains configurations for commonly used models including GLM4-9B, Qwen3-4B, Qwen3-30B-A3B, and more.
2

Run the conversion script

Convert Hugging Face weights to Megatron torch_dist format:
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
    ${MODEL_ARGS[@]} \
    --hf-checkpoint /root/GLM-Z1-9B-0414 \
    --save /root/GLM-Z1-9B-0414_torch_dist
For larger models, use torchrun to convert with multiple GPUs or nodes for faster conversion.

Run Your First Training

1

Launch the training script

Start training with the provided example script:
cd /root/slime
bash scripts/run-glm4-9B.sh
This script will:
  • Initialize Ray for distributed training
  • Set up SGLang inference servers
  • Load the Megatron training backend
  • Begin the rollout-training loop
2

Monitor training progress

The training process follows this loop:
  1. Rollout Phase: Generate responses using the current policy
  2. Reward Calculation: Evaluate generated responses
  3. Training Phase: Update model weights based on rewards
  4. Weight Sync: Synchronize updated weights to inference engines
With default settings (--num-rollout 3000), the script will run 3000 iterations of this loop.
3

View results

Training checkpoints are saved to the path specified by --save:
/root/GLM-Z1-9B-0414_slime/
├── latest_checkpointed_iteration.txt
├── iter_0000020/
├── iter_0000040/
└── ...
Convert a checkpoint back to Hugging Face format:
PYTHONPATH=/root/Megatron-LM python tools/convert_torch_dist_to_hf.py \
  --input-dir /root/GLM-Z1-9B-0414_slime/iter_0000100/ \
  --output-dir /root/GLM-Z1-9B-0414-iter_100 \
  --origin-hf-dir /root/GLM-Z1-9B-0414

Understanding Key Parameters

The training script configures several important parameter groups:

Model Configuration

source "${SCRIPT_DIR}/models/glm4-9B.sh"
Loads model architecture parameters required by Megatron (layers, hidden size, attention heads, etc.).
Always verify that configuration parameters match your model version. Different versions may use different values for parameters like --rotary-base.

Checkpoint Paths

CKPT_ARGS=(
   --hf-checkpoint /root/GLM-Z1-9B-0414       # For tokenizer and metadata
   --ref-load /root/GLM-Z1-9B-0414_torch_dist # Reference model weights
   --load /root/GLM-Z1-9B-0414_slime/         # Actor checkpoint (resume)
   --save /root/GLM-Z1-9B-0414_slime/         # Save path
   --save-interval 20                          # Save every 20 steps
)

Rollout Configuration

Controls the relationship between data generation and training:
ROLLOUT_ARGS=(
   --num-rollout 3000              # Total training iterations
   --rollout-batch-size 16         # Prompts per rollout
   --n-samples-per-prompt 8        # Responses per prompt
   --num-steps-per-rollout 1       # Training steps per rollout
   --global-batch-size 128         # Samples per optimizer step
)
Important constraint: (rollout-batch-size × n-samples-per-prompt) = (global-batch-size × num-steps-per-rollout)In this example: (16 × 8) = (128 × 1)

Performance Settings

PERF_ARGS=(
   --tensor-model-parallel-size 2    # Tensor parallelism
   --sequence-parallel               # Enable with TP
   --context-parallel-size 2         # Context/sequence parallelism
   --use-dynamic-batch-size          # Intelligent batch packing
   --max-tokens-per-gpu 4608         # Tokens per GPU in dynamic batching
)
Dynamic batching is strongly recommended. It improves training efficiency without affecting loss calculation.

GRPO Algorithm

GRPO_ARGS=(
   --advantage-estimator grpo
   --use-kl-loss
   --kl-loss-coef 0.00
   --eps-clip 0.2
   --eps-clip-high 0.28
)

Next Steps

Installation Options

Explore conda installation and multi-node setup

Usage Guide

Learn about all available parameters and features

Custom Functions

Write custom generation and reward functions

Multi-Turn Training

Train agents with tool calling and multi-turn interaction

Build docs developers (and LLMs) love