Quick Start

This guide will walk you through setting up slime and running your first RL training job with a complete working example.

Environment Setup

Pull the Docker image

We strongly recommend using the official Docker image which comes pre-configured with all dependencies:

docker pull slimerl/slime:latest

The Docker image includes temporary patches for SGLang and Megatron to avoid configuration issues.

Start the container

Launch an interactive container with GPU access:

docker run --rm --gpus all --ipc=host --shm-size=16g \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -it slimerl/slime:latest /bin/bash

The container supports both H-series (H100/H200) and B-series (B200) NVIDIA GPUs without additional configuration.

Update slime to latest version

slime is pre-installed in the Docker image. Update to the latest version:

cd /root/slime
git pull
pip install -e . --no-deps

Download Model and Data

Download the model weights

Download the GLM-Z1-9B model using huggingface_hub:

hf download zai-org/GLM-Z1-9B-0414 --local-dir /root/GLM-Z1-9B-0414

Download training dataset

Download the DAPO math training dataset:

hf download --repo-type dataset zhuzilin/dapo-math-17k \
  --local-dir /root/dapo-math-17k

Download evaluation dataset (optional)

Download the AIME 2024 evaluation dataset:

hf download --repo-type dataset zhuzilin/aime-2024 \
  --local-dir /root/aime-2024

Convert Model Weights

Megatron cannot directly read Hugging Face checkpoints. You must convert weights to Megatron’s torch_dist format.

Load model configuration

Source the configuration file for your target model:

cd /root/slime
source scripts/models/glm4-9B.sh

The scripts/models/ directory contains configurations for commonly used models including GLM4-9B, Qwen3-4B, Qwen3-30B-A3B, and more.

Run the conversion script

Convert Hugging Face weights to Megatron torch_dist format:

PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
    ${MODEL_ARGS[@]} \
    --hf-checkpoint /root/GLM-Z1-9B-0414 \
    --save /root/GLM-Z1-9B-0414_torch_dist

For larger models, use torchrun to convert with multiple GPUs or nodes for faster conversion.

Run Your First Training

Launch the training script

Start training with the provided example script:

cd /root/slime
bash scripts/run-glm4-9B.sh

This script will:

Initialize Ray for distributed training
Set up SGLang inference servers
Load the Megatron training backend
Begin the rollout-training loop

Monitor training progress

The training process follows this loop:

Rollout Phase: Generate responses using the current policy
Reward Calculation: Evaluate generated responses
Training Phase: Update model weights based on rewards
Weight Sync: Synchronize updated weights to inference engines

With default settings (--num-rollout 3000), the script will run 3000 iterations of this loop.

View results

Training checkpoints are saved to the path specified by --save:

/root/GLM-Z1-9B-0414_slime/
├── latest_checkpointed_iteration.txt
├── iter_0000020/
├── iter_0000040/
└── ...

Convert a checkpoint back to Hugging Face format:

PYTHONPATH=/root/Megatron-LM python tools/convert_torch_dist_to_hf.py \
  --input-dir /root/GLM-Z1-9B-0414_slime/iter_0000100/ \
  --output-dir /root/GLM-Z1-9B-0414-iter_100 \
  --origin-hf-dir /root/GLM-Z1-9B-0414

Understanding Key Parameters

The training script configures several important parameter groups:

Model Configuration

source "${SCRIPT_DIR}/models/glm4-9B.sh"

Loads model architecture parameters required by Megatron (layers, hidden size, attention heads, etc.).

Always verify that configuration parameters match your model version. Different versions may use different values for parameters like --rotary-base.

Checkpoint Paths

CKPT_ARGS=(
   --hf-checkpoint /root/GLM-Z1-9B-0414       # For tokenizer and metadata
   --ref-load /root/GLM-Z1-9B-0414_torch_dist # Reference model weights
   --load /root/GLM-Z1-9B-0414_slime/         # Actor checkpoint (resume)
   --save /root/GLM-Z1-9B-0414_slime/         # Save path
   --save-interval 20                          # Save every 20 steps
)

Rollout Configuration

Controls the relationship between data generation and training:

ROLLOUT_ARGS=(
   --num-rollout 3000              # Total training iterations
   --rollout-batch-size 16         # Prompts per rollout
   --n-samples-per-prompt 8        # Responses per prompt
   --num-steps-per-rollout 1       # Training steps per rollout
   --global-batch-size 128         # Samples per optimizer step
)

Important constraint: (rollout-batch-size × n-samples-per-prompt) = (global-batch-size × num-steps-per-rollout)In this example: (16 × 8) = (128 × 1) ✓

Performance Settings

PERF_ARGS=(
   --tensor-model-parallel-size 2    # Tensor parallelism
   --sequence-parallel               # Enable with TP
   --context-parallel-size 2         # Context/sequence parallelism
   --use-dynamic-batch-size          # Intelligent batch packing
   --max-tokens-per-gpu 4608         # Tokens per GPU in dynamic batching
)

Dynamic batching is strongly recommended. It improves training efficiency without affecting loss calculation.

GRPO Algorithm

GRPO_ARGS=(
   --advantage-estimator grpo
   --use-kl-loss
   --kl-loss-coef 0.00
   --eps-clip 0.2
   --eps-clip-high 0.28
)

Next Steps

Installation Options

Explore conda installation and multi-node setup

Usage Guide

Learn about all available parameters and features

Custom Functions

Write custom generation and reward functions

Multi-Turn Training

Train agents with tool calling and multi-turn interaction

Get Started

Core Concepts

Guides

Advanced

Platform Support

Environment Setup

Download Model and Data

Convert Model Weights

Run Your First Training

Understanding Key Parameters

Model Configuration

Checkpoint Paths

Rollout Configuration

Performance Settings

GRPO Algorithm

Next Steps

Installation Options

Usage Guide

Custom Functions

Multi-Turn Training

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

Platform Support

Documentation Index

​Environment Setup

​Download Model and Data

​Convert Model Weights

​Run Your First Training

​Understanding Key Parameters

​Model Configuration

​Checkpoint Paths

​Rollout Configuration

​Performance Settings

​GRPO Algorithm

​Next Steps

Installation Options

Usage Guide

Custom Functions

Multi-Turn Training

Build docs developers (and LLMs) love

Environment Setup

Download Model and Data

Convert Model Weights

Run Your First Training

Understanding Key Parameters

Model Configuration

Checkpoint Paths

Rollout Configuration

Performance Settings

GRPO Algorithm

Next Steps