Skip to main content
This guide provides a comprehensive overview of all major configuration parameters in slime. Understanding these parameters is essential for setting up training jobs effectively.

Overview

slime training scripts are organized into several argument categories:
  • MODEL_ARGS: Model architecture and hyperparameters
  • CKPT_ARGS: Checkpoint loading and saving
  • ROLLOUT_ARGS: Data generation and sampling
  • EVAL_ARGS: Evaluation configuration
  • PERF_ARGS: Performance and parallelism
  • GRPO_ARGS: RL algorithm parameters
  • OPTIMIZER_ARGS: Optimizer settings
  • SGLANG_ARGS: Inference engine configuration

MODEL_ARGS: Model Configuration

Model arguments define the architecture and are loaded from model configuration files in scripts/models/.
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
source "${SCRIPT_DIR}/models/glm4-9B.sh"
These configurations contain Megatron hyperparameters that must match your model exactly.
Always verify that parameters like --rotary-base match your specific model version. Different versions may use different values.

Overriding Model Parameters

You can override parameters after sourcing the model configuration:
source "${SCRIPT_DIR}/models/glm4-9B.sh"
MODEL_ARGS+=(--rotary-base 10000)

CKPT_ARGS: Checkpoint Configuration

Checkpoint arguments control model loading and saving.
--hf-checkpoint
string
required
Path to HuggingFace checkpoint for loading tokenizer and model metadata. Model weights are not loaded from here during training.
--ref-load
string
Path to reference model’s Megatron format checkpoint. Used as the initial checkpoint if --load is empty.
--load
string
Actor model loading path. Should typically match --save for checkpoint resumption. If empty or invalid, loads from --ref-load instead.
--save
string
Directory path where model checkpoints will be saved during training.
--save-interval
number
default:"None"
Number of rollout steps between checkpoint saves.

Example Configuration

CKPT_ARGS=(
   --hf-checkpoint /root/GLM-Z1-9B-0414/
   --ref-load /root/GLM-Z1-9B-0414_torch_dist
   --load /root/GLM-Z1-9B-0414_slime/
   --save /root/GLM-Z1-9B-0414_slime/
   --save-interval 20
)

ROLLOUT_ARGS: Data Generation Parameters

Rollout arguments control the data sampling phase of training.

Training Loop Structure

The training process follows a “Data Sampling → Weight Update” closed loop:
1

Data Sampling (Rollout)

Generate responses from prompts using the inference engine
  • --rollout-batch-size: Number of prompts per rollout
  • --n-samples-per-prompt: Responses generated per prompt
2

Weight Update (Training)

Train the model on generated samples
  • --global-batch-size: Samples per optimizer step
  • --num-steps-per-rollout: Optimizer steps per rollout

Critical Constraint

The following equation must always hold:
(rollout-batch-size × n-samples-per-prompt) = (global-batch-size × num-steps-per-rollout)
slime automatically validates this constraint and can auto-set --global-batch-size if not specified.

Data Configuration

--prompt-data
string
required
Path to prompt dataset in JSONL format. Each line should contain the keys specified by --input-key and --label-key.
--input-key
string
default:"input"
JSON key for input prompts in the dataset.
--label-key
string
JSON key for ground truth labels used in reward calculation.
--apply-chat-template
boolean
default:"false"
Apply chat template to prompts. When enabled, input should be in OpenAI message format.
--rollout-shuffle
boolean
default:"false"
Whether to shuffle prompts during rollout.

Batch Size Parameters

--rollout-batch-size
number
required
Number of prompts sampled in each rollout step.
--n-samples-per-prompt
number
default:"1"
Number of responses generated for each prompt (used in GRPO-like algorithms).
--global-batch-size
number
Total number of samples required for one optimizer step. Auto-calculated if not set.
--num-steps-per-rollout
number
default:"1"
Number of optimizer steps per rollout. Default is 1 for on-policy training.
--num-rollout
number
Total number of rollout iterations. Controls the length of training.

Sampling Parameters

--rollout-max-response-len
number
Maximum length of generated responses (equivalent to max_tokens in SGLang).
--rollout-temperature
number
default:"1.0"
Temperature for sampling during rollout.
--rollout-top-p
number
default:"1.0"
Top-p (nucleus) sampling parameter.
--rollout-top-k
number
default:"-1"
Top-k sampling parameter. Set to -1 to disable.

Reward Model

--rm-type
string
Built-in reward model type. Options include: deepscaler, math, dapo, f1, gpqa, ifbench, remote_rm.
--custom-rm-path
string
Path to custom reward function. Overrides --rm-type.

Example Configuration

ROLLOUT_ARGS=(
   --prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl
   --input-key prompt
   --label-key label
   --apply-chat-template
   --rollout-shuffle

   --rm-type deepscaler

   --num-rollout 3000
   --rollout-batch-size 16
   --n-samples-per-prompt 8
   --num-steps-per-rollout 1
   --global-batch-size 128

   --rollout-max-response-len 8192
   --rollout-temperature 1

   --balance-data
)

EVAL_ARGS: Evaluation Parameters

Evaluation inherits most rollout parameters but can be overridden.
--eval-interval
number
Number of rollouts between evaluations.
--eval-prompt-data
string[]
Evaluation dataset in format: dataset_name /path/to/data.jsonl. Multiple datasets can be specified.
--n-samples-per-eval-prompt
number
default:"1"
Number of responses to generate per evaluation prompt.
--eval-max-response-len
number
Maximum response length during evaluation.
--eval-top-p
number
Top-p sampling for evaluation.

Example Configuration

EVAL_ARGS=(
   --eval-interval 5
   --eval-prompt-data aime /root/aime-2024/aime-2024.jsonl
   --n-samples-per-eval-prompt 16
   --eval-max-response-len 16384
   --eval-top-p 1
)

PERF_ARGS: Performance and Parallelism

Performance arguments control Megatron’s parallelism strategies.

Parallelism Parameters

--tensor-model-parallel-size
number
default:"1"
Tensor parallelism degree. Splits model layers across GPUs.
--pipeline-model-parallel-size
number
default:"1"
Pipeline parallelism degree. Splits model into stages.
--context-parallel-size
number
default:"1"
Context parallelism for handling long sequences.
--expert-model-parallel-size
number
default:"1"
Expert parallelism for MoE models.
--sequence-parallel
boolean
default:"false"
Enable sequence parallelism to reduce activation memory.

Dynamic Batching

--use-dynamic-batch-size
boolean
default:"false"
Enable dynamic batching to maximize GPU utilization with variable-length sequences.
--max-tokens-per-gpu
number
Maximum tokens per GPU when using dynamic batching. The system packs samples to approach this limit.
slime trains models using data packing and ensures correct per-sample or per-token loss calculation. Enabling dynamic batch size does not affect loss accuracy and is strongly recommended.

Recomputation

--recompute-granularity
string
Granularity for activation recomputation: full, selective.
--recompute-method
string
Recomputation method: uniform, block.
--recompute-num-layers
number
Number of layers to recompute.

Example Configuration

PERF_ARGS=(
   --tensor-model-parallel-size 2
   --sequence-parallel
   --pipeline-model-parallel-size 1
   --context-parallel-size 2
   --expert-model-parallel-size 1
   --expert-tensor-parallel-size 1

   --recompute-granularity full
   --recompute-method uniform
   --recompute-num-layers 1

   --use-dynamic-batch-size
   --max-tokens-per-gpu 4608
)

GRPO_ARGS: RL Algorithm Parameters

GRPO and other RL algorithm configurations.
--advantage-estimator
string
default:"grpo"
Advantage estimation method. Options: grpo, gspo, reinforce_plus_plus, reinforce_plus_plus_baseline, ppo.
--use-kl-loss
boolean
default:"false"
Enable KL divergence calculation with reference model. KL will be computed but only affects loss if --kl-loss-coef > 0.
--kl-loss-coef
number
default:"0.0"
Coefficient for KL penalty in the loss function. Set to 0 to only monitor KL without affecting training.
--kl-loss-type
string
default:"k1"
Type of KL loss calculation: k1, k2, k3, low_var_kl.
--entropy-coef
number
default:"0.0"
Coefficient for entropy bonus in the loss.
--eps-clip
number
default:"0.2"
PPO clipping range lower bound.
--eps-clip-high
number
PPO clipping range upper bound. If not set, symmetric clipping is used.
--calculate-per-token-loss
boolean
default:"false"
Calculate loss on a per-token basis instead of per-sample basis.
--use-tis
boolean
default:"false"
Enable Truncated Importance Sampling for off-policy correction.

Example Configuration

GRPO_ARGS=(
   --advantage-estimator grpo
   --use-kl-loss
   --kl-loss-coef 0.00
   --kl-loss-type low_var_kl
   --entropy-coef 0.00
   --eps-clip 0.2
   --eps-clip-high 0.28
)

OPTIMIZER_ARGS: Optimizer Configuration

--optimizer
string
default:"adam"
Optimizer type: adam, sgd.
--lr
number
default:"1e-6"
Learning rate.
--lr-decay-style
string
default:"constant"
Learning rate schedule: constant, linear, cosine.
--weight-decay
number
default:"0.1"
Weight decay coefficient.
--adam-beta1
number
default:"0.9"
Adam beta1 parameter.
--adam-beta2
number
default:"0.98"
Adam beta2 parameter.
--clip-grad
number
default:"1.0"
Gradient clipping threshold.

Example Configuration

OPTIMIZER_ARGS=(
   --optimizer adam
   --lr 1e-6
   --lr-decay-style constant
   --weight-decay 0.1
   --adam-beta1 0.9
   --adam-beta2 0.98
)

SGLANG_ARGS: Inference Engine Configuration

SGLang service parameters for rollout inference.
--rollout-num-gpus-per-engine
number
default:"1"
Number of GPUs per inference engine (equivalent to SGLang’s tp_size).
--rollout-num-gpus
number
Total number of GPUs for inference. Ignored when using --colocate.
slime uses sgl-router to schedule multiple SGLang servers. Without DP Attention, dp_size is calculated as rollout-num-gpus / rollout-num-gpus-per-engine.

SGLang Parameter Forwarding

You can pass SGLang parameters by adding the --sglang- prefix:
SGLANG_ARGS=(
   --rollout-num-gpus-per-engine 2
   --sglang-log-level INFO
   --sglang-mem-fraction-static 0.8
)

Example Configuration

SGLANG_ARGS=(
   --rollout-num-gpus-per-engine 2
)

Advanced Features

Colocated Training and Inference

Deploy training and inference on the same GPUs to save resources:
ray job submit ... \
   -- python3 train.py \
   --actor-num-nodes 1 \
   --actor-num-gpus-per-node 8 \
   --colocate \
   --sglang-mem-fraction-static 0.8 \
   ...
In colocated mode, Megatron occupies GPU memory before offloading. Reduce SGLang’s memory usage with --sglang-mem-fraction-static 0.8 to avoid OOM errors.

Dynamic Sampling

Implement DAPO-style dynamic sampling to filter low-quality data:
--over-sampling-batch-size 64 \
--rollout-batch-size 32 \
--n-samples-per-prompt 8 \
--dynamic-sampling-filter-path \
  slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std
This samples 64 prompts with 8 responses each, filtering groups where all rewards are identical.

Partial Rollout

Cache incomplete generations during dynamic sampling:
--partial-rollout \
--buffer-filter-path slime.rollout.filter_hub.buffer_filters.pop_first
Half-generated samples are reused in the next rollout, reducing wasted computation.

Next Steps

Customization

Learn how to customize generation functions, reward models, and filters

Distributed Training

Set up Ray clusters and multi-node training

Build docs developers (and LLMs) love