Configuration Guide

This guide provides a comprehensive overview of all major configuration parameters in slime. Understanding these parameters is essential for setting up training jobs effectively.

Overview

slime training scripts are organized into several argument categories:

MODEL_ARGS: Model architecture and hyperparameters
CKPT_ARGS: Checkpoint loading and saving
ROLLOUT_ARGS: Data generation and sampling
EVAL_ARGS: Evaluation configuration
PERF_ARGS: Performance and parallelism
GRPO_ARGS: RL algorithm parameters
OPTIMIZER_ARGS: Optimizer settings
SGLANG_ARGS: Inference engine configuration

MODEL_ARGS: Model Configuration

Model arguments define the architecture and are loaded from model configuration files in scripts/models/.

SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
source "${SCRIPT_DIR}/models/glm4-9B.sh"

These configurations contain Megatron hyperparameters that must match your model exactly.

Always verify that parameters like --rotary-base match your specific model version. Different versions may use different values.

Overriding Model Parameters

You can override parameters after sourcing the model configuration:

source "${SCRIPT_DIR}/models/glm4-9B.sh"
MODEL_ARGS+=(--rotary-base 10000)

CKPT_ARGS: Checkpoint Configuration

Checkpoint arguments control model loading and saving.

--hf-checkpoint

string

required

Path to HuggingFace checkpoint for loading tokenizer and model metadata. Model weights are not loaded from here during training.

--ref-load

string

Path to reference model’s Megatron format checkpoint. Used as the initial checkpoint if --load is empty.

--load

string

Actor model loading path. Should typically match --save for checkpoint resumption. If empty or invalid, loads from --ref-load instead.

--save

string

Directory path where model checkpoints will be saved during training.

--save-interval

number

default:"None"

Number of rollout steps between checkpoint saves.

Example Configuration

CKPT_ARGS=(
   --hf-checkpoint /root/GLM-Z1-9B-0414/
   --ref-load /root/GLM-Z1-9B-0414_torch_dist
   --load /root/GLM-Z1-9B-0414_slime/
   --save /root/GLM-Z1-9B-0414_slime/
   --save-interval 20
)

ROLLOUT_ARGS: Data Generation Parameters

Rollout arguments control the data sampling phase of training.

Training Loop Structure

The training process follows a “Data Sampling → Weight Update” closed loop:

Data Sampling (Rollout)

Generate responses from prompts using the inference engine

--rollout-batch-size: Number of prompts per rollout
--n-samples-per-prompt: Responses generated per prompt

Weight Update (Training)

Train the model on generated samples

--global-batch-size: Samples per optimizer step
--num-steps-per-rollout: Optimizer steps per rollout

Critical Constraint

The following equation must always hold:

(rollout-batch-size × n-samples-per-prompt) = (global-batch-size × num-steps-per-rollout)

slime automatically validates this constraint and can auto-set --global-batch-size if not specified.

Data Configuration

--prompt-data

string

required

Path to prompt dataset in JSONL format. Each line should contain the keys specified by --input-key and --label-key.

--input-key

string

default:"input"

JSON key for input prompts in the dataset.

--label-key

string

JSON key for ground truth labels used in reward calculation.

--apply-chat-template

boolean

default:"false"

Apply chat template to prompts. When enabled, input should be in OpenAI message format.

--rollout-shuffle

boolean

default:"false"

Whether to shuffle prompts during rollout.

Batch Size Parameters

--rollout-batch-size

number

required

Number of prompts sampled in each rollout step.

--n-samples-per-prompt

number

default:"1"

Number of responses generated for each prompt (used in GRPO-like algorithms).

--global-batch-size

number

Total number of samples required for one optimizer step. Auto-calculated if not set.

--num-steps-per-rollout

number

default:"1"

Number of optimizer steps per rollout. Default is 1 for on-policy training.

--num-rollout

number

Total number of rollout iterations. Controls the length of training.

Sampling Parameters

--rollout-max-response-len

number

Maximum length of generated responses (equivalent to max_tokens in SGLang).

--rollout-temperature

number

default:"1.0"

Temperature for sampling during rollout.

--rollout-top-p

number

default:"1.0"

Top-p (nucleus) sampling parameter.

--rollout-top-k

number

default:"-1"

Top-k sampling parameter. Set to -1 to disable.

Reward Model

--rm-type

string

Built-in reward model type. Options include: deepscaler, math, dapo, f1, gpqa, ifbench, remote_rm.

--custom-rm-path

string

Path to custom reward function. Overrides --rm-type.

Example Configuration

ROLLOUT_ARGS=(
   --prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl
   --input-key prompt
   --label-key label
   --apply-chat-template
   --rollout-shuffle

   --rm-type deepscaler

   --num-rollout 3000
   --rollout-batch-size 16
   --n-samples-per-prompt 8
   --num-steps-per-rollout 1
   --global-batch-size 128

   --rollout-max-response-len 8192
   --rollout-temperature 1

   --balance-data
)

EVAL_ARGS: Evaluation Parameters

Evaluation inherits most rollout parameters but can be overridden.

--eval-interval

number

Number of rollouts between evaluations.

--eval-prompt-data

string[]

Evaluation dataset in format: dataset_name /path/to/data.jsonl. Multiple datasets can be specified.

--n-samples-per-eval-prompt

number

default:"1"

Number of responses to generate per evaluation prompt.

--eval-max-response-len

number

Maximum response length during evaluation.

--eval-top-p

number

Top-p sampling for evaluation.

Example Configuration

EVAL_ARGS=(
   --eval-interval 5
   --eval-prompt-data aime /root/aime-2024/aime-2024.jsonl
   --n-samples-per-eval-prompt 16
   --eval-max-response-len 16384
   --eval-top-p 1
)

PERF_ARGS: Performance and Parallelism

Performance arguments control Megatron’s parallelism strategies.

Parallelism Parameters

--tensor-model-parallel-size

number

default:"1"

Tensor parallelism degree. Splits model layers across GPUs.

--pipeline-model-parallel-size

number

default:"1"

Pipeline parallelism degree. Splits model into stages.

--context-parallel-size

number

default:"1"

Context parallelism for handling long sequences.

--expert-model-parallel-size

number

default:"1"

Expert parallelism for MoE models.

--sequence-parallel

boolean

default:"false"

Enable sequence parallelism to reduce activation memory.

Dynamic Batching

--use-dynamic-batch-size

boolean

default:"false"

Enable dynamic batching to maximize GPU utilization with variable-length sequences.

--max-tokens-per-gpu

number

Maximum tokens per GPU when using dynamic batching. The system packs samples to approach this limit.

slime trains models using data packing and ensures correct per-sample or per-token loss calculation. Enabling dynamic batch size does not affect loss accuracy and is strongly recommended.

Recomputation

--recompute-granularity

string

Granularity for activation recomputation: full, selective.

--recompute-method

string

Recomputation method: uniform, block.

--recompute-num-layers

number

Number of layers to recompute.

Example Configuration

PERF_ARGS=(
   --tensor-model-parallel-size 2
   --sequence-parallel
   --pipeline-model-parallel-size 1
   --context-parallel-size 2
   --expert-model-parallel-size 1
   --expert-tensor-parallel-size 1

   --recompute-granularity full
   --recompute-method uniform
   --recompute-num-layers 1

   --use-dynamic-batch-size
   --max-tokens-per-gpu 4608
)

GRPO_ARGS: RL Algorithm Parameters

GRPO and other RL algorithm configurations.

--advantage-estimator

string

default:"grpo"

Advantage estimation method. Options: grpo, gspo, reinforce_plus_plus, reinforce_plus_plus_baseline, ppo.

--use-kl-loss

boolean

default:"false"

Enable KL divergence calculation with reference model. KL will be computed but only affects loss if --kl-loss-coef > 0.

--kl-loss-coef

number

default:"0.0"

Coefficient for KL penalty in the loss function. Set to 0 to only monitor KL without affecting training.

--kl-loss-type

string

default:"k1"

Type of KL loss calculation: k1, k2, k3, low_var_kl.

--entropy-coef

number

default:"0.0"

Coefficient for entropy bonus in the loss.

--eps-clip

number

default:"0.2"

PPO clipping range lower bound.

--eps-clip-high

number

PPO clipping range upper bound. If not set, symmetric clipping is used.

--calculate-per-token-loss

boolean

default:"false"

Calculate loss on a per-token basis instead of per-sample basis.

--use-tis

boolean

default:"false"

Enable Truncated Importance Sampling for off-policy correction.

Example Configuration

GRPO_ARGS=(
   --advantage-estimator grpo
   --use-kl-loss
   --kl-loss-coef 0.00
   --kl-loss-type low_var_kl
   --entropy-coef 0.00
   --eps-clip 0.2
   --eps-clip-high 0.28
)

OPTIMIZER_ARGS: Optimizer Configuration

--optimizer

string

default:"adam"

Optimizer type: adam, sgd.

--lr

number

default:"1e-6"

Learning rate.

--lr-decay-style

string

default:"constant"

Learning rate schedule: constant, linear, cosine.

--weight-decay

number

default:"0.1"

Weight decay coefficient.

--adam-beta1

number

default:"0.9"

Adam beta1 parameter.

--adam-beta2

number

default:"0.98"

Adam beta2 parameter.

--clip-grad

number

default:"1.0"

Gradient clipping threshold.

Example Configuration

OPTIMIZER_ARGS=(
   --optimizer adam
   --lr 1e-6
   --lr-decay-style constant
   --weight-decay 0.1
   --adam-beta1 0.9
   --adam-beta2 0.98
)

SGLANG_ARGS: Inference Engine Configuration

SGLang service parameters for rollout inference.

--rollout-num-gpus-per-engine

number

default:"1"

Number of GPUs per inference engine (equivalent to SGLang’s tp_size).

--rollout-num-gpus

number

Total number of GPUs for inference. Ignored when using --colocate.

slime uses sgl-router to schedule multiple SGLang servers. Without DP Attention, dp_size is calculated as rollout-num-gpus / rollout-num-gpus-per-engine.

SGLang Parameter Forwarding

You can pass SGLang parameters by adding the --sglang- prefix:

SGLANG_ARGS=(
   --rollout-num-gpus-per-engine 2
   --sglang-log-level INFO
   --sglang-mem-fraction-static 0.8
)

Example Configuration

SGLANG_ARGS=(
   --rollout-num-gpus-per-engine 2
)

Advanced Features

Colocated Training and Inference

Deploy training and inference on the same GPUs to save resources:

ray job submit ... \
   -- python3 train.py \
   --actor-num-nodes 1 \
   --actor-num-gpus-per-node 8 \
   --colocate \
   --sglang-mem-fraction-static 0.8 \
   ...

In colocated mode, Megatron occupies GPU memory before offloading. Reduce SGLang’s memory usage with --sglang-mem-fraction-static 0.8 to avoid OOM errors.

Dynamic Sampling

Implement DAPO-style dynamic sampling to filter low-quality data:

--over-sampling-batch-size 64 \
--rollout-batch-size 32 \
--n-samples-per-prompt 8 \
--dynamic-sampling-filter-path \
  slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std

This samples 64 prompts with 8 responses each, filtering groups where all rewards are identical.

Partial Rollout

Cache incomplete generations during dynamic sampling:

--partial-rollout \
--buffer-filter-path slime.rollout.filter_hub.buffer_filters.pop_first

Half-generated samples are reused in the next rollout, reducing wasted computation.

Get Started

Core Concepts

Guides

Advanced

Platform Support

Documentation Index

​Overview

​MODEL_ARGS: Model Configuration

​Overriding Model Parameters

​CKPT_ARGS: Checkpoint Configuration

​Example Configuration

​ROLLOUT_ARGS: Data Generation Parameters

​Training Loop Structure

​Critical Constraint

​Data Configuration

​Batch Size Parameters

​Sampling Parameters

​Reward Model

​Example Configuration

​EVAL_ARGS: Evaluation Parameters

​Example Configuration

​PERF_ARGS: Performance and Parallelism

​Parallelism Parameters

​Dynamic Batching

​Recomputation

​Example Configuration

​GRPO_ARGS: RL Algorithm Parameters

​Example Configuration

​OPTIMIZER_ARGS: Optimizer Configuration

​Example Configuration

​SGLANG_ARGS: Inference Engine Configuration

​SGLang Parameter Forwarding

​Example Configuration

​Advanced Features

​Colocated Training and Inference

​Dynamic Sampling

​Partial Rollout

​Next Steps

Customization

Distributed Training

Build docs developers (and LLMs) love

Overview

MODEL_ARGS: Model Configuration

Overriding Model Parameters

CKPT_ARGS: Checkpoint Configuration

Example Configuration

ROLLOUT_ARGS: Data Generation Parameters

Training Loop Structure

Critical Constraint

Data Configuration

Batch Size Parameters

Sampling Parameters

Reward Model

Example Configuration

EVAL_ARGS: Evaluation Parameters

Example Configuration

PERF_ARGS: Performance and Parallelism

Parallelism Parameters

Dynamic Batching

Recomputation

Example Configuration

GRPO_ARGS: RL Algorithm Parameters

Example Configuration

OPTIMIZER_ARGS: Optimizer Configuration

Example Configuration

SGLANG_ARGS: Inference Engine Configuration

SGLang Parameter Forwarding

Example Configuration

Advanced Features

Colocated Training and Inference

Dynamic Sampling

Partial Rollout

Next Steps