Overview
slime training scripts are organized into several argument categories:- MODEL_ARGS: Model architecture and hyperparameters
- CKPT_ARGS: Checkpoint loading and saving
- ROLLOUT_ARGS: Data generation and sampling
- EVAL_ARGS: Evaluation configuration
- PERF_ARGS: Performance and parallelism
- GRPO_ARGS: RL algorithm parameters
- OPTIMIZER_ARGS: Optimizer settings
- SGLANG_ARGS: Inference engine configuration
MODEL_ARGS: Model Configuration
Model arguments define the architecture and are loaded from model configuration files inscripts/models/.
Overriding Model Parameters
You can override parameters after sourcing the model configuration:CKPT_ARGS: Checkpoint Configuration
Checkpoint arguments control model loading and saving.Path to HuggingFace checkpoint for loading tokenizer and model metadata. Model weights are not loaded from here during training.
Path to reference model’s Megatron format checkpoint. Used as the initial checkpoint if
--load is empty.Actor model loading path. Should typically match
--save for checkpoint resumption. If empty or invalid, loads from --ref-load instead.Directory path where model checkpoints will be saved during training.
Number of rollout steps between checkpoint saves.
Example Configuration
ROLLOUT_ARGS: Data Generation Parameters
Rollout arguments control the data sampling phase of training.Training Loop Structure
The training process follows a “Data Sampling → Weight Update” closed loop:Data Sampling (Rollout)
Generate responses from prompts using the inference engine
--rollout-batch-size: Number of prompts per rollout--n-samples-per-prompt: Responses generated per prompt
Critical Constraint
The following equation must always hold:--global-batch-size if not specified.
Data Configuration
Path to prompt dataset in JSONL format. Each line should contain the keys specified by
--input-key and --label-key.JSON key for input prompts in the dataset.
JSON key for ground truth labels used in reward calculation.
Apply chat template to prompts. When enabled, input should be in OpenAI message format.
Whether to shuffle prompts during rollout.
Batch Size Parameters
Number of prompts sampled in each rollout step.
Number of responses generated for each prompt (used in GRPO-like algorithms).
Total number of samples required for one optimizer step. Auto-calculated if not set.
Number of optimizer steps per rollout. Default is 1 for on-policy training.
Total number of rollout iterations. Controls the length of training.
Sampling Parameters
Maximum length of generated responses (equivalent to
max_tokens in SGLang).Temperature for sampling during rollout.
Top-p (nucleus) sampling parameter.
Top-k sampling parameter. Set to -1 to disable.
Reward Model
Built-in reward model type. Options include:
deepscaler, math, dapo, f1, gpqa, ifbench, remote_rm.Path to custom reward function. Overrides
--rm-type.Example Configuration
EVAL_ARGS: Evaluation Parameters
Evaluation inherits most rollout parameters but can be overridden.Number of rollouts between evaluations.
Evaluation dataset in format:
dataset_name /path/to/data.jsonl. Multiple datasets can be specified.Number of responses to generate per evaluation prompt.
Maximum response length during evaluation.
Top-p sampling for evaluation.
Example Configuration
PERF_ARGS: Performance and Parallelism
Performance arguments control Megatron’s parallelism strategies.Parallelism Parameters
Tensor parallelism degree. Splits model layers across GPUs.
Pipeline parallelism degree. Splits model into stages.
Context parallelism for handling long sequences.
Expert parallelism for MoE models.
Enable sequence parallelism to reduce activation memory.
Dynamic Batching
Enable dynamic batching to maximize GPU utilization with variable-length sequences.
Maximum tokens per GPU when using dynamic batching. The system packs samples to approach this limit.
Recomputation
Granularity for activation recomputation:
full, selective.Recomputation method:
uniform, block.Number of layers to recompute.
Example Configuration
GRPO_ARGS: RL Algorithm Parameters
GRPO and other RL algorithm configurations.Advantage estimation method. Options:
grpo, gspo, reinforce_plus_plus, reinforce_plus_plus_baseline, ppo.Enable KL divergence calculation with reference model. KL will be computed but only affects loss if
--kl-loss-coef > 0.Coefficient for KL penalty in the loss function. Set to 0 to only monitor KL without affecting training.
Type of KL loss calculation:
k1, k2, k3, low_var_kl.Coefficient for entropy bonus in the loss.
PPO clipping range lower bound.
PPO clipping range upper bound. If not set, symmetric clipping is used.
Calculate loss on a per-token basis instead of per-sample basis.
Enable Truncated Importance Sampling for off-policy correction.
Example Configuration
OPTIMIZER_ARGS: Optimizer Configuration
Optimizer type:
adam, sgd.Learning rate.
Learning rate schedule:
constant, linear, cosine.Weight decay coefficient.
Adam beta1 parameter.
Adam beta2 parameter.
Gradient clipping threshold.
Example Configuration
SGLANG_ARGS: Inference Engine Configuration
SGLang service parameters for rollout inference.Number of GPUs per inference engine (equivalent to SGLang’s
tp_size).Total number of GPUs for inference. Ignored when using
--colocate.SGLang Parameter Forwarding
You can pass SGLang parameters by adding the--sglang- prefix:
Example Configuration
Advanced Features
Colocated Training and Inference
Deploy training and inference on the same GPUs to save resources:Dynamic Sampling
Implement DAPO-style dynamic sampling to filter low-quality data:Partial Rollout
Cache incomplete generations during dynamic sampling:Next Steps
Customization
Learn how to customize generation functions, reward models, and filters
Distributed Training
Set up Ray clusters and multi-node training