verl uses Hydra for configuration management. All training runs are driven by a single top-level YAML file (typicallyDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt
Use this file to discover all available pages before exploring further.
ppo_trainer.yaml), with sub-configs composed from the verl/trainer/config/ directory. You override any field from the command line with key=value syntax — no code changes needed. This page documents every major section of the config, its fields, and their defaults.
Data Section
Thedata block controls dataset loading, tokenization, and batching. Paths can point to local files or HDFS paths; verl will download HDFS paths to DRAM automatically.
Training set parquet file path(s). Accepts a single file or a list of files. The entire dataset is loaded into DRAM so keep the total size below ~100 GB. Supports local paths and HDFS paths.
Validation set parquet file path(s). Same format as
train_files.Maximum samples to draw from the training set. Set to
-1 to use the full dataset.Maximum samples from the validation set. Set to
-1 for the full set.Column name in the parquet file that contains the prompt text.
Maximum prompt token length. All prompts are left-padded to this length. An error is raised if a prompt exceeds this value unless
truncation is set.Maximum response token length. The rollout engine generates up to this many tokens per prompt.
Global batch size (number of prompts) sampled per training iteration. This is the algorithmic batch size seen from the single-controller perspective; it is normalized across workers internally.
Return the original
input_ids without applying the chat template. Set to True when the reward model uses a different tokenizer or chat template than the policy — the tokens must be decoded and re-encoded for the RM.How to handle prompts that exceed
max_prompt_length. Options:error— raise an exception (default; forces you to set an appropriate limit)left— truncate from the leftright— truncate from the rightmiddle— keep the head and tail, drop the middle portion
When
True, prompts that exceed max_prompt_length are silently dropped rather than raising an error. Use filter_overlong_prompts_workers to parallelize this step on large datasets.Random seed for data shuffling. Set to
null for non-deterministic ordering across runs.Custom Dataset Class
Path to a Python file containing a custom dataset class. If
null, verl’s built-in dataset implementation is used.Class name within the file pointed to by
data.custom_cls.path.Actor / Rollout / Reference Policy Section
All three model roles — actor (policy being trained), rollout (inference engine), and reference policy — share a singleactor_rollout_ref config block. They share the same base model weights but have separate runtime configurations.
HuggingFace model identifier or local/HDFS path. This single path is shared by the actor, rollout, and reference model. HDFS paths are downloaded to DRAM automatically.
Override the attention implementation. Options:
flash_attention_2, eager, sdpa. Use eager for debugging or when Flash Attention 2 is unavailable.Enable gradient checkpointing for the actor (FSDP only). Reduces GPU memory at the cost of additional forward passes during backward. For Megatron, use
override_transformer_config recompute options instead.Offload activations to CPU during the forward pass (FSDP only). Works alongside gradient checkpointing to further reduce peak GPU memory.
Enable sequence packing (remove padding tokens). Improves throughput significantly for variable-length sequences. Supported for Llama, Mistral, Gemma, and Qwen-based models.
Actor Training
Distributed training backend for the actor. Options:
fsdp— PyTorch FSDP (default, FSDP1)fsdp2— PyTorch FSDP2, recommended for newer workloads (7% lower memory, 1.5% higher throughput vs FSDP1)megatron— NVIDIA Megatron-LM for very large models
Global mini-batch size for PPO actor updates. The
train_batch_size is split into sub-batches of this size for multiple gradient steps per iteration.Per-GPU micro-batch size for actor forward/backward passes (gradient accumulation). Smaller values trade throughput for lower GPU memory. Use this field;
ppo_micro_batch_size (global) is deprecated.Number of PPO update epochs over the same batch of rollout data. Higher values extract more signal per rollout but risk over-fitting to stale data.
PPO clip range (ε). The policy ratio
π/π_old is clipped to [1-ε, 1+ε] to prevent excessively large updates.Gradient norm clipping threshold. Helps stabilize training and prevents gradient explosions.
Add a KL divergence penalty term directly to the actor loss (used in GRPO). When
True, the KL is applied in the loss rather than in the reward function. The reference model is automatically enabled when this is set.Coefficient weighting the KL loss term when
use_kl_loss=True.KL divergence estimator. Options:
kl / k1, abs, mse / k2, low_var_kl / k3, full. Appending + (e.g. k1+, k3+) uses straight-through estimation for unbiased gradients. See this blog post for analysis.Weight of the entropy bonus in the PPO loss. Encourages exploration. Default changed to
0.0 from v0.3.x onward.Enable dynamic batching (sequence packing) for actor updates. When
True, use ppo_max_token_len_per_gpu instead of ppo_micro_batch_size_per_gpu to control memory. Significantly improves throughput on variable-length data.Maximum tokens per GPU per forward/backward pass when
use_dynamic_bsz=True. A good starting point is 2 × (max_prompt_length + max_response_length).Actor learning rate. For RL fine-tuning, typical values are
1e-7 to 1e-6.LR scheduler type. Options:
constant, cosine. For cosine, also configure min_lr_ratio and num_cycles.Offload model parameters to CPU when not in use (FSDP). Trades speed for GPU memory. Recommended for reference models on 7B+ models.
Offload optimizer states to CPU (FSDP). Frees significant GPU memory when optimizer states are large.
Rollout Engine
Rollout inference backend. Options:
vllm, sglang, hf. vLLM and SGLang are recommended for production; HF is useful for debugging.Tensor parallel degree for the rollout engine. A smaller TP size spawns more inference replicas (data parallelism), which typically yields higher throughput at the cost of more KV cache memory.
Fraction of GPU memory allocated to the rollout engine.
- vLLM ≥ 0.7.0: fraction of total GPU memory
- SGLang: fraction of free GPU memory for static memory (model weights + KV cache)
0.5 and 0.7 balance throughput and OOM risk when actor parameters and optimizer states are not offloaded.Number of responses to sample per prompt. Set to values greater than
1 for GRPO and RLOO, which require multiple samples per prompt to estimate advantages.Sampling temperature during training rollout. Use
0 for greedy decoding (also set in val_kwargs for deterministic evaluation).Offload the KV cache after the rollout generation stage to free GPU memory for actor/critic training.
Disable CUDA graphs in the vLLM engine. Set to
True when free_cache_engine=True with vLLM 0.5.4 / 0.6.3, or for debugging. Default False for best performance.Enable multi-turn agentic rollout with tool calling. Requires
rollout.name=sglang. Configure tool definitions via tool_config_path or function_tool_path.Compute log probabilities during rollout. Required for Rollout Correction (truncated importance sampling). Also enables
training/rollout_probs_diff_mean diagnostics.Extra keyword arguments passed directly to the vLLM engine constructor. Refer to the vLLM documentation for available options.
Extra keyword arguments for the SGLang engine. Refer to the SGLang documentation for available options.
Reference Model
The reference model is activated automatically whenactor.use_kl_loss=True or algorithm.use_kl_in_reward=True.
Offload reference model parameters to CPU. Strongly recommended for models 7B or larger to avoid GPU OOM during concurrent actor training.
Critic Section
The critic model (value function) is only needed for PPO. Its configuration mirrors the actor model.Critic model path. Typically set to the same base model as the actor. The critic adds a scalar value head on top of the transformer.
Global mini-batch size for critic gradient updates. Can often be larger than the actor’s mini-batch size since the critic has no large vocabulary output head.
Number of update epochs over the rollout batch for critic training.
Critic learning rate. Often set higher than the actor learning rate.
Reward Section
Path to a Python file containing your custom reward function. If
null, verl’s built-in reward functions are used (e.g., for GSM8K and MATH).Name of the reward function inside the file at
custom_reward_function.path. The function receives (data_source, solution_str, ground_truth, extra_info) and must return a float.Reward computation strategy.
naive runs verifications sequentially; prime parallelizes them across workers when all verification functions are multiprocessing-safe.Enable a model-based reward model. When
False, only the custom reward function is used. When True, the reward model is deployed as an inference server alongside the rollout engine.Custom Reward Function
Implementcompute_score in a Python file and point the config to it:
Algorithm Section
Discount factor for future rewards.
1.0 means no discounting (appropriate for episodic tasks with dense rewards at the end). Reduce for long-horizon tasks with intermediate rewards.GAE (Generalized Advantage Estimation) λ parameter. Controls the bias-variance tradeoff:
0 = one-step TD (low variance, high bias), 1 = Monte Carlo returns (high variance, low bias).Advantage estimation method. Options:
gae— standard PPO with Generalized Advantage Estimation (requires critic)grpo— Group Relative Policy Optimization (no critic needed)reinforce_plus_plus— REINFORCE++ with improved baselinereinforce_plus_plus_baseline— REINFORCE++ with explicit baselinerloo/rloo_vectorized— REINFORCE Leave-One-Outgrpo_vectorized— vectorized GRPO implementation
Add a KL penalty term to the reward signal at each token. Distinct from
actor.use_kl_loss which adds KL to the loss. When True, the reference model is enabled automatically.KL controller type.
fixed keeps kl_coef constant; adaptive adjusts it dynamically based on target_kl over a horizon window.KL penalty coefficient for in-reward KL (
use_kl_in_reward=True). The initial coefficient when using the adaptive controller.Trainer Section
Number of full passes through the training dataset.
Set an explicit step limit instead of using
total_epochs. When null, the step count is derived from total_epochs and train_batch_size.Project name for experiment tracking (wandb, SwanLab, MLflow).
Run/experiment name for tracking and as a component of the checkpoint directory path.
Active logging backends. Supported values:
"console", "wandb", "swanlab", "mlflow", "tensorboard", "trackio". Provide as a list to enable multiple simultaneously.Number of validation generations to log to the experiment tracker at each validation step. Set to
0 to disable (reduces overhead). Previously named val_generations_to_log_to_wandb.Number of nodes in the Ray cluster.
Number of GPUs per node.
Checkpoint save frequency in training iterations.
-1 disables periodic saving (only saves at end of training).Validation frequency in training iterations.
-1 disables periodic validation.Run a validation pass before the first training step to establish a baseline reward score.
Number of iterations to train the critic alone before starting policy updates. Useful when the critic needs to stabilize its value estimates first.
Checkpoint resume strategy:
auto— resume from the latest checkpoint indefault_local_dirif one existsdisable— always start from scratchresume_path— resume from the path specified inresume_from_path
Explicit checkpoint directory to resume from. Only used when
resume_mode=resume_path.Root directory for local checkpoint storage. Defaults to
checkpoints/{project_name}/{experiment_name}.Maximum number of actor checkpoints to retain on disk. Older checkpoints are deleted.
null keeps all.Balance batch sizes across distributed workers to avoid stragglers when sequence lengths vary.
Checkpoint Section
Checkpoint settings are nested under each model role (actor, critic). The samesave_contents / load_contents pattern applies to all roles.
Contents to include in saved checkpoints. Valid values:
model— framework-native sharded weights (FSDP per-rank shards or Megatron dist checkpoint / HF via mbridge)optimizer— sharded optimizer stateextra— LR scheduler, RNG states, and (for Megatron) the serializedTransformerConfighf_model— full HuggingFace format weights (suitable for inference)
Contents to load when resuming. Defaults to the same as
save_contents. You can specify a subset to, for example, load only model weights without optimizer state.