Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt

Use this file to discover all available pages before exploring further.

verl uses Hydra for configuration management. All training runs are driven by a single top-level YAML file (typically ppo_trainer.yaml), with sub-configs composed from the verl/trainer/config/ directory. You override any field from the command line with key=value syntax — no code changes needed. This page documents every major section of the config, its fields, and their defaults.

Data Section

The data block controls dataset loading, tokenization, and batching. Paths can point to local files or HDFS paths; verl will download HDFS paths to DRAM automatically.
data:
  tokenizer: null
  train_files: ~/data/rlhf/gsm8k/train.parquet
  val_files: ~/data/rlhf/gsm8k/test.parquet
  train_max_samples: -1     # -1 = use full dataset
  val_max_samples: -1       # -1 = use full dataset
  prompt_key: prompt
  max_prompt_length: 512
  max_response_length: 512
  train_batch_size: 1024    # global batch size (number of prompts)
  return_raw_input_ids: False
  return_raw_chat: False
  return_full_prompt: False
  shuffle: True
  seed: 42
  filter_overlong_prompts: False
  filter_overlong_prompts_workers: 1
  truncation: error         # error | left | right | middle
  image_key: images
  trust_remote_code: True
  custom_cls:
    path: null
    name: null
data.train_files
string | list
Training set parquet file path(s). Accepts a single file or a list of files. The entire dataset is loaded into DRAM so keep the total size below ~100 GB. Supports local paths and HDFS paths.
data.val_files
string | list
Validation set parquet file path(s). Same format as train_files.
data.train_max_samples
int
default:"-1"
Maximum samples to draw from the training set. Set to -1 to use the full dataset.
data.val_max_samples
int
default:"-1"
Maximum samples from the validation set. Set to -1 for the full set.
data.prompt_key
string
default:"prompt"
Column name in the parquet file that contains the prompt text.
data.max_prompt_length
int
default:"512"
Maximum prompt token length. All prompts are left-padded to this length. An error is raised if a prompt exceeds this value unless truncation is set.
data.max_response_length
int
default:"512"
Maximum response token length. The rollout engine generates up to this many tokens per prompt.
data.train_batch_size
int
default:"1024"
Global batch size (number of prompts) sampled per training iteration. This is the algorithmic batch size seen from the single-controller perspective; it is normalized across workers internally.
data.return_raw_input_ids
boolean
default:"False"
Return the original input_ids without applying the chat template. Set to True when the reward model uses a different tokenizer or chat template than the policy — the tokens must be decoded and re-encoded for the RM.
data.truncation
string
default:"error"
How to handle prompts that exceed max_prompt_length. Options:
  • error — raise an exception (default; forces you to set an appropriate limit)
  • left — truncate from the left
  • right — truncate from the right
  • middle — keep the head and tail, drop the middle portion
data.filter_overlong_prompts
boolean
default:"False"
When True, prompts that exceed max_prompt_length are silently dropped rather than raising an error. Use filter_overlong_prompts_workers to parallelize this step on large datasets.
data.seed
int
default:"42"
Random seed for data shuffling. Set to null for non-deterministic ordering across runs.

Custom Dataset Class

data:
  custom_cls:
    path: null   # path to Python file with your dataset class
    name: null   # class name inside that file
data.custom_cls.path
string
Path to a Python file containing a custom dataset class. If null, verl’s built-in dataset implementation is used.
data.custom_cls.name
string
Class name within the file pointed to by data.custom_cls.path.

Actor / Rollout / Reference Policy Section

All three model roles — actor (policy being trained), rollout (inference engine), and reference policy — share a single actor_rollout_ref config block. They share the same base model weights but have separate runtime configurations.
actor_rollout_ref:
  hybrid_engine: True
  model:
    path: ~/models/deepseek-llm-7b-chat
    external_lib: null
    override_config:
      attn_implementation: flash_attention_2
    enable_gradient_checkpointing: False
    enable_activation_offload: False
    trust_remote_code: False
    use_remove_padding: False
actor_rollout_ref.model.path
string
required
HuggingFace model identifier or local/HDFS path. This single path is shared by the actor, rollout, and reference model. HDFS paths are downloaded to DRAM automatically.
actor_rollout_ref.model.override_config.attn_implementation
string
default:"flash_attention_2"
Override the attention implementation. Options: flash_attention_2, eager, sdpa. Use eager for debugging or when Flash Attention 2 is unavailable.
actor_rollout_ref.model.enable_gradient_checkpointing
boolean
default:"False"
Enable gradient checkpointing for the actor (FSDP only). Reduces GPU memory at the cost of additional forward passes during backward. For Megatron, use override_transformer_config recompute options instead.
actor_rollout_ref.model.enable_activation_offload
boolean
default:"False"
Offload activations to CPU during the forward pass (FSDP only). Works alongside gradient checkpointing to further reduce peak GPU memory.
actor_rollout_ref.model.use_remove_padding
boolean
default:"False"
Enable sequence packing (remove padding tokens). Improves throughput significantly for variable-length sequences. Supported for Llama, Mistral, Gemma, and Qwen-based models.

Actor Training

actor_rollout_ref:
  actor:
    strategy: fsdp         # fsdp | fsdp2 | megatron
    ppo_mini_batch_size: 256
    ppo_micro_batch_size_per_gpu: 8
    use_dynamic_bsz: False
    ppo_max_token_len_per_gpu: 16384
    grad_clip: 1.0
    clip_ratio: 0.2
    entropy_coeff: 0.0
    use_kl_loss: False
    kl_loss_coef: 0.001
    kl_loss_type: low_var_kl
    ppo_epochs: 1
    shuffle: False
    use_torch_compile: True
    loss_agg_mode: token-mean
    optim:
      lr: 1e-6
      lr_warmup_steps: -1
      lr_warmup_steps_ratio: 0.0
      min_lr_ratio: 0.0
      lr_scheduler_type: constant  # constant | cosine
    fsdp_config:
      param_offload: False
      optimizer_offload: False
      fsdp_size: -1
actor_rollout_ref.actor.strategy
string
default:"fsdp"
Distributed training backend for the actor. Options:
  • fsdp — PyTorch FSDP (default, FSDP1)
  • fsdp2 — PyTorch FSDP2, recommended for newer workloads (7% lower memory, 1.5% higher throughput vs FSDP1)
  • megatron — NVIDIA Megatron-LM for very large models
actor_rollout_ref.actor.ppo_mini_batch_size
int
default:"256"
Global mini-batch size for PPO actor updates. The train_batch_size is split into sub-batches of this size for multiple gradient steps per iteration.
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu
int
default:"8"
Per-GPU micro-batch size for actor forward/backward passes (gradient accumulation). Smaller values trade throughput for lower GPU memory. Use this field; ppo_micro_batch_size (global) is deprecated.
actor_rollout_ref.actor.ppo_epochs
int
default:"1"
Number of PPO update epochs over the same batch of rollout data. Higher values extract more signal per rollout but risk over-fitting to stale data.
actor_rollout_ref.actor.clip_ratio
float
default:"0.2"
PPO clip range (ε). The policy ratio π/π_old is clipped to [1-ε, 1+ε] to prevent excessively large updates.
actor_rollout_ref.actor.grad_clip
float
default:"1.0"
Gradient norm clipping threshold. Helps stabilize training and prevents gradient explosions.
actor_rollout_ref.actor.use_kl_loss
boolean
default:"False"
Add a KL divergence penalty term directly to the actor loss (used in GRPO). When True, the KL is applied in the loss rather than in the reward function. The reference model is automatically enabled when this is set.
actor_rollout_ref.actor.kl_loss_coef
float
default:"0.001"
Coefficient weighting the KL loss term when use_kl_loss=True.
actor_rollout_ref.actor.kl_loss_type
string
default:"low_var_kl"
KL divergence estimator. Options: kl / k1, abs, mse / k2, low_var_kl / k3, full. Appending + (e.g. k1+, k3+) uses straight-through estimation for unbiased gradients. See this blog post for analysis.
actor_rollout_ref.actor.entropy_coeff
float
default:"0.0"
Weight of the entropy bonus in the PPO loss. Encourages exploration. Default changed to 0.0 from v0.3.x onward.
actor_rollout_ref.actor.use_dynamic_bsz
boolean
default:"False"
Enable dynamic batching (sequence packing) for actor updates. When True, use ppo_max_token_len_per_gpu instead of ppo_micro_batch_size_per_gpu to control memory. Significantly improves throughput on variable-length data.
actor_rollout_ref.actor.ppo_max_token_len_per_gpu
int
default:"16384"
Maximum tokens per GPU per forward/backward pass when use_dynamic_bsz=True. A good starting point is 2 × (max_prompt_length + max_response_length).
actor_rollout_ref.actor.optim.lr
float
default:"1e-6"
Actor learning rate. For RL fine-tuning, typical values are 1e-7 to 1e-6.
actor_rollout_ref.actor.optim.lr_scheduler_type
string
default:"constant"
LR scheduler type. Options: constant, cosine. For cosine, also configure min_lr_ratio and num_cycles.
actor_rollout_ref.actor.fsdp_config.param_offload
boolean
default:"False"
Offload model parameters to CPU when not in use (FSDP). Trades speed for GPU memory. Recommended for reference models on 7B+ models.
actor_rollout_ref.actor.fsdp_config.optimizer_offload
boolean
default:"False"
Offload optimizer states to CPU (FSDP). Frees significant GPU memory when optimizer states are large.

Rollout Engine

actor_rollout_ref:
  rollout:
    name: vllm              # vllm | sglang | hf
    temperature: 1.0
    top_k: -1
    top_p: 1.0
    dtype: bfloat16
    gpu_memory_utilization: 0.5
    ignore_eos: False
    enforce_eager: False
    free_cache_engine: True
    load_format: dummy
    tensor_model_parallel_size: 2
    max_num_batched_tokens: 8192
    max_num_seqs: 1024
    n: 1
    calculate_log_probs: True
    val_kwargs:
      temperature: 0
      top_k: -1
      top_p: 1.0
      n: 1
      do_sample: False
    multi_turn:
      enable: False
      max_assistant_turns: null
      tool_config_path: null
    engine_kwargs:
      vllm: {}
      sglang: {}
actor_rollout_ref.rollout.name
string
required
Rollout inference backend. Options: vllm, sglang, hf. vLLM and SGLang are recommended for production; HF is useful for debugging.
actor_rollout_ref.rollout.tensor_model_parallel_size
int
default:"2"
Tensor parallel degree for the rollout engine. A smaller TP size spawns more inference replicas (data parallelism), which typically yields higher throughput at the cost of more KV cache memory.
actor_rollout_ref.rollout.gpu_memory_utilization
float
default:"0.5"
Fraction of GPU memory allocated to the rollout engine.
  • vLLM ≥ 0.7.0: fraction of total GPU memory
  • SGLang: fraction of free GPU memory for static memory (model weights + KV cache)
Values between 0.5 and 0.7 balance throughput and OOM risk when actor parameters and optimizer states are not offloaded.
actor_rollout_ref.rollout.n
int
default:"1"
Number of responses to sample per prompt. Set to values greater than 1 for GRPO and RLOO, which require multiple samples per prompt to estimate advantages.
actor_rollout_ref.rollout.temperature
float
default:"1.0"
Sampling temperature during training rollout. Use 0 for greedy decoding (also set in val_kwargs for deterministic evaluation).
actor_rollout_ref.rollout.free_cache_engine
boolean
default:"True"
Offload the KV cache after the rollout generation stage to free GPU memory for actor/critic training.
actor_rollout_ref.rollout.enforce_eager
boolean
default:"False"
Disable CUDA graphs in the vLLM engine. Set to True when free_cache_engine=True with vLLM 0.5.4 / 0.6.3, or for debugging. Default False for best performance.
actor_rollout_ref.rollout.multi_turn.enable
boolean
default:"False"
Enable multi-turn agentic rollout with tool calling. Requires rollout.name=sglang. Configure tool definitions via tool_config_path or function_tool_path.
actor_rollout_ref.rollout.calculate_log_probs
boolean
default:"True"
Compute log probabilities during rollout. Required for Rollout Correction (truncated importance sampling). Also enables training/rollout_probs_diff_mean diagnostics.
actor_rollout_ref.rollout.engine_kwargs.vllm
object
default:"{}"
Extra keyword arguments passed directly to the vLLM engine constructor. Refer to the vLLM documentation for available options.
actor_rollout_ref.rollout.engine_kwargs.sglang
object
default:"{}"
Extra keyword arguments for the SGLang engine. Refer to the SGLang documentation for available options.

Reference Model

The reference model is activated automatically when actor.use_kl_loss=True or algorithm.use_kl_in_reward=True.
actor_rollout_ref:
  ref:
    fsdp_config:
      param_offload: False   # recommended True for 7B+ models
    log_prob_micro_batch_size_per_gpu: 16
actor_rollout_ref.ref.fsdp_config.param_offload
boolean
default:"False"
Offload reference model parameters to CPU. Strongly recommended for models 7B or larger to avoid GPU OOM during concurrent actor training.

Critic Section

The critic model (value function) is only needed for PPO. Its configuration mirrors the actor model.
critic:
  strategy: fsdp            # fsdp | fsdp2 | megatron
  ppo_mini_batch_size: 256
  ppo_micro_batch_size_per_gpu: 8
  ppo_epochs: 1
  forward_micro_batch_size_per_gpu: 16
  model:
    path: ~/models/deepseek-llm-7b-chat
    enable_gradient_checkpointing: False
  optim:
    lr: 1e-5
    lr_scheduler_type: constant
  fsdp_config:
    param_offload: False
    optimizer_offload: False
  checkpoint:
    save_contents: ['model', 'optimizer', 'extra']
    load_contents: ['model', 'optimizer', 'extra']
critic.model.path
string
Critic model path. Typically set to the same base model as the actor. The critic adds a scalar value head on top of the transformer.
critic.ppo_mini_batch_size
int
Global mini-batch size for critic gradient updates. Can often be larger than the actor’s mini-batch size since the critic has no large vocabulary output head.
critic.ppo_epochs
int
default:"1"
Number of update epochs over the rollout batch for critic training.
critic.optim.lr
float
default:"1e-5"
Critic learning rate. Often set higher than the actor learning rate.

Reward Section

reward:
  num_workers: 8
  custom_reward_function:
    path: null
    name: compute_score
  reward_manager:
    name: naive              # naive | prime
  reward_model:
    enable: False
    model_path: null
    rollout:
      name: ???
      tensor_model_parallel_size: 2
      gpu_memory_utilization: 0.5
reward.custom_reward_function.path
string
Path to a Python file containing your custom reward function. If null, verl’s built-in reward functions are used (e.g., for GSM8K and MATH).
reward.custom_reward_function.name
string
default:"compute_score"
Name of the reward function inside the file at custom_reward_function.path. The function receives (data_source, solution_str, ground_truth, extra_info) and must return a float.
reward.reward_manager.name
string
default:"naive"
Reward computation strategy. naive runs verifications sequentially; prime parallelizes them across workers when all verification functions are multiprocessing-safe.
reward.reward_model.enable
boolean
default:"False"
Enable a model-based reward model. When False, only the custom reward function is used. When True, the reward model is deployed as an inference server alongside the rollout engine.

Custom Reward Function

Implement compute_score in a Python file and point the config to it:
# my_reward.py
def compute_score(data_source, solution_str, ground_truth, extra_info=None) -> float:
    """
    Args:
        data_source: dataset name/identifier
        solution_str: the model's generated response (string)
        ground_truth: the expected answer
        extra_info: optional dict with additional metadata
    Returns:
        float reward score
    """
    if solution_str.strip() == ground_truth.strip():
        return 1.0
    return 0.0
reward:
  custom_reward_function:
    path: /path/to/my_reward.py
    name: compute_score

Algorithm Section

algorithm:
  gamma: 1.0
  lam: 1.0
  adv_estimator: gae        # gae | grpo | reinforce_plus_plus | rloo | rloo_vectorized | grpo_vectorized
  use_kl_in_reward: False
  kl_penalty: kl            # kl | abs | mse | low_var_kl | full
  norm_adv_by_std_in_grpo: True
  kl_ctrl:
    type: fixed             # fixed | adaptive
    kl_coef: 0.001
    horizon: 10000
    target_kl: 0.1
algorithm.gamma
float
default:"1.0"
Discount factor for future rewards. 1.0 means no discounting (appropriate for episodic tasks with dense rewards at the end). Reduce for long-horizon tasks with intermediate rewards.
algorithm.lam
float
default:"1.0"
GAE (Generalized Advantage Estimation) λ parameter. Controls the bias-variance tradeoff: 0 = one-step TD (low variance, high bias), 1 = Monte Carlo returns (high variance, low bias).
algorithm.adv_estimator
string
default:"gae"
Advantage estimation method. Options:
  • gae — standard PPO with Generalized Advantage Estimation (requires critic)
  • grpo — Group Relative Policy Optimization (no critic needed)
  • reinforce_plus_plus — REINFORCE++ with improved baseline
  • reinforce_plus_plus_baseline — REINFORCE++ with explicit baseline
  • rloo / rloo_vectorized — REINFORCE Leave-One-Out
  • grpo_vectorized — vectorized GRPO implementation
algorithm.use_kl_in_reward
boolean
default:"False"
Add a KL penalty term to the reward signal at each token. Distinct from actor.use_kl_loss which adds KL to the loss. When True, the reference model is enabled automatically.
algorithm.kl_ctrl.type
string
default:"fixed"
KL controller type. fixed keeps kl_coef constant; adaptive adjusts it dynamically based on target_kl over a horizon window.
algorithm.kl_ctrl.kl_coef
float
default:"0.001"
KL penalty coefficient for in-reward KL (use_kl_in_reward=True). The initial coefficient when using the adaptive controller.

Trainer Section

trainer:
  total_epochs: 30
  total_training_steps: null
  project_name: verl_examples
  experiment_name: gsm8k
  logger: ["console", "wandb"]
  log_val_generations: 0
  nnodes: 1
  n_gpus_per_node: 8
  save_freq: -1
  val_before_train: True
  test_freq: -1
  critic_warmup: 0
  default_hdfs_dir: null
  default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
  resume_mode: auto          # auto | disable | resume_path
  resume_from_path: null
  remove_previous_ckpt_in_save: False
  del_local_ckpt_after_load: False
  max_actor_ckpt_to_keep: null
  max_critic_ckpt_to_keep: null
  ray_wait_register_center_timeout: 300
  balance_batch: True
trainer.total_epochs
int
default:"30"
Number of full passes through the training dataset.
trainer.total_training_steps
int
Set an explicit step limit instead of using total_epochs. When null, the step count is derived from total_epochs and train_batch_size.
trainer.project_name
string
default:"verl_examples"
Project name for experiment tracking (wandb, SwanLab, MLflow).
trainer.experiment_name
string
default:"gsm8k"
Run/experiment name for tracking and as a component of the checkpoint directory path.
trainer.logger
list
default:"[\"console\", \"wandb\"]"
Active logging backends. Supported values: "console", "wandb", "swanlab", "mlflow", "tensorboard", "trackio". Provide as a list to enable multiple simultaneously.
trainer.log_val_generations
int
default:"0"
Number of validation generations to log to the experiment tracker at each validation step. Set to 0 to disable (reduces overhead). Previously named val_generations_to_log_to_wandb.
trainer.nnodes
int
default:"1"
Number of nodes in the Ray cluster.
trainer.n_gpus_per_node
int
default:"8"
Number of GPUs per node.
trainer.save_freq
int
default:"-1"
Checkpoint save frequency in training iterations. -1 disables periodic saving (only saves at end of training).
trainer.test_freq
int
default:"-1"
Validation frequency in training iterations. -1 disables periodic validation.
trainer.val_before_train
boolean
default:"True"
Run a validation pass before the first training step to establish a baseline reward score.
trainer.critic_warmup
int
default:"0"
Number of iterations to train the critic alone before starting policy updates. Useful when the critic needs to stabilize its value estimates first.
trainer.resume_mode
string
default:"auto"
Checkpoint resume strategy:
  • auto — resume from the latest checkpoint in default_local_dir if one exists
  • disable — always start from scratch
  • resume_path — resume from the path specified in resume_from_path
trainer.resume_from_path
string
Explicit checkpoint directory to resume from. Only used when resume_mode=resume_path.
trainer.default_local_dir
string
Root directory for local checkpoint storage. Defaults to checkpoints/{project_name}/{experiment_name}.
trainer.max_actor_ckpt_to_keep
int
Maximum number of actor checkpoints to retain on disk. Older checkpoints are deleted. null keeps all.
trainer.balance_batch
boolean
default:"True"
Balance batch sizes across distributed workers to avoid stragglers when sequence lengths vary.

Checkpoint Section

Checkpoint settings are nested under each model role (actor, critic). The same save_contents / load_contents pattern applies to all roles.
actor_rollout_ref:
  actor:
    checkpoint:
      save_contents: ['model', 'optimizer', 'extra']
      load_contents: ['model', 'optimizer', 'extra']
actor_rollout_ref.actor.checkpoint.save_contents
list
default:"['model', 'optimizer', 'extra']"
Contents to include in saved checkpoints. Valid values:
  • model — framework-native sharded weights (FSDP per-rank shards or Megatron dist checkpoint / HF via mbridge)
  • optimizer — sharded optimizer state
  • extra — LR scheduler, RNG states, and (for Megatron) the serialized TransformerConfig
  • hf_model — full HuggingFace format weights (suitable for inference)
actor_rollout_ref.actor.checkpoint.load_contents
list
Contents to load when resuming. Defaults to the same as save_contents. You can specify a subset to, for example, load only model weights without optimizer state.

Complete Minimal Example

The following is a minimal config for a PPO run on GSM8K with a 7B model on a single 8-GPU node:
# Override these fields from ppo_trainer.yaml defaults
data:
  train_files: ~/data/rlhf/gsm8k/train.parquet
  val_files: ~/data/rlhf/gsm8k/test.parquet
  train_batch_size: 256
  max_prompt_length: 512
  max_response_length: 1024

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-7B-Instruct
  actor:
    ppo_mini_batch_size: 64
    ppo_micro_batch_size_per_gpu: 4
    optim:
      lr: 1e-6
    fsdp_config:
      param_offload: False
  rollout:
    name: vllm
    tensor_model_parallel_size: 2
    gpu_memory_utilization: 0.6
    n: 1
  ref:
    fsdp_config:
      param_offload: True  # recommended for 7B+

critic:
  optim:
    lr: 1e-5

algorithm:
  adv_estimator: gae
  use_kl_in_reward: False

reward:
  custom_reward_function:
    path: examples/data_preprocess/gsm8k_reward.py
    name: compute_score

trainer:
  total_epochs: 15
  project_name: my-rl-project
  experiment_name: ppo-qwen25-gsm8k
  logger: ["wandb", "console"]
  save_freq: 50
  test_freq: 10
  n_gpus_per_node: 8
  nnodes: 1

Build docs developers (and LLMs) love