verl Configuration Reference: All YAML Fields Explained

verl uses Hydra for configuration management. All training runs are driven by a single top-level YAML file (typically ppo_trainer.yaml), with sub-configs composed from the verl/trainer/config/ directory. You override any field from the command line with key=value syntax — no code changes needed. This page documents every major section of the config, its fields, and their defaults.

Data Section

The data block controls dataset loading, tokenization, and batching. Paths can point to local files or HDFS paths; verl will download HDFS paths to DRAM automatically.

data:
  tokenizer: null
  train_files: ~/data/rlhf/gsm8k/train.parquet
  val_files: ~/data/rlhf/gsm8k/test.parquet
  train_max_samples: -1     # -1 = use full dataset
  val_max_samples: -1       # -1 = use full dataset
  prompt_key: prompt
  max_prompt_length: 512
  max_response_length: 512
  train_batch_size: 1024    # global batch size (number of prompts)
  return_raw_input_ids: False
  return_raw_chat: False
  return_full_prompt: False
  shuffle: True
  seed: 42
  filter_overlong_prompts: False
  filter_overlong_prompts_workers: 1
  truncation: error         # error | left | right | middle
  image_key: images
  trust_remote_code: True
  custom_cls:
    path: null
    name: null

data.train_files

string | list

Training set parquet file path(s). Accepts a single file or a list of files. The entire dataset is loaded into DRAM so keep the total size below ~100 GB. Supports local paths and HDFS paths.

data.val_files

string | list

Validation set parquet file path(s). Same format as train_files.

data.train_max_samples

int

default:"-1"

Maximum samples to draw from the training set. Set to -1 to use the full dataset.

data.val_max_samples

int

default:"-1"

Maximum samples from the validation set. Set to -1 for the full set.

data.prompt_key

string

default:"prompt"

Column name in the parquet file that contains the prompt text.

data.max_prompt_length

int

default:"512"

Maximum prompt token length. All prompts are left-padded to this length. An error is raised if a prompt exceeds this value unless truncation is set.

data.max_response_length

int

default:"512"

Maximum response token length. The rollout engine generates up to this many tokens per prompt.

data.train_batch_size

int

default:"1024"

Global batch size (number of prompts) sampled per training iteration. This is the algorithmic batch size seen from the single-controller perspective; it is normalized across workers internally.

data.return_raw_input_ids

boolean

default:"False"

Return the original input_ids without applying the chat template. Set to True when the reward model uses a different tokenizer or chat template than the policy — the tokens must be decoded and re-encoded for the RM.

data.truncation

string

default:"error"

How to handle prompts that exceed max_prompt_length. Options:

error — raise an exception (default; forces you to set an appropriate limit)
left — truncate from the left
right — truncate from the right
middle — keep the head and tail, drop the middle portion

data.filter_overlong_prompts

boolean

default:"False"

When True, prompts that exceed max_prompt_length are silently dropped rather than raising an error. Use filter_overlong_prompts_workers to parallelize this step on large datasets.

data.seed

int

default:"42"

Random seed for data shuffling. Set to null for non-deterministic ordering across runs.

Custom Dataset Class

data:
  custom_cls:
    path: null   # path to Python file with your dataset class
    name: null   # class name inside that file

data.custom_cls.path

string

Path to a Python file containing a custom dataset class. If null, verl’s built-in dataset implementation is used.

data.custom_cls.name

string

Class name within the file pointed to by data.custom_cls.path.

Actor / Rollout / Reference Policy Section

All three model roles — actor (policy being trained), rollout (inference engine), and reference policy — share a single actor_rollout_ref config block. They share the same base model weights but have separate runtime configurations.

actor_rollout_ref:
  hybrid_engine: True
  model:
    path: ~/models/deepseek-llm-7b-chat
    external_lib: null
    override_config:
      attn_implementation: flash_attention_2
    enable_gradient_checkpointing: False
    enable_activation_offload: False
    trust_remote_code: False
    use_remove_padding: False

actor_rollout_ref.model.path

string

required

HuggingFace model identifier or local/HDFS path. This single path is shared by the actor, rollout, and reference model. HDFS paths are downloaded to DRAM automatically.

actor_rollout_ref.model.override_config.attn_implementation

string

default:"flash_attention_2"

Override the attention implementation. Options: flash_attention_2, eager, sdpa. Use eager for debugging or when Flash Attention 2 is unavailable.

actor_rollout_ref.model.enable_gradient_checkpointing

boolean

default:"False"

Enable gradient checkpointing for the actor (FSDP only). Reduces GPU memory at the cost of additional forward passes during backward. For Megatron, use override_transformer_config recompute options instead.

actor_rollout_ref.model.enable_activation_offload

boolean

default:"False"

Offload activations to CPU during the forward pass (FSDP only). Works alongside gradient checkpointing to further reduce peak GPU memory.

actor_rollout_ref.model.use_remove_padding

boolean

default:"False"

Enable sequence packing (remove padding tokens). Improves throughput significantly for variable-length sequences. Supported for Llama, Mistral, Gemma, and Qwen-based models.

Actor Training

actor_rollout_ref:
  actor:
    strategy: fsdp         # fsdp | fsdp2 | megatron
    ppo_mini_batch_size: 256
    ppo_micro_batch_size_per_gpu: 8
    use_dynamic_bsz: False
    ppo_max_token_len_per_gpu: 16384
    grad_clip: 1.0
    clip_ratio: 0.2
    entropy_coeff: 0.0
    use_kl_loss: False
    kl_loss_coef: 0.001
    kl_loss_type: low_var_kl
    ppo_epochs: 1
    shuffle: False
    use_torch_compile: True
    loss_agg_mode: token-mean
    optim:
      lr: 1e-6
      lr_warmup_steps: -1
      lr_warmup_steps_ratio: 0.0
      min_lr_ratio: 0.0
      lr_scheduler_type: constant  # constant | cosine
    fsdp_config:
      param_offload: False
      optimizer_offload: False
      fsdp_size: -1

actor_rollout_ref.actor.strategy

string

default:"fsdp"

Distributed training backend for the actor. Options:

fsdp — PyTorch FSDP (default, FSDP1)
fsdp2 — PyTorch FSDP2, recommended for newer workloads (7% lower memory, 1.5% higher throughput vs FSDP1)
megatron — NVIDIA Megatron-LM for very large models

actor_rollout_ref.actor.ppo_mini_batch_size

int

default:"256"

Global mini-batch size for PPO actor updates. The train_batch_size is split into sub-batches of this size for multiple gradient steps per iteration.

actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu

int

default:"8"

Per-GPU micro-batch size for actor forward/backward passes (gradient accumulation). Smaller values trade throughput for lower GPU memory. Use this field; ppo_micro_batch_size (global) is deprecated.

actor_rollout_ref.actor.ppo_epochs

int

default:"1"

Number of PPO update epochs over the same batch of rollout data. Higher values extract more signal per rollout but risk over-fitting to stale data.

actor_rollout_ref.actor.clip_ratio

float

default:"0.2"

PPO clip range (ε). The policy ratio π/π_old is clipped to [1-ε, 1+ε] to prevent excessively large updates.

actor_rollout_ref.actor.grad_clip

float

default:"1.0"

Gradient norm clipping threshold. Helps stabilize training and prevents gradient explosions.

actor_rollout_ref.actor.use_kl_loss

boolean

default:"False"

Add a KL divergence penalty term directly to the actor loss (used in GRPO). When True, the KL is applied in the loss rather than in the reward function. The reference model is automatically enabled when this is set.

actor_rollout_ref.actor.kl_loss_coef

float

default:"0.001"

Coefficient weighting the KL loss term when use_kl_loss=True.

actor_rollout_ref.actor.kl_loss_type

string

default:"low_var_kl"

KL divergence estimator. Options: kl / k1, abs, mse / k2, low_var_kl / k3, full. Appending + (e.g. k1+, k3+) uses straight-through estimation for unbiased gradients. See this blog post for analysis.

actor_rollout_ref.actor.entropy_coeff

float

default:"0.0"

Weight of the entropy bonus in the PPO loss. Encourages exploration. Default changed to 0.0 from v0.3.x onward.

actor_rollout_ref.actor.use_dynamic_bsz

boolean

default:"False"

Enable dynamic batching (sequence packing) for actor updates. When True, use ppo_max_token_len_per_gpu instead of ppo_micro_batch_size_per_gpu to control memory. Significantly improves throughput on variable-length data.

actor_rollout_ref.actor.ppo_max_token_len_per_gpu

int

default:"16384"

Maximum tokens per GPU per forward/backward pass when use_dynamic_bsz=True. A good starting point is 2 × (max_prompt_length + max_response_length).

actor_rollout_ref.actor.optim.lr

float

default:"1e-6"

Actor learning rate. For RL fine-tuning, typical values are 1e-7 to 1e-6.

actor_rollout_ref.actor.optim.lr_scheduler_type

string

default:"constant"

LR scheduler type. Options: constant, cosine. For cosine, also configure min_lr_ratio and num_cycles.

actor_rollout_ref.actor.fsdp_config.param_offload

boolean

default:"False"

Offload model parameters to CPU when not in use (FSDP). Trades speed for GPU memory. Recommended for reference models on 7B+ models.

actor_rollout_ref.actor.fsdp_config.optimizer_offload

boolean

default:"False"

Offload optimizer states to CPU (FSDP). Frees significant GPU memory when optimizer states are large.

Rollout Engine

actor_rollout_ref:
  rollout:
    name: vllm              # vllm | sglang | hf
    temperature: 1.0
    top_k: -1
    top_p: 1.0
    dtype: bfloat16
    gpu_memory_utilization: 0.5
    ignore_eos: False
    enforce_eager: False
    free_cache_engine: True
    load_format: dummy
    tensor_model_parallel_size: 2
    max_num_batched_tokens: 8192
    max_num_seqs: 1024
    n: 1
    calculate_log_probs: True
    val_kwargs:
      temperature: 0
      top_k: -1
      top_p: 1.0
      n: 1
      do_sample: False
    multi_turn:
      enable: False
      max_assistant_turns: null
      tool_config_path: null
    engine_kwargs:
      vllm: {}
      sglang: {}

actor_rollout_ref.rollout.name

string

required

Rollout inference backend. Options: vllm, sglang, hf. vLLM and SGLang are recommended for production; HF is useful for debugging.

actor_rollout_ref.rollout.tensor_model_parallel_size

int

default:"2"

Tensor parallel degree for the rollout engine. A smaller TP size spawns more inference replicas (data parallelism), which typically yields higher throughput at the cost of more KV cache memory.

actor_rollout_ref.rollout.gpu_memory_utilization

float

default:"0.5"

Fraction of GPU memory allocated to the rollout engine.

vLLM ≥ 0.7.0: fraction of total GPU memory
SGLang: fraction of free GPU memory for static memory (model weights + KV cache)

Values between 0.5 and 0.7 balance throughput and OOM risk when actor parameters and optimizer states are not offloaded.

actor_rollout_ref.rollout.n

int

default:"1"

Number of responses to sample per prompt. Set to values greater than 1 for GRPO and RLOO, which require multiple samples per prompt to estimate advantages.

actor_rollout_ref.rollout.temperature

float

default:"1.0"

Sampling temperature during training rollout. Use 0 for greedy decoding (also set in val_kwargs for deterministic evaluation).

actor_rollout_ref.rollout.free_cache_engine

boolean

default:"True"

Offload the KV cache after the rollout generation stage to free GPU memory for actor/critic training.

actor_rollout_ref.rollout.enforce_eager

boolean

default:"False"

Disable CUDA graphs in the vLLM engine. Set to True when free_cache_engine=True with vLLM 0.5.4 / 0.6.3, or for debugging. Default False for best performance.

actor_rollout_ref.rollout.multi_turn.enable

boolean

default:"False"

Enable multi-turn agentic rollout with tool calling. Requires rollout.name=sglang. Configure tool definitions via tool_config_path or function_tool_path.

actor_rollout_ref.rollout.calculate_log_probs

boolean

default:"True"

Compute log probabilities during rollout. Required for Rollout Correction (truncated importance sampling). Also enables training/rollout_probs_diff_mean diagnostics.

actor_rollout_ref.rollout.engine_kwargs.vllm

object

default:"{}"

Extra keyword arguments passed directly to the vLLM engine constructor. Refer to the vLLM documentation for available options.

actor_rollout_ref.rollout.engine_kwargs.sglang

object

default:"{}"

Extra keyword arguments for the SGLang engine. Refer to the SGLang documentation for available options.

Reference Model

The reference model is activated automatically when actor.use_kl_loss=True or algorithm.use_kl_in_reward=True.

actor_rollout_ref:
  ref:
    fsdp_config:
      param_offload: False   # recommended True for 7B+ models
    log_prob_micro_batch_size_per_gpu: 16

actor_rollout_ref.ref.fsdp_config.param_offload

boolean

default:"False"

Offload reference model parameters to CPU. Strongly recommended for models 7B or larger to avoid GPU OOM during concurrent actor training.

Critic Section

The critic model (value function) is only needed for PPO. Its configuration mirrors the actor model.

critic:
  strategy: fsdp            # fsdp | fsdp2 | megatron
  ppo_mini_batch_size: 256
  ppo_micro_batch_size_per_gpu: 8
  ppo_epochs: 1
  forward_micro_batch_size_per_gpu: 16
  model:
    path: ~/models/deepseek-llm-7b-chat
    enable_gradient_checkpointing: False
  optim:
    lr: 1e-5
    lr_scheduler_type: constant
  fsdp_config:
    param_offload: False
    optimizer_offload: False
  checkpoint:
    save_contents: ['model', 'optimizer', 'extra']
    load_contents: ['model', 'optimizer', 'extra']

critic.model.path

string

Critic model path. Typically set to the same base model as the actor. The critic adds a scalar value head on top of the transformer.

critic.ppo_mini_batch_size

int

Global mini-batch size for critic gradient updates. Can often be larger than the actor’s mini-batch size since the critic has no large vocabulary output head.

critic.ppo_epochs

int

default:"1"

Number of update epochs over the rollout batch for critic training.

critic.optim.lr

float

default:"1e-5"

Critic learning rate. Often set higher than the actor learning rate.

Reward Section

reward:
  num_workers: 8
  custom_reward_function:
    path: null
    name: compute_score
  reward_manager:
    name: naive              # naive | prime
  reward_model:
    enable: False
    model_path: null
    rollout:
      name: ???
      tensor_model_parallel_size: 2
      gpu_memory_utilization: 0.5

reward.custom_reward_function.path

string

Path to a Python file containing your custom reward function. If null, verl’s built-in reward functions are used (e.g., for GSM8K and MATH).

reward.custom_reward_function.name

string

default:"compute_score"

Name of the reward function inside the file at custom_reward_function.path. The function receives (data_source, solution_str, ground_truth, extra_info) and must return a float.

reward.reward_manager.name

string

default:"naive"

Reward computation strategy. naive runs verifications sequentially; prime parallelizes them across workers when all verification functions are multiprocessing-safe.

reward.reward_model.enable

boolean

default:"False"

Enable a model-based reward model. When False, only the custom reward function is used. When True, the reward model is deployed as an inference server alongside the rollout engine.

Custom Reward Function

Implement compute_score in a Python file and point the config to it:

# my_reward.py
def compute_score(data_source, solution_str, ground_truth, extra_info=None) -> float:
    """
    Args:
        data_source: dataset name/identifier
        solution_str: the model's generated response (string)
        ground_truth: the expected answer
        extra_info: optional dict with additional metadata
    Returns:
        float reward score
    """
    if solution_str.strip() == ground_truth.strip():
        return 1.0
    return 0.0

reward:
  custom_reward_function:
    path: /path/to/my_reward.py
    name: compute_score

Algorithm Section

algorithm:
  gamma: 1.0
  lam: 1.0
  adv_estimator: gae        # gae | grpo | reinforce_plus_plus | rloo | rloo_vectorized | grpo_vectorized
  use_kl_in_reward: False
  kl_penalty: kl            # kl | abs | mse | low_var_kl | full
  norm_adv_by_std_in_grpo: True
  kl_ctrl:
    type: fixed             # fixed | adaptive
    kl_coef: 0.001
    horizon: 10000
    target_kl: 0.1

algorithm.gamma

float

default:"1.0"

Discount factor for future rewards. 1.0 means no discounting (appropriate for episodic tasks with dense rewards at the end). Reduce for long-horizon tasks with intermediate rewards.

algorithm.lam

float

default:"1.0"

GAE (Generalized Advantage Estimation) λ parameter. Controls the bias-variance tradeoff: 0 = one-step TD (low variance, high bias), 1 = Monte Carlo returns (high variance, low bias).

algorithm.adv_estimator

string

default:"gae"

Advantage estimation method. Options:

gae — standard PPO with Generalized Advantage Estimation (requires critic)
grpo — Group Relative Policy Optimization (no critic needed)
reinforce_plus_plus — REINFORCE++ with improved baseline
reinforce_plus_plus_baseline — REINFORCE++ with explicit baseline
rloo / rloo_vectorized — REINFORCE Leave-One-Out
grpo_vectorized — vectorized GRPO implementation

algorithm.use_kl_in_reward

boolean

default:"False"

Add a KL penalty term to the reward signal at each token. Distinct from actor.use_kl_loss which adds KL to the loss. When True, the reference model is enabled automatically.

algorithm.kl_ctrl.type

string

default:"fixed"

KL controller type. fixed keeps kl_coef constant; adaptive adjusts it dynamically based on target_kl over a horizon window.

algorithm.kl_ctrl.kl_coef

float

default:"0.001"

KL penalty coefficient for in-reward KL (use_kl_in_reward=True). The initial coefficient when using the adaptive controller.

Trainer Section

trainer:
  total_epochs: 30
  total_training_steps: null
  project_name: verl_examples
  experiment_name: gsm8k
  logger: ["console", "wandb"]
  log_val_generations: 0
  nnodes: 1
  n_gpus_per_node: 8
  save_freq: -1
  val_before_train: True
  test_freq: -1
  critic_warmup: 0
  default_hdfs_dir: null
  default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
  resume_mode: auto          # auto | disable | resume_path
  resume_from_path: null
  remove_previous_ckpt_in_save: False
  del_local_ckpt_after_load: False
  max_actor_ckpt_to_keep: null
  max_critic_ckpt_to_keep: null
  ray_wait_register_center_timeout: 300
  balance_batch: True

trainer.total_epochs

int

default:"30"

Number of full passes through the training dataset.

trainer.total_training_steps

int

Set an explicit step limit instead of using total_epochs. When null, the step count is derived from total_epochs and train_batch_size.

trainer.project_name

string

default:"verl_examples"

Project name for experiment tracking (wandb, SwanLab, MLflow).

trainer.experiment_name

string

default:"gsm8k"

Run/experiment name for tracking and as a component of the checkpoint directory path.

trainer.logger

list

default:"[\"console\", \"wandb\"]"

Active logging backends. Supported values: "console", "wandb", "swanlab", "mlflow", "tensorboard", "trackio". Provide as a list to enable multiple simultaneously.

trainer.log_val_generations

int

default:"0"

Number of validation generations to log to the experiment tracker at each validation step. Set to 0 to disable (reduces overhead). Previously named val_generations_to_log_to_wandb.

trainer.nnodes

int

default:"1"

Number of nodes in the Ray cluster.

trainer.n_gpus_per_node

int

default:"8"

Number of GPUs per node.

trainer.save_freq

int

default:"-1"

Checkpoint save frequency in training iterations. -1 disables periodic saving (only saves at end of training).

trainer.test_freq

int

default:"-1"

Validation frequency in training iterations. -1 disables periodic validation.

trainer.val_before_train

boolean

default:"True"

Run a validation pass before the first training step to establish a baseline reward score.

trainer.critic_warmup

int

default:"0"

Number of iterations to train the critic alone before starting policy updates. Useful when the critic needs to stabilize its value estimates first.

trainer.resume_mode

string

default:"auto"

Checkpoint resume strategy:

auto — resume from the latest checkpoint in default_local_dir if one exists
disable — always start from scratch
resume_path — resume from the path specified in resume_from_path

trainer.resume_from_path

string

Explicit checkpoint directory to resume from. Only used when resume_mode=resume_path.

trainer.default_local_dir

string

Root directory for local checkpoint storage. Defaults to checkpoints/{project_name}/{experiment_name}.

trainer.max_actor_ckpt_to_keep

int

Maximum number of actor checkpoints to retain on disk. Older checkpoints are deleted. null keeps all.

trainer.balance_batch

boolean

default:"True"

Balance batch sizes across distributed workers to avoid stragglers when sequence lengths vary.

Checkpoint Section

Checkpoint settings are nested under each model role (actor, critic). The same save_contents / load_contents pattern applies to all roles.

actor_rollout_ref:
  actor:
    checkpoint:
      save_contents: ['model', 'optimizer', 'extra']
      load_contents: ['model', 'optimizer', 'extra']

actor_rollout_ref.actor.checkpoint.save_contents

list

default:"['model', 'optimizer', 'extra']"

Contents to include in saved checkpoints. Valid values:

model — framework-native sharded weights (FSDP per-rank shards or Megatron dist checkpoint / HF via mbridge)
optimizer — sharded optimizer state
extra — LR scheduler, RNG states, and (for Megatron) the serialized TransformerConfig
hf_model — full HuggingFace format weights (suitable for inference)

actor_rollout_ref.actor.checkpoint.load_contents

list

Contents to load when resuming. Defaults to the same as save_contents. You can specify a subset to, for example, load only model weights without optimizer state.

Complete Minimal Example

The following is a minimal config for a PPO run on GSM8K with a 7B model on a single 8-GPU node:

# Override these fields from ppo_trainer.yaml defaults
data:
  train_files: ~/data/rlhf/gsm8k/train.parquet
  val_files: ~/data/rlhf/gsm8k/test.parquet
  train_batch_size: 256
  max_prompt_length: 512
  max_response_length: 1024

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-7B-Instruct
  actor:
    ppo_mini_batch_size: 64
    ppo_micro_batch_size_per_gpu: 4
    optim:
      lr: 1e-6
    fsdp_config:
      param_offload: False
  rollout:
    name: vllm
    tensor_model_parallel_size: 2
    gpu_memory_utilization: 0.6
    n: 1
  ref:
    fsdp_config:
      param_offload: True  # recommended for 7B+

critic:
  optim:
    lr: 1e-5

algorithm:
  adv_estimator: gae
  use_kl_in_reward: False

reward:
  custom_reward_function:
    path: examples/data_preprocess/gsm8k_reward.py
    name: compute_score

trainer:
  total_epochs: 15
  project_name: my-rl-project
  experiment_name: ppo-qwen25-gsm8k
  logger: ["wandb", "console"]
  save_freq: 50
  test_freq: 10
  n_gpus_per_node: 8
  nnodes: 1

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

verl Configuration Reference: All YAML Fields Explained

Data Section

Custom Dataset Class

Actor / Rollout / Reference Policy Section

Actor Training

Rollout Engine

Reference Model

Critic Section

Reward Section

Custom Reward Function

Algorithm Section

Trainer Section

Checkpoint Section

Complete Minimal Example

Build docs developers (and LLMs) love

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Documentation Index

​Data Section

​Custom Dataset Class

​Actor / Rollout / Reference Policy Section

​Actor Training

​Rollout Engine

​Reference Model

​Critic Section

​Reward Section

​Custom Reward Function

​Algorithm Section

​Trainer Section

​Checkpoint Section

​Complete Minimal Example

Build docs developers (and LLMs) love

Data Section

Custom Dataset Class

Actor / Rollout / Reference Policy Section

Actor Training

Rollout Engine

Reference Model

Critic Section

Reward Section

Custom Reward Function

Algorithm Section

Trainer Section

Checkpoint Section

Complete Minimal Example