Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt

Use this file to discover all available pages before exploring further.

Before launching a verl training job, your dataset must be converted to Parquet format with a specific set of fields that the RL trainer knows how to consume. The preprocessing step is intentionally separate from training: you run a script once to produce .parquet files, then point your training config at those files. This separation keeps the training loop fast and lets you preprocess on CPU without holding GPU resources.

Required Parquet Schema

Every row in your training (and evaluation) Parquet file must contain the following fields:
FieldTypeDescription
data_sourcestringDataset name used to route the sample to the correct reward function in RewardManager
promptlist[dict]The conversation prompt in HuggingFace chat-template format — a list of {"role": ..., "content": ...} dicts
abilitystringTask category (e.g., "math", "code", "qa") — used for logging and filtering
reward_modeldictContains at minimum {"style": "rule", "ground_truth": "<answer>"} for rule-based rewards
extra_infodictArbitrary metadata (split name, index, original question text, etc.)
verl applies the model’s chat template automatically at training time using the tokenizer. The prompt field should be a list of role/content dicts — not a pre-formatted string. The tokenizer’s apply_chat_template is called during data loading inside RLHFDataset, so the model always sees the correct format regardless of which model family you use.

The reward_model Field

The reward_model dict carries the information needed by the reward function:
{
  "style": "rule",
  "ground_truth": "42"
}
The ground_truth value is extracted from this field at reward-computation time and passed directly to your compute_score function. Its format must exactly match what your reward function expects — for GSM8K, for example, it is a plain numeric string with commas removed.

Writing a Preprocessing Script

Preprocessing scripts follow a consistent two-part structure: load the raw dataset with the HuggingFace datasets library, apply a make_map_fn transformation to each row, then write Parquet files to disk.
1

Load the Raw Dataset

Use datasets.load_dataset to fetch your dataset from HuggingFace Hub or from a local path.
import datasets

data_source = "openai/gsm8k"
dataset = datasets.load_dataset(data_source, "main")

train_dataset = dataset["train"]
test_dataset = dataset["test"]
2

Implement extract_solution

Write a helper that extracts the clean answer string from the raw label. For GSM8K, the answer appears after ####:
import re

def extract_solution(solution_str):
    solution = re.search(r"#### (\-?[0-9\.\,]+)", solution_str)
    assert solution is not None
    final_solution = solution.group(0)
    final_solution = final_solution.split("#### ")[1].replace(",", "")
    return final_solution
3

Implement make_map_fn

Return a process_fn that transforms each raw example into the verl schema:
instruction_following = 'Let\'s think step by step and output the final answer after "####".'

def make_map_fn(split):
    def process_fn(example, idx):
        question_raw = example.pop("question")
        question = question_raw + " " + instruction_following

        answer_raw = example.pop("answer")
        solution = extract_solution(answer_raw)

        data = {
            "data_source": data_source,
            "prompt": [
                {
                    "role": "user",
                    "content": question,
                }
            ],
            "ability": "math",
            "reward_model": {
                "style": "rule",
                "ground_truth": solution,
            },
            "extra_info": {
                "split": split,
                "index": idx,
                "answer": answer_raw,
                "question": question_raw,
            },
        }
        return data

    return process_fn
4

Apply the Map and Write Parquet

Apply make_map_fn to every row and save the result:
import os
from verl.utils.hdfs_io import copy, makedirs

train_dataset = train_dataset.map(
    function=make_map_fn("train"), with_indices=True
)
test_dataset = test_dataset.map(
    function=make_map_fn("test"), with_indices=True
)

local_save_dir = args.local_save_dir  # e.g. ~/data/gsm8k
train_dataset.to_parquet(os.path.join(local_save_dir, "train.parquet"))
test_dataset.to_parquet(os.path.join(local_save_dir, "test.parquet"))

# Optionally copy to HDFS
if args.hdfs_dir is not None:
    makedirs(args.hdfs_dir)
    copy(src=local_save_dir, dst=args.hdfs_dir)
Parquet is a columnar binary format. Compared to JSONL, it loads dramatically faster during training because column pruning avoids deserializing fields you don’t need, and Arrow memory-mapping enables zero-copy reads. For large datasets (millions of samples), the difference is significant.

Complete GSM8K Example

The full preprocessing script for GSM8K ships at examples/data_preprocess/gsm8k.py. Run it as:
python examples/data_preprocess/gsm8k.py \
    --local_save_dir ~/data/gsm8k \
    --hdfs_dir hdfs://user/data/gsm8k  # optional
Command-line arguments:
ArgumentDefaultDescription
--local_save_dir~/data/gsm8kDirectory where .parquet files are written
--local_dataset_pathNonePath to a locally cached raw dataset (skips HuggingFace download)
--hdfs_dirNoneHDFS destination; if set, files are copied there after local save

Pre-Built Preprocessing Scripts

verl ships preprocessing scripts for several common datasets under examples/data_preprocess/:
DatasetScriptTask
GSM8Kgsm8k.pyGrade-school math
MATHmath_dataset.pyCompetition math
HellaSwaghellaswag.pyCommon-sense reasoning
Full HH-RLHFfull_hh_rlhf.pyDialogue helpfulness/harmlessness
To prepare a dataset not covered by these scripts, copy the structure of the closest existing script and adjust the extract_solution and make_map_fn implementations for your label format.

HDFS Support

Both local and HDFS paths are supported throughout verl. The verl.utils.hdfs_io module provides copy and makedirs helpers that work transparently with hdfs:// prefixed URIs. Set your training config’s data.train_files and data.val_files to either a local glob pattern or an HDFS path:
data:
  train_files: ~/data/gsm8k/train.parquet
  val_files: ~/data/gsm8k/test.parquet

Filtering Overlong Prompts

Very long prompts can cause out-of-memory errors at the start of training. verl can pre-filter them during data loading:
data:
  filter_overlong_prompts: true
  filter_overlong_prompts_workers: 4   # parallel CPU workers for filtering
When enabled, any prompt whose tokenized length exceeds the configured maximum sequence length is discarded before it reaches the GPU workers.
Filtering is applied at load time, so the effective dataset size may differ from the number of rows in your Parquet files. Log the post-filter dataset size to avoid surprises when estimating training duration.

Multi-Turn and Image Inputs

For multi-turn conversations, extend the prompt list with additional {"role": "assistant", ...} and {"role": "user", ...} turns — the chat template handles the rest. For vision-language models, add an images key to the extra_info dict with the image data; the RLHFDataset class will pass it through to the tokenizer/processor.

Build docs developers (and LLMs) love