Every trainer in the TRL ecosystem expects its dataset in a specific column schema.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/avnlp/llm-finetuning/llms.txt
Use this file to discover all available pages before exploring further.
SFTTrainer wants a single text column; GRPOTrainer wants a prompt list and an answer string; PPOTrainer needs pre-tokenized input_ids. Handling these schema differences in ad-hoc scripts leads to duplication and subtle bugs. BaseDatasetLoader solves this by giving every dataset a single load() entry point and a format_example() hook where the schema is enforced. All built-in loaders in this project subclass BaseDatasetLoader, and you can do the same for any new dataset you add.
DatasetConfig
DatasetConfig is a frozen dataclass that bundles everything needed to download a HuggingFace dataset split.
The HuggingFace Hub dataset identifier, e.g.
"allenai/ai2_arc" or
"openai/gsm8k". Passed directly to datasets.load_dataset.The dataset configuration name (second positional argument to
load_dataset),
e.g. "ARC-Challenge" or "rc". Set to None for datasets with a single
configuration.Local directory where the downloaded dataset files are cached. Defaults to the
HuggingFace cache (
~/.cache/huggingface/datasets).When
True, all original dataset columns are removed after format_example
runs, leaving only the columns produced by the formatter. Set to False to
keep source columns alongside the formatted ones.Number of processes used by
datasets.Dataset.map when applying
format_example. Increase for large datasets on multi-core machines.Additional keyword arguments forwarded verbatim to
load_dataset. Use this to
pass flags like trust_remote_code=True or download_mode="force_redownload".BaseDatasetLoader
Output column schemas by trainer type
Each trainer requires a specific set of columns. Yourformat_example implementation must return a dict that matches the schema for the trainer you are targeting.
| Trainer | Required columns | Types |
|---|---|---|
SFTTrainer | text | str |
GRPOTrainer | prompt, answer | list[dict], str |
DPOTrainer / ORPOTrainer | prompt, chosen, rejected | str, str, str |
PPOTrainer | input_ids (after tokenization) | list[int] |
KTOTrainer | prompt, completion, label | str, str, bool |
PPO loaders first produce
{"prompt": str} from format_example, then call
tokenize() to convert the prompt strings to input_ids. See the
tokenize() section below.The load() method
load() is the primary entry point. Call it with a split name to get a formatted Dataset ready for training:
load() calls datasets.load_dataset with the parameters from DatasetConfig, then pipes the result through _format_dataset, which maps format_example over every row.
The format_example() method
format_example is the only abstract method you must implement in a subclass. It receives one raw dataset row and returns a dict whose keys match the target trainer’s schema.
_format_dataset via Dataset.map, so it must be stateless — it cannot accumulate state across rows.
The tokenize() method
tokenize() is used exclusively for PPO, where PPOTrainer expects input_ids rather than raw strings:
A formatted
Dataset containing a text column (typically the output of load()).The tokenizer to encode the text. Must match the model being trained.
Name of the column to tokenize. Defaults to
"prompt".Overriding the dataset via config.yaml
Train scripts read the dataset ID and subset fromconfig.yaml and pass them to DatasetConfig, falling back to the loader’s class-level CONFIG constant when the keys are absent. This means you can point any loader at a different dataset by adding two lines to config.yaml — no code change required.
Implementing a custom loader subclass
The following example creates an SFT loader for a new dataset. The pattern mirrors the built-in loaders insupervised_finetuning/loaders.py.
{"prompt": list[dict], "answer": str} instead of {"text": str}. Build the prompt message list directly in format_example rather than using PromptTemplate.render_user(), since GRPO loaders work with raw message dicts.