Dataset reference: all 16 training datasets

This repo trains across 16 HuggingFace datasets organised into 5 categories. Each dataset has a loader class with class-level defaults for the HF ID, config/subset, and split. Those defaults are used when the corresponding keys are absent from config.yaml. You can override any of them at runtime without changing the loader code:

dataset_id — overrides the HuggingFace dataset ID
dataset_subset — overrides the dataset config/subset
split (supervised_finetuning only) or dataset_split (all other modules) — selects the HF split

Omitting any key preserves the loader’s default.

Summary

#	Dataset	HF ID	Config	Split	Size	Module
1	ARC	`allenai/ai2_arc`	`ARC-Challenge`	`train`	~1.1k	SFT
2	Earnings Calls	`lamini/earnings-calls-qa`	—	`train`	~3.7k	SFT
3	FactScore	`awinml/factscore_unlabelled_alpaca_13b_retrieval`	—	`train`	500	SFT
4	PopQA	`akariasai/PopQA`	—	`test`	~14k	SFT
5	TriviaQA	`mandarjoshi/trivia_qa`	`rc`	`train`	~88k	SFT
6	HotpotQA	`hotpotqa/hotpot_qa`	`distractor`	`train`	~90k	Multi-Hop QA
7	FreshQA	`vtllms/sealqa`	`longseal`	`test`	264	Multi-Hop QA
8	MuSiQue	`dgslibisey/MuSiQue`	—	`train`	~19.9k	Multi-Hop QA
9	OpenR1-Math-220k	`open-r1/OpenR1-Math-220k`	`default`	`train`	~220k	Math (Stage 1 SFT)
10	GSM8K	`openai/gsm8k`	`main`	`train`	7.47k	Math (Stage 2 GRPO)
11	UltraFeedback	`HuggingFaceH4/ultrafeedback_binarized`	—	`train_prefs`	~60k	Pref. Alignment
12	KTO Mix	`trl-lib/kto-mix-14k`	—	`train`	14k	Pref. Alignment
13	WebGPT	`openai/webgpt_comparisons`	—	`train`	~19k	Pref. Alignment
14	MedQA	`bigbio/med_qa`	`med_qa_en_4options_bigbio_qa`	`train`	~10.2k	Medical QA
15	BioASQ	`enelpol/rag-mini-bioasq`	`question-answer-passages`	`train`	4,010	Medical QA
16	PubMedQA	`qiaojin/PubMedQA`	`pqa_artificial`	`train`	211k	Medical QA

Datasets by category

Supervised Fine-Tuning (5 datasets)

All five SFT datasets are loaded by SFTDatasetLoader subclasses in supervised_finetuning/loaders.py. The loader applies apply_chat_template and outputs a single "text" column consumed by SFTTrainer(dataset_text_field="text"). Override the split with the split key in config.yaml (not dataset_split).

ARC (AI2 Reasoning Challenge)

Key	Value
HF ID	`allenai/ai2_arc`
Config	`ARC-Challenge`
Split	`train`
Size	~1.1k

Raw fields: question (str), choices (dict: label list[str], text list[str]), answerKey (str)Preprocessing: format_choices(labels, texts) flattens the nested choices dict into "A. text\nB. text\n..." before template substitution.

Earnings Calls

Key	Value
HF ID	`lamini/earnings-calls-qa`
Config	—
Split	`train`
Size	~3.7k

Raw fields: question (str), answer (str), transcript (str), ticker (str), company (str), date (str), q (str)Preprocessing: The transcript field is renamed to context to match the {context} placeholder in the user template.

FactScore (Biography Generation)

Key	Value
HF ID	`awinml/factscore_unlabelled_alpaca_13b_retrieval`
Config	—
Split	`train`
Size	500

Raw fields: input (str — bio prompt), output (str — biography), topic (str), ctxs (list[dict] — 25 Wikipedia passages)Preprocessing: No preprocess() override needed. input maps to the user message; output maps to the assistant response. The ctxs passages are not injected into the prompt.

PopQA

Key	Value
HF ID	`akariasai/PopQA`
Config	—
Split	`test`
Size	~14k

Raw fields: question (str), possible_answers (list[str]), subj (str), prop (str), obj (str)Preprocessing: No preprocess() override needed. possible_answers is a list; str() conversion is applied when the value is used as a response target.

TriviaQA

Key	Value
HF ID	`mandarjoshi/trivia_qa`
Config	`rc`
Split	`train`
Size	~88k

Raw fields: question (str), answer (dict: value str, aliases list[str]), search_results (dict: search_context list, …)Preprocessing: answer["value"] is extracted as a flat string. context is built from search_results["search_context"][0] (first search result).

Multi-Hop QA (3 datasets)

Multi-Hop QA pipelines use GRPO + QLoRA training. The loader outputs a "prompt" column (list of message dicts) and an "answer" column (str). GRPOTrainer passes all dataset columns as **kwargs to reward functions. Override the split with dataset_split in config.yaml.

HotpotQA

Key	Value
HF ID	`hotpotqa/hotpot_qa`
Config	`distractor`
Split	`train`
Size	~90k

Raw fields: question (str), answer (str), context (dict), supporting_facts (dict), type (str), level (str)Preprocessing: question maps to the prompt message list; answer maps to the answer column. Context is not injected into the prompt.

Uses the distractor config, not fullwiki.

FreshQA (via SealQA)

Key	Value
HF ID	`vtllms/sealqa`
Config	`longseal`
Split	`test`
Size	264

Raw fields: question (str), answer (str), golds (list[dict]), 12_docs/20_docs/30_docs, freshness, question_types, topicPreprocessing: question maps to the prompt message list; answer maps to the answer column. Document retrieval fields are not used.

Only a test split exists — all 264 rows are used for training.

MuSiQue

Key	Value
HF ID	`dgslibisey/MuSiQue`
Config	—
Split	`train`
Size	~19.9k (filtered)

Raw fields: id (str), question (str), answer (str), answer_aliases (list[str]), answerable (bool), paragraphs (list[dict]), question_decomposition (list[dict])Preprocessing: Filtered to rows where answerable == True. question maps to the prompt message list; answer maps to the answer column.

Math Reasoning (2 datasets)

The math reasoning pipeline is two-stage. Stage 1 uses SFT on OpenR1-Math-220k to build chain-of-thought capability. Stage 2 uses GRPO on GSM8K to refine numeric answer accuracy. Override splits with dataset_split in both cases (the Stage 1 SFT pipeline uses dataset_split, not split).

OpenR1-Math-220k (Stage 1 SFT)

Key	Value
HF ID	`open-r1/OpenR1-Math-220k`
Config	`default`
Split	`train`
Size	~220k

Raw fields: problem (str), solution (str — chain-of-thought pre-formatted with <reasoning> and <answer> tags)Preprocessing: None required. problem maps to the user message; solution maps to the assistant response via chat template. Output: single "text" column consumed by SFTTrainer.Used by: math_reasoning/sft/openr1_math/. The trained checkpoint is consumed by Stage 2 (math_reasoning/grpo/gsm8k/ with config_qwen3.yaml).

GSM8K (Stage 2 GRPO)

Key	Value
HF ID	`openai/gsm8k`
Config	`main`
Split	`train`
Size	7.47k

Raw fields: question (str), answer (str — chain-of-thought ending with #### {number})Preprocessing: Numeric answer extracted via re.search(r"####\s*(.+)", text). question maps to the prompt message list.

Preference Alignment (3 datasets)

Preference alignment pipelines (DPO, ORPO, KTO, PPO) consume pairwise or pointwise preference data. Override splits with dataset_split in config.yaml.

UltraFeedback Binarized

Key	Value
HF ID	`HuggingFaceH4/ultrafeedback_binarized`
Config	—
Split	`train_prefs`
Size	~60k

Raw fields: chosen (list[dict] — role/content pairs), rejected (list[dict])Preprocessing (DPO/ORPO): Extracts the user message from chosen[0]; applies apply_chat_template on chosen/rejected to produce prompt/chosen/rejected columns.Preprocessing (PPO): Loaded raw; a PointwiseRewardModel (OpenAssistant/reward-model-deberta-v3-large-v2) scores completions at inference time.

KTO Mix

Key	Value
HF ID	`trl-lib/kto-mix-14k`
Config	—
Split	`train`
Size	14k

Raw fields: prompt (str), completion (str), label (bool)Preprocessing: None required — the dataset is already in KTO format and is passed directly to KTOTrainer.

WebGPT Comparisons

Key	Value
HF ID	`openai/webgpt_comparisons`
Config	—
Split	`train`
Size	~19k

Raw fields: question (dict: full_text str), answer_0 (dict: text str), answer_1 (dict: text str), preference (float)Preprocessing: preference > 0 → answer_0 is chosen; preference < 0 → answer_1 is chosen. Rows where preference == 0 are dropped.

Medical QA (3 datasets)

Medical QA pipelines use GRPO + QLoRA training. Output format is identical to Multi-Hop QA: "prompt" column (list of message dicts) and "answer" column (str). Override splits with dataset_split in config.yaml.

MedQA

Key	Value
HF ID	`bigbio/med_qa`
Config	`med_qa_en_4options_bigbio_qa`
Split	`train`
Size	~10.2k

Raw fields: question (str), choices (dict: key list[str], value list[str]), answer (str)Preprocessing: Choices are formatted as "A. text\nB. text\n..." and appended to the question. answer maps to the answer column.

BioASQ

Key	Value
HF ID	`enelpol/rag-mini-bioasq`
Config	`question-answer-passages`
Split	`train`
Size	4,010

Raw fields: question (str), answer (str), id (int), relevant_passage_ids (list[int])Preprocessing: question maps to the prompt message list; answer maps to the answer column. relevant_passage_ids is not used.

bigbio/bioasq_task_b requires manual download from participants-area.bioasq.org. enelpol/rag-mini-bioasq is the publicly available substitute used here.

PubMedQA

Key	Value
HF ID	`qiaojin/PubMedQA`
Config	`pqa_artificial`
Split	`train`
Size	211k

Raw fields: pubid (int), question (str), context (dict: contexts list[str], labels list[str], meshes list[str]), long_answer (str), final_decision (str)Preprocessing: question maps to the prompt message list; long_answer maps to the answer column. final_decision and context passages are not used.

Overriding dataset defaults

Add any combination of these keys to config.yaml. Omitted keys fall back to the loader default.

SFT pipelines
GRPO pipelines
Preference alignment

dataset_id: "allenai/ai2_arc"
dataset_subset: "ARC-Easy"
split: "train"

dataset_id: "openai/gsm8k"
dataset_subset: "main"
dataset_split: "train"

dataset_id: "HuggingFaceH4/ultrafeedback_binarized"
dataset_split: "train_prefs"

Get Started

Training Paradigms

Core Concepts

Reference

Dataset reference: all 16 training datasets

Summary

Datasets by category

Overriding dataset defaults

Build docs developers (and LLMs) love

Get Started

Training Paradigms

Core Concepts

Reference

Documentation Index

​Summary

​Datasets by category

​Overriding dataset defaults

Build docs developers (and LLMs) love

Summary

Datasets by category

Overriding dataset defaults