This repo trains across 16 HuggingFace datasets organised into 5 categories. Each dataset has a loader class with class-level defaults for the HF ID, config/subset, and split. Those defaults are used when the corresponding keys are absent fromDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/avnlp/llm-finetuning/llms.txt
Use this file to discover all available pages before exploring further.
config.yaml. You can override any of them at runtime without changing the loader code:
dataset_id— overrides the HuggingFace dataset IDdataset_subset— overrides the dataset config/subsetsplit(supervised_finetuningonly) ordataset_split(all other modules) — selects the HF split
Summary
| # | Dataset | HF ID | Config | Split | Size | Module |
|---|---|---|---|---|---|---|
| 1 | ARC | allenai/ai2_arc | ARC-Challenge | train | ~1.1k | SFT |
| 2 | Earnings Calls | lamini/earnings-calls-qa | — | train | ~3.7k | SFT |
| 3 | FactScore | awinml/factscore_unlabelled_alpaca_13b_retrieval | — | train | 500 | SFT |
| 4 | PopQA | akariasai/PopQA | — | test | ~14k | SFT |
| 5 | TriviaQA | mandarjoshi/trivia_qa | rc | train | ~88k | SFT |
| 6 | HotpotQA | hotpotqa/hotpot_qa | distractor | train | ~90k | Multi-Hop QA |
| 7 | FreshQA | vtllms/sealqa | longseal | test | 264 | Multi-Hop QA |
| 8 | MuSiQue | dgslibisey/MuSiQue | — | train | ~19.9k | Multi-Hop QA |
| 9 | OpenR1-Math-220k | open-r1/OpenR1-Math-220k | default | train | ~220k | Math (Stage 1 SFT) |
| 10 | GSM8K | openai/gsm8k | main | train | 7.47k | Math (Stage 2 GRPO) |
| 11 | UltraFeedback | HuggingFaceH4/ultrafeedback_binarized | — | train_prefs | ~60k | Pref. Alignment |
| 12 | KTO Mix | trl-lib/kto-mix-14k | — | train | 14k | Pref. Alignment |
| 13 | WebGPT | openai/webgpt_comparisons | — | train | ~19k | Pref. Alignment |
| 14 | MedQA | bigbio/med_qa | med_qa_en_4options_bigbio_qa | train | ~10.2k | Medical QA |
| 15 | BioASQ | enelpol/rag-mini-bioasq | question-answer-passages | train | 4,010 | Medical QA |
| 16 | PubMedQA | qiaojin/PubMedQA | pqa_artificial | train | 211k | Medical QA |
Datasets by category
Supervised Fine-Tuning (5 datasets)
Supervised Fine-Tuning (5 datasets)
All five SFT datasets are loaded by
Raw fields:
Raw fields:
Raw fields:
Raw fields:
Raw fields:
SFTDatasetLoader subclasses in supervised_finetuning/loaders.py. The loader applies apply_chat_template and outputs a single "text" column consumed by SFTTrainer(dataset_text_field="text"). Override the split with the split key in config.yaml (not dataset_split).ARC (AI2 Reasoning Challenge)
ARC (AI2 Reasoning Challenge)
| Key | Value |
|---|---|
| HF ID | allenai/ai2_arc |
| Config | ARC-Challenge |
| Split | train |
| Size | ~1.1k |
question (str), choices (dict: label list[str], text list[str]), answerKey (str)Preprocessing: format_choices(labels, texts) flattens the nested choices dict into "A. text\nB. text\n..." before template substitution.Earnings Calls
Earnings Calls
| Key | Value |
|---|---|
| HF ID | lamini/earnings-calls-qa |
| Config | — |
| Split | train |
| Size | ~3.7k |
question (str), answer (str), transcript (str), ticker (str), company (str), date (str), q (str)Preprocessing: The transcript field is renamed to context to match the {context} placeholder in the user template.FactScore (Biography Generation)
FactScore (Biography Generation)
| Key | Value |
|---|---|
| HF ID | awinml/factscore_unlabelled_alpaca_13b_retrieval |
| Config | — |
| Split | train |
| Size | 500 |
input (str — bio prompt), output (str — biography), topic (str), ctxs (list[dict] — 25 Wikipedia passages)Preprocessing: No preprocess() override needed. input maps to the user message; output maps to the assistant response. The ctxs passages are not injected into the prompt.PopQA
PopQA
| Key | Value |
|---|---|
| HF ID | akariasai/PopQA |
| Config | — |
| Split | test |
| Size | ~14k |
question (str), possible_answers (list[str]), subj (str), prop (str), obj (str)Preprocessing: No preprocess() override needed. possible_answers is a list; str() conversion is applied when the value is used as a response target.TriviaQA
TriviaQA
| Key | Value |
|---|---|
| HF ID | mandarjoshi/trivia_qa |
| Config | rc |
| Split | train |
| Size | ~88k |
question (str), answer (dict: value str, aliases list[str]), search_results (dict: search_context list, …)Preprocessing: answer["value"] is extracted as a flat string. context is built from search_results["search_context"][0] (first search result).Multi-Hop QA (3 datasets)
Multi-Hop QA (3 datasets)
Multi-Hop QA pipelines use GRPO + QLoRA training. The loader outputs a
Raw fields:
Raw fields:
Raw fields:
"prompt" column (list of message dicts) and an "answer" column (str). GRPOTrainer passes all dataset columns as **kwargs to reward functions. Override the split with dataset_split in config.yaml.HotpotQA
HotpotQA
| Key | Value |
|---|---|
| HF ID | hotpotqa/hotpot_qa |
| Config | distractor |
| Split | train |
| Size | ~90k |
question (str), answer (str), context (dict), supporting_facts (dict), type (str), level (str)Preprocessing: question maps to the prompt message list; answer maps to the answer column. Context is not injected into the prompt.Uses the
distractor config, not fullwiki.FreshQA (via SealQA)
FreshQA (via SealQA)
| Key | Value |
|---|---|
| HF ID | vtllms/sealqa |
| Config | longseal |
| Split | test |
| Size | 264 |
question (str), answer (str), golds (list[dict]), 12_docs/20_docs/30_docs, freshness, question_types, topicPreprocessing: question maps to the prompt message list; answer maps to the answer column. Document retrieval fields are not used.MuSiQue
MuSiQue
| Key | Value |
|---|---|
| HF ID | dgslibisey/MuSiQue |
| Config | — |
| Split | train |
| Size | ~19.9k (filtered) |
id (str), question (str), answer (str), answer_aliases (list[str]), answerable (bool), paragraphs (list[dict]), question_decomposition (list[dict])Preprocessing: Filtered to rows where answerable == True. question maps to the prompt message list; answer maps to the answer column.Math Reasoning (2 datasets)
Math Reasoning (2 datasets)
The math reasoning pipeline is two-stage. Stage 1 uses SFT on OpenR1-Math-220k to build chain-of-thought capability. Stage 2 uses GRPO on GSM8K to refine numeric answer accuracy. Override splits with
Raw fields:
Raw fields:
dataset_split in both cases (the Stage 1 SFT pipeline uses dataset_split, not split).OpenR1-Math-220k (Stage 1 SFT)
OpenR1-Math-220k (Stage 1 SFT)
| Key | Value |
|---|---|
| HF ID | open-r1/OpenR1-Math-220k |
| Config | default |
| Split | train |
| Size | ~220k |
problem (str), solution (str — chain-of-thought pre-formatted with <reasoning> and <answer> tags)Preprocessing: None required. problem maps to the user message; solution maps to the assistant response via chat template. Output: single "text" column consumed by SFTTrainer.Used by: math_reasoning/sft/openr1_math/. The trained checkpoint is consumed by Stage 2 (math_reasoning/grpo/gsm8k/ with config_qwen3.yaml).GSM8K (Stage 2 GRPO)
GSM8K (Stage 2 GRPO)
| Key | Value |
|---|---|
| HF ID | openai/gsm8k |
| Config | main |
| Split | train |
| Size | 7.47k |
question (str), answer (str — chain-of-thought ending with #### {number})Preprocessing: Numeric answer extracted via re.search(r"####\s*(.+)", text). question maps to the prompt message list.Preference Alignment (3 datasets)
Preference Alignment (3 datasets)
Preference alignment pipelines (DPO, ORPO, KTO, PPO) consume pairwise or pointwise preference data. Override splits with
Raw fields:
Raw fields:
Raw fields:
dataset_split in config.yaml.UltraFeedback Binarized
UltraFeedback Binarized
| Key | Value |
|---|---|
| HF ID | HuggingFaceH4/ultrafeedback_binarized |
| Config | — |
| Split | train_prefs |
| Size | ~60k |
chosen (list[dict] — role/content pairs), rejected (list[dict])Preprocessing (DPO/ORPO): Extracts the user message from chosen[0]; applies apply_chat_template on chosen/rejected to produce prompt/chosen/rejected columns.Preprocessing (PPO): Loaded raw; a PointwiseRewardModel (OpenAssistant/reward-model-deberta-v3-large-v2) scores completions at inference time.KTO Mix
KTO Mix
| Key | Value |
|---|---|
| HF ID | trl-lib/kto-mix-14k |
| Config | — |
| Split | train |
| Size | 14k |
prompt (str), completion (str), label (bool)Preprocessing: None required — the dataset is already in KTO format and is passed directly to KTOTrainer.WebGPT Comparisons
WebGPT Comparisons
| Key | Value |
|---|---|
| HF ID | openai/webgpt_comparisons |
| Config | — |
| Split | train |
| Size | ~19k |
question (dict: full_text str), answer_0 (dict: text str), answer_1 (dict: text str), preference (float)Preprocessing: preference > 0 → answer_0 is chosen; preference < 0 → answer_1 is chosen. Rows where preference == 0 are dropped.Medical QA (3 datasets)
Medical QA (3 datasets)
Medical QA pipelines use GRPO + QLoRA training. Output format is identical to Multi-Hop QA:
Raw fields:
Raw fields:
Raw fields:
"prompt" column (list of message dicts) and "answer" column (str). Override splits with dataset_split in config.yaml.MedQA
MedQA
| Key | Value |
|---|---|
| HF ID | bigbio/med_qa |
| Config | med_qa_en_4options_bigbio_qa |
| Split | train |
| Size | ~10.2k |
question (str), choices (dict: key list[str], value list[str]), answer (str)Preprocessing: Choices are formatted as "A. text\nB. text\n..." and appended to the question. answer maps to the answer column.BioASQ
BioASQ
| Key | Value |
|---|---|
| HF ID | enelpol/rag-mini-bioasq |
| Config | question-answer-passages |
| Split | train |
| Size | 4,010 |
question (str), answer (str), id (int), relevant_passage_ids (list[int])Preprocessing: question maps to the prompt message list; answer maps to the answer column. relevant_passage_ids is not used.bigbio/bioasq_task_b requires manual download from participants-area.bioasq.org. enelpol/rag-mini-bioasq is the publicly available substitute used here.PubMedQA
PubMedQA
| Key | Value |
|---|---|
| HF ID | qiaojin/PubMedQA |
| Config | pqa_artificial |
| Split | train |
| Size | 211k |
pubid (int), question (str), context (dict: contexts list[str], labels list[str], meshes list[str]), long_answer (str), final_decision (str)Preprocessing: question maps to the prompt message list; long_answer maps to the answer column. final_decision and context passages are not used.Overriding dataset defaults
Add any combination of these keys toconfig.yaml. Omitted keys fall back to the loader default.
- SFT pipelines
- GRPO pipelines
- Preference alignment