Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/avnlp/llm-finetuning/llms.txt

Use this file to discover all available pages before exploring further.

This repo trains across 16 HuggingFace datasets organised into 5 categories. Each dataset has a loader class with class-level defaults for the HF ID, config/subset, and split. Those defaults are used when the corresponding keys are absent from config.yaml. You can override any of them at runtime without changing the loader code:
  • dataset_id — overrides the HuggingFace dataset ID
  • dataset_subset — overrides the dataset config/subset
  • split (supervised_finetuning only) or dataset_split (all other modules) — selects the HF split
Omitting any key preserves the loader’s default.

Summary

#DatasetHF IDConfigSplitSizeModule
1ARCallenai/ai2_arcARC-Challengetrain~1.1kSFT
2Earnings Callslamini/earnings-calls-qatrain~3.7kSFT
3FactScoreawinml/factscore_unlabelled_alpaca_13b_retrievaltrain500SFT
4PopQAakariasai/PopQAtest~14kSFT
5TriviaQAmandarjoshi/trivia_qarctrain~88kSFT
6HotpotQAhotpotqa/hotpot_qadistractortrain~90kMulti-Hop QA
7FreshQAvtllms/sealqalongsealtest264Multi-Hop QA
8MuSiQuedgslibisey/MuSiQuetrain~19.9kMulti-Hop QA
9OpenR1-Math-220kopen-r1/OpenR1-Math-220kdefaulttrain~220kMath (Stage 1 SFT)
10GSM8Kopenai/gsm8kmaintrain7.47kMath (Stage 2 GRPO)
11UltraFeedbackHuggingFaceH4/ultrafeedback_binarizedtrain_prefs~60kPref. Alignment
12KTO Mixtrl-lib/kto-mix-14ktrain14kPref. Alignment
13WebGPTopenai/webgpt_comparisonstrain~19kPref. Alignment
14MedQAbigbio/med_qamed_qa_en_4options_bigbio_qatrain~10.2kMedical QA
15BioASQenelpol/rag-mini-bioasqquestion-answer-passagestrain4,010Medical QA
16PubMedQAqiaojin/PubMedQApqa_artificialtrain211kMedical QA

Datasets by category

All five SFT datasets are loaded by SFTDatasetLoader subclasses in supervised_finetuning/loaders.py. The loader applies apply_chat_template and outputs a single "text" column consumed by SFTTrainer(dataset_text_field="text"). Override the split with the split key in config.yaml (not dataset_split).
KeyValue
HF IDallenai/ai2_arc
ConfigARC-Challenge
Splittrain
Size~1.1k
Raw fields: question (str), choices (dict: label list[str], text list[str]), answerKey (str)Preprocessing: format_choices(labels, texts) flattens the nested choices dict into "A. text\nB. text\n..." before template substitution.
KeyValue
HF IDlamini/earnings-calls-qa
Config
Splittrain
Size~3.7k
Raw fields: question (str), answer (str), transcript (str), ticker (str), company (str), date (str), q (str)Preprocessing: The transcript field is renamed to context to match the {context} placeholder in the user template.
KeyValue
HF IDawinml/factscore_unlabelled_alpaca_13b_retrieval
Config
Splittrain
Size500
Raw fields: input (str — bio prompt), output (str — biography), topic (str), ctxs (list[dict] — 25 Wikipedia passages)Preprocessing: No preprocess() override needed. input maps to the user message; output maps to the assistant response. The ctxs passages are not injected into the prompt.
KeyValue
HF IDakariasai/PopQA
Config
Splittest
Size~14k
Raw fields: question (str), possible_answers (list[str]), subj (str), prop (str), obj (str)Preprocessing: No preprocess() override needed. possible_answers is a list; str() conversion is applied when the value is used as a response target.
KeyValue
HF IDmandarjoshi/trivia_qa
Configrc
Splittrain
Size~88k
Raw fields: question (str), answer (dict: value str, aliases list[str]), search_results (dict: search_context list, …)Preprocessing: answer["value"] is extracted as a flat string. context is built from search_results["search_context"][0] (first search result).
Multi-Hop QA pipelines use GRPO + QLoRA training. The loader outputs a "prompt" column (list of message dicts) and an "answer" column (str). GRPOTrainer passes all dataset columns as **kwargs to reward functions. Override the split with dataset_split in config.yaml.
KeyValue
HF IDhotpotqa/hotpot_qa
Configdistractor
Splittrain
Size~90k
Raw fields: question (str), answer (str), context (dict), supporting_facts (dict), type (str), level (str)Preprocessing: question maps to the prompt message list; answer maps to the answer column. Context is not injected into the prompt.
Uses the distractor config, not fullwiki.
KeyValue
HF IDvtllms/sealqa
Configlongseal
Splittest
Size264
Raw fields: question (str), answer (str), golds (list[dict]), 12_docs/20_docs/30_docs, freshness, question_types, topicPreprocessing: question maps to the prompt message list; answer maps to the answer column. Document retrieval fields are not used.
Only a test split exists — all 264 rows are used for training.
KeyValue
HF IDdgslibisey/MuSiQue
Config
Splittrain
Size~19.9k (filtered)
Raw fields: id (str), question (str), answer (str), answer_aliases (list[str]), answerable (bool), paragraphs (list[dict]), question_decomposition (list[dict])Preprocessing: Filtered to rows where answerable == True. question maps to the prompt message list; answer maps to the answer column.
The math reasoning pipeline is two-stage. Stage 1 uses SFT on OpenR1-Math-220k to build chain-of-thought capability. Stage 2 uses GRPO on GSM8K to refine numeric answer accuracy. Override splits with dataset_split in both cases (the Stage 1 SFT pipeline uses dataset_split, not split).
KeyValue
HF IDopen-r1/OpenR1-Math-220k
Configdefault
Splittrain
Size~220k
Raw fields: problem (str), solution (str — chain-of-thought pre-formatted with <reasoning> and <answer> tags)Preprocessing: None required. problem maps to the user message; solution maps to the assistant response via chat template. Output: single "text" column consumed by SFTTrainer.Used by: math_reasoning/sft/openr1_math/. The trained checkpoint is consumed by Stage 2 (math_reasoning/grpo/gsm8k/ with config_qwen3.yaml).
KeyValue
HF IDopenai/gsm8k
Configmain
Splittrain
Size7.47k
Raw fields: question (str), answer (str — chain-of-thought ending with #### {number})Preprocessing: Numeric answer extracted via re.search(r"####\s*(.+)", text). question maps to the prompt message list.
Preference alignment pipelines (DPO, ORPO, KTO, PPO) consume pairwise or pointwise preference data. Override splits with dataset_split in config.yaml.
KeyValue
HF IDHuggingFaceH4/ultrafeedback_binarized
Config
Splittrain_prefs
Size~60k
Raw fields: chosen (list[dict] — role/content pairs), rejected (list[dict])Preprocessing (DPO/ORPO): Extracts the user message from chosen[0]; applies apply_chat_template on chosen/rejected to produce prompt/chosen/rejected columns.Preprocessing (PPO): Loaded raw; a PointwiseRewardModel (OpenAssistant/reward-model-deberta-v3-large-v2) scores completions at inference time.
KeyValue
HF IDtrl-lib/kto-mix-14k
Config
Splittrain
Size14k
Raw fields: prompt (str), completion (str), label (bool)Preprocessing: None required — the dataset is already in KTO format and is passed directly to KTOTrainer.
KeyValue
HF IDopenai/webgpt_comparisons
Config
Splittrain
Size~19k
Raw fields: question (dict: full_text str), answer_0 (dict: text str), answer_1 (dict: text str), preference (float)Preprocessing: preference > 0answer_0 is chosen; preference < 0answer_1 is chosen. Rows where preference == 0 are dropped.
Medical QA pipelines use GRPO + QLoRA training. Output format is identical to Multi-Hop QA: "prompt" column (list of message dicts) and "answer" column (str). Override splits with dataset_split in config.yaml.
KeyValue
HF IDbigbio/med_qa
Configmed_qa_en_4options_bigbio_qa
Splittrain
Size~10.2k
Raw fields: question (str), choices (dict: key list[str], value list[str]), answer (str)Preprocessing: Choices are formatted as "A. text\nB. text\n..." and appended to the question. answer maps to the answer column.
KeyValue
HF IDenelpol/rag-mini-bioasq
Configquestion-answer-passages
Splittrain
Size4,010
Raw fields: question (str), answer (str), id (int), relevant_passage_ids (list[int])Preprocessing: question maps to the prompt message list; answer maps to the answer column. relevant_passage_ids is not used.
bigbio/bioasq_task_b requires manual download from participants-area.bioasq.org. enelpol/rag-mini-bioasq is the publicly available substitute used here.
KeyValue
HF IDqiaojin/PubMedQA
Configpqa_artificial
Splittrain
Size211k
Raw fields: pubid (int), question (str), context (dict: contexts list[str], labels list[str], meshes list[str]), long_answer (str), final_decision (str)Preprocessing: question maps to the prompt message list; long_answer maps to the answer column. final_decision and context passages are not used.

Overriding dataset defaults

Add any combination of these keys to config.yaml. Omitted keys fall back to the loader default.
dataset_id: "allenai/ai2_arc"
dataset_subset: "ARC-Easy"
split: "train"

Build docs developers (and LLMs) love