LLM Fine-tuning is a collection of 39 self-contained training pipelines that covers three paradigms — Supervised Fine-Tuning (SFT) with adapter methods, Reinforcement Learning via Group Relative Policy Optimization (GRPO), and Preference Alignment — across 16 datasets spanning math reasoning, multi-hop question answering, medical question answering, and general QA domains. Each pipeline is built on HuggingFace TRL, PEFT, and Unsloth, with reward functions powered by DeepEval and Evidently AI.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/avnlp/llm-finetuning/llms.txt
Use this file to discover all available pages before exploring further.
Supervised fine-tuning
Five adapter techniques (LoRA, QLoRA, DoRA, P-Tuning, Prefix-Tuning) across five QA datasets.
GRPO math reasoning
Group Relative Policy Optimization on GSM8K with correctness and format reward functions.
Multi-hop QA
GRPO fine-tuning on HotpotQA, FreshQA, and MuSiQue with eight reward functions.
Medical QA
GRPO fine-tuning on MedQA, BioASQ, and PubMedQA with LLM-as-a-Judge evaluation.
Preference alignment
DPO, ORPO, KTO, and PPO alignment algorithms using QLoRA via Unsloth and TRL.
Project structure
All training code lives undersrc/llm_finetuning/, organized by paradigm and then by technique and dataset.
Pipeline anatomy
Every pipeline directory shares the same three-file layout:train.py reads all hyperparameters from the co-located config.yaml, so you can swap models, datasets, or training settings without touching source code.
Training paradigms
Supervised fine-tuning
Supervised Fine-Tuning trains Llama-3.2-3B on five QA datasets using adapter-based methods. The base model weights remain frozen; only the adapter parameters are updated. All pipelines useSFTTrainer from TRL.
| Technique | Description | Key parameter |
|---|---|---|
| LoRA | Low-rank weight updates applied to attention and feed-forward projection matrices | Rank 8, alpha 32 |
| QLoRA | LoRA with 4-bit NF4 quantization of the base model via BitsAndBytes | Rank 8, alpha 32 |
| DoRA | Weight-Decomposed LoRA — decomposes weights into magnitude and direction, applies LoRA to the directional component | use_dora=True |
| P-Tuning | Trains a small encoder network to produce continuous prompt embeddings prepended to the input | Virtual tokens configurable |
| Prefix-Tuning | Prepends trainable prefix vectors to the key and value tensors of every attention layer | Virtual tokens configurable |
| Dataset | Domain | Description |
|---|---|---|
| ARC | Science QA | AI2 Reasoning Challenge — grade-school multiple-choice science questions |
| TriviaQA | Open-domain QA | Trivia questions with evidence documents from Wikipedia and the web |
| FactScore | Factual QA | Atomic fact verification dataset for hallucination detection |
| PopQA | Entity QA | Factoid questions about popular entities from Wikipedia |
| Earnings Calls | Financial QA | Question answering over earnings call transcripts from 2,800+ companies |
GRPO math reasoning
GRPO trains language models to generate step-by-step reasoning for grade-school math word problems. Five models are trained on GSM8K (7,473 problems) using one correctness reward and four format rewards. An optional two-stage pipeline primes Qwen3-4B-Base on OpenR1-Math-220k before applying GRPO. Models: Phi-4, Mistral-7B, Llama-3.2-3B, Llama-3.1-8B, Gemma3-1B| Category | Reward function | Description |
|---|---|---|
| Correctness | AnswerCorrectnessReward | Extracts numeric value from <answer> tags and compares to ground truth |
| Format | ReasoningTagsReward | Validates presence and proper nesting of <reasoning> and <answer> tags |
| Format | StepFormatReward | Rewards numbered or bulleted step-by-step structure (minimum 3 steps) |
| Format | MultilineComplianceReward | Rewards multi-line responses with sufficient depth (minimum 5 lines) |
| Format | ResponseStructureReward | Validates that both reasoning and answer blocks are present |
Multi-hop question answering
Fine-tunes Llama-3.2-3B using QLoRA and GRPO on three multi-hop reasoning datasets. Eight reward functions enforce both answer quality and structured reasoning format.Medical question answering
Fine-tunes Llama-3.2-3B using QLoRA and GRPO on three biomedical QA datasets. Uses the same eight reward function architecture as multi-hop QA, with LLM-as-a-Judge evaluation tailored for biomedical reasoning.Preference alignment
Fine-tunes language models using human preference signals. All methods use QLoRA (4-bit quantization) via Unsloth and TRL trainers.| Algorithm | Description | Datasets |
|---|---|---|
| DPO | Directly optimizes on {prompt, chosen, rejected} triplets without a separate reward model | UltraFeedback, WebGPT Comparisons |
| ORPO | Combines SFT and preference alignment in a single training step via odds ratio penalty | UltraFeedback |
| KTO | Trains on binary desirability labels using prospect theory-inspired loss, without paired data | KTO-Mix-14k |
| PPO | Uses a reward model to score rollouts with policy gradient updates and a KL penalty | UltraFeedback, WebGPT Comparisons |
Core dependencies
| Library | Role |
|---|---|
| HuggingFace TRL | SFTTrainer, GRPOTrainer, DPOTrainer, PPOTrainer, and config classes for all paradigms |
| PEFT | LoRA, QLoRA, DoRA, P-Tuning, and Prefix-Tuning adapter implementations |
| Unsloth | Memory-efficient model loading and GRPO/alignment training via FastLanguageModel |
| DeepEval | LLM-as-a-Judge reward functions: GEval RAG, Summarization, Answer Relevancy |
| Evidently AI | CorrectnessLLMEval reward metric for multi-hop and medical QA pipelines |
GPU memory requirements
Reduce
per_device_train_batch_size to 1 and increase gradient_accumulation_steps if you run out of VRAM.| Technique | Typical VRAM |
|---|---|
| SFT LoRA (3B model) | 8–12 GB |
| SFT QLoRA (3B model) | 6–8 GB |
| GRPO QLoRA (3B model) | 12–16 GB |
| DPO QLoRA (7B model) | 16–24 GB |
| ORPO QLoRA (7B model) | 16–24 GB |
| KTO QLoRA (1.5B model) | 8–12 GB |
| PPO (8B model) | 40+ GB |