Get started with LLM Fine-tuning

This quickstart walks you through installing the project with uv, setting up HuggingFace credentials, and executing your first training pipeline. By the end you will have a working fine-tuned adapter saved to ./outputs/. All 39 pipelines follow the same pattern, so once you run one, the rest are identical in structure.

Install uv and clone the repository

The project uses uv for fast, reproducible dependency management. Install it with pip if you do not already have it, then clone the repository.

pip install uv
git clone https://github.com/avnlp/llm-finetuning
cd llm-finetuning

Install dependencies

uv sync creates a virtual environment and installs all runtime dependencies — transformers, trl, peft, unsloth, deepeval, evidently, and more — in one step.

uv sync
source .venv/bin/activate

For linting and type-checking tools (Ruff, MyPy, Bandit), include the dev group:

uv sync --dev
source .venv/bin/activate

Set up credentials

All pipelines download model weights from HuggingFace. Gated models (Llama, Mistral, Gemma) require you to be logged in.

# Required for gated HuggingFace models (Llama, Mistral, Gemma, etc.)
huggingface-cli login

An OPENAI_API_KEY is required only for multi_hop_question_answering and medical_question_answering pipelines, which use DeepEval and Evidently LLM-as-a-Judge reward functions. SFT, GRPO math reasoning, and preference alignment pipelines do not need it.

For pipelines that do require it, export the key before running:

# Required only for multi_hop_question_answering and medical_question_answering
export OPENAI_API_KEY="your-key"

Run your first pipeline

Each pipeline is a single Python script that reads its config.yaml, downloads the dataset, trains the model, and writes adapter weights to ./outputs/. Pick one of the options below.

SFT — LoRA on ARC
GRPO — GSM8K
Preference alignment — DPO

The recommended starting point. Fine-tunes Llama-3.2-3B with LoRA (rank 8, alpha 32) on the ARC-Challenge science QA dataset. Requires 8–12 GB VRAM.

python src/llm_finetuning/supervised_finetuning/lora/arc/train.py

The script reads src/llm_finetuning/supervised_finetuning/lora/arc/config.yaml:

model_id: "meta-llama/Llama-3.2-3B"
dataset_name: "allenai/ai2_arc"
dataset_config: "ARC-Challenge"
split: "train"
output_dir: "./outputs/supervised_finetuning/lora/arc"

num_train_epochs: 3
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 2.0e-4
save_strategy: "epoch"
logging_steps: 10

lora_r: 8
lora_alpha: 32
lora_dropout: 0.05
target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

Adapter weights are saved to ./outputs/supervised_finetuning/lora/arc after training.

Fine-tunes with Group Relative Policy Optimization on GSM8K grade-school math. Uses Unsloth’s FastLanguageModel with 4-bit quantization. Requires 12–16 GB VRAM.

python src/llm_finetuning/math_reasoning/grpo/gsm8k/train.py --config config.yaml

To switch to a model-specific config:

python src/llm_finetuning/math_reasoning/grpo/gsm8k/train.py --config config_mistral7b.yaml

The default config.yaml for this pipeline:

model_id: "unsloth/Llama-3.2-3B-Instruct"
max_seq_length: 2048
output_dir: "./outputs/math_reasoning/grpo/gsm8k"
dataset_split: "train"
learning_rate: 5.0e-6
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
num_generations: 4
max_prompt_length: 256
max_completion_length: 512
num_train_epochs: 1
max_grad_norm: 0.1
lora_r: 8
lora_alpha: 8
target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]

Fine-tunes Zephyr-7B using Direct Preference Optimization on UltraFeedback (~60k examples). Requires 16–24 GB VRAM.

python src/llm_finetuning/preference_alignment/dpo/ultrafeedback/train.py

To use WebGPT Comparisons instead:

python src/llm_finetuning/preference_alignment/dpo/webgpt/train.py

Overriding config.yaml settings

You can change the dataset, model, or any hyperparameter by editing the pipeline’s config.yaml — no code changes required.

Changing the dataset

For supervised_finetuning/ pipelines (uses the split key):

dataset_id: "allenai/ai2_arc"
dataset_subset: "ARC-Challenge"
split: "train"

For GRPO, math reasoning, medical QA, and preference alignment pipelines (uses the dataset_split key):

dataset_id: "openai/gsm8k"
dataset_subset: "main"
dataset_split: "train"

All three keys are optional. Omitting them falls back to each loader’s built-in defaults.

Changing the base model

Open the pipeline’s config.yaml and update model_id:

# Before
model_id: "meta-llama/Llama-3.2-3B"

# After
model_id: "meta-llama/Llama-3.1-8B"

For GRPO and preference alignment pipelines, use Unsloth-quantized variants to reduce VRAM:

model_id: "unsloth/Llama-3.1-8B-Instruct"

Output location

All pipelines write adapter weights and the tokenizer to ./outputs/<module>/<method>/<dataset>/ by default. For example:

SFT LoRA on ARC → ./outputs/supervised_finetuning/lora/arc/
GRPO on GSM8K → ./outputs/math_reasoning/grpo/gsm8k/
DPO on UltraFeedback → ./outputs/preference_alignment/dpo/ultrafeedback/

Override the location by setting output_dir in config.yaml:

output_dir: "./my_custom_output_dir"

GPU memory guidance

Choose a pipeline that fits your available VRAM. Lower-memory techniques like QLoRA are available for most training paradigms.

Technique	Typical VRAM
SFT LoRA (3B model)	8–12 GB
SFT QLoRA (3B model)	6–8 GB
GRPO QLoRA (3B model)	12–16 GB
DPO QLoRA (7B model)	16–24 GB
ORPO QLoRA (7B model)	16–24 GB
KTO QLoRA (1.5B model)	8–12 GB
PPO (8B model)	40+ GB

If you run out of memory, set per_device_train_batch_size: 1 and increase gradient_accumulation_steps to maintain effective batch size.

Pipelines that require OPENAI_API_KEY

The following pipeline groups use DeepEval or Evidently AI LLM-as-a-Judge reward functions, which call the OpenAI API during training. Set OPENAI_API_KEY before running any of these.

Running multi-hop QA or medical QA pipelines without OPENAI_API_KEY will raise an authentication error at the first reward function call.

Multi-hop question answering:

export OPENAI_API_KEY="your-key"
python src/llm_finetuning/multi_hop_question_answering/grpo/hotpotqa/train.py
python src/llm_finetuning/multi_hop_question_answering/grpo/freshqa/train.py
python src/llm_finetuning/multi_hop_question_answering/grpo/musique/train.py

Medical question answering:

export OPENAI_API_KEY="your-key"
python src/llm_finetuning/medical_question_answering/medqa/train.py
python src/llm_finetuning/medical_question_answering/bioasq/train.py
python src/llm_finetuning/medical_question_answering/pubmedqa/train.py

All SFT, GRPO math reasoning, and preference alignment pipelines run without an OpenAI key.

Get Started

Training Paradigms

Core Concepts

Reference

Get started with LLM Fine-tuning

Overriding config.yaml settings

Changing the dataset

Changing the base model

Output location

GPU memory guidance

Pipelines that require OPENAI_API_KEY

Build docs developers (and LLMs) love

Get Started

Training Paradigms

Core Concepts

Reference

Documentation Index

​Overriding config.yaml settings

​Changing the dataset

​Changing the base model

​Output location

​GPU memory guidance

​Pipelines that require OPENAI_API_KEY

Build docs developers (and LLMs) love

Overriding config.yaml settings

Changing the dataset

Changing the base model

Output location

GPU memory guidance

Pipelines that require OPENAI_API_KEY