Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/avnlp/llm-finetuning/llms.txt

Use this file to discover all available pages before exploring further.

Preference alignment teaches a model to prefer human-approved responses over rejected ones. Unlike standard SFT, these methods encode a preference signal directly into the loss: paired comparisons (DPO, ORPO), binary desirability labels (KTO), or live reward model scoring (PPO). This module provides six pipelines across four algorithms. DPO, ORPO, and KTO use QLoRA via Unsloth’s FastLanguageModel; PPO uses AutoModelForCausalLMWithValueHead with a 4-bit base and a manual rollout loop scored by the OpenAssistant DeBERTa-v3 reward model.

All six pipelines

AlgorithmDatasetModelCommand
DPOUltraFeedback (~60k)unsloth/zephyr-sft-bnb-4bitpython src/llm_finetuning/preference_alignment/dpo/ultrafeedback/train.py
DPOWebGPT Comparisons (~19k)unsloth/zephyr-sft-bnb-4bitpython src/llm_finetuning/preference_alignment/dpo/webgpt/train.py
ORPOUltraFeedback (~60k)unsloth/llama-3-8b-bnb-4bitpython src/llm_finetuning/preference_alignment/orpo/ultrafeedback/train.py
KTOKTO-Mix-14k (14k)unsloth/Qwen2.5-1.5B-Instructpython src/llm_finetuning/preference_alignment/kto/kto_mix/train.py
PPOUltraFeedback (~60k)unsloth/llama-3-8b-bnb-4bitpython src/llm_finetuning/preference_alignment/ppo/ultrafeedback/train.py
PPOWebGPT Comparisons (~19k)unsloth/llama-3-8b-bnb-4bitpython src/llm_finetuning/preference_alignment/ppo/webgpt/train.py

Algorithms

Direct Preference Optimization (DPO) reparameterizes the RLHF objective to optimize directly on {prompt, chosen, rejected} triplets using a contrastive cross-entropy loss — no separate reward model or value network required.Data format: prompt, chosen, rejected — all serialized with apply_chat_template.The beta parameter controls the KL penalty strength between the trained policy and the reference model. Higher values constrain the policy more tightly to the original model’s behavior.Datasets:
DatasetHuggingFace IDSplit
UltraFeedback BinarizedHuggingFaceH4/ultrafeedback_binarizedtrain_prefs
WebGPT Comparisonsopenai/webgpt_comparisonstrain
python src/llm_finetuning/preference_alignment/dpo/ultrafeedback/train.py
python src/llm_finetuning/preference_alignment/dpo/webgpt/train.py
Key config parameters (dpo/ultrafeedback/config.yaml):
model_id: "unsloth/zephyr-sft-bnb-4bit"
max_seq_length: 4096
dpo_beta: 0.1
lora_r: 64
lora_alpha: 64
PatchDPOTrainer() from Unsloth must be called at module level before importing DPOTrainer. This is already present in all DPO train scripts — do not remove it.

PPO rollout loop

Unlike DPO, ORPO, and KTO — which call trainer.train() — PPO drives training through a manual loop. The script generates completions, scores them with the reward model, and calls trainer.step():
from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer

# PPO requires a value head — not FastLanguageModel
model = AutoModelForCausalLMWithValueHead.from_pretrained(
    config["model_id"],
    load_in_4bit=True,
    device_map="auto",
)

reward_model = PointwiseRewardModel()  # OpenAssistant DeBERTa-v3

generation_kwargs = {
    "max_new_tokens": config.get("max_new_tokens", 128),
    "do_sample": True,
    "temperature": 0.7,
    "pad_token_id": tokenizer.eos_token_id,
}

for batch in trainer.dataloader:
    query_tensors = batch["input_ids"]

    # Generate completions
    response_tensors = trainer.generate(query_tensors, **generation_kwargs)
    batch["response"] = tokenizer.batch_decode(response_tensors, skip_special_tokens=True)

    # Score with the pointwise reward model
    prompts = tokenizer.batch_decode(query_tensors, skip_special_tokens=True)
    rewards = [
        torch.tensor(reward_model.score(p, r))
        for p, r in zip(prompts, batch["response"])
    ]

    # PPO policy gradient update
    trainer.step(list(query_tensors), list(response_tensors), rewards)

GPU memory guidance

AlgorithmModel sizeTypical VRAM
DPO QLoRA7B16–24 GB
ORPO QLoRA7B16–24 GB
KTO QLoRA1.5B8–12 GB
PPO8B40+ GB

Output

All pipelines save trained weights and the tokenizer to ./outputs/preference_alignment/<algorithm>/<dataset>/ by default. Change output_dir in config.yaml to save elsewhere.

Build docs developers (and LLMs) love