TRL provides a full stack of trainers for post-training language models. Methods are organized into four categories:Documentation Index
Fetch the complete documentation index at: https://mintlify.com/huggingface/trl/llms.txt
Use this file to discover all available pages before exploring further.
Online methods
Offline methods
Reward modeling
Knowledge distillation
Online methods
Online methods generate completions at training time and use those completions — along with a reward signal — to update the policy. They generally require more compute per step than offline methods but can achieve better alignment by training on the model’s own distribution.GRPOTrainer — Group Relative Policy Optimization
GRPOTrainer — Group Relative Policy Optimization
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — introduces GRPO.
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — multi-stage pipeline using GRPO for reasoning.
- DAPO: An Open-Source LLM Reinforcement Learning System at Scale — overlong filtering, clip-higher, soft overlong punishment, and token-level loss.
- Dr. GRPO: Understanding R1-Zero-Like Training: A Critical Perspective — length-debiased GRPO variant.
- It Takes Two: Your GRPO Is Secretly DPO — formal connection between GRPO and DPO; 2-GRPO with
num_generations=2. - Group Sequence Policy Optimization — sequence-level importance sampling.
RLOOTrainer — REINFORCE Leave-One-Out
RLOOTrainer — REINFORCE Leave-One-Out
- Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs — introduces RLOO.
- REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models — global advantage normalization for training stability.
OnlineDPOTrainer — Online Direct Preference Optimization
OnlineDPOTrainer — Online Direct Preference Optimization
- Direct Language Model Alignment from Online AI Feedback — introduces online DPO with real-time AI feedback.
NashMDTrainer — Nash Mirror Descent
NashMDTrainer — Nash Mirror Descent
- Nash Learning from Human Feedback — introduces Nash-MD.
PPOTrainer — Proximal Policy Optimization
PPOTrainer — Proximal Policy Optimization
- Proximal Policy Optimization Algorithms — introduces PPO.
XPOTrainer — Exploratory Preference Optimization
XPOTrainer — Exploratory Preference Optimization
- Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF — introduces XPO.
Offline methods
Offline methods train on a fixed, pre-collected dataset. They are computationally lighter than online methods and simpler to set up, but may suffer from distribution shift between the training data and the model’s own generation distribution.SFTTrainer — Supervised Fine-Tuning
SFTTrainer — Supervised Fine-Tuning
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer — introduces sequence packing for efficient SFT.
- Fewer Truncations Improve Language Modeling — Best Fit Decreasing packing strategy to minimize truncation.
- On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification — Dynamic Fine-Tuning (DFT) with gradient rescaling.
DPOTrainer — Direct Preference Optimization
DPOTrainer — Direct Preference Optimization
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model — introduces DPO.
- A General Theoretical Paradigm to Understand Learning from Human Preferences — IPO loss variant to avoid preference overfitting.
- ORPO: Monolithic Preference Optimization without Reference Model — reference-free monolithic variant (see ORPOTrainer).
- Learn Your Reference Model for Real Good Alignment — Trust Region DPO with periodic reference model updates.
- Anchored Preference Optimization and Contrastive Revisions — APO objective for more contrastive preference pairs.
BCOTrainer — Binary Classifier Optimization
BCOTrainer — Binary Classifier Optimization
- Binary Classifier Optimization for Large Language Model Alignment — introduces BCO.
CPOTrainer — Contrastive Preference Optimization
CPOTrainer — Contrastive Preference Optimization
- Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation — introduces CPO.
- SimPO: Simple Preference Optimization with a Reference-Free Reward — reference-free CPO variant with target reward margin.
KTOTrainer — Kahneman–Tversky Optimization
KTOTrainer — Kahneman–Tversky Optimization
- KTO: Model Alignment as Prospect Theoretic Optimization — introduces KTO.
ORPOTrainer — Odds Ratio Preference Optimization
ORPOTrainer — Odds Ratio Preference Optimization
- ORPO: Monolithic Preference Optimization without Reference Model — introduces ORPO.
Reward modeling
Reward models score model outputs and provide the feedback signal used by online training methods.RewardTrainer — Outcome reward modeling
RewardTrainer — Outcome reward modeling
- Llama 2: Open Foundation and Fine-Tuned Chat Models — margin-based reward loss for multi-level preference ratings.
- Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking — auxiliary centering loss to reduce underdetermination.
PRMTrainer — Process reward modeling
PRMTrainer — Process reward modeling
- Solving math word problems with process- and outcome-based feedback — compares process-based vs outcome-based supervision; demonstrates the value of PRMs for reducing reasoning errors.
Knowledge distillation
Knowledge distillation methods train a smaller student model to mimic the output distribution of a larger teacher model, rather than training on hard labels.GKDTrainer — Generalized Knowledge Distillation
GKDTrainer — Generalized Knowledge Distillation
- On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes — introduces GKD with flexible divergence losses and on-policy student sampling.
MiniLLMTrainer — Sequence-level reverse KL distillation
MiniLLMTrainer — Sequence-level reverse KL distillation
- Knowledge Distillation of Large Language Models — introduces MiniLLM.
Choosing a method
Use the table below as a starting point. The right choice depends on your data, compute budget, and alignment goals.| Scenario | Recommended method |
|---|---|
| Instruction-following from demonstrations | SFTTrainer |
| Preference alignment with paired data (offline) | DPOTrainer |
| Preference alignment without a reference model | ORPOTrainer or CPOTrainer |
| Binary feedback (liked/disliked), no pairs | KTOTrainer or BCOTrainer |
| Online RL with rule-based rewards (e.g., math) | GRPOTrainer |
| Online RL with a reward model, critic-free | RLOOTrainer |
| Online RL with a full actor-critic setup | PPOTrainer |
| Scoring full completions | RewardTrainer |
| Scoring reasoning steps | PRMTrainer |
| Compressing a large model into a smaller one | GKDTrainer or MiniLLMTrainer |
use_vllm=True in the corresponding config.