Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/avnlp/llm-finetuning/llms.txt

Use this file to discover all available pages before exploring further.

Adapter-based fine-tuning keeps the base model weights frozen and trains only a small number of additional parameters, making it possible to fine-tune large language models on a single GPU. This project supports five adapter methods from the PEFT library: LoRA, QLoRA, DoRA, P-Tuning, and Prefix-Tuning. All methods use SFTTrainer from TRL and save only the adapter weights at the end of training, not a full model copy.

How LoRA works

LoRA (Low-Rank Adaptation) inserts trainable low-rank decomposition matrices alongside the frozen weight matrices of the attention and feed-forward projection layers. For a weight matrix W, LoRA learns two matrices A and B such that the effective weight update is W + α/r · BA. Only A and B are updated during training; W remains frozen.The rank r controls the size of the update — lower rank means fewer trainable parameters and less expressive updates. The scaling factor α (lora_alpha) scales the magnitude of the update relative to the rank.

Configuration

# config.yaml (lora/arc/config.yaml)
lora_r: 8
lora_alpha: 32
lora_dropout: 0.05
target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj
from peft import LoraConfig, TaskType, get_peft_model

peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=config["lora_r"],
    lora_alpha=config["lora_alpha"],
    lora_dropout=config["lora_dropout"],
    bias="none",
    target_modules=config["target_modules"],
)
model = get_peft_model(model, peft_config)

VRAM usage

Approximately 8–12 GB for Llama-3.2-3B with default settings.

When to use

LoRA is the default choice for most fine-tuning tasks. It offers a strong balance between adapter quality and training efficiency, works with standard bfloat16 precision, and is compatible with all five trainer types. Start with LoRA unless VRAM is the primary constraint.

Comparison table

TechniqueTrainable paramsQuantizedVRAM (3B)Best for
LoRALow-rank A, B matrices per target moduleNo8–12 GBGeneral purpose, best quality/efficiency tradeoff
QLoRASame as LoRAYes (4-bit NF4)6–8 GBLimited VRAM, larger models
DoRAMagnitude scalars + LoRA direction matricesNo8–12 GBMore expressive updates at same rank as LoRA
P-TuningPrompt encoder MLPNoVery lowMinimal param overhead, prompt-style adaptation
Prefix-TuningPrefix K/V vectors per attention layerNoLowSequence-to-sequence, attention-level adaptation

Target modules

LoRA, QLoRA, and DoRA all use the same target_modules list, which selects which linear projection layers receive LoRA adapters. The default list covers all attention and feed-forward projections in Llama-style models:
target_modules:
  - q_proj    # query projection (attention)
  - k_proj    # key projection (attention)
  - v_proj    # value projection (attention)
  - o_proj    # output projection (attention)
  - gate_proj # gate projection (SwiGLU feed-forward)
  - up_proj   # up projection (SwiGLU feed-forward)
  - down_proj # down projection (SwiGLU feed-forward)
Applying LoRA to all seven modules (the default) gives the most expressive adapter. To reduce trainable parameters, restrict to attention-only modules (q_proj, k_proj, v_proj, o_proj). The optimal set is task-dependent — start with all seven and ablate if needed.
P-Tuning and Prefix-Tuning do not use target_modules — they operate on the input sequence or the attention K/V tensors respectively, and are configured entirely through num_virtual_tokens (and encoder_hidden_size for P-Tuning).

Build docs developers (and LLMs) love