LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that enables adapting large language models to specific tasks without modifying the original model weights. Instead of fine-tuning all parameters, LoRA introduces small trainable rank decomposition matrices that are added to existing weights during inference.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/NVIDIA/TensorRT-LLM/llms.txt
Use this file to discover all available pages before exploring further.
What is LoRA?
LoRA decomposes weight updates into low-rank matrices:Wis the original pre-trained weight matrixBandAare low-rank matrices (rankr << min(d_in, d_out))- Only
BandAare trained and stored (massive parameter reduction)
For a 7B parameter model, a LoRA adapter with rank 8 typically adds only ~8-16M parameters (0.1-0.2% of original model size).
Quick Start
Single LoRA Adapter
Multi-LoRA Support
Serve multiple LoRA adapters simultaneously:Configuration Options
LoraConfig Parameters
| Parameter | Type | Description |
|---|---|---|
lora_dir | List[str] | Paths to LoRA adapter directories |
lora_target_modules | List[str] | Which modules to apply LoRA to (e.g., ['attn_q', 'attn_k', 'attn_v']) |
max_lora_rank | int | Maximum rank of LoRA adapters |
max_loras | int | Maximum number of LoRAs active on GPU simultaneously |
max_cpu_loras | int | Maximum number of LoRAs cached in CPU memory |
lora_ckpt_source | str | Format of LoRA checkpoint: "hf" (HuggingFace) or "nemo" (NeMo) |
trtllm_modules_to_hf_modules | Dict | Mapping from TRT-LLM module names to HuggingFace names |
max_cpu_loras should be >= max_loras. The system maintains a cache in CPU memory and swaps LoRAs to GPU as needed.Advanced Usage
LoRA with Quantization
LoRA works seamlessly with quantized models:LoRA adapters are applied in full precision (FP16/BF16) even when the base model is quantized. This preserves adapter quality while maintaining memory savings from quantization.
NeMo LoRA Format
Support for NeMo-format LoRA checkpoints:Cache Management
Fine-tune LoRA cache sizes for optimal performance:host_cache_size
host_cache_size
Controls CPU memory allocated for caching inactive LoRA adapters. Larger values allow more adapters to be cached, reducing load time when switching between adapters.
device_cache_percent
device_cache_percent
Percentage of free GPU memory dedicated to the LoRA adapter cache. Higher values allow more adapters to be active simultaneously but reduce memory available for KV cache.
Serving with trtllm-serve
YAML Configuration
Create aconfig.yaml file:
Starting the Server
Client Usage
Send requests with LoRA adapters:- Python (OpenAI SDK)
- cURL
Benchmarking with trtllm-bench
YAML Configuration
Run Benchmark
Target Modules
Commonly used LoRA target modules:- Attention Only
- Attention + Output
- Full (Attention + FFN)
- Lightweight adaptation
- Good for task-specific tuning
- Minimal memory overhead
Module names may vary by model architecture. Use
trtllm_modules_to_hf_modules to map TRT-LLM names to HuggingFace names if needed.Performance Considerations
Rank Selection
Rank Selection
Lower ranks (4-8):
- Faster inference
- Smaller adapter files
- Sufficient for most tasks
- Better adaptation capacity
- Slower inference
- Use for complex domain adaptation
Number of Active LoRAs
Number of Active LoRAs
max_lorascontrols GPU memory usage- More active LoRAs → less memory for KV cache
- Start with 4-8 and tune based on workload
- Use
max_cpu_lorasfor larger adapter pools
Cache Tuning
Cache Tuning
- Increase
host_cache_sizeif frequently switching adapters - Increase
device_cache_percentif many adapters are used concurrently - Monitor adapter swap times and adjust accordingly
Best Practices
Start with attention-only modules
Begin with
['attn_q', 'attn_k', 'attn_v'] for most tasks. This provides good adaptation with minimal overhead.Use rank 8 as baseline
Rank 8 offers a good balance between adaptation capacity and inference speed for most use cases.
Configure cache sizes appropriately
Set
max_cpu_loras to 2-4x your max_loras to allow efficient adapter swapping.Combine with quantization
Use FP8 or INT4 quantization for the base model to maximize memory savings while maintaining LoRA quality.
Limitations
Additional Resources
LoRA Paper
Original LoRA: Low-Rank Adaptation of Large Language Models
HuggingFace PEFT
Training LoRA adapters with PEFT library
LoRA Adapters Hub
Browse pre-trained LoRA adapters