TensorRT-LLM supports three execution backends, each optimized for different use cases and deployment scenarios. This guide helps you choose the right backend for your needs.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/NVIDIA/TensorRT-LLM/llms.txt
Use this file to discover all available pages before exploring further.
Backend Comparison
PyTorch
Default - Recommended for most usersBest balance of performance and flexibility
TensorRT
Legacy - Maintained for compatibilityCompiled TensorRT engines
AutoDeploy
Beta - ExperimentalAutomatic model optimization
Backend Overview Table
| Feature | PyTorch | TensorRT | AutoDeploy |
|---|---|---|---|
| Status | Default ✅ | Legacy | Beta (experimental) |
| Entry Point | LLM(backend="pytorch") | LLM(backend="tensorrt") | LLM(backend="_autodeploy") |
| Key Path | _torch/pyexecutor/ → PyExecutor | builder.py → trtllm.Executor | _torch/auto_deploy/ → ADExecutor |
| Performance | Excellent | Maximum | Good (improving) |
| Flexibility | High | Low | Very High |
| Build Time | None (dynamic) | Long (compilation) | Medium (graph transforms) |
| New Model Support | Requires implementation | Requires implementation | Day-0 support |
| Recommended For | Production, development | Legacy workloads | Prototyping, new models |
PyTorch Backend (Default)
The PyTorch backend is the default and recommended backend for TensorRT-LLM. It combines excellent performance with maximum flexibility.Architecture
Key Features
Dynamic Execution
Dynamic Execution
- No compilation step required
- Immediate model loading and inference
- Easy debugging with standard PyTorch tools
- Supports
torch.compilefor additional optimization
Custom Attention Kernels
Custom Attention Kernels
The PyTorch backend uses highly optimized custom attention implementations:
- TrtllmAttention (default): Hand-tuned CUDA kernels for maximum performance
- FlashInferAttention: Alternative backend with FP8 quantization support
- VanillaAttention: Reference implementation for testing
LLM(attn_backend="trtllm") or LLM(attn_backend="flashinfer")Full Feature Support
Full Feature Support
- In-flight batching (continuous batching)
- Paged KV cache with cross-request reuse
- Speculative decoding (EAGLE, Medusa, n-gram, etc.)
- LoRA adapters with dynamic switching
- Multi-modal models (vision-language)
- Quantization (FP8, INT8, INT4)
- CUDA Graphs
- Overlap scheduler
Distributed Inference
Distributed Inference
- Tensor parallelism
- Pipeline parallelism
- Multiple communication backends (MPI, Ray, RPC)
- Disaggregated serving (separate prefill and decode)
When to Use PyTorch Backend
Use the PyTorch backend when:
- Starting a new project (it’s the default)
- You need rapid iteration and development
- You want the latest features and optimizations
- You need to debug model behavior
- You’re deploying to production (recommended)
Example Usage
Source Location: All PyTorch backend code is in
tensorrt_llm/_torch/Key files:_torch/pyexecutor/py_executor.py- Main executor_torch/pyexecutor/model_engine.py- Model execution_torch/attention_backend/- Attention implementations
TensorRT Backend (Legacy)
The TensorRT backend uses compiled TensorRT engines for inference. This backend is considered legacy and is maintained primarily for backward compatibility.Architecture
Key Characteristics
Advantages:- Maximum theoretical performance through aggressive optimization
- Highly optimized kernel fusion
- Efficient memory usage
- Long build times (30+ minutes for large models)
- Hardware-specific engines (cannot transfer between GPU types)
- Limited flexibility (cannot modify model after compilation)
- Slower to adopt new features
- Difficult to debug
When to Use TensorRT Backend
Use the TensorRT backend only when:
- You have an existing deployment using TensorRT engines
- You need to maintain backward compatibility
- You have very specific performance requirements that PyTorch backend doesn’t meet
Example Usage
AutoDeploy Backend (Beta)
AutoDeploy is an experimental backend that automatically optimizes PyTorch/HuggingFace models for inference through automated graph transformations. It requires no manual model implementation.Status: Beta - Under active development. The API may change in future releases.
Architecture
Key Features
Zero Code Changes
Works with unmodified PyTorch/HuggingFace modelsNo manual kernel implementation required
Day-0 Model Support
Support new model architectures immediatelyGreat for prototyping and experimentation
Automated Optimization
Automatic graph transformations:
- Sharding for multi-GPU
- KV cache integration
- Attention fusion
- Quantization
- CUDA Graph optimization
Single Source of Truth
Maintain your original PyTorch modelNo need for separate inference implementations
Workflow
Graph Transformation
Applies automated transformations:
- Graph sharding for tensor parallelism
- KV cache block insertion
- GEMM fusion
- Custom attention operator replacement
When to Use AutoDeploy Backend
Use the AutoDeploy backend when:
- You’re working with a new model architecture not yet supported in TensorRT-LLM
- You need rapid prototyping and experimentation
- You want to deploy a custom PyTorch model without manual optimization
- You’re evaluating whether to invest in a full TensorRT-LLM implementation
Example Usage
Example: Custom Model
Source Location: AutoDeploy code is in
tensorrt_llm/_torch/auto_deploy/Roadmap:- Vision-Language Models (VLMs)
- State Space Models (SSMs)
- LoRA support
- Speculative decoding
Choosing the Right Backend
Decision Flow
Performance Comparison
In most cases, the PyTorch backend provides performance within 5-10% of the TensorRT backend, without any compilation overhead. For many workloads, especially with CUDA Graphs enabled, the PyTorch backend matches or exceeds TensorRT backend performance.
Benchmark Example (Llama-2-7B on H100)
| Backend | Throughput (tokens/s) | Build Time | Flexibility |
|---|---|---|---|
| PyTorch | 12,500 | None | High |
| TensorRT | 13,000 | 45 min | Low |
| AutoDeploy | 10,000 | 5 min | Very High |
Actual performance depends on many factors: model architecture, batch size, sequence length, hardware, and configuration parameters. Always benchmark with your specific workload.
Shared Features Across All Backends
All three backends benefit from the Shared C++ Core components:- Scheduler: In-flight batching and request scheduling
- KV Cache Manager: Paged memory management with cross-request reuse
- Batch Manager: Dynamic batching optimization
- Decoder: Token generation orchestration
- Sampler: Sampling strategies (greedy, top-k, top-p, beam search)
System Architecture
Learn about the overall system design
Optimization Techniques
Explore advanced performance optimizations