Documentation Index Fetch the complete documentation index at: https://mintlify.com/NVIDIA/TensorRT-LLM/llms.txt
Use this file to discover all available pages before exploring further.
TensorRT-LLM supports a variety of multimodal models, enabling efficient inference with inputs beyond just text. These models combine specialized encoders for images, video, and audio with powerful LLM decoders.
Architecture Overview
Multimodal LLMs typically handle non-text inputs by combining a multimodal encoder with an LLM decoder:
Multimodal Input Processor
Preprocesses raw multimodal input (images, audio) into a format suitable for the encoder, such as pixel values or spectrograms.
Multimodal Encoder
Encodes the processed input into embeddings aligned with the LLM’s embedding space (e.g., vision transformers for images).
Integration with LLM Decoder
Fuses multimodal embeddings with text embeddings as input to the LLM decoder for downstream inference.
Image/Audio → Preprocessor → Encoder → Embeddings ──┐
├→ LLM Decoder → Output
Text Prompt ─────────────────────────────────────────┘
Supported Models
TensorRT-LLM supports a wide range of multimodal architectures:
Vision-Language Models
LLaVA (LLaMA + Vision)
VILA (Visual Language Assistant)
Qwen2-VL (Qwen with Vision)
NVILA (NVIDIA Vision-Language)
BLIP2 (Bootstrapped Language-Image Pre-training)
Nougat (Neural OCR for documents)
Audio Models
Whisper (Speech recognition)
Audio-language models (coming soon)
Optimizations
TensorRT-LLM incorporates key optimizations to enhance multimodal inference performance:
Batches multimodal requests within the GPU executor to improve GPU utilization and throughput. Context-phase (image encoding) and generation-phase requests are batched together.
Asynchronously overlaps data preprocessing on the CPU with image encoding on the GPU, reducing end-to-end latency.
Leverages image hashes and token chunk information to improve KV cache reuse and minimize collisions. Identical images across requests share cached encoder outputs.
Quick Start
Basic Usage
Run a vision-language model with a single image:
from tensorrt_llm import LLM
from tensorrt_llm.inputs import TextPrompt
from PIL import Image
# Load image
image = Image.open( "/path/to/image.jpg" )
# Create multimodal prompt
prompt = TextPrompt(
prompt = "Describe this image in detail." ,
multi_modal_data = { "image" : [image]}
)
# Initialize model
llm = LLM( model = "Efficient-Large-Model/NVILA-8B" )
# Generate
outputs = llm.generate(prompt)
print (outputs[ 0 ].text)
Multiple Images
Process multiple images in a single prompt:
from tensorrt_llm.inputs import TextPrompt
from PIL import Image
image1 = Image.open( "/path/to/image1.jpg" )
image2 = Image.open( "/path/to/image2.jpg" )
prompt = TextPrompt(
prompt = "What are the differences between these two images?" ,
multi_modal_data = { "image" : [image1, image2]}
)
outputs = llm.generate(prompt)
KV Cache Reuse with UUIDs
For better cache management across sessions, provide custom UUIDs:
from tensorrt_llm.inputs import TextPrompt
from PIL import Image
image1 = Image.open( "/path/to/image1.jpg" )
image2 = Image.open( "/path/to/image2.jpg" )
prompt = TextPrompt(
prompt = "Describe these images." ,
multi_modal_data = { "image" : [image1, image2]},
multi_modal_uuids = { "image" : [ "image-001" , "image-002" ]}
)
outputs = llm.generate(prompt)
Why use UUIDs? Custom UUIDs enable deterministic cache management. The same UUID + content combination always produces the same cache key, allowing you to:
Track cache entries externally
Implement per-user cache isolation
Pre-warm cache with known images
Manage cache lifecycle across sessions
Serving Multimodal Models
Start OpenAI-Compatible Server
Launch a server with multimodal support:
trtllm-serve Qwen/Qwen2-VL-7B-Instruct --backend pytorch
Send Requests with Images
import openai
import base64
# Encode image to base64
with open ( "/path/to/image.jpg" , "rb" ) as f:
image_data = base64.b64encode(f.read()).decode( "utf-8" )
client = openai.OpenAI(
base_url = "http://localhost:8000/v1" ,
api_key = "dummy"
)
response = client.chat.completions.create(
model = "Qwen/Qwen2-VL-7B-Instruct" ,
messages = [
{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "What's in this image?" },
{
"type" : "image_url" ,
"image_url" : {
"url" : f "data:image/jpeg;base64, { image_data } "
}
}
]
}
]
)
print (response.choices[ 0 ].message.content)
# Save this as request.json
{
"model" : "Qwen/Qwen2-VL-7B-Instruct",
"messages" : [
{
"role" : "user",
"content" : [
{ "type" : "text", "text": "Describe this image"},
{
"type" : "image_url",
"image_url" : {
"url" : "https://example.com/image.jpg"
}
}
]
}
]
}
# Send request
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d @request.json
See the multimodal client script for a complete example.
Benchmarking
Evaluate multimodal inference performance:
trtllm-bench \
--model Qwen/Qwen2-VL-7B-Instruct \
throughput \
--dataset /path/to/multimodal_dataset.json \
--num_requests 100
Configuration Options
Disable KV Cache Reuse
For testing or when cache reuse is not beneficial:
python quickstart_multimodal.py \
--model Efficient-Large-Model/NVILA-8B \
--modality image \
--disable_kv_cache_reuse
Or in Python:
from tensorrt_llm import LLM
from tensorrt_llm.llmapi import KvCacheConfig
kv_cache_config = KvCacheConfig(
enable_block_reuse = False
)
llm = LLM(
model = "Efficient-Large-Model/NVILA-8B" ,
kv_cache_config = kv_cache_config
)
Multimodal-Specific Cache Settings
from tensorrt_llm import LLM
from tensorrt_llm.llmapi import KvCacheConfig
kv_cache_config = KvCacheConfig(
enable_block_reuse = True , # Enable cross-request reuse
free_gpu_memory_fraction = 0.9 , # Allocate 90% of free GPU memory
dtype = 'fp8' # Use FP8 KV cache (2x memory savings)
)
llm = LLM(
model = "Qwen/Qwen2-VL-7B-Instruct" ,
kv_cache_config = kv_cache_config
)
Model-Specific Examples
LLaVA
from tensorrt_llm import LLM
from tensorrt_llm.inputs import TextPrompt
from PIL import Image
image = Image.open( "/path/to/image.jpg" )
prompt = TextPrompt(
prompt = "USER: <image> \n What is shown in this image? \n ASSISTANT:" ,
multi_modal_data = { "image" : [image]}
)
llm = LLM( model = "llava-hf/llava-1.5-7b-hf" )
outputs = llm.generate(prompt)
NVILA
from tensorrt_llm import LLM
from tensorrt_llm.inputs import TextPrompt
from PIL import Image
image = Image.open( "/path/to/image.jpg" )
prompt = TextPrompt(
prompt = "Describe this image in detail." ,
multi_modal_data = { "image" : [image]}
)
llm = LLM( model = "Efficient-Large-Model/NVILA-8B" )
outputs = llm.generate(prompt)
Qwen2-VL
from tensorrt_llm import LLM
from tensorrt_llm.inputs import TextPrompt
from PIL import Image
image = Image.open( "/path/to/image.jpg" )
prompt = TextPrompt(
prompt = "<|im_start|>user \n <|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|> \n <|im_start|>assistant \n " ,
multi_modal_data = { "image" : [image]}
)
llm = LLM( model = "Qwen/Qwen2-VL-7B-Instruct" )
outputs = llm.generate(prompt)
Best Practices
Resize images to model’s expected resolution before inference
Use appropriate image format (JPEG, PNG) based on content
Normalize pixel values according to model requirements
Batch multiple images when possible for better throughput
Enable enable_block_reuse=True for scenarios with repeated images
Use custom multi_modal_uuids for deterministic cache keys
Allocate sufficient GPU memory for KV cache (90%+ of free memory)
Consider FP8 KV cache for 2x memory savings
Follow model-specific prompt templates (LLaVA uses USER:/ASSISTANT:, Qwen uses special tokens)
Place image tokens where the model expects them
Be explicit about what you want the model to analyze
For multiple images, clearly reference which image you’re asking about
Limitations
Vision components use FP16 by default (cannot be quantized independently)
Some models have specific image resolution requirements
Multi-image support varies by model architecture
Video inputs are supported only for specific models (check support matrix)
Complete Example
Here’s a full example with all best practices:
from tensorrt_llm import LLM
from tensorrt_llm.inputs import TextPrompt
from tensorrt_llm.llmapi import KvCacheConfig
from tensorrt_llm.sampling_params import SamplingParams
from PIL import Image
import hashlib
# Load and prepare images
image1 = Image.open( "/path/to/product1.jpg" )
image2 = Image.open( "/path/to/product2.jpg" )
# Generate stable UUIDs based on image content or external IDs
image1_uuid = "product-image-12345"
image2_uuid = "product-image-67890"
# Configure KV cache with reuse
kv_cache_config = KvCacheConfig(
enable_block_reuse = True ,
free_gpu_memory_fraction = 0.9 ,
dtype = 'fp8' ,
host_cache_size = 2 * 1024 ** 3 # 2GB host cache for overflow
)
# Initialize model
llm = LLM(
model = "Qwen/Qwen2-VL-7B-Instruct" ,
kv_cache_config = kv_cache_config
)
# Create prompts with UUIDs for cache management
prompts = [
TextPrompt(
prompt = "Describe the product in this image." ,
multi_modal_data = { "image" : [image1]},
multi_modal_uuids = { "image" : [image1_uuid]}
),
TextPrompt(
prompt = "Describe the product in this image." ,
multi_modal_data = { "image" : [image2]},
multi_modal_uuids = { "image" : [image2_uuid]}
),
TextPrompt(
prompt = "What are the differences between these products?" ,
multi_modal_data = { "image" : [image1, image2]},
multi_modal_uuids = { "image" : [image1_uuid, image2_uuid]}
)
]
# Configure sampling
sampling_params = SamplingParams(
max_tokens = 200 ,
temperature = 0.7
)
# Generate (third prompt reuses cached encodings from first two)
for output in llm.generate(prompts, sampling_params):
print (output.text)
print ( "-" * 80 )
Additional Resources
Multimodal Examples Complete quickstart example for multimodal models
Supported Models Full multimodal model support matrix
Serving Script Example serving client for multimodal requests
Benchmarking Guide Measure multimodal inference performance