Documentation Index Fetch the complete documentation index at: https://mintlify.com/GoogleCloudPlatform/generative-ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Vertex AI provides multiple options for serving open-source models with optimized inference performance. Choose the right serving solution based on your latency, throughput, and cost requirements.
Serving Options
High-throughput serving with PagedAttention:
Best for : High-volume production workloads
Features : Continuous batching, KV cache optimization
Throughput : Up to 24x higher than standard serving
Models : Most LLMs (Llama, Gemma, Mistral, etc.)
Hugging Face’s production-ready inference server:
Best for : Hugging Face Hub models
Features : Flash Attention, tensor parallelism
Latency : Optimized for low-latency inference
Models : All HF-compatible transformers
Lightweight local model serving:
Best for : Development and small-scale deployments
Features : Easy setup, local execution
Deployment : Cloud Run, GKE, or local
Models : Curated model library
PyTorch inference with custom logic:
Best for : Specialized preprocessing/postprocessing
Features : Full control over inference pipeline
Flexibility : Support for any PyTorch model
Use cases : Vision models, multimodal, custom architectures
vLLM Deployment
vLLM is the recommended option for high-performance LLM serving.
Basic vLLM Deployment
Install Dependencies
pip install --upgrade google-cloud-aiplatform huggingface_hub
Initialize Vertex AI
import vertexai
from vertexai import model_garden
PROJECT_ID = "your-project-id"
LOCATION = "us-central1"
vertexai.init( project = PROJECT_ID , location = LOCATION )
Deploy with vLLM
# Models deployed through Model Garden SDK automatically use vLLM
model = model_garden.OpenModel( "meta/llama3_1@llama-3.1-8b-instruct" )
endpoint = model.deploy(
machine_type = "g2-standard-12" ,
accelerator_type = "NVIDIA_L4" ,
accelerator_count = 1 ,
accept_eula = True
)
Test Inference
response = endpoint.predict(
instances = [{
"prompt" : "Explain machine learning" ,
"max_tokens" : 200 ,
"temperature" : 0.7
}]
)
print (response.predictions[ 0 ])
vLLM with Multiple LoRA Adapters
Serve one base model with multiple task-specific adapters:
Download Adapters
Build Custom Container
Entrypoint Script
Deploy
from huggingface_hub import snapshot_download
import os
os.environ[ "HF_HUB_ENABLE_HF_TRANSFER" ] = "1"
# Download LoRA adapters
sql_adapter_path = snapshot_download(
repo_id = "google-cloud-partnership/gemma-2-2b-it-lora-sql" ,
local_dir = "./adapters/sql"
)
code_adapter_path = snapshot_download(
repo_id = "google-cloud-partnership/gemma-2-2b-it-lora-magicoder" ,
local_dir = "./adapters/code"
)
# Upload to GCS
BUCKET_URI = "gs://your-bucket"
! gcloud storage cp - r . / adapters /* { BUCKET_URI } / lora - adapters /
Using Multiple Adapters
import openai
client = openai.OpenAI(
base_url = f "https:// { endpoint.resource_name } /v1" ,
api_key = auth_token
)
# Use SQL adapter
sql_response = client.chat.completions.create(
model = "sql" , # Specify adapter name
messages = [{
"role" : "user" ,
"content" : "Write a SQL query to find top 10 customers by revenue"
}]
)
# Use code adapter
code_response = client.chat.completions.create(
model = "code" , # Different adapter
messages = [{
"role" : "user" ,
"content" : "Write a Python function to merge two sorted arrays"
}]
)
Text Generation Inference (TGI)
Deploy Hugging Face models with TGI for optimized performance.
TGI Deployment
Authenticate with Hugging Face
from huggingface_hub import interpreter_login, get_token
# Login to Hugging Face
interpreter_login()
# Get token
hf_token = get_token()
Create Model Registry Entry
from google.cloud import aiplatform
# Upload model with TGI container
model = aiplatform.Model.upload(
display_name = "gemma-tgi" ,
serving_container_image_uri = "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-hf-tgi-serve:20240220_0936_RC01" ,
serving_container_environment_variables = {
"MODEL_ID" : "google/gemma-7b-it" ,
"HUGGING_FACE_HUB_TOKEN" : hf_token,
"DEPLOY_SOURCE" : "notebook"
},
serving_container_ports = [ 7080 ]
)
Deploy to Endpoint
endpoint = model.deploy(
machine_type = "g2-standard-12" ,
accelerator_type = "NVIDIA_L4" ,
accelerator_count = 1 ,
traffic_split = { "0" : 100 },
deploy_request_timeout = 1800
)
Make Predictions
prediction = endpoint.predict(
instances = [{
"inputs" : "Explain quantum computing" ,
"parameters" : {
"max_new_tokens" : 200 ,
"temperature" : 0.7 ,
"top_p" : 0.9
}
}]
)
print (prediction.predictions[ 0 ])
TGI with Multiple LoRA Adapters
# Environment variables for TGI with LoRA
env_vars = {
"MODEL_ID" : "google/gemma-2-9b-it" ,
"HUGGING_FACE_HUB_TOKEN" : hf_token,
"NUM_SHARD" : "1" ,
"MAX_INPUT_LENGTH" : "4096" ,
"MAX_TOTAL_TOKENS" : "8192" ,
"LORA_ADAPTERS" : "sql,code" , # Comma-separated adapter IDs
"LORA_ADAPTER_sql" : "google-cloud-partnership/gemma-2-9b-it-lora-sql" ,
"LORA_ADAPTER_code" : "google-cloud-partnership/gemma-2-9b-it-lora-magicoder"
}
model = aiplatform.Model.upload(
display_name = "gemma-tgi-multi-lora" ,
serving_container_image_uri = TGI_IMAGE_URI ,
serving_container_environment_variables = env_vars,
serving_container_ports = [ 7080 ]
)
Ollama on Cloud Run
Deploy models with Ollama for lightweight serving:
Dockerfile
Modelfile
Deploy to Cloud Run
Test Endpoint
FROM ollama/ollama:latest
# Copy model
COPY Modelfile /Modelfile
# Pull and create model
RUN ollama serve & \
sleep 5 && \
ollama pull gemma2:2b && \
ollama create mymodel -f /Modelfile
EXPOSE 11434
CMD [ "serve" ]
Custom PyTorch Handlers
Deploy models with custom preprocessing/postprocessing:
handler.py
Deploy Custom Handler
from ts.torch_handler.base_handler import BaseHandler
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
class CustomLLMHandler ( BaseHandler ):
def initialize ( self , context ):
self .manifest = context.manifest
properties = context.system_properties
model_id = properties.get( "model_id" , "google/gemma-2b" )
# Load model and tokenizer
self .tokenizer = AutoTokenizer.from_pretrained(model_id)
self .model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype = torch.bfloat16,
device_map = "auto"
)
self .initialized = True
def preprocess ( self , data ):
"""Custom preprocessing"""
prompts = [item.get( "data" ) or item.get( "body" ) for item in data]
# Apply chat template
inputs = self .tokenizer(
prompts,
return_tensors = "pt" ,
padding = True ,
truncation = True ,
max_length = 2048
).to( self .model.device)
return inputs
def inference ( self , inputs ):
"""Model inference"""
with torch.no_grad():
outputs = self .model.generate(
** inputs,
max_new_tokens = 200 ,
temperature = 0.7 ,
top_p = 0.9 ,
do_sample = True
)
return outputs
def postprocess ( self , outputs ):
"""Custom postprocessing"""
responses = self .tokenizer.batch_decode(
outputs,
skip_special_tokens = True
)
return responses
Batching Strategies
vLLM automatically batches requests: # No configuration needed - vLLM handles batching
# Achieves up to 24x higher throughput
endpoint = model.deploy(
machine_type = "g2-standard-12" ,
accelerator_type = "NVIDIA_L4" ,
accelerator_count = 1
)
Configure batch size for TGI: env_vars = {
"MAX_BATCH_SIZE" : "32" ,
"MAX_BATCH_PREFILL_TOKENS" : "4096" ,
"MAX_WAITING_TOKENS" : "20"
}
Use Triton Inference Server: {
"dynamic_batching" : {
"max_queue_delay_microseconds" : 100 ,
"preferred_batch_size" : [ 8 , 16 ],
"max_batch_size" : 32
}
}
Memory Optimization
Quantization (vLLM)
KV Cache Optimization
Tensor Parallelism
# vLLM supports automatic quantization
endpoint = model.deploy(
serving_container_environment_variables = {
"QUANTIZATION" : "awq" , # or "gptq", "squeezellm"
"DTYPE" : "float16"
}
)
Autoscaling Configuration
from google.cloud import aiplatform
# Deploy with autoscaling
endpoint = model.deploy(
machine_type = "g2-standard-12" ,
accelerator_type = "NVIDIA_L4" ,
accelerator_count = 1 ,
min_replica_count = 1 ,
max_replica_count = 10 ,
# Scale based on CPU utilization
autoscaling_target_cpu_utilization = 70 ,
# Or scale based on accelerator utilization
autoscaling_target_accelerator_utilization = 80
)
Monitoring and Observability
Cloud Monitoring Integration
from google.cloud import monitoring_v3
# Create custom metrics
client = monitoring_v3.MetricServiceClient()
# Query endpoint metrics
results = client.list_time_series(
request = {
"name" : f "projects/ { PROJECT_ID } " ,
"filter" : f 'resource.type="aiplatform.googleapis.com/Endpoint"'
}
)
for result in results:
print ( f "Metric: { result.metric.type } " )
print ( f "Value: { result.points[ 0 ].value.double_value } " )
Logging
import logging
from google.cloud import logging as cloud_logging
# Setup Cloud Logging
client = cloud_logging.Client()
client.setup_logging()
logger = logging.getLogger( __name__ )
# Log predictions
response = endpoint.predict( instances = [{ "prompt" : "test" }])
logger.info( f "Prediction latency: { response.metadata[ 'prediction_latency_ms' ] } ms" )
Best Practices
Choose Right Serving Engine Use vLLM for high throughput, TGI for HF models, custom handlers for specialized needs
Enable Autoscaling Configure min/max replicas to handle traffic spikes efficiently
Optimize GPU Usage Use tensor parallelism for large models, quantization for memory constraints
Monitor Performance Track latency, throughput, and GPU utilization metrics
Use LoRA for Multi-Task Serve multiple specialized models with shared base weights
Test Before Production Load test endpoints to validate performance under expected traffic
Cost Optimization
Right-Size Compute
Start with smaller machine types and scale up based on metrics
Use Spot VMs
Enable spot VMs for up to 80% cost savings on fault-tolerant workloads
Scale to Zero
Use Cloud Run for infrequent workloads that can scale to zero
Batch Requests
Send multiple predictions in a single request to reduce overhead
Next Steps
Model Garden Explore models available for deployment
Fine-Tuning Customize models before deployment
Example Notebooks View serving examples on GitHub
Performance Guide Learn more about optimization techniques