Model Serving and Inference Optimization

Overview

Vertex AI provides multiple options for serving open-source models with optimized inference performance. Choose the right serving solution based on your latency, throughput, and cost requirements.

Serving Options

vLLM
Text Generation Inference (TGI)
Ollama
Custom Handlers

High-throughput serving with PagedAttention:

Best for: High-volume production workloads
Features: Continuous batching, KV cache optimization
Throughput: Up to 24x higher than standard serving
Models: Most LLMs (Llama, Gemma, Mistral, etc.)

vLLM Deployment

vLLM is the recommended option for high-performance LLM serving.

Basic vLLM Deployment

Install Dependencies

pip install --upgrade google-cloud-aiplatform huggingface_hub

Initialize Vertex AI

import vertexai
from vertexai import model_garden

PROJECT_ID = "your-project-id"
LOCATION = "us-central1"

vertexai.init(project=PROJECT_ID, location=LOCATION)

Deploy with vLLM

# Models deployed through Model Garden SDK automatically use vLLM
model = model_garden.OpenModel("meta/llama3_1@llama-3.1-8b-instruct")

endpoint = model.deploy(
    machine_type="g2-standard-12",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
    accept_eula=True
)

Test Inference

response = endpoint.predict(
    instances=[{
        "prompt": "Explain machine learning",
        "max_tokens": 200,
        "temperature": 0.7
    }]
)

print(response.predictions[0])

vLLM with Multiple LoRA Adapters

Serve one base model with multiple task-specific adapters:

from huggingface_hub import snapshot_download
import os

os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

# Download LoRA adapters
sql_adapter_path = snapshot_download(
    repo_id="google-cloud-partnership/gemma-2-2b-it-lora-sql",
    local_dir="./adapters/sql"
)

code_adapter_path = snapshot_download(
    repo_id="google-cloud-partnership/gemma-2-2b-it-lora-magicoder",
    local_dir="./adapters/code"
)

# Upload to GCS
BUCKET_URI = "gs://your-bucket"
!gcloud storage cp -r ./adapters/* {BUCKET_URI}/lora-adapters/

Using Multiple Adapters

import openai

client = openai.OpenAI(
    base_url=f"https://{endpoint.resource_name}/v1",
    api_key=auth_token
)

# Use SQL adapter
sql_response = client.chat.completions.create(
    model="sql",  # Specify adapter name
    messages=[{
        "role": "user",
        "content": "Write a SQL query to find top 10 customers by revenue"
    }]
)

# Use code adapter
code_response = client.chat.completions.create(
    model="code",  # Different adapter
    messages=[{
        "role": "user",
        "content": "Write a Python function to merge two sorted arrays"
    }]
)

Text Generation Inference (TGI)

Deploy Hugging Face models with TGI for optimized performance.

TGI Deployment

Authenticate with Hugging Face

from huggingface_hub import interpreter_login, get_token

# Login to Hugging Face
interpreter_login()

# Get token
hf_token = get_token()

Create Model Registry Entry

from google.cloud import aiplatform

# Upload model with TGI container
model = aiplatform.Model.upload(
    display_name="gemma-tgi",
    serving_container_image_uri="us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-hf-tgi-serve:20240220_0936_RC01",
    serving_container_environment_variables={
        "MODEL_ID": "google/gemma-7b-it",
        "HUGGING_FACE_HUB_TOKEN": hf_token,
        "DEPLOY_SOURCE": "notebook"
    },
    serving_container_ports=[7080]
)

Deploy to Endpoint

endpoint = model.deploy(
    machine_type="g2-standard-12",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
    traffic_split={"0": 100},
    deploy_request_timeout=1800
)

Make Predictions

prediction = endpoint.predict(
    instances=[{
        "inputs": "Explain quantum computing",
        "parameters": {
            "max_new_tokens": 200,
            "temperature": 0.7,
            "top_p": 0.9
        }
    }]
)

print(prediction.predictions[0])

TGI with Multiple LoRA Adapters

# Environment variables for TGI with LoRA
env_vars = {
    "MODEL_ID": "google/gemma-2-9b-it",
    "HUGGING_FACE_HUB_TOKEN": hf_token,
    "NUM_SHARD": "1",
    "MAX_INPUT_LENGTH": "4096",
    "MAX_TOTAL_TOKENS": "8192",
    "LORA_ADAPTERS": "sql,code",  # Comma-separated adapter IDs
    "LORA_ADAPTER_sql": "google-cloud-partnership/gemma-2-9b-it-lora-sql",
    "LORA_ADAPTER_code": "google-cloud-partnership/gemma-2-9b-it-lora-magicoder"
}

model = aiplatform.Model.upload(
    display_name="gemma-tgi-multi-lora",
    serving_container_image_uri=TGI_IMAGE_URI,
    serving_container_environment_variables=env_vars,
    serving_container_ports=[7080]
)

Ollama on Cloud Run

Deploy models with Ollama for lightweight serving:

FROM ollama/ollama:latest

# Copy model
COPY Modelfile /Modelfile

# Pull and create model
RUN ollama serve & \
    sleep 5 && \
    ollama pull gemma2:2b && \
    ollama create mymodel -f /Modelfile

EXPOSE 11434

CMD ["serve"]

Custom PyTorch Handlers

Deploy models with custom preprocessing/postprocessing:

from ts.torch_handler.base_handler import BaseHandler
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

class CustomLLMHandler(BaseHandler):
    def initialize(self, context):
        self.manifest = context.manifest
        properties = context.system_properties
        
        model_id = properties.get("model_id", "google/gemma-2b")
        
        # Load model and tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.bfloat16,
            device_map="auto"
        )
        
        self.initialized = True
    
    def preprocess(self, data):
        """Custom preprocessing"""
        prompts = [item.get("data") or item.get("body") for item in data]
        
        # Apply chat template
        inputs = self.tokenizer(
            prompts,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=2048
        ).to(self.model.device)
        
        return inputs
    
    def inference(self, inputs):
        """Model inference"""
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=200,
                temperature=0.7,
                top_p=0.9,
                do_sample=True
            )
        return outputs
    
    def postprocess(self, outputs):
        """Custom postprocessing"""
        responses = self.tokenizer.batch_decode(
            outputs,
            skip_special_tokens=True
        )
        return responses

Performance Optimization

Batching Strategies

Continuous Batching (vLLM)
Static Batching
Dynamic Batching

vLLM automatically batches requests:

# No configuration needed - vLLM handles batching
# Achieves up to 24x higher throughput
endpoint = model.deploy(
    machine_type="g2-standard-12",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1
)

Configure batch size for TGI:

env_vars = {
    "MAX_BATCH_SIZE": "32",
    "MAX_BATCH_PREFILL_TOKENS": "4096",
    "MAX_WAITING_TOKENS": "20"
}

Use Triton Inference Server:

{
  "dynamic_batching": {
    "max_queue_delay_microseconds": 100,
    "preferred_batch_size": [8, 16],
    "max_batch_size": 32
  }
}

Memory Optimization

# vLLM supports automatic quantization
endpoint = model.deploy(
    serving_container_environment_variables={
        "QUANTIZATION": "awq",  # or "gptq", "squeezellm"
        "DTYPE": "float16"
    }
)

Autoscaling Configuration

from google.cloud import aiplatform

# Deploy with autoscaling
endpoint = model.deploy(
    machine_type="g2-standard-12",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
    min_replica_count=1,
    max_replica_count=10,
    # Scale based on CPU utilization
    autoscaling_target_cpu_utilization=70,
    # Or scale based on accelerator utilization
    autoscaling_target_accelerator_utilization=80
)

Monitoring and Observability

Cloud Monitoring Integration

from google.cloud import monitoring_v3

# Create custom metrics
client = monitoring_v3.MetricServiceClient()

# Query endpoint metrics
results = client.list_time_series(
    request={
        "name": f"projects/{PROJECT_ID}",
        "filter": f'resource.type="aiplatform.googleapis.com/Endpoint"'
    }
)

for result in results:
    print(f"Metric: {result.metric.type}")
    print(f"Value: {result.points[0].value.double_value}")

Logging

import logging
from google.cloud import logging as cloud_logging

# Setup Cloud Logging
client = cloud_logging.Client()
client.setup_logging()

logger = logging.getLogger(__name__)

# Log predictions
response = endpoint.predict(instances=[{"prompt": "test"}])
logger.info(f"Prediction latency: {response.metadata['prediction_latency_ms']}ms")

Best Practices

Choose Right Serving Engine

Use vLLM for high throughput, TGI for HF models, custom handlers for specialized needs

Enable Autoscaling

Configure min/max replicas to handle traffic spikes efficiently

Optimize GPU Usage

Use tensor parallelism for large models, quantization for memory constraints

Monitor Performance

Track latency, throughput, and GPU utilization metrics

Use LoRA for Multi-Task

Serve multiple specialized models with shared base weights

Test Before Production

Load test endpoints to validate performance under expected traffic

Cost Optimization

Right-Size Compute

Start with smaller machine types and scale up based on metrics

Use Spot VMs

Enable spot VMs for up to 80% cost savings on fault-tolerant workloads

Scale to Zero

Use Cloud Run for infrequent workloads that can scale to zero

Batch Requests

Send multiple predictions in a single request to reduce overhead

Next Steps

Model Garden

Explore models available for deployment

Fine-Tuning

Customize models before deployment

Example Notebooks

View serving examples on GitHub

Performance Guide

Learn more about optimization techniques

Evaluation & Testing

Production Deployment

Open Models

Overview

Serving Options

vLLM Deployment

Basic vLLM Deployment

vLLM with Multiple LoRA Adapters

Using Multiple Adapters

Text Generation Inference (TGI)

TGI Deployment

TGI with Multiple LoRA Adapters

Ollama on Cloud Run

Custom PyTorch Handlers

Performance Optimization

Batching Strategies

Memory Optimization

Autoscaling Configuration

Monitoring and Observability

Cloud Monitoring Integration

Logging

Best Practices

Choose Right Serving Engine

Enable Autoscaling

Optimize GPU Usage

Monitor Performance

Use LoRA for Multi-Task

Test Before Production

Cost Optimization

Next Steps

Model Garden

Fine-Tuning

Example Notebooks

Performance Guide

Build docs developers (and LLMs) love

Evaluation & Testing

Production Deployment

Open Models

Documentation Index

​Overview

​Serving Options

​vLLM Deployment

​Basic vLLM Deployment

​vLLM with Multiple LoRA Adapters

​Using Multiple Adapters

​Text Generation Inference (TGI)

​TGI Deployment

​TGI with Multiple LoRA Adapters

​Ollama on Cloud Run

​Custom PyTorch Handlers

​Performance Optimization

​Batching Strategies

​Memory Optimization

​Autoscaling Configuration

​Monitoring and Observability

​Cloud Monitoring Integration

​Logging

​Best Practices

Choose Right Serving Engine

Enable Autoscaling

Optimize GPU Usage

Monitor Performance

Use LoRA for Multi-Task

Test Before Production

​Cost Optimization

​Next Steps

Model Garden

Fine-Tuning

Example Notebooks

Performance Guide

Build docs developers (and LLMs) love

Overview

Serving Options

vLLM Deployment

Basic vLLM Deployment

vLLM with Multiple LoRA Adapters

Using Multiple Adapters

Text Generation Inference (TGI)

TGI Deployment

TGI with Multiple LoRA Adapters

Ollama on Cloud Run

Custom PyTorch Handlers

Performance Optimization

Batching Strategies

Memory Optimization

Autoscaling Configuration

Monitoring and Observability

Cloud Monitoring Integration

Logging

Best Practices

Cost Optimization

Next Steps