Documentation Index
Fetch the complete documentation index at: https://mintlify.com/BerriAI/litellm/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Google Vertex AI provides access to Gemini models, PaLM, and other Google AI models through Google Cloud Platform with enterprise features and SLAs.
Quick Start
Set Google Cloud Credentials
export VERTEX_PROJECT="your-project-id"
export VERTEX_LOCATION="us-central1"
export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"
Make Your First Call
from litellm import completion
response = completion(
model="vertex_ai/gemini-2.0-flash-exp",
messages=[{"role": "user", "content": "Hello Gemini!"}]
)
print(response.choices[0].message.content)
Supported Models
Gemini 2.0
Gemini 1.5
Gemini 1.0
Other Models
Latest Gemini models with multimodal capabilities:# Gemini 2.0 Flash (Experimental)
response = completion(
model="vertex_ai/gemini-2.0-flash-exp",
messages=[{"role": "user", "content": "Analyze this data..."}]
)
# With thinking mode
response = completion(
model="vertex_ai/gemini-2.0-flash-thinking-exp-01-21",
messages=[{"role": "user", "content": "Complex problem..."}]
)
Production Gemini models:# Gemini 1.5 Pro - Most capable
response = completion(
model="vertex_ai/gemini-1.5-pro",
messages=[{"role": "user", "content": "Complex analysis..."}]
)
# Gemini 1.5 Flash - Fast and efficient
response = completion(
model="vertex_ai/gemini-1.5-flash",
messages=[{"role": "user", "content": "Quick task..."}]
)
# Gemini 1.5 Flash-8B - Ultra fast
response = completion(
model="vertex_ai/gemini-1.5-flash-8b",
messages=[{"role": "user", "content": "Simple query..."}]
)
Earlier Gemini models:# Gemini 1.0 Pro
response = completion(
model="vertex_ai/gemini-1.0-pro",
messages=[{"role": "user", "content": "Task..."}]
)
# PaLM 2
response = completion(
model="vertex_ai/text-bison",
messages=[{"role": "user", "content": "Generate text..."}]
)
# Codey (Code generation)
response = completion(
model="vertex_ai/code-bison",
messages=[{"role": "user", "content": "Write Python code..."}]
)
Authentication
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
export VERTEX_PROJECT="your-project-id"
export VERTEX_LOCATION="us-central1"
from litellm import completion
response = completion(
model="vertex_ai/gemini-1.5-pro",
messages=[{"role": "user", "content": "Hello!"}]
)
# Authenticate using gcloud
gcloud auth application-default login
export VERTEX_PROJECT="your-project-id"
export VERTEX_LOCATION="us-central1"
from litellm import completion
response = completion(
model="vertex_ai/gemini-1.5-pro",
messages=[{"role": "user", "content": "Hello!"}]
)
from litellm import completion
response = completion(
model="vertex_ai/gemini-1.5-pro",
messages=[{"role": "user", "content": "Hello!"}],
vertex_project="your-project-id",
vertex_location="us-central1",
vertex_credentials="/path/to/credentials.json"
)
Available Locations
Vertex AI is available in multiple regions:
| Location | Code | Description |
|---|
| US Multi-Region | us-central1 | US multi-region (recommended) |
| Europe | europe-west1 | Belgium |
| Europe | europe-west4 | Netherlands |
| Asia | asia-southeast1 | Singapore |
| Asia | asia-northeast1 | Tokyo |
response = completion(
model="vertex_ai/gemini-1.5-pro",
messages=[{"role": "user", "content": "Hello!"}],
vertex_location="europe-west1"
)
Multimodal (Vision)
Gemini models support images, videos, and audio:
response = completion(
model="vertex_ai/gemini-1.5-pro",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {"url": "https://example.com/image.jpg"}
}
]
}]
)
Function Calling
Gemini supports function calling:
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
}]
response = completion(
model="vertex_ai/gemini-1.5-pro",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools
)
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
Streaming
from litellm import completion
response = completion(
model="vertex_ai/gemini-1.5-pro",
messages=[{"role": "user", "content": "Write a story"}],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Context Caching
Cache large contexts to reduce costs:
response = completion(
model="vertex_ai/gemini-1.5-pro",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an expert in... " * 1000, # Long prompt
"cache_control": {"type": "ephemeral"}
}
]
},
{"role": "user", "content": "Question 1"}
]
)
# Subsequent requests reuse cached context
response2 = completion(
model="vertex_ai/gemini-1.5-pro",
messages=[
{"role": "system", "content": [{...}]}, # Same cached content
{"role": "user", "content": "Question 2"}
]
)
JSON Mode
Force JSON output:
response = completion(
model="vertex_ai/gemini-1.5-pro",
messages=[{
"role": "user",
"content": "Extract: John is 30 years old, lives in NYC"
}],
response_format={"type": "json_object"}
)
import json
data = json.loads(response.choices[0].message.content)
Grounding (Search)
Ground responses in Google Search or Vertex AI Search:
# Google Search grounding
response = completion(
model="vertex_ai/gemini-1.5-pro",
messages=[{"role": "user", "content": "What are the latest AI developments?"}],
tools=[{"googleSearchRetrieval": {}}]
)
# Vertex AI Search grounding
response = completion(
model="vertex_ai/gemini-1.5-pro",
messages=[{"role": "user", "content": "Search our docs"}],
tools=[{
"retrieval": {
"vertexAiSearch": {
"datastore": "projects/PROJECT/locations/LOCATION/collections/default_collection/dataStores/DATASTORE_ID"
}
}
}]
)
Safety Settings
Configure content safety filters:
response = completion(
model="vertex_ai/gemini-1.5-pro",
messages=[{"role": "user", "content": "Generate content"}],
safety_settings=[
{
"category": "HARM_CATEGORY_HARASSMENT",
"threshold": "BLOCK_MEDIUM_AND_ABOVE"
},
{
"category": "HARM_CATEGORY_HATE_SPEECH",
"threshold": "BLOCK_MEDIUM_AND_ABOVE"
}
]
)
Embeddings
Generate embeddings:
from litellm import embedding
# Text embeddings
response = embedding(
model="vertex_ai/text-embedding-005",
input="Hello world"
)
print(len(response.data[0].embedding)) # 768 dimensions
# Multimodal embeddings (text + image)
response = embedding(
model="vertex_ai/multimodalembedding",
input={
"text": "A cat",
"image": {"url": "https://example.com/cat.jpg"}
}
)
Advanced Parameters
Temperature and Sampling
response = completion(
model="vertex_ai/gemini-1.5-pro",
messages=[{"role": "user", "content": "Be creative"}],
temperature=0.9,
top_p=0.95,
top_k=40,
max_tokens=2048
)
System Instructions
response = completion(
model="vertex_ai/gemini-1.5-pro",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
)
Stop Sequences
response = completion(
model="vertex_ai/gemini-1.5-pro",
messages=[{"role": "user", "content": "Count to 10"}],
stop=["5", "\n\n"]
)
Batch Prediction
Process large batches asynchronously:
from litellm import create_batch, retrieve_batch
batch = create_batch(
custom_llm_provider="vertex_ai",
input_file_id="gs://bucket/input.jsonl",
output_uri_prefix="gs://bucket/output/",
endpoint="/generateContent"
)
print(f"Batch ID: {batch.id}")
Error Handling
from litellm import completion
from litellm.exceptions import (
AuthenticationError,
RateLimitError,
APIError
)
try:
response = completion(
model="vertex_ai/gemini-1.5-pro",
messages=[{"role": "user", "content": "Hello!"}]
)
except AuthenticationError:
print("Invalid Google Cloud credentials")
except RateLimitError:
print("Quota exceeded")
except APIError as e:
print(f"Vertex AI error: {e}")
Cost Tracking
from litellm import completion, completion_cost
response = completion(
model="vertex_ai/gemini-1.5-pro",
messages=[{"role": "user", "content": "Hello!"}]
)
cost = completion_cost(completion_response=response)
print(f"Cost: ${cost:.6f}")
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
Model Garden
Use models from Vertex AI Model Garden:
response = completion(
model="vertex_ai_model_garden/meta/llama3-70b-instruct",
messages=[{"role": "user", "content": "Hello"}],
vertex_project="your-project",
vertex_location="us-central1"
)
Best Practices
Use Service Accounts
Use service accounts with minimal required permissions for production.
Enable Caching
Use context caching for large prompts to reduce costs.
Choose Right Model
Use Flash for speed, Pro for quality, Flash-8B for high throughput.
Set Safety Filters
Configure appropriate safety settings for your use case.
Vision
Work with images, videos, and PDFs
Function Calling
Implement tool use with Gemini
Embeddings
Generate embeddings on Vertex AI
Streaming
Stream responses in real-time