Prompt Caching

Prompt caching allows LLM providers to cache parts of your prompt that don’t change between requests, significantly reducing costs and latency for repeated queries.

How Prompt Caching Works

When you send a request to an LLM, providers like Anthropic can cache specific message segments. On subsequent requests with the same cached content, you:

Pay reduced rates for cached tokens (often 90% cheaper)
Experience faster response times (cached content is pre-processed)
Reduce computational overhead

Providers supporting prompt caching:

Anthropic: Claude models with cache_control metadata
OpenAI: GPT models with automatic caching
Google AI: Gemini models with caching support
Vertex AI: Google Cloud models with caching

AWS Bedrock does not currently report cached token usage, though some underlying models may use caching internally.

Enabling Prompt Caching in BAML

Caching is configured using role metadata in your BAML prompts. Here’s how to enable it:

Step 1: Configure Client to Allow Role Metadata

Add allowed_role_metadata to your client configuration:

main.baml

client<llm> AnthropicClient {
  provider "anthropic"
  options {
    model "claude-sonnet-4-5-20250929"
    allowed_role_metadata ["cache_control"]
  }
}

Step 2: Add Cache Control to Messages

Use the _.role() function to add cache control metadata:

main.baml

function AnalyzeBook(book: string, question: string) -> string {
  client AnthropicClient
  prompt #"
    {{ _.role("user") }}
    {{ book }}
    
    {{ _.role("user", cache_control={"type": "ephemeral"}) }}
    {{ question }}
  "#
}

This creates two user messages:

First message: Contains the book content (will be cached)
Second message: Contains the question (marked with cache control)

The cache control metadata tells the provider that everything up to this point should be cached. Place it after the content you want to cache.

Real-World Examples

Analyzing Large Documents

Cache document content while varying questions:

main.baml

class Analysis {
  themes string[]
  characters string[]
  summary string
}

function AnalyzeLiterature(document: string, focus: string) -> Analysis {
  client AnthropicClient
  prompt #"
    {{ _.role("user") }}
    Here is the complete text to analyze:
    
    {{ document }}
    
    {{ _.role("user", cache_control={"type": "ephemeral"}) }}
    Focus your analysis on: {{ focus }}
    
    {{ ctx.output_format }}
  "#
}

Usage:

from baml_client import b

# First call - caches document
result1 = await b.AnalyzeLiterature(
    document=pride_and_prejudice_text,
    focus="character development"
)

# Second call - reuses cached document
result2 = await b.AnalyzeLiterature(
    document=pride_and_prejudice_text,  # Same document
    focus="social commentary"  # Different focus
)

Multi-Turn Conversations with Context

Cache conversation history and system context:

main.baml

class Message {
  role string
  content string
}

class Response {
  answer string
  confidence float
}

function ChatWithContext(
  system_context: string,
  conversation_history: Message[],
  user_message: string
) -> Response {
  client AnthropicClient
  prompt #"
    {{ _.role("system") }}
    {{ system_context }}
    
    {{ _.role("user") }}
    Previous conversation:
    {% for msg in conversation_history %}
    {{ msg.role }}: {{ msg.content }}
    {% endfor %}
    
    {{ _.role("user", cache_control={"type": "ephemeral"}) }}
    {{ user_message }}
    
    {{ ctx.output_format }}
  "#
}

RAG with Cached Context

Cache retrieved documents while varying queries:

main.baml

class DocumentChunk {
  title string
  content string
}

class Answer {
  response string
  sources int[]  // Indices of relevant chunks
  confidence float
}

function AnswerFromDocs(
  docs: DocumentChunk[],
  query: string
) -> Answer {
  client AnthropicClient
  prompt #"
    {{ _.role("user") }}
    Reference Documents:
    {% for doc in docs %}
    [{{ loop.index }}] {{ doc.title }}
    {{ doc.content }}
    
    {% endfor %}
    
    {{ _.role("user", cache_control={"type": "ephemeral"}) }}
    Query: {{ query }}
    
    Provide an answer based on the reference documents above.
    
    {{ ctx.output_format }}
  "#
}

Monitoring Cache Performance

Use Collectors to track cache hits and savings:

from baml_client import b
from baml_py import Collector

async def analyze_with_caching():
    collector = Collector(name="cache-monitor")
    
    # First call
    await b.AnalyzeLiterature(
        document=large_doc,
        focus="themes",
        baml_options={"collector": collector}
    )
    
    # Second call with same doc
    await b.AnalyzeLiterature(
        document=large_doc,
        focus="characters",
        baml_options={"collector": collector}
    )
    
    # Analyze cache performance
    for i, log in enumerate(collector.logs):
        print(f"\nCall {i + 1}:")
        print(f"  Input tokens: {log.usage.input_tokens}")
        print(f"  Cached tokens: {log.usage.cached_input_tokens or 0}")
        print(f"  Output tokens: {log.usage.output_tokens}")
        
        # Calculate savings
        if log.usage.cached_input_tokens:
            cache_pct = (log.usage.cached_input_tokens / log.usage.input_tokens) * 100
            print(f"  Cache hit rate: {cache_pct:.1f}%")

Verifying Cache Requests

Use the VSCode Playground to verify your cache configuration:

Open your BAML function in VSCode
Run it in the Playground
Switch from “Prompt Review” to “Raw cURL” view
Verify the cache_control metadata is present in the request

Example output:

{
  "model": "claude-sonnet-4-5-20250929",
  "messages": [
    {
      "role": "user",
      "content": "<document content>"
    },
    {
      "role": "user",
      "content": "<question>",
      "cache_control": { "type": "ephemeral" }
    }
  ]
}

Best Practices

Cache Large, Static Content: Cache documentation, large documents, or system prompts that don’t change
Place Cache Markers Strategically: Put cache_control after the content you want cached, not before
Monitor Cache Effectiveness: Use Collectors to track cache hit rates and cost savings
Consider TTL: Anthropic’s ephemeral cache lasts ~5 minutes; plan request timing accordingly
Balance Cache Size: Caching works best with substantial content (1000+ tokens)
Provider Differences: Different providers have different caching mechanisms - test thoroughly

Cost Savings Example

For Anthropic Claude Sonnet 4.5:

Regular input tokens: $3.00 per million
Cached input tokens: $0.30 per million (90% cheaper)

If analyzing a 50,000 token document with 10 different questions:

Without caching: 500,000 tokens × $3.00/M =$ 1.50
With caching: 50,000 × $3.00/M + 450,000 ×$ 0.30/M = $0.15 +$ 0.135 = $0.285
Savings: $1.215 (81% reduction)

Limitations

Cache Duration: Ephemeral caches expire after ~5 minutes of inactivity
Provider-Specific: Implementation varies by provider
Metadata Requirements: Must configure allowed_role_metadata for each client
No Cross-Request Guarantees: Cache hits aren’t guaranteed across different sessions

Collectors - Track cache performance
Optimization Techniques - Broader optimization strategies
Anthropic Caching Documentation

Get Started

Installation

Core Concepts

Guides

Advanced

Deployment

Prompt Caching

How Prompt Caching Works

Enabling Prompt Caching in BAML

Step 1: Configure Client to Allow Role Metadata

Step 2: Add Cache Control to Messages

Real-World Examples

Analyzing Large Documents

Multi-Turn Conversations with Context

RAG with Cached Context

Monitoring Cache Performance

Verifying Cache Requests

Best Practices

Cost Savings Example

Limitations

Build docs developers (and LLMs) love

Get Started

Installation

Core Concepts

Guides

Advanced

Deployment

Documentation Index

​How Prompt Caching Works

​Enabling Prompt Caching in BAML

​Step 1: Configure Client to Allow Role Metadata

​Step 2: Add Cache Control to Messages

​Real-World Examples

​Analyzing Large Documents

​Multi-Turn Conversations with Context

​RAG with Cached Context

​Monitoring Cache Performance

​Verifying Cache Requests

​Best Practices

​Cost Savings Example

​Limitations

​Related Topics

Build docs developers (and LLMs) love

How Prompt Caching Works

Enabling Prompt Caching in BAML

Step 1: Configure Client to Allow Role Metadata

Step 2: Add Cache Control to Messages

Real-World Examples

Analyzing Large Documents

Multi-Turn Conversations with Context

RAG with Cached Context

Monitoring Cache Performance

Verifying Cache Requests

Best Practices

Cost Savings Example

Limitations

Related Topics