Prompt caching allows LLM providers to cache parts of your prompt that don’t change between requests, significantly reducing costs and latency for repeated queries.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/BoundaryML/baml/llms.txt
Use this file to discover all available pages before exploring further.
How Prompt Caching Works
When you send a request to an LLM, providers like Anthropic can cache specific message segments. On subsequent requests with the same cached content, you:- Pay reduced rates for cached tokens (often 90% cheaper)
- Experience faster response times (cached content is pre-processed)
- Reduce computational overhead
- Anthropic: Claude models with
cache_controlmetadata - OpenAI: GPT models with automatic caching
- Google AI: Gemini models with caching support
- Vertex AI: Google Cloud models with caching
AWS Bedrock does not currently report cached token usage, though some underlying models may use caching internally.
Enabling Prompt Caching in BAML
Caching is configured using role metadata in your BAML prompts. Here’s how to enable it:Step 1: Configure Client to Allow Role Metadata
Addallowed_role_metadata to your client configuration:
main.baml
Step 2: Add Cache Control to Messages
Use the_.role() function to add cache control metadata:
main.baml
- First message: Contains the book content (will be cached)
- Second message: Contains the question (marked with cache control)
The cache control metadata tells the provider that everything up to this point should be cached. Place it after the content you want to cache.
Real-World Examples
Analyzing Large Documents
Cache document content while varying questions:main.baml
Multi-Turn Conversations with Context
Cache conversation history and system context:main.baml
RAG with Cached Context
Cache retrieved documents while varying queries:main.baml
Monitoring Cache Performance
Use Collectors to track cache hits and savings:Verifying Cache Requests
Use the VSCode Playground to verify your cache configuration:- Open your BAML function in VSCode
- Run it in the Playground
- Switch from “Prompt Review” to “Raw cURL” view
- Verify the
cache_controlmetadata is present in the request
Best Practices
- Cache Large, Static Content: Cache documentation, large documents, or system prompts that don’t change
- Place Cache Markers Strategically: Put
cache_controlafter the content you want cached, not before - Monitor Cache Effectiveness: Use Collectors to track cache hit rates and cost savings
- Consider TTL: Anthropic’s ephemeral cache lasts ~5 minutes; plan request timing accordingly
- Balance Cache Size: Caching works best with substantial content (1000+ tokens)
- Provider Differences: Different providers have different caching mechanisms - test thoroughly
Cost Savings Example
For Anthropic Claude Sonnet 4.5:- Regular input tokens: $3.00 per million
- Cached input tokens: $0.30 per million (90% cheaper)
- Without caching: 500,000 tokens × 1.50
- With caching: 50,000 × 0.30/M = 0.135 = $0.285
- Savings: $1.215 (81% reduction)
Limitations
- Cache Duration: Ephemeral caches expire after ~5 minutes of inactivity
- Provider-Specific: Implementation varies by provider
- Metadata Requirements: Must configure
allowed_role_metadatafor each client - No Cross-Request Guarantees: Cache hits aren’t guaranteed across different sessions
Related Topics
- Collectors - Track cache performance
- Optimization Techniques - Broader optimization strategies
- Anthropic Caching Documentation