Prompt caching reduces repeated prefill work when multiple model calls share the same prefix. In long-running agents, this can materially reduce input-token cost and time-to-first-token latency. Treat prompt caching as part of harness architecture — not as a provider afterthought. The context builder, tool registry, instruction manager, compactor, and telemetry layer all affect cache hit rate, and a few common mistakes can silently destroy it.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/DenisSergeevitch/agents-best-practices/llms.txt
Use this file to discover all available pages before exploring further.
How Prompt Caching Works
Most prompt-cache systems cache the stable leading portion of a request (the prefix) and skip re-processing it on subsequent calls that share an identical prefix. Only the volatile trailing portion (the suffix) is processed fresh each time. The result is that requests with a large stable prefix pay much less for input tokens on the second and subsequent calls. The key constraint is exactness: the cached prefix must match byte-for-byte (or token-for-token, depending on the provider). Any change to the prefix — including whitespace, key ordering, or injected metadata — invalidates the cache entry.Cache-Aware Context Ordering
Design requests so stable content appears first and volatile content appears last. This gives the provider the longest possible prefix to cache. Recommended ordering — stable to volatile:The compaction vs. cache tradeoff is real: compaction is sometimes necessary to stay within context limits, but it resets or shifts the stable prefix, triggering a cold-cache turn. Compact only when the context window requires it. After one cold turn, the compacted summary can itself become the new stable prefix and caching resumes. Frequent compaction churn — for example, re-summarizing every turn — prevents this recovery and keeps cache hit rates low.
Concrete Ordering Examples
Compaction vs. Cache Tradeoff
Compaction is often necessary, but it resets or changes the reusable prefix. Apply these rules:Compact only when useful
Do not compact speculatively. Compact when the context window is genuinely constrained, not as a routine housekeeping step.
Make compaction boundaries explicit
Record when and why compaction occurred. This makes cache-hit-rate drops explainable in telemetry.
Stabilize the summary
Once a compacted summary is written, do not rewrite it on subsequent turns. Let it become the new stable prefix.
Preserve recent high-value messages
When possible, keep the most recent high-signal messages verbatim rather than summarizing them. Pruning oversized tool outputs is preferable to rewriting all of history.
Cost Telemetry
Log cache diagnostics on every model call. The following fields cover the full cost picture:Provider-Specific Notes
- OpenAI
- Anthropic
- OpenAI-compatible
OpenAI prompt caching is automatic on supported API requests. The response includes a
cached_tokens field under usage.prompt_tokens_details.Cost Control Strategies
Session budgets
Set a maximum token spend per session and per tool call. Track cumulative cost against the budget and stop early when it is exceeded rather than letting runaway loops consume unlimited tokens.
Progressive tool disclosure
Prefer skill and tool progressive disclosure over loading large inventories upfront. Fewer tools in the prompt means a smaller stable prefix to maintain and lower input-token cost.
Deterministic serialization
Sort tools and schema keys deterministically. Use versioned prompt and tool bundles. Avoid middleware that injects trace IDs, timestamps, or randomized content into the stable prefix.
Long retention only when justified
Use extended cache retention only when the expected inter-request gap justifies the retention cost. For low-frequency sessions, standard TTL is usually sufficient.
Anti-Patterns
Avoid these cache-killing patterns:current date/time or a request_id at the top of the system prompt. This makes every request a cache miss regardless of how stable the rest of the prompt is.