Caching

The AI Gateway can cache LLM responses and serve them without making a new provider request. This reduces both cost and latency for repeated or similar queries.

Cache modes

Simple
Semantic

Simple caching performs exact-match lookups. The cache key is a SHA-256 hash of the serialized request body combined with the target URL. If an identical request has been made before, the cached response is returned immediately.

{
  "cache": {
    "mode": "simple",
    "max_age": 3600
  }
}

This mode is available in both the open-source gateway and the hosted Portkey service.

Semantic caching uses embedding-based similarity to match requests that are semantically equivalent even if the exact text differs. For example, “What is the capital of France?” and “Tell me the capital city of France” would both hit the same cache entry.

{
  "cache": {
    "mode": "semantic",
    "max_age": 7200
  }
}

Semantic caching is available on the hosted Portkey service and enterprise deployments only. It requires a vector store and embedding model to be configured in your deployment.

Configuration

cache.mode

string

required

Cache mode. "simple" for exact-match caching, "semantic" for embedding-based similarity matching.

cache.max_age

number

Cache TTL in seconds. After this duration, the cached entry expires and the next matching request hits the provider. Defaults to 24 hours (86400 seconds) if not set.

Usage

from portkey_ai import Portkey

client = Portkey(
    provider="openai",
    Authorization="sk-...",
    config={
      "cache": {
        "mode": "simple",
        "max_age": 3600
      }
    }
)

# First call — hits the provider
response1 = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is 2 + 2?"}]
)

# Second identical call — served from cache
response2 = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is 2 + 2?"}]
)

Cache status headers

Every response includes a x-portkey-cache-status header indicating whether the response came from cache:

Value	Meaning
`HIT`	Response served from simple cache
`SEMANTIC HIT`	Response served from semantic cache
`MISS`	No cache entry found; response from provider
`SEMANTIC MISS`	Semantic search found no match; response from provider
`REFRESH`	Cache was bypassed due to force-refresh header
`DISABLED`	Caching is not enabled for this request

Force-refreshing the cache

To bypass the cache and fetch a fresh response from the provider, include the x-portkey-cache-force-refresh header:

curl http://localhost:8787/v1/chat/completions \
  -H "x-portkey-cache-force-refresh: true" \
  -H "x-portkey-config: <config>" \
  -d '{"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hello"}]}'

Persistent caching with Redis

By default, the gateway uses an in-memory cache that is lost when the process restarts. To persist the cache across restarts and share it across multiple gateway instances, configure a Redis connection:

REDIS_CONNECTION_STRING=redis://localhost:6379

Set this environment variable before starting the gateway. When Redis is available, all putInCache and getFromCache operations use Redis instead of the in-process store.

The in-memory cache does not persist across gateway restarts. Use Redis for production deployments where cache durability matters.

Caching limitations

Streaming responses are not cached. If stream: true is set in the request body, caching is skipped for that request.
Cache keys include the full request body and the provider URL. Any change to the request — model, messages, parameters — produces a different cache key.
The simple cache key is a SHA-256 hash of JSON.stringify(requestBody) + targetURL.

Get Started

Deployment

Core Concepts

Guardrails

MCP Gateway

Integrations

Plugin Development

Cache modes

Configuration

Usage

Cache status headers

Force-refreshing the cache

Persistent caching with Redis

Caching limitations

Build docs developers (and LLMs) love

Get Started

Deployment

Core Concepts

Guardrails

MCP Gateway

Integrations

Plugin Development

​Cache modes

​Configuration

​Usage

​Cache status headers

​Force-refreshing the cache

​Persistent caching with Redis

​Caching limitations

Build docs developers (and LLMs) love

Cache modes

Configuration

Usage

Cache status headers

Force-refreshing the cache

Persistent caching with Redis

Caching limitations