oMLX’s KV cache is modeled after vLLM’s block pool architecture, adapted for Apple Silicon’s unified memory system. Context is stored in fixed-size blocks that can live in two tiers simultaneously: a hot in-memory tier for fast reuse and a cold SSD tier—written in safetensors format—for durable, restart-proof persistence. When a new request arrives whose token prefix matches blocks already on disk, those blocks are registered back into the metadata index on demand and served without any recomputation. This is especially valuable for long-context coding sessions where system prompts and file contents are reused across many turns.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/jundot/omlx/llms.txt
Use this file to discover all available pages before exploring further.
Architecture Overview
CacheBlock holds a fixed number of tokens (--initial-cache-blocks controls the starting pool; the pool grows dynamically up to a configurable maximum). Blocks are identified by a chain hash: each block’s hash is derived from its token content and the hash of the preceding block, creating a verifiable prefix chain identical in spirit to vLLM’s approach. Model name is included in the hash to isolate caches between different models sharing the same SSD directory.
- Hot Cache (RAM)
- Cold SSD Cache
The hot tier is managed entirely by
PagedCacheManager using a doubly-linked free queue that provides O(1) LRU eviction. Blocks at the front of the queue are the least recently used and are evicted first when space is needed.Key properties:- Blocks are reference-counted. A block with
ref_count > 1is shared across requests via prefix deduplication; it is Copy-on-Write (CoW) safe. - When a request begins generation on a shared block,
_cow_copy_block()allocates a fresh block and transfers ownership, leaving the original available for other requests. - The hot cache size is bounded by
--hot-cache-max-size. Set this to a percentage of RAM (e.g.20%) to reserve headroom for model weights and KV generation. - When
--hot-cache-max-sizeis omitted, the hot tier is unbounded and grows with demand up to available unified memory.
Prefix Sharing and Copy-on-Write
Two requests that begin with the same system prompt will share all blocks covering that prefix. Shared blocks haveref_count > 1. When one request diverges (e.g., a different user message follows), fork_block_table() increments the reference count and marks the last block for CoW. On the next generation step, _cow_copy_block() transparently duplicates only the diverged block, keeping the shared prefix blocks untouched and available for future requests.
This means a 32 k-token system prompt loaded once stays cached for every concurrent request that uses it—at the cost of a single block allocation per diverging session.
Configuration Reference
| Flag | Default | Description |
|---|---|---|
--paged-ssd-cache-dir | (disabled) | Path for SSD cold tier storage. Enables tiered caching when set. |
--paged-ssd-cache-max-size | 100GB | Maximum disk space consumed by the cold tier. |
--hot-cache-max-size | 0 (disabled cap) | Maximum RAM for the hot in-memory tier. Accepts bytes or percentage (e.g. 20%). |
--no-cache | false | Disable all KV caching entirely. |
--initial-cache-blocks | 256 | Number of blocks pre-allocated at startup. The pool grows dynamically. |
Example: Full Tiered Cache Setup
How It Survives Restarts
On each request,get_computed_blocks() walks the token sequence block by block, computing the expected chain hash. If a block is not in RAM but PagedSSDCacheManager.has_block(hash) returns True, the block is re-registered in the in-memory index with metadata only—its tensor data remains on disk. The BatchGenerator pulls the data from disk during prefill injection. This means cache state built up over hours of usage is never lost when the server restarts or crashes.
The SSD cache survives not just restarts but also model unloads. If a model is evicted by LRU to free RAM and later reloaded, its entire cached prefix history is immediately available again.