An eviction policy decides which tokens to keep when a context must be compressed to a fixed budget. In SuperCompress every policy implements theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/arjunkshah/supercompress/llms.txt
Use this file to discover all available pages before exploring further.
EvictionPolicy abstract base class from policies.py, which exposes a single method: select(records, budget) → List[int]. The method receives the full list of TokenRecord objects produced by build_inference_records and the integer token budget, and returns a sorted list of token position indices to retain. Policies range from stateless rules (FIFO, Truncation) to attention-informed heuristics (H2O, SnapKV) to the trained neural policy (LearnedPolicy) used by SuperCompress by default.
Policy Comparison
| Policy | Class | Strategy |
|---|---|---|
| FIFO | FIFO | Drop oldest tokens; keep the most recent budget tokens |
| LRU | LRU | Keep tokens with highest recency score |
| Sliding Window | SlidingWindow | Recent half + first 5% attention sinks |
| Truncation | TruncationPolicy | Attention sinks + most recent tokens (head+tail) |
| Summarization | SummarizationPolicy | Extractive: keep lines with highest entity overlap to the question |
| H2O | H2OPolicy | Attention sinks + recent window + top cumulative-attention tokens |
| SuperCompress (Learned) | LearnedPolicy | Top-k tokens by EvictionPolicyNetwork keep score |
| Attention Heuristic | AttentionHeuristicPolicy | Non-learned: keep tokens with highest attention_mass |
| SnapKV | SnapKVPolicy | Score prefix tokens by attention from an observation window at sequence end |
| Oracle | OraclePolicy | Upper bound: keep all oracle-important tokens, then fill with recent |
FIFO
What it does:FIFO discards the oldest tokens first, retaining only the most recent budget tokens. It is the simplest possible eviction policy and requires no per-token scoring.
When to use it: Useful as a fast lower-bound baseline in benchmarks, or when your context is append-only and older content is genuinely irrelevant. Not suitable when answers are buried in the middle of long sessions.
LRU
What it does:LRU keeps the budget tokens with the highest position value (i.e. most recently seen), effectively behaving like a recency-ranked cache eviction. It scores every token by its position field and retains the top-ranked entries.
When to use it: A lightweight alternative to FIFO when you want a position-sorted recency bias. Both policies share the same weakness: answer-bearing tokens buried in the middle of a long context are evicted regardless.
Sliding Window
What it does:SlidingWindow always retains the first 5% of tokens as attention sinks, then fills the remaining budget with the most recent tokens. It is a fixed-window strategy that never considers the semantic content of middle tokens.
When to use it: A step up from pure FIFO for contexts where early structural context (imports, schema definitions) is worth preserving alongside the most recent content. Still unsuitable for long agentic sessions where answers are scattered throughout.
Truncation
What it does:TruncationPolicy is a head+tail strategy. It keeps a small number of attention sink tokens from the start of the sequence (sink_tokens=4 by default) and fills the remaining budget with the most recent tokens. Content in the middle of the context is always dropped.
When to use it: A reasonable default for chat applications where older turns are genuinely stale. Fails badly when critical information appears in the middle of the context (oracle recall ~25%).
Summarization
What it does:SummarizationPolicy is an extractive summarization baseline. It scores each source line by the number of question entity overlaps it contains, adds a small attention_mass signal, and keeps the highest-scoring whole lines until the budget is reached. Lines containing oracle-important tokens get a +10 score boost. Code lines get a +0.5 bonus.
When to use it: Good when the query is keyword-rich and the context is prose. Less effective on code-heavy contexts where entity overlap is weaker.
H2O (Heavy Hitter Oracle)
What it does:H2OPolicy implements the Heavy Hitter Oracle strategy from Zhang et al. (2023). It partitions the budget into three pools: attention sink tokens (first sink_tokens), a recent window (recent_ratio=0.2 of the budget), and “heavy hitter” slots filled by the tokens with the highest cumulative h2o_score (or layer_attention_mean as a fallback). This gives it near-oracle recall (98%) without any training.
When to use it: The best non-learned baseline. SuperCompress automatically falls back to H2OPolicy if no trained checkpoint is found, so it is always safe to use as a production default.
Attention Heuristic
What it does:AttentionHeuristicPolicy is a non-learned baseline that keeps the budget tokens with the highest attention_mass values. Unlike H2O it does not carve out explicit sink or recency pools — it ranks all tokens purely by their synthesized attention weight and takes the top-k.
When to use it: Useful for ablation studies to isolate how much of H2O’s or LearnedPolicy’s gain comes from attention-mass scoring alone, before adding recency and entity-match signals.
SnapKV
What it does:SnapKVPolicy implements a SnapKV-style strategy from Li et al.. It scores prefix tokens by their attention from an observation window near the end of the sequence, retaining the sink_tokens most important prefix positions plus the top-scoring remaining tokens by snapkv_score (falling back to attention_mass).
When to use it: When you want a prefix-focused heuristic that mirrors how recent query tokens attend back to earlier context — a useful baseline for retrieval-heavy prompts.
SuperCompress (Learned Policy)
What it does:LearnedPolicy runs the EvictionPolicyNetwork — a ~5K-parameter MLP with the architecture Linear(9→64) → LayerNorm → GELU → Linear(64→64) → GELU → Linear(64→1). For each token it computes a sigmoid keep-score over the 9-dimensional feature vector from build_feature_tensor. The top-budget tokens by score are retained. The model runs on CPU in well under 1 ms.
When to use it: The recommended default for all production workloads. Achieves 100% oracle recall at 35% token budget — the only policy to match the theoretical Oracle upper bound in benchmarks.
Oracle
What it does:OraclePolicy is the theoretical upper bound. It keeps all tokens flagged as is_oracle_important (ground-truth labels derived from the source text) and then fills the remaining budget with the most recent tokens. It cannot be used at inference time because oracle labels are unknown before generation.
When to use it: Benchmarking only — to establish the ceiling that all learned and heuristic policies are measured against.
OraclePolicy is reserved for benchmarking only — it uses ground-truth is_oracle_important labels that are unavailable at inference time. Never use it in production.Comparing All Policies at Once
Usecompare_policies() to run every policy on the same context and question and get a side-by-side result dict:
run_policy_benchmarks() from supercompress.benchmarks.runner — see Benchmarks for details.