SuperCompress: Learned Context Compression for LLMs

SuperCompress is a Python library for learned context compression. Before each LLM inference call, it uses a small (~5K-parameter) CPU policy to decide which lines of your agent’s context are worth keeping under a fixed token budget — and which can be safely dropped. The result is a shorter prompt that fits in less KV cache, costs less to process, and still contains the information your model needs to answer correctly.

The problem with naive truncation

The standard approach to fitting a long context into a token budget is to keep the head and tail and throw away whatever is in the middle. This is fast, but it has a fundamental flaw: the answer to the current question is often in the middle. A retrieval log, a previous tool response, or a computed value buried between padding lines will be silently removed, and the model will hallucinate or refuse rather than answer. SuperCompress solves this by treating eviction as a learned ranking problem. Given the current user query, it scores every line of context for relevance and retains the highest-scoring lines within the budget — so a critical CRITICAL_ANSWER = "404 when row is missing" buried 180 lines deep survives compression while irrelevant filler is discarded.

Benchmark results

The table below compares SuperCompress against standard baselines at a 35% token budget (8 seeds):

Policy	Oracle recall	Entity recall	KV savings	Policy size
Truncation / FIFO	25%	73%	~65%	rule-based
Summarization	61%	65%	~65%	rule-based
H2O	98%	73%	~65%	rule-based
SuperCompress	100%	73%	~65%	~5K params

SuperCompress achieves 100% oracle recall — meaning it never loses the answer line — while matching the KV savings of every other approach. The policy runs entirely on CPU before inference, adding roughly 60 ms of overhead. At 1 million compressions (estimated): ~800M tokens avoided · 29 kWh · 12 kg CO₂ saved. See docs/ENVIRONMENT.md for the full methodology.

Public API

SuperCompress exposes seven public symbols from supercompress:

Symbol	Kind	Purpose
`compress_context`	function	Compress one text blob for a given question and budget
`compress_for_turn`	function	Merge multiple context blocks, then compress before a turn
`compress_detailed`	function	Compress with per-line `LineAnnotation` keep/drop reasoning
`compare_policies`	function	Run FIFO, Truncation, Summarization, H2O, and SuperCompress side by side
`middle_truncation_failure_case`	function	Build the canonical synthetic context that defeats head-and-tail truncation
`CompressResult`	dataclass	Return type of every compress function — holds the trimmed text, token counts, and savings metrics
`LineAnnotation`	dataclass	Per-line annotation returned by `compress_detailed` — holds `kept`, `reason`, and `line_index`

Installing the package also registers two CLI entry points: supercompress (run compression from the shell) and supercompress-train (train or fine-tune the eviction policy). Both require Python 3.10 or newer.

Where to go next

Quickstart

Install SuperCompress and run your first compression in under five minutes.

How It Works

Understand the learned eviction policy, token budgets, and the sink-and-recent heuristic.

Eviction Policies

Compare FIFO, Truncation, Summarization, H2O, and the SuperCompress learned policy.

Integrations

Drop SuperCompress into OpenAI, LangChain, and LlamaIndex pipelines.

SuperCompress is released under the MIT License. You are free to use, modify, and distribute it in personal and commercial projects. See the LICENSE file in the repository root for the full text.

Get Started

Core Concepts

Guides

Development

SuperCompress: Learned Context Compression for LLMs

The problem with naive truncation

Benchmark results

Public API

Where to go next

Quickstart

How It Works

Eviction Policies

Integrations

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Development

Documentation Index

​The problem with naive truncation

​Benchmark results

​Public API

​Where to go next

Quickstart

How It Works

Eviction Policies

Integrations

Build docs developers (and LLMs) love

The problem with naive truncation

Benchmark results

Public API

Where to go next