SuperCompress: Learned Context Compression for LLMs

SuperCompress is a Python library that compresses long agent context before every LLM call. Instead of blindly truncating from the head or tail, SuperCompress uses a lightweight ~5K-parameter eviction policy to retain the tokens most relevant to your current query — including answer-bearing lines in the middle of long documents that naive truncation drops entirely.

Quickstart

Get from install to your first compressed context in under five minutes.

How It Works

Understand the learned eviction pipeline and why it beats truncation.

Python API Reference

Full signatures for compress_context, compress_for_turn, and every public export.

Integrations

Wire SuperCompress into OpenAI messages, LangChain agents, or any HTTP client.

Why SuperCompress?

Long agent context is expensive. Every token in the KV cache costs GPU prefill time. Truncation keeps the head and tail but silently drops answers buried in the middle. SuperCompress learns which lines to keep for the current question — under a fixed token budget.

Metric	SuperCompress	Truncation / FIFO
KV savings @ 35% budget	~65%	~65%
Oracle recall	100%	~25%
Policy size	~5K params	rule-based
Runs on	CPU (pre-inference)	CPU

Install

pip install git+https://github.com/arjunkshah/supercompress.git

Compress your context

from supercompress import compress_context

result = compress_context(
    "long context text…",
    "What does fetch return when the row is missing?",
    budget_ratio=0.35,
)
print(result.compressed_text)
print(f"{result.kv_savings_pct:.1f}% KV saved · {result.kept_tokens}/{result.original_tokens} tokens")

Pass the result to your LLM

Use result.compressed_text wherever you’d pass your original context — it’s a plain string.

Explore the docs

Eviction Policies

FIFO, Truncation, H2O, Summarization, and the learned SuperCompress policy explained.

Benchmarks

Reproducible benchmark results across 8 seeds — oracle recall, entity recall, latency.

Environmental Impact

How tokens saved translates to GPU-seconds, Wh, and CO₂ with documented assumptions.

API Dashboard

Firebase auth, API key management, and per-key usage tracking for the hosted API.

Local Server

Run the FastAPI server locally for development and integration testing.

HTTP API

REST endpoints for the hosted compress service with API key authentication.

Get Started

Core Concepts

Guides

Development

SuperCompress: Learned Context Compression for LLMs

Quickstart

How It Works

Python API Reference

Integrations

Why SuperCompress?

Explore the docs

Eviction Policies

Benchmarks

Environmental Impact

API Dashboard

Local Server

HTTP API

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Development

Documentation Index

Quickstart

How It Works

Python API Reference

Integrations

​Why SuperCompress?

​Explore the docs

Eviction Policies

Benchmarks

Environmental Impact

API Dashboard

Local Server

HTTP API

Build docs developers (and LLMs) love

Why SuperCompress?

Explore the docs