SuperCompress Quickstart: Compress Your First LLM Context

SuperCompress is designed to slot into any Python LLM workflow with a single function call. This guide walks you from a fresh install to a working compression pipeline, and then shows you exactly why naive truncation fails on real agent context — and how SuperCompress handles it.

Install SuperCompress

Install directly from GitHub using pip. No PyPI release is required.

pip install git+https://github.com/arjunkshah/supercompress.git

This pulls in the core library along with its two dependencies (torch>=2.0.0 and numpy>=1.24.0) and includes the pretrained checkpoint at checkpoints/default.pt. See the Installation guide for optional extras like the dev tools and local HTTP server.

Compress your context

Import compress_context and pass it your context string, the current user query, and a budget_ratio representing the fraction of tokens to keep.

from supercompress import compress_context

result = compress_context(
    "long context text…",
    "What does fetch return when the row is missing?",
    budget_ratio=0.35,
)

SuperCompress loads the bundled checkpoint automatically. If the checkpoint is not found it falls back to the H2OPolicy baseline so the call never raises unexpectedly.

Read the result

compress_context returns a CompressResult dataclass. The fields you’ll use most often are:

# The trimmed context — pass this to your LLM
print(result.compressed_text)

# How much KV cache you saved (e.g. 65.0)
print(f"{result.kv_savings_pct:.1f}% KV saved")

# Token counts before and after
print(f"{result.kept_tokens}/{result.original_tokens} tokens kept")

The full CompressResult also exposes compression_ratio, kept_line_ratio, policy_name, budget_ratio, and the original question used for scoring.

Pass to your LLM

Swap result.compressed_text in wherever you would have passed the original context — the format is identical, just shorter.

import openai

client = openai.OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": result.compressed_text},
        {"role": "user", "content": result.question},
    ],
)

See why truncation fails

The middle_truncation_failure_case() helper builds a synthetic context where a critical answer is buried 180 lines deep — exactly where head-and-tail truncation will always drop it. Running compare_policies() on it shows the difference at a glance.

from supercompress import middle_truncation_failure_case, compare_policies

context, question = middle_truncation_failure_case()
# context contains 180 filler lines, then:
#   CRITICAL_ANSWER = "404 when row is missing — User.fetch returns None"
# then 40 bridge lines and 15 tail lines.

results = compare_policies(context, question, budget_ratio=0.35)

for name, r in results.items():
    answer_kept = "CRITICAL_ANSWER" in r.compressed_text
    print(f"{name:>16}  kept={r.kept_tokens:>4}/{r.original_tokens}  answer={'✓' if answer_kept else '✗'}")

Expected output (results will vary slightly by seed):

            FIFO  kept= 84/240  answer=✗
      Truncation  kept= 84/240  answer=✗
   Summarization  kept= 84/240  answer=✓
             H2O  kept= 84/240  answer=✓
   SuperCompress  kept= 84/240  answer=✓

compare_policies() runs all five policies — FIFO, Truncation, Summarization, H2O, and SuperCompress — over the same context and returns a dict[str, CompressResult] so you can benchmark them side by side on your own data too.

Other exported functions

SuperCompress exports two additional functions for common agent patterns. compress_for_turn accepts a list of context blocks (tool responses, memory chunks, prior turns) and merges them before compressing. Use it when your agent builds context from multiple sources:

from supercompress import compress_for_turn

compressed_text, result = compress_for_turn(
    context_blocks=["tool response A…", "memory chunk B…", "prior turn C…"],
    user_query="What did the tool return?",
    budget_ratio=0.35,
)
# compressed_text is ready to pass to your LLM
print(f"{result.kv_savings_pct:.1f}% KV saved")

compress_detailed returns the same CompressResult plus a list of LineAnnotation objects — one per input line — each carrying line_index, text, kept (bool), and reason (e.g. "learned retention score", "attention sink (always kept)"). Use it when you need to inspect or visualize exactly which lines were evicted and why:

from supercompress import compress_detailed

result, annotations = compress_detailed(
    "long context text…",
    "What does fetch return when the row is missing?",
    budget_ratio=0.35,
)

for ann in annotations:
    status = "KEEP" if ann.kept else "DROP"
    print(f"[{status}] line {ann.line_index}: {ann.reason}")

The default budget_ratio of 0.35 keeps 35% of the original tokens. This is the value used in all published benchmarks. You can lower it for more aggressive compression or raise it toward 1.0 to preserve more context — just note that budget_ratio must be in the range (0, 1] or compress_context will raise a ValueError.

Get Started

Core Concepts

Guides

Development

SuperCompress Quickstart: Compress Your First LLM Context

See why truncation fails

Other exported functions

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Development

Documentation Index

​See why truncation fails

​Other exported functions

Build docs developers (and LLMs) love

See why truncation fails

Other exported functions