SuperCompress reduces the number of tokens that reach your LLM’s KV cache prefill step. Fewer tokens means less GPU work, less energy consumed, and lower CO₂ emissions for the exact same workflow. The learned eviction policy runs on CPU in sub-millisecond time before any GPU inference begins, so the overhead of compression is negligible compared to the prefill cost it avoids on long contexts.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/arjunkshah/supercompress/llms.txt
Use this file to discover all available pages before exploring further.
What we measure
Every compression call produces anoriginal_tokens and kept_tokens value. The sustainability module converts the difference into an energy and emissions estimate using a simple linear model:
| Metric | Definition |
|---|---|
| Tokens saved | original_tokens − kept_tokens per compression call |
| KV savings % | (1 − kept / original) × 100 |
| GPU-seconds avoided | Effective tokens saved ÷ throughput (tokens/sec) |
| Wh saved | GPU-seconds avoided × GPU watts ÷ 3,600 |
| CO₂ avoided | Wh saved × grid intensity (kg/kWh) ÷ 1,000 |
kv_share_of_prefill). This avoids over-claiming: embedding lookup, attention over new tokens, and other prefill work are excluded.
Default assumptions
The defaults are defined insupercompress/benchmarks/metrics.py as a frozen dataclass so every estimate is fully reproducible and traceable:
| Parameter | Default | Rationale |
|---|---|---|
tokens_per_gpu_second | 2,500 | 7B-class prefill on a consumer GPU (e.g. RTX 3090) |
gpu_watts | 150 W | Typical single-GPU sustained draw during inference |
kv_share_of_prefill | 55% | Only the context/KV portion is attributed to savings |
grid_kg_co2_per_kwh | 0.417 | US average grid intensity (EIA 2023) |
SustainabilityAssumptions object and passing it to sustainability_from_tokens_saved().
Python API
Usesustainability_from_tokens_saved() to compute an estimate for any number of tokens saved:
Scale example
At 1 million compressions with approximately 800 tokens saved per run:- 800 million tokens avoided from GPU prefill
- ~29 kWh of GPU energy saved (default assumptions)
- ~12 kg CO₂ avoided (US grid average)
Use the Projection calculator on the SuperCompress website (
#impact) to adjust compression volume, tokens-per-run, and grid intensity interactively without writing any code.Honesty guidance for submissions and reports
When citing SuperCompress sustainability metrics in papers, demos, or hackathon submissions, follow these principles to avoid misleading claims:- State assumptions clearly — quote the
SustainabilityAssumptionsvalues used; do not present estimates as live metering. - Report quality alongside savings — token reduction without answer quality data is not a fair comparison. Use
answer_quality_score()or an equivalent evaluation. - Scope the claim correctly — SuperCompress targets edge-CPU policy inference and measurable KV cache reduction, not datacenter-wide carbon accounting.