SuperCompress ships with a reproducible benchmark suite that measures eviction policy quality across three independent axes: oracle recall (did we keep the tokens that actually answer the question?), entity recall (did we keep the named entities from the query?), and latency. All reported numbers were produced by runningDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/arjunkshah/supercompress/llms.txt
Use this file to discover all available pages before exploring further.
scripts/benchmark_web.py over 8 random seeds at a 35% token budget, using synthetically generated long contexts from supercompress.simulator. The suite does not require an external LLM — quality is measured by comparing the compressed output against ground-truth oracle labels derived from the source text.
Benchmark Results
Policy comparison atbudget_ratio=0.35, averaged over 8 seeds:
| Policy | Oracle recall | Entity recall | Latency |
|---|---|---|---|
| FIFO / Truncation | 25% | 73% | ~57 ms |
| Summarization | 61% | 65% | ~63 ms |
| H2O | 98% | 73% | ~56 ms |
| SuperCompress | 100% | 73% | ~60 ms |
KV savings are held constant across all policies at ~65% (the budget is the same for all runs). The table focuses on quality differences that emerge from which tokens each policy chooses to keep.
Metric Definitions
All metrics are implemented insupercompress/benchmarks/metrics.py.
oracle_recall
question using mark_oracle_important (a heuristic based on entity matching and structural code patterns), then measures the fraction of those tokens that appear in the set of kept positions. A score of 1.0 means every answer-bearing token was retained.
Range: 0.0 – 1.0. Higher is better. This is the primary quality metric.
entity_recall
extract_question_entities, then checks how many of those entities appear (as substrings) in the compressed text. A score of 1.0 means all question entities are present in the output.
Range: 0.0 – 1.0. Higher is better.
answer_quality_score
- 65% weight on
entity_recall - 35% weight on a pattern score: presence of
def <entity>,<entity> =, and bare entity substring matches in the compressed text
sustainability_from_tokens_saved
SustainabilityAssumptions. Pass a custom SustainabilityAssumptions instance to override any constant for your hardware or grid region.
| Constant | Value |
|---|---|
tokens_per_gpu_second | 2 500 |
gpu_watts | 150 W |
grid_kg_co2_per_kwh | 0.417 kg CO₂/kWh (US grid average) |
kv_share_of_prefill | 55% |
These are illustrative estimates, not per-deployment measurements. Adjust
SustainabilityAssumptions fields for your own hardware and grid region.Running Benchmarks Yourself
All benchmark scripts are included in the repository and require only the[dev] extras:
web/assets/img/chart-kv-savings.svg, chart-oracle-recall.svg, and chart-impact.svg.
Running a Custom Policy Comparison
Usecompare_policies() for a quick single-seed comparison, or run_policy_benchmarks() for the full multi-seed statistical run:
run_policy_benchmarks return value includes:
summary— per-policy averages for all metricsruns— raw per-seed rows with full metric breakdownsheadline— top-level KV savings and quality numbersassumptions— theSustainabilityAssumptionsused for CO₂ estimates
What the Benchmarks Claim and Don’t Claim
What we claim
- Learned CPU eviction beats truncation on oracle recall at similar KV savings
- Policy size is ~5K parameters — fits comfortably on any CPU
- Benchmarks and tests are fully reproducible from the public repository
- Environmental estimates use documented, adjustable assumptions
What we don't claim
- Live datacenter energy metering — CO₂ numbers use modeled assumptions
- That every workload matches the synthetic benchmark seeds
- Superiority on tasks outside the benchmark distribution
- Per-deployment measurement without adjusting
SustainabilityAssumptions