Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/headroomlabs-ai/headroom/llms.txt

Use this file to discover all available pages before exploring further.

Headroom’s core promise is to compress LLM context without losing accuracy. This page covers compression benchmarks, accuracy evaluations, latency overhead, and production telemetry from real deployments. All results are reproducible from the open-source repository.

Compression Performance

Tested on Apple M-series (CPU), Headroom v0.5.18. Each test runs compress() on realistic tool outputs representative of real agent workloads.
Content typeOriginal tokensCompressedSavedRatioLatency
JSON array (100 items)3,1632972,86690.6%1 ms
JSON array (500 items)9,5261,6147,91283.1%2 ms
Shell output (200 lines)3,2384692,76985.5%1 ms
Build log (200 lines)2,4121482,26493.9%1 ms
grep results (150 hits)2,6242,62400.0%<1 ms
Python source (~480 lines)2,9582,95800.0%<1 ms
Total23,9218,11015,81166.1%5 ms
Zero compression is intentional. grep results and Python source show 0% compression. grep results are already a compact structured format. Source code passes through unchanged to preserve correctness — the CodeCompressor is gated by safety protections (recent-code protection, analysis-intent detection) that prevent it from firing in the scenarios where it would matter most. See Limitations for details.

Real Workload Savings

Savings measured on complete agent sessions — not synthetic benchmarks:
WorkloadBeforeAfterSavings
Code search (100 results)17,7651,40892%
SRE incident debugging65,6945,11892%
GitHub issue triage54,17414,76173%
Codebase exploration78,50241,25447%
The range reflects content mix: JSON-heavy workloads (search results, API responses, build logs) compress most aggressively. Sessions dominated by file reads and code changes compress less because code passes through.

Accuracy Benchmarks

Standard NLP and Reasoning Benchmarks

Accuracy is measured with and without Headroom compression applied. A delta of ±0.000 means no measurable accuracy loss at the compression ratios tested.
BenchmarkCategoryNBaselineHeadroomDelta
GSM8KMath1000.8700.870±0.000
TruthfulQAFactual1000.5300.560+0.030
SQuAD v2QA10097%at 19% compression
BFCLTools10097%at 32% compression
TruthfulQA scores slightly higher with Headroom (+0.030). Removing HTML noise and verbose wrapper text helps LLMs focus on relevant content rather than boilerplate, reducing hallucination triggers.

HTML Extraction Accuracy

Dataset: Scrapinghub Article Extraction Benchmark (181 HTML pages with ground-truth article text).
MetricValue
F1 Score0.919
Precision0.879
Recall0.982
Compression ratio94.9%
For LLM applications, recall is the critical metric — 98.2% means nearly all article content is preserved. The small precision drop (some extra content included) does not hurt LLM answer quality.

JSON Compression (SmartCrusher)

Test: 100 production log entries with a critical error buried at position 67. Task: find the error, error code, resolution, and affected count.
MetricBaselineHeadroom
Input tokens10,1441,260
Correct answers4/44/4
Compression87.6%
SmartCrusher preserves first N items (schema discovery), last N items (recency), all anomalies (errors, warnings), and items relevant to the query context — so the critical error at position 67 is never dropped.

QA Accuracy on Extracted Content

MetricOriginal HTMLHeadroom extractedDelta
F1 Score0.850.87+0.02
Exact Match60%62%+2%

Latency Overhead

Total Pipeline Latency

The compression pipeline adds sub-5ms total on realistic tool outputs (see the compression-performance table above). The pipeline step breakdown on production traffic:
StepMedianP90Description
pipeline_total16.9 ms289 msFull compression pipeline
content_router11.7 ms259 msContent detection + routing
smart_crusher50.1 ms50 msJSON array compression
text_compressor32.0 ms576 msText compression (Kompress ONNX)
initial_token_count2.9 ms16 msToken counting (tiktoken)
ContentRouter accounts for 91–98% of pipeline cost on average. CacheAligner is sub-millisecond.

SDK Compression Latency by Input Size

Measured per-scenario on Apple M-series (CPU):
ScenarioTokens inTokens outSavedp50 (ms)p95 (ms)
JSON: Search results (100 items)10.2K1.5K8.7K189231
JSON: Search results (500 items)50.2K1.5K48.7K943955
JSON: Search results (1K items)100.5K1.5K99.0K2,0122,198
JSON: API responses (500 items)38.9K1.1K37.8K743776
JSON: Database rows (1K rows)43.7K60543.1K9611,104
JSON: String array (100 strings)1.1K2318201515
JSON: String array (500 strings)4.9K2334.6K7280
JSON: Number array (200 numbers)1.2K1921.1K3162
JSON: Mixed array (250 items)2.3K3681.9K3840

Cost-Benefit Analysis

Compression overhead is always smaller than the LLM time saved from sending fewer tokens. At Claude Sonnet pricing ($3.0/MTok input):
ScenarioCompress (ms)LLM saved (ms)Net benefitSavings per 1K requests
JSON: Search results (100 items)189261+72ms$26
JSON: Search results (500 items)9431,461+518ms$146
JSON: Search results (1K items)2,0122,969+957ms$297
JSON: API responses (500 items)7431,134+391ms$113
JSON: Database rows (1K rows)9611,292+331ms$129
Compression pays for itself in latency for 11 of 12 tested scenarios against Claude Sonnet. Slower, more expensive models (Opus) benefit even more — output costs 5× input on Opus-class models.

Production Telemetry

Real-world data from 50,000+ proxy sessions across 250+ unique instances (March–April 2026). Collected via anonymous opt-in telemetry (HEADROOM_TELEMETRY=on; telemetry is off by default).

Proxy Overhead Distribution

PercentileLatency
Median (P50)52 ms
P90309 ms
P994,172 ms
Mean161 ms
The median 52 ms overhead is negligible compared to LLM inference time (typically 2–10 seconds for streaming and 5–30 seconds for long context).

Compression Rate Distribution

PercentileCompression
P254.8%
Median4.8%
P756.9%
Mean11.3%
Median compression is modest because many requests are short conversational turns. Heavy tool-use sessions (file reads, shell output, API responses) see 40–80% compression.

Fleet Summary

MetricValue
Healthy instances249
Total tokens saved1.4 billion
Total cost avoided~$4,000
OS distributionLinux 57%, macOS 38%, Windows 5%

Running the Evaluations

Reproduce all accuracy benchmarks with:
# Tier 1: fast benchmarks (GSM8K, TruthfulQA, SQuAD v2, BFCL)
python -m headroom.evals suite --tier 1

# Full suite including HTML extraction
git clone https://github.com/chopratejas/headroom.git && cd headroom
pip install -e ".[evals,html]"
pytest tests/test_evals/ -v -s
The [evals] extra is required for the eval harness. HTML benchmarks additionally need the [html] extra. Both are excluded from [all] to keep the default install lightweight.

Build docs developers (and LLMs) love