Skip to main content

What is Calibration?

Calibration measures whether your forecast probabilities match reality. A perfectly calibrated forecaster’s predictions come true at the rate they predict. Examples:
  • If you forecast 70% on 100 questions, roughly 70 should resolve “Yes”
  • If you forecast 30% on 100 questions, roughly 30 should resolve “Yes”
  • If you’re well-calibrated, your confidence matches your accuracy
Calibration is different from resolution! You can be well-calibrated with 70% predictions that come true 70% of the time, even though 30% “fail.”

Calibration Curves

A calibration curve visualizes how well your forecast probabilities align with actual outcomes.

Reading a Calibration Curve

Axes:
  • X-axis: Your predicted probability (0-100%)
  • Y-axis: Actual frequency of “Yes” resolutions (0-100%)
Perfect Calibration:
  • Points fall on the diagonal line (y = x)
  • 30% predictions resolve Yes 30% of the time
  • 80% predictions resolve Yes 80% of the time
Overconfidence:
  • Points fall below the diagonal
  • You predict 80%, but only 60% resolve Yes
  • Your confidence exceeds your accuracy
Underconfidence:
  • Points fall above the diagonal
  • You predict 40%, but 60% resolve Yes
  • You’re more accurate than you think

Calibration Curve Components

Gold Diamonds:
  • Your actual calibration data points
  • Each represents a probability bin (e.g., 10-20%, 20-30%)
  • Vertical position = actual resolution frequency
Gray Confidence Interval:
  • Shows 90% confidence interval for perfect calibration
  • Accounts for statistical uncertainty
  • Points within this band are consistent with good calibration
Dark Gray Line:
  • The median of perfect calibration
  • Your target for each bin
  • The closer your points are to this line, the better
Bin Structure:Calibration uses variable bin sizes:
  • Smaller bins near extremes (0-4%, 96-100%)
  • Larger bins in middle (e.g., 12.5-17.5%)
  • Accounts for fewer predictions at extremes

How Calibration is Calculated

Metaculus calculates calibration curves using a sophisticated algorithm:
Data Collection:
  1. Filter Questions: Only resolved binary questions from past 5 years
  2. Extract Forecasts: Get all your forecasts on these questions
  3. Weight by Coverage: Each forecast weighted by time active
Coverage Weight Formula:
w = (t_end - t_start) / T_question
Where:
  • t_start = When your forecast became active
  • t_end = When your forecast ended or question closed
  • T_question = Total question duration (open to close)
Binning:Forecasts are grouped into probability bins:
Bin RangeExample Forecasts
0.0 - 1.4%Extreme “No”
1.4 - 2.8%Very confident “No”
2.8 - 4.2%Confident “No”
12.5 - 17.5%Low probability
47.5 - 52.5%Uncertain
82.5 - 87.5%High probability
95.8 - 97.2%Confident “Yes”
97.2 - 98.6%Very confident “Yes”
98.6 - 100%Extreme “Yes”
Weighted Average:For each bin:
Calibration = (Σ (w_i × r_i)) / (Σ w_i)
Where:
  • w_i = Coverage weight for forecast i
  • r_i = Resolution (1 for Yes, 0 for No)
Confidence Intervals:Using binomial distribution:
  • Lower CI: 5th percentile with p = bin_min
  • Upper CI: 95th percentile with p = bin_max
  • Accounts for sample size in each bin

Viewing Your Calibration

Personal Track Record

  1. Go to your profile
  2. Click Track Record tab
  3. Scroll to Calibration Curve section
  4. View your calibration across all resolved binary questions

Community Calibration

  • View aggregate calibration for different aggregation methods
  • Compare Recency Weighted vs Unweighted vs Pros
  • See how collective intelligence calibrates

Common Calibration Patterns

Pattern 1: General Overconfidence

Symptoms:
  • Most points below diagonal
  • Especially at extremes (0-20%, 80-100%)
  • 90% predictions only resolve 75% of the time
Causes:
  • Not accounting for unknown unknowns
  • Confirmation bias
  • Insufficient research
  • Overweighting recent information
Fix:
  • Move predictions toward 50%
  • Add uncertainty buffer
  • Consider base rates more heavily
  • Challenge your assumptions

Pattern 2: General Underconfidence

Symptoms:
  • Most points above diagonal
  • Especially in middle ranges
  • 60% predictions resolve 75% of the time
Causes:
  • Overcorrecting for overconfidence
  • Discounting your research
  • Overvaluing contrarian views
  • Excessive hedging
Fix:
  • Trust your analysis more
  • Move predictions away from 50%
  • Don’t add unnecessary uncertainty
  • Confidence is justified when earned

Pattern 3: Extreme Avoidance

Symptoms:
  • Few forecasts below 20% or above 80%
  • Clustering around 30-70%
  • Missing obvious high-confidence situations
Causes:
  • Fear of being wrong with high confidence
  • Not recognizing slam dunks
  • Overthinking simple questions
Fix:
  • Make extreme predictions when warranted
  • Don’t artificially hedge
  • Some questions deserve 5% or 95%

Pattern 4: Central Overconfidence

Symptoms:
  • Good calibration at extremes
  • Poor calibration near 50%
  • 50% predictions resolve 30% or 70%
Causes:
  • Using 50% as “don’t know”
  • Insufficient research on unclear questions
  • Lazy forecasting in uncertain cases
Fix:
  • Never use 50% as default
  • Research harder on uncertain questions
  • Find any small edge (51% vs 49%)

Improving Your Calibration

1. Regular Practice

Calibration Training:
  • Forecast on 50+ resolved questions per year
  • Track your calibration monthly
  • Focus on binary questions initially
  • Review results systematically

2. Deliberate Probability Selection

Framework:
Confidence LevelProbabilityWhen to Use
Extreme certainty95-99%Overwhelming evidence, slam dunks
High confidence75-90%Strong evidence, few unknowns
Moderate confidence60-75%Decent evidence, some uncertainty
Slight edge51-59%Weak evidence, high uncertainty
True toss-up50%No evidence either direction
Slight doubt41-49%Weak counter-evidence
Moderate doubt25-40%Decent counter-evidence
High doubt10-25%Strong counter-evidence
Extreme doubt1-10%Overwhelming counter-evidence

3. Update Based on Calibration

Use your calibration curve to adjust: If overconfident overall:
  • Before submitting X%, ask “Am I sure enough?”
  • Add 10-20% uncertainty buffer
  • Move 90% → 75%, 80% → 65%, etc.
If underconfident overall:
  • Before submitting X%, ask “Do I have more evidence?”
  • Reduce uncertainty buffer
  • Move 60% → 70%, 70% → 80%, etc.

4. Base Rate Focus

What are Base Rates?The historical frequency of similar events.Example:Question: “Will Candidate X win the election?”
  1. Find base rate: Incumbents win 75% of the time
  2. Start there: Begin forecast at 75%
  3. Adjust for specifics: Candidate’s polling, economy, etc.
  4. Don’t stray too far: Rarely justified to go below 50% or above 90%
Benefits:
  • Prevents extreme predictions
  • Anchors on reality
  • Improves calibration naturally

5. Post-Resolution Analysis

After questions resolve:
  1. Review your forecasts: What probability did you assign?
  2. Check the outcome: Did it match your forecast?
  3. Identify patterns: Are you consistently over/under in certain domains?
  4. Adjust strategy: Change your approach for similar questions

Calibration vs. Other Metrics

Calibration vs. Brier Score

Calibration:
  • Measures probability accuracy
  • “Do 70% forecasts come true 70% of the time?”
  • Pure probability assessment
Brier Score:
  • Measures overall prediction error
  • Rewards both calibration AND resolution
  • Being right is better than being calibrated
Example:
  • Forecasting 70% on everything that resolves Yes 70% of the time = perfectly calibrated but mediocre Brier
  • Forecasting 95% on Yes outcomes and 5% on No outcomes = may have worse calibration but better Brier

Calibration vs. Peer Score

Calibration:
  • You vs. reality
  • Absolute accuracy measure
  • Independent of other forecasters
Peer Score:
  • You vs. community
  • Relative accuracy measure
  • Depends on others’ forecasts
Can diverge:
  • Poor calibration but good Peer Score: You and community are both overconfident, but you’re less wrong
  • Good calibration but poor Peer Score: You’re well-calibrated but the (poorly-calibrated) community outperforms you

Technical Implementation

Calibration calculation is implemented in users/services/profile_stats.py:
  • get_calibration_curve_data() - Main calibration function
  • Coverage-weighted by forecast duration
  • Variable bin sizes for statistical robustness
  • Binomial confidence intervals
  • 5-year lookback window

Advanced: Calibration for Continuous Questions

While calibration curves focus on binary questions, continuous questions have analogous metrics: Prediction Intervals:
  • Do your 80% intervals contain the outcome 80% of the time?
  • Do your 50% intervals contain it 50% of the time?
Calibration Principle: Same core idea - your stated confidence should match reality.

Build docs developers (and LLMs) love