What is Calibration?
Calibration measures whether your forecast probabilities match reality. A perfectly calibrated forecaster’s predictions come true at the rate they predict. Examples:- If you forecast 70% on 100 questions, roughly 70 should resolve “Yes”
- If you forecast 30% on 100 questions, roughly 30 should resolve “Yes”
- If you’re well-calibrated, your confidence matches your accuracy
Calibration Curves
A calibration curve visualizes how well your forecast probabilities align with actual outcomes.Reading a Calibration Curve
Axes:- X-axis: Your predicted probability (0-100%)
- Y-axis: Actual frequency of “Yes” resolutions (0-100%)
- Points fall on the diagonal line (y = x)
- 30% predictions resolve Yes 30% of the time
- 80% predictions resolve Yes 80% of the time
- Points fall below the diagonal
- You predict 80%, but only 60% resolve Yes
- Your confidence exceeds your accuracy
- Points fall above the diagonal
- You predict 40%, but 60% resolve Yes
- You’re more accurate than you think
Calibration Curve Components
Understanding the Chart Elements
Understanding the Chart Elements
Gold Diamonds:
- Your actual calibration data points
- Each represents a probability bin (e.g., 10-20%, 20-30%)
- Vertical position = actual resolution frequency
- Shows 90% confidence interval for perfect calibration
- Accounts for statistical uncertainty
- Points within this band are consistent with good calibration
- The median of perfect calibration
- Your target for each bin
- The closer your points are to this line, the better
- Smaller bins near extremes (0-4%, 96-100%)
- Larger bins in middle (e.g., 12.5-17.5%)
- Accounts for fewer predictions at extremes
How Calibration is Calculated
Metaculus calculates calibration curves using a sophisticated algorithm:Calibration Algorithm Details
Calibration Algorithm Details
Data Collection:Where:
Weighted Average:For each bin:Where:
- Filter Questions: Only resolved binary questions from past 5 years
- Extract Forecasts: Get all your forecasts on these questions
- Weight by Coverage: Each forecast weighted by time active
t_start= When your forecast became activet_end= When your forecast ended or question closedT_question= Total question duration (open to close)
| Bin Range | Example Forecasts |
|---|---|
| 0.0 - 1.4% | Extreme “No” |
| 1.4 - 2.8% | Very confident “No” |
| 2.8 - 4.2% | Confident “No” |
| 12.5 - 17.5% | Low probability |
| 47.5 - 52.5% | Uncertain |
| 82.5 - 87.5% | High probability |
| 95.8 - 97.2% | Confident “Yes” |
| 97.2 - 98.6% | Very confident “Yes” |
| 98.6 - 100% | Extreme “Yes” |
w_i= Coverage weight for forecast ir_i= Resolution (1 for Yes, 0 for No)
- Lower CI: 5th percentile with p = bin_min
- Upper CI: 95th percentile with p = bin_max
- Accounts for sample size in each bin
Viewing Your Calibration
Personal Track Record
- Go to your profile
- Click Track Record tab
- Scroll to Calibration Curve section
- View your calibration across all resolved binary questions
Community Calibration
- View aggregate calibration for different aggregation methods
- Compare Recency Weighted vs Unweighted vs Pros
- See how collective intelligence calibrates
Common Calibration Patterns
Pattern 1: General Overconfidence
Symptoms:- Most points below diagonal
- Especially at extremes (0-20%, 80-100%)
- 90% predictions only resolve 75% of the time
- Not accounting for unknown unknowns
- Confirmation bias
- Insufficient research
- Overweighting recent information
- Move predictions toward 50%
- Add uncertainty buffer
- Consider base rates more heavily
- Challenge your assumptions
Pattern 2: General Underconfidence
Symptoms:- Most points above diagonal
- Especially in middle ranges
- 60% predictions resolve 75% of the time
- Overcorrecting for overconfidence
- Discounting your research
- Overvaluing contrarian views
- Excessive hedging
- Trust your analysis more
- Move predictions away from 50%
- Don’t add unnecessary uncertainty
- Confidence is justified when earned
Pattern 3: Extreme Avoidance
Symptoms:- Few forecasts below 20% or above 80%
- Clustering around 30-70%
- Missing obvious high-confidence situations
- Fear of being wrong with high confidence
- Not recognizing slam dunks
- Overthinking simple questions
- Make extreme predictions when warranted
- Don’t artificially hedge
- Some questions deserve 5% or 95%
Pattern 4: Central Overconfidence
Symptoms:- Good calibration at extremes
- Poor calibration near 50%
- 50% predictions resolve 30% or 70%
- Using 50% as “don’t know”
- Insufficient research on unclear questions
- Lazy forecasting in uncertain cases
- Never use 50% as default
- Research harder on uncertain questions
- Find any small edge (51% vs 49%)
Improving Your Calibration
1. Regular Practice
2. Deliberate Probability Selection
Framework:| Confidence Level | Probability | When to Use |
|---|---|---|
| Extreme certainty | 95-99% | Overwhelming evidence, slam dunks |
| High confidence | 75-90% | Strong evidence, few unknowns |
| Moderate confidence | 60-75% | Decent evidence, some uncertainty |
| Slight edge | 51-59% | Weak evidence, high uncertainty |
| True toss-up | 50% | No evidence either direction |
| Slight doubt | 41-49% | Weak counter-evidence |
| Moderate doubt | 25-40% | Decent counter-evidence |
| High doubt | 10-25% | Strong counter-evidence |
| Extreme doubt | 1-10% | Overwhelming counter-evidence |
3. Update Based on Calibration
Use your calibration curve to adjust: If overconfident overall:- Before submitting X%, ask “Am I sure enough?”
- Add 10-20% uncertainty buffer
- Move 90% → 75%, 80% → 65%, etc.
- Before submitting X%, ask “Do I have more evidence?”
- Reduce uncertainty buffer
- Move 60% → 70%, 70% → 80%, etc.
4. Base Rate Focus
Using Base Rates
Using Base Rates
What are Base Rates?The historical frequency of similar events.Example:Question: “Will Candidate X win the election?”
- Find base rate: Incumbents win 75% of the time
- Start there: Begin forecast at 75%
- Adjust for specifics: Candidate’s polling, economy, etc.
- Don’t stray too far: Rarely justified to go below 50% or above 90%
- Prevents extreme predictions
- Anchors on reality
- Improves calibration naturally
5. Post-Resolution Analysis
After questions resolve:- Review your forecasts: What probability did you assign?
- Check the outcome: Did it match your forecast?
- Identify patterns: Are you consistently over/under in certain domains?
- Adjust strategy: Change your approach for similar questions
Calibration vs. Other Metrics
Calibration vs. Brier Score
Calibration:- Measures probability accuracy
- “Do 70% forecasts come true 70% of the time?”
- Pure probability assessment
- Measures overall prediction error
- Rewards both calibration AND resolution
- Being right is better than being calibrated
- Forecasting 70% on everything that resolves Yes 70% of the time = perfectly calibrated but mediocre Brier
- Forecasting 95% on Yes outcomes and 5% on No outcomes = may have worse calibration but better Brier
Calibration vs. Peer Score
Calibration:- You vs. reality
- Absolute accuracy measure
- Independent of other forecasters
- You vs. community
- Relative accuracy measure
- Depends on others’ forecasts
- Poor calibration but good Peer Score: You and community are both overconfident, but you’re less wrong
- Good calibration but poor Peer Score: You’re well-calibrated but the (poorly-calibrated) community outperforms you
Technical Implementation
Calibration calculation is implemented inusers/services/profile_stats.py:
get_calibration_curve_data()- Main calibration function- Coverage-weighted by forecast duration
- Variable bin sizes for statistical robustness
- Binomial confidence intervals
- 5-year lookback window
Advanced: Calibration for Continuous Questions
While calibration curves focus on binary questions, continuous questions have analogous metrics: Prediction Intervals:- Do your 80% intervals contain the outcome 80% of the time?
- Do your 50% intervals contain it 50% of the time?
