Overview
Metaculus uses sophisticated scoring algorithms to evaluate forecast accuracy and rank forecasters. The platform implements three primary scoring methods, each designed to measure different aspects of forecasting skill.Score Types
Peer Score
Peer Score measures how much better your forecast is compared to the community aggregate at each point in time. This is the primary scoring method used in most tournaments.How Peer Score Works
How Peer Score Works
Algorithm:For each forecast, the Peer Score is calculated by comparing your prediction to the geometric mean of all other forecasters’ predictions at that time:Where:Where:
S_peer= Peer scoreN= Number of forecasters (excluding you)p= Your forecast probability for the correct outcomep_geo= Geometric mean of all other forecasters’ predictions
S_i= Score during interval it_i= Duration of interval iT= Total question duration (from open to close)
- Rewards forecasts that outperform the community
- Neutral when you match the community aggregate
- Penalizes forecasts worse than community consensus
- Coverage-weighted: longer-held forecasts have more impact
- Tournament scoring (most common)
- Measuring relative forecasting skill
- Competitive leaderboards
Baseline Score
Baseline Score measures forecast accuracy against a naive baseline prediction, rewarding information gain over the prior.How Baseline Score Works
How Baseline Score Works
Algorithm:For Binary/Multiple Choice:Where:Where:
p= Your forecast probability for the correct outcomen= Number of options available at forecast time
p= Your forecast probability density for the correct bucketb= Baseline probability:b = 0.05for open bounds (tails)b = (1 - 0.05 × open_bounds) / (n - 2)for interior buckets
- Global leaderboards
- Measuring absolute accuracy
- Questions where community size varies
- Comparing performance across different question sets
Spot Score
Spot Score evaluates a single forecast made at a specific point in time, rather than integrating over the entire question duration.How Spot Score Works
How Spot Score Works
Spot Peer Score:Evaluates your forecast at a specific timestamp against the community geometric mean at that moment:Where all variables are evaluated at the spot forecast timestamp.Spot Baseline Score:Evaluates your forecast at a specific timestamp against the baseline:
- For binary/MC: Same formula as baseline but evaluated at one point
- For continuous: Same formula as baseline but evaluated at one point
- No time-weighting (single point evaluation)
- Useful for snapshot competitions
- Coverage is always 1.0
- Simpler to understand than time-integrated scores
- Snapshot tournaments (e.g., “forecast at market close”)
- One-time prediction challenges
- Rapid-response forecasting events
Score Calculation Details
Geometric Mean Aggregation
The community forecast used in Peer Score is calculated using geometric mean of probabilities:Geometric Mean Details
Geometric Mean Details
For each outcome bucket:Where
p_i is forecaster i’s probability for that outcome.The geometric mean is recalculated every time any forecaster updates their prediction, creating a time series of community forecasts.Why Geometric Mean?- More robust to extreme predictions than arithmetic mean
- Naturally handles probability distributions
- Encourages well-calibrated forecasts
- Prevents any single forecaster from dominating the aggregate
Coverage
Coverage measures what fraction of a question’s lifetime your forecasts covered:t_start= When your forecast becomes active (max of forecast start and question open)t_end= When your forecast ends (min of forecast end and question close)T_total= Total question duration
- Full coverage (1.0) means you had an active forecast for the entire question lifetime
- Partial coverage reduces your effective contribution
- Some leaderboards require minimum coverage thresholds
Leaderboard Scoring
Leaderboards aggregate individual question scores using different methods:| Leaderboard Type | Aggregation Method | Use Case |
|---|---|---|
| Peer Tournament | Sum of peer scores | Most tournaments |
| Baseline Global | Sum of baseline scores | Global skill ranking |
| Peer Global | Coverage-weighted average | Global peer performance |
| Spot Peer | Sum of spot peer scores | Snapshot competitions |
| Spot Baseline | Sum of spot baseline scores | Snapshot accuracy |
Implementation Reference
The scoring algorithms are implemented inscoring/score_math.py:
evaluate_forecasts_peer_accuracy()- Peer Score calculationevaluate_forecasts_baseline_accuracy()- Baseline Score calculationevaluate_forecasts_peer_spot_forecast()- Spot Peer Scoreevaluate_forecasts_baseline_spot_forecast()- Spot Baseline Scoreget_geometric_means()- Community aggregate calculation
