Skip to main content

Overview

Metaculus uses sophisticated scoring algorithms to evaluate forecast accuracy and rank forecasters. The platform implements three primary scoring methods, each designed to measure different aspects of forecasting skill.

Score Types

Peer Score

Peer Score measures how much better your forecast is compared to the community aggregate at each point in time. This is the primary scoring method used in most tournaments.
Algorithm:For each forecast, the Peer Score is calculated by comparing your prediction to the geometric mean of all other forecasters’ predictions at that time:
S_peer = 100 × (N / (N-1)) × ln(p / p_geo)
Where:
  • S_peer = Peer score
  • N = Number of forecasters (excluding you)
  • p = Your forecast probability for the correct outcome
  • p_geo = Geometric mean of all other forecasters’ predictions
For continuous questions, the score is divided by 2 to normalize the scale.Time-Weighted Integration:Since forecasts change over time, the final Peer Score is a time-weighted average:
Score_final = Σ(i=1 to n) S_i × (t_i / T)
Where:
  • S_i = Score during interval i
  • t_i = Duration of interval i
  • T = Total question duration (from open to close)
Key Properties:
  • Rewards forecasts that outperform the community
  • Neutral when you match the community aggregate
  • Penalizes forecasts worse than community consensus
  • Coverage-weighted: longer-held forecasts have more impact
When to use Peer Score:
  • Tournament scoring (most common)
  • Measuring relative forecasting skill
  • Competitive leaderboards

Baseline Score

Baseline Score measures forecast accuracy against a naive baseline prediction, rewarding information gain over the prior.
Algorithm:For Binary/Multiple Choice:
S_baseline = 100 × (ln(p × n) / ln(n))
Where:
  • p = Your forecast probability for the correct outcome
  • n = Number of options available at forecast time
This normalizes the log score so that a uniform distribution (1/n for each option) scores 0, and a perfect prediction (100%) scores 100.For Continuous Questions:
S_baseline = 100 × (ln(p / b) / 2)
Where:
  • p = Your forecast probability density for the correct bucket
  • b = Baseline probability:
    • b = 0.05 for open bounds (tails)
    • b = (1 - 0.05 × open_bounds) / (n - 2) for interior buckets
Time Integration:Like Peer Score, Baseline Score is time-weighted across the question’s lifetime.
When to use Baseline Score:
  • Global leaderboards
  • Measuring absolute accuracy
  • Questions where community size varies
  • Comparing performance across different question sets

Spot Score

Spot Score evaluates a single forecast made at a specific point in time, rather than integrating over the entire question duration.
Spot Peer Score:Evaluates your forecast at a specific timestamp against the community geometric mean at that moment:
S_spot_peer = 100 × (N / (N-1)) × ln(p / p_geo)
Where all variables are evaluated at the spot forecast timestamp.Spot Baseline Score:Evaluates your forecast at a specific timestamp against the baseline:
  • For binary/MC: Same formula as baseline but evaluated at one point
  • For continuous: Same formula as baseline but evaluated at one point
Key Properties:
  • No time-weighting (single point evaluation)
  • Useful for snapshot competitions
  • Coverage is always 1.0
  • Simpler to understand than time-integrated scores
When to use Spot Score:
  • Snapshot tournaments (e.g., “forecast at market close”)
  • One-time prediction challenges
  • Rapid-response forecasting events

Score Calculation Details

Geometric Mean Aggregation

The community forecast used in Peer Score is calculated using geometric mean of probabilities:
For each outcome bucket:
p_geo = (∏(i=1 to N) p_i)^(1/N)
Where p_i is forecaster i’s probability for that outcome.The geometric mean is recalculated every time any forecaster updates their prediction, creating a time series of community forecasts.Why Geometric Mean?
  • More robust to extreme predictions than arithmetic mean
  • Naturally handles probability distributions
  • Encourages well-calibrated forecasts
  • Prevents any single forecaster from dominating the aggregate

Coverage

Coverage measures what fraction of a question’s lifetime your forecasts covered:
Coverage = (Σ_forecasts (t_end - t_start)) / T_total
Where:
  • t_start = When your forecast becomes active (max of forecast start and question open)
  • t_end = When your forecast ends (min of forecast end and question close)
  • T_total = Total question duration
Impact:
  • Full coverage (1.0) means you had an active forecast for the entire question lifetime
  • Partial coverage reduces your effective contribution
  • Some leaderboards require minimum coverage thresholds

Leaderboard Scoring

Leaderboards aggregate individual question scores using different methods:
Leaderboard TypeAggregation MethodUse Case
Peer TournamentSum of peer scoresMost tournaments
Baseline GlobalSum of baseline scoresGlobal skill ranking
Peer GlobalCoverage-weighted averageGlobal peer performance
Spot PeerSum of spot peer scoresSnapshot competitions
Spot BaselineSum of spot baseline scoresSnapshot accuracy

Implementation Reference

The scoring algorithms are implemented in scoring/score_math.py:
  • evaluate_forecasts_peer_accuracy() - Peer Score calculation
  • evaluate_forecasts_baseline_accuracy() - Baseline Score calculation
  • evaluate_forecasts_peer_spot_forecast() - Spot Peer Score
  • evaluate_forecasts_baseline_spot_forecast() - Spot Baseline Score
  • get_geometric_means() - Community aggregate calculation

Best Practices

Maximizing Your Scores:
  1. Forecast early - More coverage means more weight
  2. Update regularly - React to new information
  3. Beat the crowd - Peer Score rewards contrarian accuracy
  4. Calibrate well - Baseline Score rewards proper confidence
  5. Avoid extremes - Unless you have strong evidence
Common Pitfalls:
  • Forecasting too late reduces coverage
  • Copying the community aggregate gives ~0 peer score
  • Overconfident predictions can be heavily penalized
  • Not updating forecasts misses opportunities for improvement

Build docs developers (and LLMs) love