Scoring Algorithms

Overview

Metaculus uses sophisticated scoring algorithms to evaluate forecast accuracy and rank forecasters. The platform implements three primary scoring methods, each designed to measure different aspects of forecasting skill.

Score Types

Peer Score

Peer Score measures how much better your forecast is compared to the community aggregate at each point in time. This is the primary scoring method used in most tournaments.

How Peer Score Works

Algorithm:For each forecast, the Peer Score is calculated by comparing your prediction to the geometric mean of all other forecasters’ predictions at that time:

S_peer = 100 × (N / (N-1)) × ln(p / p_geo)

Where:

S_peer = Peer score
N = Number of forecasters (excluding you)
p = Your forecast probability for the correct outcome
p_geo = Geometric mean of all other forecasters’ predictions

For continuous questions, the score is divided by 2 to normalize the scale.Time-Weighted Integration:Since forecasts change over time, the final Peer Score is a time-weighted average:

Score_final = Σ(i=1 to n) S_i × (t_i / T)

Where:

S_i = Score during interval i
t_i = Duration of interval i
T = Total question duration (from open to close)

Key Properties:

Rewards forecasts that outperform the community
Neutral when you match the community aggregate
Penalizes forecasts worse than community consensus
Coverage-weighted: longer-held forecasts have more impact

When to use Peer Score:

Tournament scoring (most common)
Measuring relative forecasting skill
Competitive leaderboards

Baseline Score

Baseline Score measures forecast accuracy against a naive baseline prediction, rewarding information gain over the prior.

How Baseline Score Works

Algorithm:For Binary/Multiple Choice:

S_baseline = 100 × (ln(p × n) / ln(n))

Where:

p = Your forecast probability for the correct outcome
n = Number of options available at forecast time

This normalizes the log score so that a uniform distribution (1/n for each option) scores 0, and a perfect prediction (100%) scores 100.For Continuous Questions:

S_baseline = 100 × (ln(p / b) / 2)

Where:

p = Your forecast probability density for the correct bucket
b = Baseline probability:
- b = 0.05 for open bounds (tails)
- b = (1 - 0.05 × open_bounds) / (n - 2) for interior buckets

Time Integration:Like Peer Score, Baseline Score is time-weighted across the question’s lifetime.

When to use Baseline Score:

Global leaderboards
Measuring absolute accuracy
Questions where community size varies
Comparing performance across different question sets

Spot Score

Spot Score evaluates a single forecast made at a specific point in time, rather than integrating over the entire question duration.

How Spot Score Works

Spot Peer Score:Evaluates your forecast at a specific timestamp against the community geometric mean at that moment:

S_spot_peer = 100 × (N / (N-1)) × ln(p / p_geo)

Where all variables are evaluated at the spot forecast timestamp.Spot Baseline Score:Evaluates your forecast at a specific timestamp against the baseline:

For binary/MC: Same formula as baseline but evaluated at one point
For continuous: Same formula as baseline but evaluated at one point

Key Properties:

No time-weighting (single point evaluation)
Useful for snapshot competitions
Coverage is always 1.0
Simpler to understand than time-integrated scores

When to use Spot Score:

Snapshot tournaments (e.g., “forecast at market close”)
One-time prediction challenges
Rapid-response forecasting events

Score Calculation Details

Geometric Mean Aggregation

The community forecast used in Peer Score is calculated using geometric mean of probabilities:

Geometric Mean Details

For each outcome bucket:

p_geo = (∏(i=1 to N) p_i)^(1/N)

Where p_i is forecaster i’s probability for that outcome.The geometric mean is recalculated every time any forecaster updates their prediction, creating a time series of community forecasts.Why Geometric Mean?

More robust to extreme predictions than arithmetic mean
Naturally handles probability distributions
Encourages well-calibrated forecasts
Prevents any single forecaster from dominating the aggregate

Coverage

Coverage measures what fraction of a question’s lifetime your forecasts covered:

Coverage = (Σ_forecasts (t_end - t_start)) / T_total

Where:

t_start = When your forecast becomes active (max of forecast start and question open)
t_end = When your forecast ends (min of forecast end and question close)
T_total = Total question duration

Impact:

Full coverage (1.0) means you had an active forecast for the entire question lifetime
Partial coverage reduces your effective contribution
Some leaderboards require minimum coverage thresholds

Leaderboard Scoring

Leaderboards aggregate individual question scores using different methods:

Leaderboard Type	Aggregation Method	Use Case
Peer Tournament	Sum of peer scores	Most tournaments
Baseline Global	Sum of baseline scores	Global skill ranking
Peer Global	Coverage-weighted average	Global peer performance
Spot Peer	Sum of spot peer scores	Snapshot competitions
Spot Baseline	Sum of spot baseline scores	Snapshot accuracy

Implementation Reference

The scoring algorithms are implemented in scoring/score_math.py:

evaluate_forecasts_peer_accuracy() - Peer Score calculation
evaluate_forecasts_baseline_accuracy() - Baseline Score calculation
evaluate_forecasts_peer_spot_forecast() - Spot Peer Score
evaluate_forecasts_baseline_spot_forecast() - Spot Baseline Score
get_geometric_means() - Community aggregate calculation

Best Practices

Maximizing Your Scores:

Forecast early - More coverage means more weight
Update regularly - React to new information
Beat the crowd - Peer Score rewards contrarian accuracy
Calibrate well - Baseline Score rewards proper confidence
Avoid extremes - Unless you have strong evidence

Common Pitfalls:

Forecasting too late reduces coverage
Copying the community aggregate gives ~0 peer score
Overconfident predictions can be heavily penalized
Not updating forecasts misses opportunities for improvement

Get Started

Core Features

User Guides

Advanced

Scoring Algorithms

Overview

Score Types

Peer Score

Baseline Score

Spot Score

Score Calculation Details

Geometric Mean Aggregation

Coverage

Leaderboard Scoring

Implementation Reference

Best Practices

Build docs developers (and LLMs) love

Get Started

Core Features

User Guides

Advanced

​Overview

​Score Types

​Peer Score

​Baseline Score

​Spot Score

​Score Calculation Details

​Geometric Mean Aggregation

​Coverage

​Leaderboard Scoring

​Implementation Reference

​Best Practices

Build docs developers (and LLMs) love

Overview

Score Types

Peer Score

Baseline Score

Spot Score

Score Calculation Details

Geometric Mean Aggregation

Coverage

Leaderboard Scoring

Implementation Reference

Best Practices