Metaculus uses sophisticated scoring algorithms to evaluate forecast accuracy. Scoring incentivizes truthful probability estimates and rewards forecasters who make bold, accurate predictions.
Metaculus provides multiple scoring methods, each measuring different aspects of forecasting skill.From scoring/constants.py:4-10, the available score types are:
From scoring/score_math.py:152-200, peer scores use logarithmic scoring:
def evaluate_forecasts_peer_accuracy( forecasts: Sequence[Forecast | AggregateForecast], base_forecasts: list[Forecast | AggregateForecast] | None, resolution_bucket: int, forecast_horizon_start: float, actual_close_time: float, forecast_horizon_end: float, question_type: str, geometric_means: list[AggregationEntry] | None = None,) -> list[ForecastScore]: # Calculate geometric mean of all forecasters geometric_mean_forecasts = geometric_means or get_geometric_means(base_forecasts) # Score each forecast against the community for forecast in forecasts: forecast_start = max(forecast.start_time.timestamp(), forecast_horizon_start) forecast_end = ( actual_close_time if forecast.end_time is None else min(forecast.end_time.timestamp(), actual_close_time) ) pmf = forecast.get_pmf() p = pmf[resolution_bucket] # Your forecast probability # Score against community at each timestep for gm in geometric_mean_forecasts: if forecast_start <= gm.timestamp < forecast_end: gmp = gm.pmf[resolution_bucket] # Community forecast probability # Peer score formula interval_score = ( 100 * (gm.num_forecasters / (gm.num_forecasters - 1)) * np.log(p / gmp) ) if question_type in QUESTION_CONTINUOUS_TYPES: interval_score /= 2 # Continuous questions are harder
Baseline: Uniform distribution over all available optionsFor a binary question, the baseline is 50/50. For a 4-option multiple choice, each option gets 25%.Maximum score: 100 points (achieved with 100% confidence in the correct outcome)Minimum score: Negative infinity (as probability approaches 0 for the correct outcome)
Baseline: 5% probability in each tail (if bounds are open), uniform over the middle rangeThis baseline represents a “reasonable prior” that extreme values are possible but unlikely.Score is halved: Continuous questions get scores divided by 2 to account for increased difficulty
def evaluate_forecasts_baseline_spot_forecast( forecasts: Sequence[Forecast | AggregateForecast], resolution_bucket: int, spot_forecast_timestamp: float, question_type: str, open_bounds_count: int,) -> list[ForecastScore]: for forecast in forecasts: start = forecast.start_time.timestamp() end = float("inf") if forecast.end_time is None else forecast.end_time.timestamp() # Only score forecasts active at the spot time if start <= spot_forecast_timestamp < end: pmf = forecast.get_pmf() # Same scoring formula as baseline, but coverage = 1.0 forecast_scores.append(ForecastScore(forecast_score, 1.0)) else: forecast_scores.append(ForecastScore(0))
If you don’t have an active forecast at the spot scoring time, you receive 0 points for that question.
Each question designates a default score type (from questions/models.py:91-102):
default_score_type = models.CharField( max_length=20, choices=ScoreTypes.choices, default=ScoreTypes.PEER, help_text="""Default score type for this question. Generally, this should be either 'peer' or 'spot_peer'. Determines which score will be most prominently displayed in the UI. Also, for Leaderboards that have a 'score type' of 'default', this question's default score type will be the one that contributes to the leaderboard. """,)