Skip to main content

Overview

Metaculus uses sophisticated scoring algorithms to evaluate forecast accuracy. Scoring incentivizes truthful probability estimates and rewards forecasters who make bold, accurate predictions.

Score Types

Metaculus provides multiple scoring methods, each measuring different aspects of forecasting skill. From scoring/constants.py:4-10, the available score types are:
class ScoreTypes(models.TextChoices):
    RELATIVE_LEGACY = "relative_legacy"
    PEER = "peer"
    BASELINE = "baseline"
    SPOT_PEER = "spot_peer"
    SPOT_BASELINE = "spot_baseline"
    MANUAL = "manual"

Peer Score

Measures how much better you forecast compared to the community

Baseline Score

Measures raw forecasting accuracy against a uniform baseline

Spot Peer

Peer score evaluated at a specific point in time

Spot Baseline

Baseline score evaluated at a specific point in time

Peer Score

Peer scores measure how much better (or worse) your forecasts are compared to the community prediction.

Algorithm

From scoring/score_math.py:152-200, peer scores use logarithmic scoring:
def evaluate_forecasts_peer_accuracy(
    forecasts: Sequence[Forecast | AggregateForecast],
    base_forecasts: list[Forecast | AggregateForecast] | None,
    resolution_bucket: int,
    forecast_horizon_start: float,
    actual_close_time: float,
    forecast_horizon_end: float,
    question_type: str,
    geometric_means: list[AggregationEntry] | None = None,
) -> list[ForecastScore]:
    # Calculate geometric mean of all forecasters
    geometric_mean_forecasts = geometric_means or get_geometric_means(base_forecasts)
    
    # Score each forecast against the community
    for forecast in forecasts:
        forecast_start = max(forecast.start_time.timestamp(), forecast_horizon_start)
        forecast_end = (
            actual_close_time if forecast.end_time is None
            else min(forecast.end_time.timestamp(), actual_close_time)
        )
        
        pmf = forecast.get_pmf()
        p = pmf[resolution_bucket]  # Your forecast probability
        
        # Score against community at each timestep
        for gm in geometric_mean_forecasts:
            if forecast_start <= gm.timestamp < forecast_end:
                gmp = gm.pmf[resolution_bucket]  # Community forecast probability
                
                # Peer score formula
                interval_score = (
                    100 * (gm.num_forecasters / (gm.num_forecasters - 1)) 
                    * np.log(p / gmp)
                )
                
                if question_type in QUESTION_CONTINUOUS_TYPES:
                    interval_score /= 2  # Continuous questions are harder

Key Properties

Peer scores are relative: A positive peer score means you beat the community, while a negative score means the community was more accurate.
  • Scale: Typically ranges from -50 to +50 points
  • Zero sum: Peer scores across all forecasters sum to approximately zero
  • Requires multiple forecasters: Need at least 2 forecasters (formula includes num_forecasters / (num_forecasters - 1))

Time-Weighted

Peer scores are weighted by coverage - the fraction of the forecasting period your prediction was active:
forecast_duration = forecast_end - forecast_start
forecast_coverage = forecast_duration / total_duration
final_score = interval_score * forecast_coverage

Baseline Score

Baseline scores measure raw forecasting accuracy against a naive baseline prediction.

Algorithm

From scoring/score_math.py:62-108, baseline scoring:
def evaluate_forecasts_baseline_accuracy(
    forecasts: Sequence[Forecast | AggregateForecast],
    resolution_bucket: int,
    forecast_horizon_start: float,
    actual_close_time: float,
    forecast_horizon_end: float,
    question_type: str,
    open_bounds_count: int,
) -> list[ForecastScore]:
    total_duration = forecast_horizon_end - forecast_horizon_start
    
    for forecast in forecasts:
        forecast_coverage = forecast_duration / total_duration
        pmf = forecast.get_pmf()
        
        if question_type in ["binary", "multiple_choice"]:
            # Baseline: uniform distribution over options
            options_at_time = sum(~np.isnan(pmf))
            p = pmf[resolution_bucket]
            
            # Logarithmic score vs. uniform
            forecast_score = (
                100 * np.log(p * options_at_time) / np.log(options_at_time)
            )
        else:
            # Baseline: 5% probability in each tail, uniform over middle
            if resolution_bucket in [0, len(pmf) - 1]:
                baseline = 0.05
            else:
                baseline = (1 - 0.05 * open_bounds_count) / (len(pmf) - 2)
            
            # Logarithmic score vs. baseline
            forecast_score = 100 * np.log(pmf[resolution_bucket] / baseline) / 2
        
        # Weight by coverage
        final_score = forecast_score * forecast_coverage

Baseline Assumptions

Baseline: Uniform distribution over all available optionsFor a binary question, the baseline is 50/50. For a 4-option multiple choice, each option gets 25%.Maximum score: 100 points (achieved with 100% confidence in the correct outcome)Minimum score: Negative infinity (as probability approaches 0 for the correct outcome)

Key Properties

  • Scale: 0 to 100+ points (higher is better)
  • Absolute: Scores don’t depend on other forecasters
  • Always calculable: Can be computed even with only one forecaster
  • Incentivizes boldness: Confident, accurate predictions earn more points

Spot Scores

Spot scores evaluate forecasts at a single point in time rather than across the entire forecasting period.

Spot Scoring Time

From questions/models.py:428-441, the spot scoring time is determined by:
def get_spot_scoring_time(self) -> datetime | None:
    if self.spot_scoring_time:
        return self.spot_scoring_time
    elif self.cp_reveal_time and self.open_time and self.cp_reveal_time > self.open_time:
        return self.cp_reveal_time
    elif self.actual_close_time:
        return self.actual_close_time
    elif self.scheduled_close_time:
        return self.scheduled_close_time
    return None
Priority order:
  1. Explicit spot_scoring_time if set
  2. cp_reveal_time (when community prediction is revealed)
  3. actual_close_time
  4. scheduled_close_time

Spot Baseline Score

From scoring/score_math.py:111-149:
def evaluate_forecasts_baseline_spot_forecast(
    forecasts: Sequence[Forecast | AggregateForecast],
    resolution_bucket: int,
    spot_forecast_timestamp: float,
    question_type: str,
    open_bounds_count: int,
) -> list[ForecastScore]:
    for forecast in forecasts:
        start = forecast.start_time.timestamp()
        end = float("inf") if forecast.end_time is None else forecast.end_time.timestamp()
        
        # Only score forecasts active at the spot time
        if start <= spot_forecast_timestamp < end:
            pmf = forecast.get_pmf()
            # Same scoring formula as baseline, but coverage = 1.0
            forecast_scores.append(ForecastScore(forecast_score, 1.0))
        else:
            forecast_scores.append(ForecastScore(0))
If you don’t have an active forecast at the spot scoring time, you receive 0 points for that question.

Use Cases

Spot scores are ideal for:
  • Questions where the resolution can affect forecasting (CP hidden until reveal)
  • Live forecasting events with synchronized scoring times
  • Questions that resolve quickly after closing
  • Preventing score manipulation by rapid forecast updates

Score Model

Scores are stored per user, per question, per score type (from scoring/models.py:17-57):
class Score(TimeStampedModel):
    user = models.ForeignKey(User, null=True, on_delete=models.CASCADE)
    question = models.ForeignKey(Question, on_delete=models.CASCADE)
    
    # Score data
    score = models.FloatField()
    coverage = models.FloatField(default=0)  # Fraction of forecast period covered
    
    # Metadata
    score_type = models.CharField(max_length=200, choices=ScoreTypes.choices)
    aggregation_method = models.CharField(
        max_length=200,
        choices=AggregationMethod.choices,
        db_index=True
    )

Coverage

The coverage field tracks what fraction of the forecasting period you participated in:
  • Coverage = 1.0: You had an active forecast for the entire question period
  • Coverage = 0.5: You forecasted for half the question period
  • Coverage = 0.0: You never made a forecast
Early forecasting is rewarded! Making predictions early and maintaining them gives you higher coverage and more scoring opportunities.

Archived Scores

Historical scores that can’t be recalculated are archived (from scoring/models.py:59-94):
class ArchivedScore(TimeStampedModel):
    user = models.ForeignKey(User, null=True, on_delete=models.CASCADE)
    question = models.ForeignKey(Question, on_delete=models.CASCADE)
    score = models.FloatField()
    coverage = models.FloatField(default=0)
    score_type = models.CharField(max_length=200, choices=ArchivedScoreTypes.choices)
    aggregation_method = models.CharField(max_length=200, choices=AggregationMethod.choices)
Archived score types (from scoring/constants.py:13-14):
class ArchivedScoreTypes(models.TextChoices):
    RELATIVE_LEGACY = "relative_legacy"  # Old scoring system

Default Score Type

Each question designates a default score type (from questions/models.py:91-102):
default_score_type = models.CharField(
    max_length=20,
    choices=ScoreTypes.choices,
    default=ScoreTypes.PEER,
    help_text="""Default score type for this question.
    Generally, this should be either 'peer' or 'spot_peer'.
    Determines which score will be most prominently displayed in the UI.
    Also, for Leaderboards that have a 'score type' of 'default',
    this question's default score type will be the one that contributes
    to the leaderboard.
    """,
)

Question Weighting

Questions can have different weights when aggregating scores (from questions/models.py:90):
question_weight = models.FloatField(default=1.0)
  • Weight = 1.0: Standard question (default)
  • Weight > 1.0: More important question (higher contribution to leaderboards)
  • Weight = 0.0: Question excluded from scoring entirely
  • Weight < 1.0: Less important question

Unsuccessful Resolutions

Some questions cannot be resolved normally:
  • Ambiguous: Resolution criteria cannot be clearly applied
  • Annulled: Question was flawed or should not have been asked
From questions/constants.py:4-6:
class UnsuccessfulResolutionType(models.TextChoices):
    AMBIGUOUS = "ambiguous"
    ANNULLED = "annulled"
Questions resolved as ambiguous or annulled are typically excluded from scoring and leaderboards.

Scoring Best Practices

Forecast early and update regularly to maximize your coverage. Even a simple initial forecast can boost your score potential.
Over many forecasts, events you predict at 70% should happen about 70% of the time. Track your calibration!
The scoring system rewards confident, accurate predictions. If you have strong evidence, don’t be afraid to make extreme forecasts.
Scores are weighted by time, so updating your forecast when you learn new information helps your score.
To beat the baseline, you need to do better than uniform/naive probabilities. Consider what “uninformed” would predict.

API Reference

Scores API

Explore the full Scores API documentation

Questions

Understand question types and structure

Forecasting

Learn how to make predictions

Leaderboards

See how scores aggregate into rankings

Tournaments

Compete for prizes using these scoring rules

Build docs developers (and LLMs) love