Scoring

Overview

Metaculus uses sophisticated scoring algorithms to evaluate forecast accuracy. Scoring incentivizes truthful probability estimates and rewards forecasters who make bold, accurate predictions.

Score Types

Metaculus provides multiple scoring methods, each measuring different aspects of forecasting skill. From scoring/constants.py:4-10, the available score types are:

class ScoreTypes(models.TextChoices):
    RELATIVE_LEGACY = "relative_legacy"
    PEER = "peer"
    BASELINE = "baseline"
    SPOT_PEER = "spot_peer"
    SPOT_BASELINE = "spot_baseline"
    MANUAL = "manual"

Peer Score

Measures how much better you forecast compared to the community

Baseline Score

Measures raw forecasting accuracy against a uniform baseline

Spot Peer

Peer score evaluated at a specific point in time

Spot Baseline

Baseline score evaluated at a specific point in time

Peer Score

Peer scores measure how much better (or worse) your forecasts are compared to the community prediction.

Algorithm

From scoring/score_math.py:152-200, peer scores use logarithmic scoring:

def evaluate_forecasts_peer_accuracy(
    forecasts: Sequence[Forecast | AggregateForecast],
    base_forecasts: list[Forecast | AggregateForecast] | None,
    resolution_bucket: int,
    forecast_horizon_start: float,
    actual_close_time: float,
    forecast_horizon_end: float,
    question_type: str,
    geometric_means: list[AggregationEntry] | None = None,
) -> list[ForecastScore]:
    # Calculate geometric mean of all forecasters
    geometric_mean_forecasts = geometric_means or get_geometric_means(base_forecasts)
    
    # Score each forecast against the community
    for forecast in forecasts:
        forecast_start = max(forecast.start_time.timestamp(), forecast_horizon_start)
        forecast_end = (
            actual_close_time if forecast.end_time is None
            else min(forecast.end_time.timestamp(), actual_close_time)
        )
        
        pmf = forecast.get_pmf()
        p = pmf[resolution_bucket]  # Your forecast probability
        
        # Score against community at each timestep
        for gm in geometric_mean_forecasts:
            if forecast_start <= gm.timestamp < forecast_end:
                gmp = gm.pmf[resolution_bucket]  # Community forecast probability
                
                # Peer score formula
                interval_score = (
                    100 * (gm.num_forecasters / (gm.num_forecasters - 1)) 
                    * np.log(p / gmp)
                )
                
                if question_type in QUESTION_CONTINUOUS_TYPES:
                    interval_score /= 2  # Continuous questions are harder

Key Properties

Peer scores are relative: A positive peer score means you beat the community, while a negative score means the community was more accurate.

Scale: Typically ranges from -50 to +50 points
Zero sum: Peer scores across all forecasters sum to approximately zero
Requires multiple forecasters: Need at least 2 forecasters (formula includes num_forecasters / (num_forecasters - 1))

Time-Weighted

Peer scores are weighted by coverage - the fraction of the forecasting period your prediction was active:

forecast_duration = forecast_end - forecast_start
forecast_coverage = forecast_duration / total_duration
final_score = interval_score * forecast_coverage

Baseline Score

Baseline scores measure raw forecasting accuracy against a naive baseline prediction.

Algorithm

From scoring/score_math.py:62-108, baseline scoring:

def evaluate_forecasts_baseline_accuracy(
    forecasts: Sequence[Forecast | AggregateForecast],
    resolution_bucket: int,
    forecast_horizon_start: float,
    actual_close_time: float,
    forecast_horizon_end: float,
    question_type: str,
    open_bounds_count: int,
) -> list[ForecastScore]:
    total_duration = forecast_horizon_end - forecast_horizon_start
    
    for forecast in forecasts:
        forecast_coverage = forecast_duration / total_duration
        pmf = forecast.get_pmf()
        
        if question_type in ["binary", "multiple_choice"]:
            # Baseline: uniform distribution over options
            options_at_time = sum(~np.isnan(pmf))
            p = pmf[resolution_bucket]
            
            # Logarithmic score vs. uniform
            forecast_score = (
                100 * np.log(p * options_at_time) / np.log(options_at_time)
            )
        else:
            # Baseline: 5% probability in each tail, uniform over middle
            if resolution_bucket in [0, len(pmf) - 1]:
                baseline = 0.05
            else:
                baseline = (1 - 0.05 * open_bounds_count) / (len(pmf) - 2)
            
            # Logarithmic score vs. baseline
            forecast_score = 100 * np.log(pmf[resolution_bucket] / baseline) / 2
        
        # Weight by coverage
        final_score = forecast_score * forecast_coverage

Baseline Assumptions

Binary/Multiple Choice
Numeric/Date/Discrete

Baseline: Uniform distribution over all available optionsFor a binary question, the baseline is 50/50. For a 4-option multiple choice, each option gets 25%.Maximum score: 100 points (achieved with 100% confidence in the correct outcome)Minimum score: Negative infinity (as probability approaches 0 for the correct outcome)

Key Properties

Scale: 0 to 100+ points (higher is better)
Absolute: Scores don’t depend on other forecasters
Always calculable: Can be computed even with only one forecaster
Incentivizes boldness: Confident, accurate predictions earn more points

Spot Scores

Spot scores evaluate forecasts at a single point in time rather than across the entire forecasting period.

Spot Scoring Time

From questions/models.py:428-441, the spot scoring time is determined by:

def get_spot_scoring_time(self) -> datetime | None:
    if self.spot_scoring_time:
        return self.spot_scoring_time
    elif self.cp_reveal_time and self.open_time and self.cp_reveal_time > self.open_time:
        return self.cp_reveal_time
    elif self.actual_close_time:
        return self.actual_close_time
    elif self.scheduled_close_time:
        return self.scheduled_close_time
    return None

Priority order:

Explicit spot_scoring_time if set
cp_reveal_time (when community prediction is revealed)
actual_close_time
scheduled_close_time

Spot Baseline Score

From scoring/score_math.py:111-149:

def evaluate_forecasts_baseline_spot_forecast(
    forecasts: Sequence[Forecast | AggregateForecast],
    resolution_bucket: int,
    spot_forecast_timestamp: float,
    question_type: str,
    open_bounds_count: int,
) -> list[ForecastScore]:
    for forecast in forecasts:
        start = forecast.start_time.timestamp()
        end = float("inf") if forecast.end_time is None else forecast.end_time.timestamp()
        
        # Only score forecasts active at the spot time
        if start <= spot_forecast_timestamp < end:
            pmf = forecast.get_pmf()
            # Same scoring formula as baseline, but coverage = 1.0
            forecast_scores.append(ForecastScore(forecast_score, 1.0))
        else:
            forecast_scores.append(ForecastScore(0))

If you don’t have an active forecast at the spot scoring time, you receive 0 points for that question.

Use Cases

Spot scores are ideal for:

Questions where the resolution can affect forecasting (CP hidden until reveal)
Live forecasting events with synchronized scoring times
Questions that resolve quickly after closing
Preventing score manipulation by rapid forecast updates

Score Model

Scores are stored per user, per question, per score type (from scoring/models.py:17-57):

class Score(TimeStampedModel):
    user = models.ForeignKey(User, null=True, on_delete=models.CASCADE)
    question = models.ForeignKey(Question, on_delete=models.CASCADE)
    
    # Score data
    score = models.FloatField()
    coverage = models.FloatField(default=0)  # Fraction of forecast period covered
    
    # Metadata
    score_type = models.CharField(max_length=200, choices=ScoreTypes.choices)
    aggregation_method = models.CharField(
        max_length=200,
        choices=AggregationMethod.choices,
        db_index=True
    )

Coverage

The coverage field tracks what fraction of the forecasting period you participated in:

Coverage = 1.0: You had an active forecast for the entire question period
Coverage = 0.5: You forecasted for half the question period
Coverage = 0.0: You never made a forecast

Early forecasting is rewarded! Making predictions early and maintaining them gives you higher coverage and more scoring opportunities.

Archived Scores

Historical scores that can’t be recalculated are archived (from scoring/models.py:59-94):

class ArchivedScore(TimeStampedModel):
    user = models.ForeignKey(User, null=True, on_delete=models.CASCADE)
    question = models.ForeignKey(Question, on_delete=models.CASCADE)
    score = models.FloatField()
    coverage = models.FloatField(default=0)
    score_type = models.CharField(max_length=200, choices=ArchivedScoreTypes.choices)
    aggregation_method = models.CharField(max_length=200, choices=AggregationMethod.choices)

Archived score types (from scoring/constants.py:13-14):

class ArchivedScoreTypes(models.TextChoices):
    RELATIVE_LEGACY = "relative_legacy"  # Old scoring system

Default Score Type

Each question designates a default score type (from questions/models.py:91-102):

default_score_type = models.CharField(
    max_length=20,
    choices=ScoreTypes.choices,
    default=ScoreTypes.PEER,
    help_text="""Default score type for this question.
    Generally, this should be either 'peer' or 'spot_peer'.
    Determines which score will be most prominently displayed in the UI.
    Also, for Leaderboards that have a 'score type' of 'default',
    this question's default score type will be the one that contributes
    to the leaderboard.
    """,
)

Question Weighting

Questions can have different weights when aggregating scores (from questions/models.py:90):

question_weight = models.FloatField(default=1.0)

Weight = 1.0: Standard question (default)
Weight > 1.0: More important question (higher contribution to leaderboards)
Weight = 0.0: Question excluded from scoring entirely
Weight < 1.0: Less important question

Unsuccessful Resolutions

Some questions cannot be resolved normally:

Ambiguous: Resolution criteria cannot be clearly applied
Annulled: Question was flawed or should not have been asked

From questions/constants.py:4-6:

class UnsuccessfulResolutionType(models.TextChoices):
    AMBIGUOUS = "ambiguous"
    ANNULLED = "annulled"

Questions resolved as ambiguous or annulled are typically excluded from scoring and leaderboards.

Scoring Best Practices

Maximize Coverage

Forecast early and update regularly to maximize your coverage. Even a simple initial forecast can boost your score potential.

Be Calibrated

Over many forecasts, events you predict at 70% should happen about 70% of the time. Track your calibration!

Be Bold When Warranted

The scoring system rewards confident, accurate predictions. If you have strong evidence, don’t be afraid to make extreme forecasts.

Update on New Information

Scores are weighted by time, so updating your forecast when you learn new information helps your score.

Understand the Baseline

To beat the baseline, you need to do better than uniform/naive probabilities. Consider what “uninformed” would predict.

API Reference

Scores API

Explore the full Scores API documentation

Questions

Understand question types and structure

Forecasting

Learn how to make predictions

Leaderboards

See how scores aggregate into rankings

Tournaments

Compete for prizes using these scoring rules

Get Started

Core Features

User Guides

Advanced

Overview

Score Types

Peer Score

Baseline Score

Spot Peer

Spot Baseline

Peer Score

Algorithm

Key Properties

Time-Weighted

Baseline Score

Algorithm

Baseline Assumptions

Key Properties

Spot Scores

Spot Scoring Time

Spot Baseline Score

Use Cases

Score Model

Coverage

Archived Scores

Default Score Type

Question Weighting

Unsuccessful Resolutions

Scoring Best Practices

API Reference

Scores API

Questions

Forecasting

Leaderboards

Tournaments

Build docs developers (and LLMs) love

Get Started

Core Features

User Guides

Advanced

​Overview

​Score Types

Peer Score

Baseline Score

Spot Peer

Spot Baseline

​Peer Score

​Algorithm

​Key Properties

​Time-Weighted

​Baseline Score

​Algorithm

​Baseline Assumptions

​Key Properties

​Spot Scores

​Spot Scoring Time

​Spot Baseline Score

​Use Cases

​Score Model

​Coverage

​Archived Scores

​Default Score Type

​Question Weighting

​Unsuccessful Resolutions

​Scoring Best Practices

​API Reference

Scores API

​Related Topics

Questions

Forecasting

Leaderboards

Tournaments

Build docs developers (and LLMs) love

Overview

Score Types

Peer Score

Algorithm

Key Properties

Time-Weighted

Baseline Score

Algorithm

Baseline Assumptions

Key Properties

Spot Scores

Spot Scoring Time

Spot Baseline Score

Use Cases

Score Model

Coverage

Archived Scores

Default Score Type

Question Weighting

Unsuccessful Resolutions

Scoring Best Practices

API Reference

Related Topics