Skip to main content

Overview

The Scores endpoints provide detailed information about how forecasts are scored on Metaculus. Scores measure forecast accuracy using proper scoring rules and are the basis for leaderboards and performance tracking.
Scores are typically accessed through leaderboard endpoints and post data downloads. These endpoints provide raw scoring data for advanced analysis.

Understanding Metaculus Scoring

Score Types

Metaculus uses several scoring methods to evaluate forecasts:
peer
Score Type
Peer Score: Measures performance relative to the community aggregate prediction. Rewards forecasters who beat the crowd.Calculated as: Your log score - Community log score
baseline
Score Type
Baseline Score: Measures performance relative to a simple baseline prior (e.g., uniform distribution for continuous questions, 50% for binary).More generous than peer score, useful for beginners.
spot_peer
Score Type
Spot Peer Score: Peer score evaluated at a specific time (CP reveal time) rather than continuously weighted.Used for tournament scoring to prevent gaming through frequent updates.
spot_baseline
Score Type
Spot Baseline Score: Baseline score evaluated at CP reveal time.
relative_legacy
Score Type
Legacy Relative Score: Historical scoring method from old Metaculus. Deprecated.

Scoring Mechanics

How Scores Are Calculated
  1. Log Score: Your forecast is scored using logarithmic scoring rules. For binary questions with outcome O and prediction p:
    • Score = log₂(p) if O = Yes
    • Score = log₂(1-p) if O = No
  2. Continuous Questions: CDF is converted to PMF, then log score is calculated based on the probability mass assigned to the actual outcome.
  3. Coverage: Your score is weighted by what fraction of scored questions you forecasted. Higher coverage = more reliable score.
  4. Aggregation: Scores across questions are averaged with question weights applied.

Download Score Data

Score data is primarily accessed through the data download endpoints:

Post-Level Scores

curl -X GET "https://www.metaculus.com/api/posts/3530/download-data/?include_scores=true" \
  -H "Authorization: Token YOUR_TOKEN" \
  --output question_data.zip
See the Posts endpoint documentation for full details on the download-data endpoint.

Project-Level Scores

curl -X GET "https://www.metaculus.com/api/projects/144/download-data/?include_scores=true" \
  -H "Authorization: Token YOUR_TOKEN" \
  --output project_data.zip
See the Projects endpoint documentation for details.

Score Data Format

When you download score data, you receive a CSV with the following schema:

Score Data CSV Schema

Question ID
integer
The question ID this score is for
User ID
integer
The user ID who earned this score
User Username
string
The username of the scorer
Score Type
string
Type of score: peer, baseline, spot_peer, spot_baseline, relative_legacy, or manual
Score
number
The score value. Higher is better. Can be negative.
Coverage
number
The coverage value (0-1) representing what fraction of time the user had an active forecast

Accessing Scores in Aggregations

Scores for community aggregations are included in question data when using with_cp=true:
import requests

response = requests.get(
    "https://www.metaculus.com/api/posts/3530/",
    headers={"Authorization": "Token YOUR_TOKEN"},
    params={"with_cp": True}
)

post = response.json()
question = post["question"]
aggregations = question["aggregations"]

# Get recency-weighted aggregation scores
rw_scores = aggregations["recency_weighted"]["score_data"]

print("Recency Weighted Aggregation Scores:")
print(f"  Peer Score: {rw_scores.get('peer_score', 'N/A')}")
print(f"  Baseline Score: {rw_scores.get('baseline_score', 'N/A')}")
print(f"  Coverage: {rw_scores.get('coverage', 'N/A')}")

Example: Analyze User Performance

import requests
import pandas as pd
import zipfile
import io

# Download scores for a project
response = requests.get(
    "https://www.metaculus.com/api/projects/3876/download-data/",
    headers={"Authorization": "Token YOUR_TOKEN"},
    params={"include_scores": True}
)

with zipfile.ZipFile(io.BytesIO(response.content)) as zf:
    with zf.open('score_data.csv') as f:
        scores_df = pd.read_csv(f)
    with zf.open('question_data.csv') as f:
        questions_df = pd.read_csv(f)

# Filter to peer scores only
peer_scores = scores_df[scores_df['Score Type'] == 'peer']

# Calculate statistics per user
user_stats = peer_scores.groupby('User Username').agg({
    'Score': ['mean', 'sum', 'count'],
    'Coverage': 'mean'
}).round(2)

user_stats.columns = ['Avg Score', 'Total Score', 'Questions', 'Avg Coverage']
user_stats = user_stats.sort_values('Total Score', ascending=False)

print("Top 10 Forecasters by Total Peer Score:")
print(user_stats.head(10))

# Analyze by question type
question_types = questions_df.set_index('Question ID')['Question Type']
scores_with_type = peer_scores.copy()
scores_with_type['Question Type'] = scores_with_type['Question ID'].map(question_types)

type_stats = scores_with_type.groupby('Question Type')['Score'].agg(['mean', 'count'])
print("\nAverage Score by Question Type:")
print(type_stats)

Example: Coverage Analysis

import requests
import pandas as pd
import matplotlib.pyplot as plt
import zipfile
import io

response = requests.get(
    "https://www.metaculus.com/api/posts/3530/download-data/",
    headers={"Authorization": "Token YOUR_TOKEN"},
    params={"include_scores": True}
)

with zipfile.ZipFile(io.BytesIO(response.content)) as zf:
    with zf.open('score_data.csv') as f:
        scores_df = pd.read_csv(f)

# Filter to peer scores
peer_scores = scores_df[scores_df['Score Type'] == 'peer'].copy()

# Create coverage bins
peer_scores['Coverage Bin'] = pd.cut(
    peer_scores['Coverage'],
    bins=[0, 0.25, 0.5, 0.75, 1.0],
    labels=['0-25%', '25-50%', '50-75%', '75-100%']
)

# Analyze score by coverage
coverage_stats = peer_scores.groupby('Coverage Bin')['Score'].agg(['mean', 'std', 'count'])

print("Score Statistics by Coverage:")
print(coverage_stats)

# Plot
plt.figure(figsize=(10, 6))
plt.bar(range(len(coverage_stats)), coverage_stats['mean'], 
        yerr=coverage_stats['std'], capsize=5)
plt.xlabel('Coverage Bin')
plt.ylabel('Average Peer Score')
plt.title('Forecast Accuracy vs Coverage')
plt.xticks(range(len(coverage_stats)), coverage_stats.index)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.savefig('coverage_analysis.png')

Example: Historical Score Tracking

import requests
import pandas as pd
import zipfile
import io
from datetime import datetime

# Download question and forecast data
response = requests.get(
    "https://www.metaculus.com/api/posts/3530/download-data/",
    headers={"Authorization": "Token YOUR_TOKEN"},
    params={"include_scores": True}
)

with zipfile.ZipFile(io.BytesIO(response.content)) as zf:
    with zf.open('forecast_data.csv') as f:
        forecasts_df = pd.read_csv(f)
    with zf.open('score_data.csv') as f:
        scores_df = pd.read_csv(f)
    with zf.open('question_data.csv') as f:
        questions_df = pd.read_csv(f)

# Focus on a specific user
my_user_id = 12345

my_forecasts = forecasts_df[
    (forecasts_df['Forecaster ID'] == my_user_id) &
    (forecasts_df['Forecaster ID'].notna())  # Exclude aggregations
].copy()

my_scores = scores_df[
    (scores_df['User ID'] == my_user_id) &
    (scores_df['Score Type'] == 'peer')
].copy()

# Merge to get forecast timestamps
my_forecasts['Start Time'] = pd.to_datetime(my_forecasts['Start Time'])
my_forecasts['Month'] = my_forecasts['Start Time'].dt.to_period('M')

# Calculate rolling performance
monthly_questions = my_forecasts.groupby('Month')['Question ID'].nunique()
monthly_forecasts = my_forecasts.groupby('Month').size()

print(f"Your Forecasting Activity:")
print(f"  Questions forecasted: {my_forecasts['Question ID'].nunique()}")
print(f"  Total forecasts: {len(my_forecasts)}")
print(f"  Average peer score: {my_scores['Score'].mean():.2f}")
print(f"  Score std dev: {my_scores['Score'].std():.2f}")
print(f"\nMonthly Activity:")
for month, count in monthly_questions.tail(6).items():
    forecasts = monthly_forecasts[month]
    print(f"  {month}: {count} questions, {forecasts} forecasts")

Important Notes

Score Data AccessIndividual user scores are only available:
  • To the user themselves
  • To site administrators
  • In aggregate form (leaderboards)
You cannot download other users’ detailed score data for privacy reasons.
Score TimingScores are calculated:
  • When questions resolve
  • When leaderboards are updated (typically daily)
  • When explicitly recalculated by admins
There may be a delay between question resolution and score appearance.
Coverage MattersHigh coverage (forecasting many questions) makes scores more reliable and statistically meaningful. Users with low coverage may have high variance in their scores.

Build docs developers (and LLMs) love