Overview
The Scores endpoints provide detailed information about how forecasts are scored on Metaculus. Scores measure forecast accuracy using proper scoring rules and are the basis for leaderboards and performance tracking.
Scores are typically accessed through leaderboard endpoints and post data downloads. These endpoints provide raw scoring data for advanced analysis.
Score Types
Metaculus uses several scoring methods to evaluate forecasts:
Peer Score: Measures performance relative to the community aggregate prediction. Rewards forecasters who beat the crowd.Calculated as: Your log score - Community log score
Baseline Score: Measures performance relative to a simple baseline prior (e.g., uniform distribution for continuous questions, 50% for binary).More generous than peer score, useful for beginners.
Spot Peer Score: Peer score evaluated at a specific time (CP reveal time) rather than continuously weighted.Used for tournament scoring to prevent gaming through frequent updates.
Spot Baseline Score: Baseline score evaluated at CP reveal time.
Legacy Relative Score: Historical scoring method from old Metaculus. Deprecated.
Scoring Mechanics
How Scores Are Calculated
-
Log Score: Your forecast is scored using logarithmic scoring rules. For binary questions with outcome O and prediction p:
- Score = log₂(p) if O = Yes
- Score = log₂(1-p) if O = No
-
Continuous Questions: CDF is converted to PMF, then log score is calculated based on the probability mass assigned to the actual outcome.
-
Coverage: Your score is weighted by what fraction of scored questions you forecasted. Higher coverage = more reliable score.
-
Aggregation: Scores across questions are averaged with question weights applied.
Download Score Data
Score data is primarily accessed through the data download endpoints:
Post-Level Scores
curl -X GET "https://www.metaculus.com/api/posts/3530/download-data/?include_scores=true" \
-H "Authorization: Token YOUR_TOKEN" \
--output question_data.zip
See the Posts endpoint documentation for full details on the download-data endpoint.
Project-Level Scores
curl -X GET "https://www.metaculus.com/api/projects/144/download-data/?include_scores=true" \
-H "Authorization: Token YOUR_TOKEN" \
--output project_data.zip
See the Projects endpoint documentation for details.
When you download score data, you receive a CSV with the following schema:
Score Data CSV Schema
The question ID this score is for
The user ID who earned this score
The username of the scorer
Type of score: peer, baseline, spot_peer, spot_baseline, relative_legacy, or manual
The score value. Higher is better. Can be negative.
The coverage value (0-1) representing what fraction of time the user had an active forecast
Accessing Scores in Aggregations
Scores for community aggregations are included in question data when using with_cp=true:
import requests
response = requests.get(
"https://www.metaculus.com/api/posts/3530/",
headers={"Authorization": "Token YOUR_TOKEN"},
params={"with_cp": True}
)
post = response.json()
question = post["question"]
aggregations = question["aggregations"]
# Get recency-weighted aggregation scores
rw_scores = aggregations["recency_weighted"]["score_data"]
print("Recency Weighted Aggregation Scores:")
print(f" Peer Score: {rw_scores.get('peer_score', 'N/A')}")
print(f" Baseline Score: {rw_scores.get('baseline_score', 'N/A')}")
print(f" Coverage: {rw_scores.get('coverage', 'N/A')}")
import requests
import pandas as pd
import zipfile
import io
# Download scores for a project
response = requests.get(
"https://www.metaculus.com/api/projects/3876/download-data/",
headers={"Authorization": "Token YOUR_TOKEN"},
params={"include_scores": True}
)
with zipfile.ZipFile(io.BytesIO(response.content)) as zf:
with zf.open('score_data.csv') as f:
scores_df = pd.read_csv(f)
with zf.open('question_data.csv') as f:
questions_df = pd.read_csv(f)
# Filter to peer scores only
peer_scores = scores_df[scores_df['Score Type'] == 'peer']
# Calculate statistics per user
user_stats = peer_scores.groupby('User Username').agg({
'Score': ['mean', 'sum', 'count'],
'Coverage': 'mean'
}).round(2)
user_stats.columns = ['Avg Score', 'Total Score', 'Questions', 'Avg Coverage']
user_stats = user_stats.sort_values('Total Score', ascending=False)
print("Top 10 Forecasters by Total Peer Score:")
print(user_stats.head(10))
# Analyze by question type
question_types = questions_df.set_index('Question ID')['Question Type']
scores_with_type = peer_scores.copy()
scores_with_type['Question Type'] = scores_with_type['Question ID'].map(question_types)
type_stats = scores_with_type.groupby('Question Type')['Score'].agg(['mean', 'count'])
print("\nAverage Score by Question Type:")
print(type_stats)
Example: Coverage Analysis
import requests
import pandas as pd
import matplotlib.pyplot as plt
import zipfile
import io
response = requests.get(
"https://www.metaculus.com/api/posts/3530/download-data/",
headers={"Authorization": "Token YOUR_TOKEN"},
params={"include_scores": True}
)
with zipfile.ZipFile(io.BytesIO(response.content)) as zf:
with zf.open('score_data.csv') as f:
scores_df = pd.read_csv(f)
# Filter to peer scores
peer_scores = scores_df[scores_df['Score Type'] == 'peer'].copy()
# Create coverage bins
peer_scores['Coverage Bin'] = pd.cut(
peer_scores['Coverage'],
bins=[0, 0.25, 0.5, 0.75, 1.0],
labels=['0-25%', '25-50%', '50-75%', '75-100%']
)
# Analyze score by coverage
coverage_stats = peer_scores.groupby('Coverage Bin')['Score'].agg(['mean', 'std', 'count'])
print("Score Statistics by Coverage:")
print(coverage_stats)
# Plot
plt.figure(figsize=(10, 6))
plt.bar(range(len(coverage_stats)), coverage_stats['mean'],
yerr=coverage_stats['std'], capsize=5)
plt.xlabel('Coverage Bin')
plt.ylabel('Average Peer Score')
plt.title('Forecast Accuracy vs Coverage')
plt.xticks(range(len(coverage_stats)), coverage_stats.index)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.savefig('coverage_analysis.png')
Example: Historical Score Tracking
import requests
import pandas as pd
import zipfile
import io
from datetime import datetime
# Download question and forecast data
response = requests.get(
"https://www.metaculus.com/api/posts/3530/download-data/",
headers={"Authorization": "Token YOUR_TOKEN"},
params={"include_scores": True}
)
with zipfile.ZipFile(io.BytesIO(response.content)) as zf:
with zf.open('forecast_data.csv') as f:
forecasts_df = pd.read_csv(f)
with zf.open('score_data.csv') as f:
scores_df = pd.read_csv(f)
with zf.open('question_data.csv') as f:
questions_df = pd.read_csv(f)
# Focus on a specific user
my_user_id = 12345
my_forecasts = forecasts_df[
(forecasts_df['Forecaster ID'] == my_user_id) &
(forecasts_df['Forecaster ID'].notna()) # Exclude aggregations
].copy()
my_scores = scores_df[
(scores_df['User ID'] == my_user_id) &
(scores_df['Score Type'] == 'peer')
].copy()
# Merge to get forecast timestamps
my_forecasts['Start Time'] = pd.to_datetime(my_forecasts['Start Time'])
my_forecasts['Month'] = my_forecasts['Start Time'].dt.to_period('M')
# Calculate rolling performance
monthly_questions = my_forecasts.groupby('Month')['Question ID'].nunique()
monthly_forecasts = my_forecasts.groupby('Month').size()
print(f"Your Forecasting Activity:")
print(f" Questions forecasted: {my_forecasts['Question ID'].nunique()}")
print(f" Total forecasts: {len(my_forecasts)}")
print(f" Average peer score: {my_scores['Score'].mean():.2f}")
print(f" Score std dev: {my_scores['Score'].std():.2f}")
print(f"\nMonthly Activity:")
for month, count in monthly_questions.tail(6).items():
forecasts = monthly_forecasts[month]
print(f" {month}: {count} questions, {forecasts} forecasts")
Important Notes
Score Data AccessIndividual user scores are only available:
- To the user themselves
- To site administrators
- In aggregate form (leaderboards)
You cannot download other users’ detailed score data for privacy reasons.
Score TimingScores are calculated:
- When questions resolve
- When leaderboards are updated (typically daily)
- When explicitly recalculated by admins
There may be a delay between question resolution and score appearance.
Coverage MattersHigh coverage (forecasting many questions) makes scores more reliable and statistically meaningful. Users with low coverage may have high variance in their scores.