Detailed guide on generating 201-point CDFs for continuous questions
Generating a proper Cumulative Distribution Function (CDF) for continuous questions (numeric or date types) can be challenging. This guide provides complete, tested code to help you create valid CDFs.
Metaculus requires continuous forecasts as a 201-point CDF - a list of 201 probability values representing the cumulative probability at evenly-spaced points across the question’s range.
This is the recommended approach - specify a few key percentiles and generate a full CDF:
Python
def generate_continuous_cdf( percentiles: dict, question_data: dict, below_lower_bound: float = None, above_upper_bound: float = None,) -> list[float]: """ Takes a set of percentiles and returns a corresponding cdf with 201 values Parameters ---------- percentiles : dict[str, float | str] Keys must terminate in a number interpretable as a float in range (0, 100) optionally preceded by an underscore "_" Values must be a nominal value in the scale of the question, either interpretable as a float (for "numeric" type questions) or a datetime in ISO format (for "date" type questions) Example: percentiles = { "percentile_01": 25, "percentile_25": 500, "50": 650, "percentile_75": "700", "percentile_99": 990, } below_lower_bound : float, optional Amount of probability mass assigned below the lower bound above_upper_bound : float, optional Amount of probability mass assigned above the upper bound question_data : dict Question object from the API Returns ------- list[float] 201-point CDF ready for submission """ # This will be the set of (x, y) points that are the set points of the cdf percentile_locations = [] # Take the given boundary values if below_lower_bound is not None: percentile_locations.append((0.0, below_lower_bound)) if above_upper_bound is not None: percentile_locations.append((1.0, 1 - above_upper_bound)) # Generate the remaining set of points for percentile, nominal_location in percentiles.items(): height = float(str(percentile).split("_")[-1]) / 100 location = nominal_location_to_cdf_location(nominal_location, question_data) percentile_locations.append((location, height)) # Sort to ensure lookup works percentile_locations.sort() # Check validity first_point, last_point = percentile_locations[0], percentile_locations[-1] if (first_point[0] > 0.0) or (last_point[0] < 1.0): raise ValueError("Percentiles must encompass bounds of the question") def get_cdf_at(location): # Helper function that takes a location and returns # the height of the cdf at that location, linearly # interpolating between values previous = percentile_locations[0] for i in range(1, len(percentile_locations)): current = percentile_locations[i] if previous[0] <= location <= current[0]: return previous[1] + (current[1] - previous[1]) * ( location - previous[0] ) / (current[0] - previous[0]) previous = current # Generate that cdf continuous_cdf = [get_cdf_at(i / 200) for i in range(201)] return continuous_cdf
This function ensures your CDF meets all Metaculus requirements:
Python
def standardize_cdf(cdf, question_data: dict): """ Takes a cdf and returns a standardized version of it - Assigns no mass outside of closed bounds (scales accordingly) - Assigns at least a minimum amount of mass outside of open bounds - Increasing by at least the minimum amount (0.01 / 200 = 0.00005) - Caps the maximum growth to 0.2 Note: thresholds change with different `inbound_outcome_count`s """ lower_open = question_data["open_lower_bound"] upper_open = question_data["open_upper_bound"] inbound_outcome_count = question_data["inbound_outcome_count"] default_inbound_outcome_count = 200 cdf = np.asarray(cdf, dtype=float) if not cdf.size: return [] # Apply lower bound & enforce boundary values scale_lower_to = 0 if lower_open else cdf[0] scale_upper_to = 1.0 if upper_open else cdf[-1] rescaled_inbound_mass = scale_upper_to - scale_lower_to def standardize(F: float, location: float) -> float: # `F` is the height of the cdf at `location` (in range [0, 1]) # Rescale rescaled_F = (F - scale_lower_to) / rescaled_inbound_mass # Offset if lower_open and upper_open: return 0.988 * rescaled_F + 0.01 * location + 0.001 elif lower_open: return 0.989 * rescaled_F + 0.01 * location + 0.001 elif upper_open: return 0.989 * rescaled_F + 0.01 * location return 0.99 * rescaled_F + 0.01 * location for i, value in enumerate(cdf): cdf[i] = standardize(value, i / (len(cdf) - 1)) # Apply upper bound - operate in PMF space pmf = np.diff(cdf, prepend=0, append=1) # Cap depends on inboundOutcomeCount (0.2 if it is the default 200) cap = 0.2 * (default_inbound_outcome_count / inbound_outcome_count) def cap_pmf(scale: float) -> np.ndarray: return np.concatenate([pmf[:1], np.minimum(cap, scale * pmf[1:-1]), pmf[-1:]]) def capped_sum(scale: float) -> float: return float(cap_pmf(scale).sum()) # Find the appropriate scale search space lo = hi = scale = 1.0 while capped_sum(hi) < 1.0: hi *= 1.2 # Hone in on scale value that makes capped sum 1 for _ in range(100): scale = 0.5 * (lo + hi) s = capped_sum(scale) if s < 1.0: lo = scale else: hi = scale if s == 1.0 or (hi - lo) < 2e-5: break # Apply scale and renormalize pmf = cap_pmf(scale) pmf[1:-1] *= (cdf[-1] - cdf[0]) / pmf[1:-1].sum() # Back to CDF space cdf = np.cumsum(pmf)[:-1] # Round to minimize floating point errors cdf = np.round(cdf, 10) return cdf.tolist()
Your CDF will be rejected if it violates these rules:
Rule 1: Strictly Increasing
The CDF must increase by at least 0.00005 (0.005%) at each step.
# Check this rulefor i in range(1, len(cdf)): increase = cdf[i] - cdf[i-1] if increase < 0.00005: print(f"Error at index {i}: increase too small ({increase})")
Rule 2: Maximum Step Size
No step can increase by more than 0.2 (20%).
# Check this rulefor i in range(1, len(cdf)): increase = cdf[i] - cdf[i-1] if increase > 0.2: print(f"Error at index {i}: increase too large ({increase})")
Rule 3: Boundary Conditions
Closed lower bound: First value must be 0.0
Open lower bound: First value must be at least 0.001 (0.1%)
Closed upper bound: Last value must be 1.0
Open upper bound: Last value must be at most 0.999 (99.9%)
# Check boundariesif not question["open_lower_bound"] and cdf[0] != 0.0: print(f"Error: Closed lower bound requires cdf[0] = 0.0, got {cdf[0]}")if question["open_lower_bound"] and cdf[0] < 0.001: print(f"Error: Open lower bound requires cdf[0] >= 0.001, got {cdf[0]}")if not question["open_upper_bound"] and cdf[-1] != 1.0: print(f"Error: Closed upper bound requires cdf[-1] = 1.0, got {cdf[-1]}")if question["open_upper_bound"] and cdf[-1] > 0.999: print(f"Error: Open upper bound requires cdf[-1] <= 0.999, got {cdf[-1]}")
Rule 4: Length
Must have exactly inbound_outcome_count + 1 points (usually 201).
Problem: Your distribution is too concentrated (too much probability in one place).Solution: Use standardize_cdf() which adds a uniform component to ensure minimum increase rates.
# Always standardize before submittingstandardized_cdf = standardize_cdf(my_cdf, question)
Error: 'Step size too large'
Problem: Your distribution has too sharp a spike.Solution: Spread out your percentiles more evenly, or use standardize_cdf() which caps maximum step size.
Start with percentiles: It’s much easier to think in terms of “I believe there’s a 50% chance the answer is below X” than to manually construct 201 probability values.
Use more percentiles for complex beliefs: If you have a bimodal or unusual distribution, specify more percentiles (10th, 20th, 30th, etc.).
Always standardize: The standardize_cdf() function ensures your CDF will be accepted and adds a small uniform component that actually improves forecasting performance.
Check your work: Print out key percentiles from your generated CDF to verify it matches your beliefs:
# Verify your CDFfor percentile in [10, 25, 50, 75, 90]: index = int(percentile * 2) # Convert to 0-200 index print(f"{percentile}th percentile is at index {index}: {cdf[index]:.4f}")
Test with closed bounds first: Start by practicing with questions that have closed bounds - they’re simpler to work with.