Documentation Index Fetch the complete documentation index at: https://mintlify.com/joicodev/polymarket-bot/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The metrics module implements proper scoring rules and statistical tests to evaluate probabilistic predictions. These metrics quantify model calibration, discrimination ability, and serial independence.
Proper Scoring Rules : Both Brier Score and Log Loss are “proper” — they are minimized when the forecaster reports their true beliefs. This property is critical for honest calibration.
Brier Score
Measures: Overall prediction error (calibration + discrimination)
Range: 0 (perfect) to 1 (worst)
Baseline: 0.25 (random coin flip, always predict 50%)
Brier Score = (1/N) × Σ(p_i - o_i)²
where:
p_i = predicted probability
o_i = actual outcome (0 or 1)
N = number of predictions
Implementation
export function brierScore ( data ) {
if ( data . length === 0 ) return NaN
let sum = 0
for ( const { predicted , outcome } of data ) {
sum += ( predicted - outcome ) ** 2
}
return sum / data . length
}
src/engine/metrics.js (lines 8-21)
/**
* Brier Score: (1/N) * sum((p_i - o_i)^2)
* Perfect = 0, Random (always 0.5) = 0.25, Worst = 1.0
* @param {Array<{predicted: number, outcome: 0|1}>} data
* @returns {number}
*/
export function brierScore ( data ) {
if ( data . length === 0 ) return NaN
let sum = 0
for ( const { predicted , outcome } of data ) {
sum += ( predicted - outcome ) ** 2
}
return sum / data . length
}
Interpretation
Brier Score Quality Interpretation 0.00 - 0.10 Excellent Very well calibrated 0.10 - 0.20 Good Useful signal 0.20 - 0.25 Fair Barely better than chance > 0.25 Poor Worse than random
Example
import { brierScore } from './engine/metrics.js'
const predictions = [
{ predicted: 0.70 , outcome: 1 }, // Correct, confident
{ predicted: 0.45 , outcome: 0 }, // Correct, uncertain
{ predicted: 0.80 , outcome: 0 }, // Wrong, confident (costly)
{ predicted: 0.60 , outcome: 1 }, // Correct, moderate
]
const bs = brierScore ( predictions )
console . log ( `Brier Score: ${ bs . toFixed ( 4 ) } ` )
// Output: Brier Score: 0.1525
Brier Skill Score (BSS)
Measures: Improvement over a baseline model
Range: -∞ to 1
Interpretation: BSS > 0 means better than baseline; BSS = 1 is perfect
BSS = 1 - (BS_model / BS_baseline)
Baseline (random 50% guess):
BS_baseline = 0.25
Implementation
export function brierSkillScore ( bs , baseline = 0.25 ) {
if ( baseline === 0 ) return NaN
return 1 - ( bs / baseline )
}
src/engine/metrics.js (lines 39-49)
/**
* Brier Skill Score: 1 - (BS_model / BS_baseline)
* BSS > 0 means better than baseline. BSS = 1 is perfect.
* @param {number} bs Model's Brier Score
* @param {number} [baseline = 0.25] Baseline Brier Score (0.25 = random 50%)
* @returns {number}
*/
export function brierSkillScore ( bs , baseline = 0.25 ) {
if ( baseline === 0 ) return NaN
return 1 - ( bs / baseline )
}
Example
import { brierScore , brierSkillScore } from './engine/metrics.js'
const bs = 0.1525
const bss = brierSkillScore ( bs , 0.25 )
console . log ( `BSS: ${ bss . toFixed ( 2 ) } ` )
// Output: BSS: 0.39 (39% improvement over random)
Log Loss (Binary Cross-Entropy)
Measures: Prediction confidence penalty (heavily punishes confident mistakes)
Range: 0 (perfect) to ∞
Baseline: 0.693 (random 50% guess)
Log Loss = -(1/N) × Σ[o×log(p) + (1-o)×log(1-p)]
where:
p = predicted probability (clamped to [ε, 1-ε] to avoid log(0))
o = actual outcome (0 or 1)
ε = 1e-15 (epsilon for numerical stability)
Implementation
const EPSILON = 1e-15
export function logLoss ( data ) {
if ( data . length === 0 ) return NaN
let sum = 0
for ( const { predicted , outcome } of data ) {
const p = Math . max ( EPSILON , Math . min ( 1 - EPSILON , predicted ))
sum += outcome * Math . log ( p ) + ( 1 - outcome ) * Math . log ( 1 - p )
}
return - sum / data . length
}
src/engine/metrics.js (lines 23-37)
/**
* Log Loss (Binary Cross-Entropy): -(1/N) * sum[o*log(p) + (1-o)*log(1-p)]
* Perfect = 0, Random (always 0.5) = 0.693, Worse > 0.693
* @param {Array<{predicted: number, outcome: 0|1}>} data
* @returns {number}
*/
export function logLoss ( data ) {
if ( data . length === 0 ) return NaN
let sum = 0
for ( const { predicted , outcome } of data ) {
const p = Math . max ( EPSILON , Math . min ( 1 - EPSILON , predicted ))
sum += outcome * Math . log ( p ) + ( 1 - outcome ) * Math . log ( 1 - p )
}
return - sum / data . length
}
Interpretation
Log Loss Quality Interpretation 0.00 - 0.30 Excellent Very confident and accurate 0.30 - 0.60 Good Solid predictions 0.60 - 0.693 Fair Barely better than random > 0.693 Poor Worse than random
When to use Log Loss vs Brier?
Log Loss : Use when confident mistakes are very costly (e.g., risk management)
Brier Score : Use when all errors should be weighted equally
Example
import { logLoss } from './engine/metrics.js'
const predictions = [
{ predicted: 0.90 , outcome: 1 }, // Very confident, correct
{ predicted: 0.90 , outcome: 0 }, // Very confident, WRONG (heavy penalty)
{ predicted: 0.55 , outcome: 1 }, // Weak signal, correct
]
const ll = logLoss ( predictions )
console . log ( `Log Loss: ${ ll . toFixed ( 4 }) `)
// Output: Log Loss: 0.8954 (worse than random due to confident mistake)
Murphy Decomposition
Measures: Breaks Brier Score into three interpretable components:
Formula: BS = Reliability - Resolution + Uncertainty
Components
Component Meaning Goal Reliability How well probabilities match observed frequencies Minimize (0 = perfect) Resolution Ability to discriminate between outcomes Maximize (higher = better) Uncertainty Inherent randomness in outcomes Constant (oBar × (1-oBar))
Algorithm
Bin predictions into K equal-width bins [0, 1/K), [1/K, 2/K), …, [(K-1)/K, 1]
For each bin, compute:
Average predicted probability: p̄ = (1/n) × Σp_i
Average actual outcome: ō = (1/n) × Σo_i
Compute components:
Reliability = Σ(n_k/N) × (p̄_k - ō_k)²
Resolution = Σ(n_k/N) × (ō_k - ōBar)²
Uncertainty = ōBar × (1 - ōBar)
Implementation
export function murphyDecomposition ( data , numBins = 10 ) {
if ( data . length === 0 ) return { reliability: NaN , resolution: NaN , uncertainty: NaN }
const N = data . length
// Overall base rate
const oBar = data . reduce (( s , d ) => s + d . outcome , 0 ) / N
const uncertainty = oBar * ( 1 - oBar )
// Bin the data
const bins = Array . from ({ length: numBins }, () => ({ sumP: 0 , sumO: 0 , count: 0 }))
for ( const { predicted , outcome } of data ) {
let binIdx = Math . floor ( predicted * numBins )
if ( binIdx >= numBins ) binIdx = numBins - 1
if ( binIdx < 0 ) binIdx = 0
bins [ binIdx ]. sumP += predicted
bins [ binIdx ]. sumO += outcome
bins [ binIdx ]. count += 1
}
let reliability = 0
let resolution = 0
for ( const bin of bins ) {
if ( bin . count === 0 ) continue
const avgP = bin . sumP / bin . count
const avgO = bin . sumO / bin . count
reliability += ( bin . count / N ) * ( avgP - avgO ) ** 2
resolution += ( bin . count / N ) * ( avgO - oBar ) ** 2
}
return { reliability , resolution , uncertainty }
}
src/engine/metrics.js (lines 51-94)
/**
* Murphy (1973) decomposition: BS = Reliability - Resolution + Uncertainty
*
* Bins predictions into equal-width bins [0, 1/K), [1/K, 2/K), ..., [(K-1)/K, 1]
* and computes the three components.
*
* @param {Array<{predicted: number, outcome: 0|1}>} data
* @param {number} [numBins = 10]
* @returns {{ reliability: number, resolution: number, uncertainty: number }}
*/
export function murphyDecomposition ( data , numBins = 10 ) {
if ( data . length === 0 ) return { reliability: NaN , resolution: NaN , uncertainty: NaN }
const N = data . length
// Overall base rate
const oBar = data . reduce (( s , d ) => s + d . outcome , 0 ) / N
const uncertainty = oBar * ( 1 - oBar )
// Bin the data
const bins = Array . from ({ length: numBins }, () => ({ sumP: 0 , sumO: 0 , count: 0 }))
for ( const { predicted , outcome } of data ) {
let binIdx = Math . floor ( predicted * numBins )
if ( binIdx >= numBins ) binIdx = numBins - 1
if ( binIdx < 0 ) binIdx = 0
bins [ binIdx ]. sumP += predicted
bins [ binIdx ]. sumO += outcome
bins [ binIdx ]. count += 1
}
let reliability = 0
let resolution = 0
for ( const bin of bins ) {
if ( bin . count === 0 ) continue
const avgP = bin . sumP / bin . count
const avgO = bin . sumO / bin . count
reliability += ( bin . count / N ) * ( avgP - avgO ) ** 2
resolution += ( bin . count / N ) * ( avgO - oBar ) ** 2
}
return { reliability , resolution , uncertainty }
}
Example
import { murphyDecomposition } from './engine/metrics.js'
const { reliability , resolution , uncertainty } = murphyDecomposition ( predictions )
console . log ( `Reliability: ${ reliability . toFixed ( 4 ) } (lower is better)` )
console . log ( `Resolution: ${ resolution . toFixed ( 4 ) } (higher is better)` )
console . log ( `Uncertainty: ${ uncertainty . toFixed ( 4 ) } (constant)` )
// Example output:
// Reliability: 0.0123 (well calibrated)
// Resolution: 0.0845 (strong discrimination)
// Uncertainty: 0.2499 (base rate ≈ 50%)
// Verify: BS = Reliability - Resolution + Uncertainty
const bs = reliability - resolution + uncertainty
console . log ( `Brier Score: ${ bs . toFixed ( 4 ) } ` )
Runs Test (Wald-Wolfowitz)
Measures: Serial independence in a binary sequence
Purpose: Detect patterns/streaks that violate randomness assumption
Concept
A “run” is a maximal sequence of consecutive identical values:
Sequence: 1 1 1 0 0 1 0 1 1
Runs: [---] [---] [-] [-] [---]
Count: 5 runs
Under independence: The number of runs follows an approximately normal distribution with known mean and variance.
Expected runs: μ = (2×n₁×n₀)/n + 1
Variance: σ² = (2×n₁×n₀×(2×n₁×n₀ - n)) / (n²×(n-1))
Z-score: z = (R - μ) / σ
where:
R = observed number of runs
n₁ = count of 1's
n₀ = count of 0's
n = total length
Implementation
export function runsTest ( outcomes ) {
if ( outcomes . length < 2 ) return { runs: NaN , expected: NaN , zScore: NaN , pValue: NaN }
const n = outcomes . length
const n1 = outcomes . filter ( o => o === 1 ). length
const n0 = n - n1
if ( n1 === 0 || n0 === 0 ) return { runs: NaN , expected: NaN , zScore: NaN , pValue: NaN }
// Count runs
let runs = 1
for ( let i = 1 ; i < n ; i ++ ) {
if ( outcomes [ i ] !== outcomes [ i - 1 ]) runs ++
}
// Expected runs and variance under independence
const expected = ( 2 * n1 * n0 ) / n + 1
const variance = ( 2 * n1 * n0 * ( 2 * n1 * n0 - n )) / ( n * n * ( n - 1 ))
if ( variance <= 0 ) return { runs , expected , zScore: NaN , pValue: NaN }
const zScore = ( runs - expected ) / Math . sqrt ( variance )
// Two-tailed p-value from standard normal
const pValue = 2 * ( 1 - normalCDF ( Math . abs ( zScore )))
return { runs , expected: + expected . toFixed ( 4 ), zScore: + zScore . toFixed ( 4 ), pValue: + pValue . toFixed ( 4 ) }
}
src/engine/metrics.js (lines 96-133)
/**
* Wald-Wolfowitz runs test for serial independence in a binary sequence.
*
* A "run" is a maximal sequence of consecutive identical values.
* Under independence, the number of runs follows an approximately normal
* distribution with known mean and variance.
*
* @param {Array<0|1>} outcomes Binary sequence
* @returns {{ runs: number, expected: number, zScore: number, pValue: number }}
*/
export function runsTest ( outcomes ) {
if ( outcomes . length < 2 ) return { runs: NaN , expected: NaN , zScore: NaN , pValue: NaN }
const n = outcomes . length
const n1 = outcomes . filter ( o => o === 1 ). length
const n0 = n - n1
if ( n1 === 0 || n0 === 0 ) return { runs: NaN , expected: NaN , zScore: NaN , pValue: NaN }
// Count runs
let runs = 1
for ( let i = 1 ; i < n ; i ++ ) {
if ( outcomes [ i ] !== outcomes [ i - 1 ]) runs ++
}
// Expected runs and variance under independence
const expected = ( 2 * n1 * n0 ) / n + 1
const variance = ( 2 * n1 * n0 * ( 2 * n1 * n0 - n )) / ( n * n * ( n - 1 ))
if ( variance <= 0 ) return { runs , expected , zScore: NaN , pValue: NaN }
const zScore = ( runs - expected ) / Math . sqrt ( variance )
// Two-tailed p-value from standard normal (using the error function approximation)
const pValue = 2 * ( 1 - normalCDF ( Math . abs ( zScore )))
return { runs , expected: + expected . toFixed ( 4 ), zScore: + zScore . toFixed ( 4 ), pValue: + pValue . toFixed ( 4 ) }
}
Interpretation
Z-Score P-Value Interpretation -2 to +2 > 0.05 Pass: Sequence appears random < -2 < 0.05 Too few runs (clustering/streaks) > +2 < 0.05 Too many runs (oscillation)
Example
import { runsTest } from './engine/metrics.js'
// Random sequence (should pass)
const random = [ 1 , 0 , 1 , 1 , 0 , 1 , 0 , 0 , 1 , 0 ]
const result1 = runsTest ( random )
console . log ( `Runs: ${ result1 . runs } , Expected: ${ result1 . expected } , Z: ${ result1 . zScore } , p: ${ result1 . pValue } ` )
// Output: Runs: 8, Expected: 6.0, Z: 0.94, p: 0.35 (PASS: appears random)
// Streaky sequence (should fail)
const streaky = [ 1 , 1 , 1 , 1 , 1 , 0 , 0 , 0 , 0 , 0 ]
const result2 = runsTest ( streaky )
console . log ( `Runs: ${ result2 . runs } , Expected: ${ result2 . expected } , Z: ${ result2 . zScore } , p: ${ result2 . pValue } ` )
// Output: Runs: 2, Expected: 6.0, Z: -2.53, p: 0.01 (FAIL: too few runs, clustering detected)
Cold Streak Detection : If the runs test shows Z < -2, the model is producing streaky predictions rather than independent ones. This is a red flag for risk management.
Band Analysis
Classifies predictions into 5 confidence bands and computes per-band accuracy, Brier score, and mean probability.
Band Definitions
Band Label Range Confidence Distance 1 Ruido 45-55% |p - 0.5| < 0.05 2 Senal debil 55-65% |p - 0.5| 0.05-0.15 3 Senal moderada 65-75% |p - 0.5| 0.15-0.25 4 Senal fuerte 75-85% |p - 0.5| 0.25-0.35 5 Senal muy fuerte 85%+ |p - 0.5| ≥ 0.35
Implementation
export function bandAnalysis ( records ) {
const bands = [
{ band: 1 , label: 'Ruido' , range: '45-55%' , min: 0.00 , max: 0.05 , items: [] },
{ band: 2 , label: 'Senal debil' , range: '55-65%' , min: 0.05 , max: 0.15 , items: [] },
{ band: 3 , label: 'Senal moderada' , range: '65-75%' , min: 0.15 , max: 0.25 , items: [] },
{ band: 4 , label: 'Senal fuerte' , range: '75-85%' , min: 0.25 , max: 0.35 , items: [] },
{ band: 5 , label: 'Senal muy fuerte' , range: '85%+' , min: 0.35 , max: Infinity , items: [] },
]
for ( const record of records ) {
const ep = record . earlyPrediction
if ( ! ep || ep . abstained ) continue
const confidence = Math . abs ( ep . probability - 0.5 )
const correct = record . earlyPredictionCorrect
if ( correct == null ) continue
// Find the right band
for ( const b of bands ) {
if ( confidence >= b . min && confidence < b . max ) {
b . items . push ({ confidence , correct , record })
break
}
}
}
// Build scoring data per band for partial Brier calculation
return bands . map ( b => {
const count = b . items . length
if ( count === 0 ) {
return {
band: b . band , label: b . label , range: b . range ,
count: 0 , accuracy: '--' , brier: '--' , meanProb: '--'
}
}
const correctCount = b . items . filter ( i => i . correct ). length
const accuracy = (( correctCount / count ) * 100 ). toFixed ( 1 )
// Compute partial Brier for this band
const scoringItems = []
for ( const item of b . items ) {
const ep = item . record . earlyPrediction
if ( ep . direction === 'UP' ) {
scoringItems . push ({ predicted: ep . probability , outcome: item . record . result === 'UP' ? 1 : 0 })
} else if ( ep . direction === 'DOWN' ) {
scoringItems . push ({ predicted: 1 - ep . probability , outcome: item . record . result === 'DOWN' ? 1 : 0 })
}
}
const brier = scoringItems . length > 0 ? brierScore ( scoringItems ). toFixed ( 4 ) : '--'
// Mean effective confidence
const meanConf = b . items . reduce (( s , i ) => s + i . confidence , 0 ) / count
const meanProb = ( 0.5 + meanConf ). toFixed ( 2 )
return {
band: b . band , label: b . label , range: b . range ,
count , accuracy , brier , meanProb
}
})
}
src/engine/metrics.js (lines 192-270)
/**
* 5-band confidence analysis.
*
* Classifies early predictions into 5 bands based on confidence distance
* from 0.50, then computes per-band count, accuracy, partial Brier, and
* mean confidence.
*
* Band boundaries (mapped from raw probability distance from 0.5):
* Band 1: 45-55% (Ruido) — |p-0.5| < 0.05 → effective 0.50-0.55
* Band 2: 55-65% (Senal debil) — |p-0.5| 0.05-0.15 → effective 0.55-0.65
* Band 3: 65-75% (Senal moderada) — |p-0.5| 0.15-0.25 → effective 0.65-0.75
* Band 4: 75-85% (Senal fuerte) — |p-0.5| 0.25-0.35 → effective 0.75-0.85
* Band 5: 85%+ (Senal muy fuerte) — |p-0.5| >= 0.35 → effective 0.85+
*
* @param {Array<Object>} records IntervalRecord objects
* @returns {Array<{band: number, label: string, range: string, count: number, accuracy: string, brier: string, meanProb: string}>}
*/
export function bandAnalysis ( records ) {
// [implementation shown above]
}
Example Output
import { bandAnalysis } from './engine/metrics.js'
import { HistoryStore } from './tracker/history.js'
const history = new HistoryStore ({ filePath: 'data/history.json' })
const records = await history . load ()
const bands = bandAnalysis ( records )
console . table ( bands )
Band Label Range Count Accuracy Brier Mean Prob 1 Ruido 45-55% 23 52.2% 0.2489 0.52 2 Senal debil 55-65% 45 58.9% 0.2301 0.60 3 Senal moderada 65-75% 38 68.4% 0.1876 0.70 4 Senal fuerte 75-85% 12 75.0% 0.1123 0.80 5 Senal muy fuerte 85%+ 3 100.0% 0.0289 0.91
Calibration Check : If accuracy closely matches mean probability in each band, the model is well-calibrated. Large discrepancies indicate miscalibration.
Data Conversion
Convert IntervalRecord objects into scoring data format:
export function intervalsToScoringData ( records ) {
const data = []
for ( const record of records ) {
const ep = record . earlyPrediction
if ( ! ep || ep . abstained ) continue
const direction = ep . direction
const probability = ep . probability
if ( direction === 'UP' ) {
data . push ({
predicted: probability ,
outcome: record . result === 'UP' ? 1 : 0
})
} else if ( direction === 'DOWN' ) {
data . push ({
predicted: 1 - probability ,
outcome: record . result === 'DOWN' ? 1 : 0
})
}
}
return data
}
src/engine/metrics.js (lines 155-190)
/**
* Convert closed IntervalRecord objects into scoring data format.
*
* Uses earlyPrediction.probability as the predicted value.
* If direction='UP': predicted = probability, outcome = 1 when result='UP'.
* If direction='DOWN': predicted = 1 - probability, outcome = 1 when result='DOWN'.
* Skips records where earlyPrediction is null or has abstained flag.
*
* @param {Array<Object>} records IntervalRecord objects from history.json
* @returns {Array<{predicted: number, outcome: 0|1}>}
*/
export function intervalsToScoringData ( records ) {
const data = []
for ( const record of records ) {
const ep = record . earlyPrediction
if ( ! ep || ep . abstained ) continue
const direction = ep . direction
const probability = ep . probability
if ( direction === 'UP' ) {
data . push ({
predicted: probability ,
outcome: record . result === 'UP' ? 1 : 0
})
} else if ( direction === 'DOWN' ) {
data . push ({
predicted: 1 - probability ,
outcome: record . result === 'DOWN' ? 1 : 0
})
}
}
return data
}
Full Analysis Pipeline
import { HistoryStore } from './tracker/history.js'
import {
intervalsToScoringData ,
brierScore ,
brierSkillScore ,
logLoss ,
murphyDecomposition ,
runsTest ,
bandAnalysis
} from './engine/metrics.js'
// Load interval history
const history = new HistoryStore ({ filePath: 'data/history.json' })
const records = await history . load ()
// Convert to scoring format
const data = intervalsToScoringData ( records )
// Compute all metrics
const bs = brierScore ( data )
const bss = brierSkillScore ( bs )
const ll = logLoss ( data )
const murphy = murphyDecomposition ( data )
// Extract outcomes for runs test
const outcomes = data . map ( d => d . outcome )
const runs = runsTest ( outcomes )
// Band analysis
const bands = bandAnalysis ( records )
console . log ( '=== OVERALL METRICS ===' )
console . log ( `Brier Score: ${ bs . toFixed ( 4 ) } ` )
console . log ( `Brier Skill Score: ${ bss . toFixed ( 2 ) } ` )
console . log ( `Log Loss: ${ ll . toFixed ( 4 ) } ` )
console . log ()
console . log ( '=== MURPHY DECOMPOSITION ===' )
console . log ( `Reliability: ${ murphy . reliability . toFixed ( 4 ) } ` )
console . log ( `Resolution: ${ murphy . resolution . toFixed ( 4 ) } ` )
console . log ( `Uncertainty: ${ murphy . uncertainty . toFixed ( 4 ) } ` )
console . log ()
console . log ( '=== RUNS TEST ===' )
console . log ( `Observed: ${ runs . runs } runs` )
console . log ( `Expected: ${ runs . expected } runs` )
console . log ( `Z-Score: ${ runs . zScore } ` )
console . log ( `P-Value: ${ runs . pValue } ${ runs . pValue < 0.05 ? '(FAIL: not random)' : '(PASS: appears random)' } ` )
console . log ()
console . log ( '=== BAND ANALYSIS ===' )
console . table ( bands )
Interval Tracking How intervals are tracked and closed
History Store JSON persistence for interval records
Logging Structured logs and tick data