stats module provides comprehensive statistical functions for automated valuation modeling, including ratio study statistics (COD, PRD, PRB), bootstrap confidence intervals, outlier detection, and variable selection utilities.
Ratio Study Statistics
calc_cod()
Calculate the Coefficient of Dispersion (COD) for an array of values.Array of numeric values
The COD percentage. Returns
nan if array is empty, 0.0 if all values are zero, and inf if median is zero but not all values are zero.calc_ratio_stats_bootstrap()
Calculate ratio study statistics (Median ratio, Mean ratio, COD, PRD) with bootstrap percentile confidence intervals, following IAAO definitions.Array of predicted values
Array of corresponding ground truth (e.g., sale price) values
The size of the confidence interval (e.g., 0.95 = 95% confidence)
The number of bootstrap iterations to perform
Random seed for reproducibility
Dictionary containing:
median_ratio: ConfidenceStat object with point estimate and confidence boundsmean_ratio: ConfidenceStat objectcod: ConfidenceStat object (COD = 100 * mean(|ri - median(r)|) / median(r))prd: ConfidenceStat object (PRD = mean(r) / weighted_mean(r))
None if no valid observations remain after filtering.calc_prd()
Calculate the Price Related Differential (PRD).Array of predicted values
Array of ground truth values
The PRD value, computed as the ratio of the mean ratio to the weighted mean ratio
calc_prb()
Calculate the Price Related Bias (PRB) metric using a regression-based approach.Array of predicted values
Array of ground truth values
Desired confidence interval
Tuple containing:
- PRB value
- Lower bound of the confidence interval
- Upper bound of the confidence interval
Bootstrap Functions
calc_cod_bootstrap()
Calculate COD using bootstrapping to generate confidence intervals.Array of numeric values
The desired confidence level
Number of bootstrap iterations
Random seed for reproducibility
Tuple containing the median COD, lower bound, and upper bound of the confidence interval
calc_prd_bootstrap()
Calculate PRD with bootstrapping.Array of predicted values
Array of ground truth values
The desired confidence level
Number of bootstrap iterations
Random seed for reproducibility
Tuple containing median PRD, the lower bound, and upper bound of the confidence interval
Outlier Detection
trim_outliers()
Trim outliers using IQR fences per IAAO guidance, with a maximum trim cap.1D numeric array with no NaNs allowed
Maximum fraction to remove (e.g., 0.10 = 10%)
1.5 for standard outliers, 3.0 for extreme outliers
Trimmed array according to IQR rules or symmetric quantile cut if IQR-based trimming exceeds the cap
trim_outliers_mask()
Same astrim_outliers() but returns a boolean mask instead of trimmed values.
1D numeric array with no NaNs allowed
Maximum fraction to remove
IQR multiplier for fence calculation
Boolean array where
True indicates values within the quantile boundsVariable Selection
calc_correlations()
Calculate correlations and iteratively drop variables with low combined scores.Input DataFrame containing the variables to evaluate
Minimum acceptable combined score for variables. Variables with a score below this value will be dropped.
If True, plot the initial and final correlation heatmaps
Dictionary with keys:
initial: pandas.Series of combined scores from the first iterationfinal: pandas.Series of combined scores after dropping low-scoring variablesbad_vars: list of variables that should be dropped
calc_vif()
Calculate the Variance Inflation Factor (VIF) for each variable in a DataFrame.Input features DataFrame
DataFrame with columns:
variable: Name of each feature in Xvif: Variance Inflation Factor value for that feature
calc_vif_recursive_drop()
Recursively drop variables with a Variance Inflation Factor (VIF) exceeding the threshold.Input features DataFrame
Maximum acceptable VIF. Variables with VIF above this threshold will be removed.
Settings dictionary containing field classifications, if needed for VIF computation
Dictionary with keys:
initial: pandas.DataFrame of VIF values before dropping variablesfinal: pandas.DataFrame of VIF values after recursively dropping high-VIF variables
Model Performance
calc_mse()
Calculate the Mean Squared Error (MSE) between predictions and ground truth.Array of predicted values
Array of true values
The MSE value
calc_mse_r2_adj_r2()
Calculate the Mean Squared Error (MSE), R-squared, and adjusted R-squared.Array of predicted values
Array of true values
Number of independent variables used to produce the predictions
Tuple containing:
- The MSE value
- The R-squared value
- The adjusted R-squared value
calc_cross_validation_score()
Calculate cross-validation score using negative mean squared error.Input features for modeling
Target variable
The mean cross-validated mean squared error (positive value)
Helper Classes
ConfidenceStat
A class representing any statistic along with its confidence interval bounds.The base value of the statistic
The % value of the confidence interval (e.g., 0.95 for 95% confidence interval)
The lower bound of the confidence interval
The upper bound of the confidence interval