Skip to main content
The stats module provides comprehensive statistical functions for automated valuation modeling, including ratio study statistics (COD, PRD, PRB), bootstrap confidence intervals, outlier detection, and variable selection utilities.

Ratio Study Statistics

calc_cod()

Calculate the Coefficient of Dispersion (COD) for an array of values.
from openavmkit.utilities.stats import calc_cod
import numpy as np

values = np.array([100, 105, 95, 110, 90])
cod = calc_cod(values)
values
numpy.ndarray
required
Array of numeric values
cod
float
The COD percentage. Returns nan if array is empty, 0.0 if all values are zero, and inf if median is zero but not all values are zero.

calc_ratio_stats_bootstrap()

Calculate ratio study statistics (Median ratio, Mean ratio, COD, PRD) with bootstrap percentile confidence intervals, following IAAO definitions.
from openavmkit.utilities.stats import calc_ratio_stats_bootstrap
import numpy as np

predictions = np.array([95000, 102000, 98000, 105000])
ground_truth = np.array([100000, 100000, 100000, 100000])

stats = calc_ratio_stats_bootstrap(
    predictions, 
    ground_truth,
    confidence_interval=0.95,
    iterations=10000,
    seed=777
)
predictions
numpy.ndarray
required
Array of predicted values
ground_truth
numpy.ndarray
required
Array of corresponding ground truth (e.g., sale price) values
confidence_interval
float
default:"0.95"
The size of the confidence interval (e.g., 0.95 = 95% confidence)
iterations
int
default:"10000"
The number of bootstrap iterations to perform
seed
int
default:"777"
Random seed for reproducibility
return
dict
Dictionary containing:
  • median_ratio: ConfidenceStat object with point estimate and confidence bounds
  • mean_ratio: ConfidenceStat object
  • cod: ConfidenceStat object (COD = 100 * mean(|ri - median(r)|) / median(r))
  • prd: ConfidenceStat object (PRD = mean(r) / weighted_mean(r))
Returns None if no valid observations remain after filtering.

calc_prd()

Calculate the Price Related Differential (PRD).
from openavmkit.utilities.stats import calc_prd

prd = calc_prd(predictions, ground_truth)
predictions
numpy.ndarray
required
Array of predicted values
ground_truth
numpy.ndarray
required
Array of ground truth values
prd
float
The PRD value, computed as the ratio of the mean ratio to the weighted mean ratio

calc_prb()

Calculate the Price Related Bias (PRB) metric using a regression-based approach.
from openavmkit.utilities.stats import calc_prb

prb, lower, upper = calc_prb(predictions, ground_truth, confidence_interval=0.95)
predictions
numpy.ndarray
required
Array of predicted values
ground_truth
numpy.ndarray
required
Array of ground truth values
confidence_interval
float
default:"0.95"
Desired confidence interval
return
tuple[float, float, float]
Tuple containing:
  • PRB value
  • Lower bound of the confidence interval
  • Upper bound of the confidence interval

Bootstrap Functions

calc_cod_bootstrap()

Calculate COD using bootstrapping to generate confidence intervals.
from openavmkit.utilities.stats import calc_cod_bootstrap

median_cod, lower_bound, upper_bound = calc_cod_bootstrap(
    values,
    confidence_interval=0.95,
    iterations=10000,
    seed=777
)
values
numpy.ndarray
required
Array of numeric values
confidence_interval
float
default:"0.95"
The desired confidence level
iterations
int
default:"10000"
Number of bootstrap iterations
seed
int
default:"777"
Random seed for reproducibility
return
tuple[float, float, float]
Tuple containing the median COD, lower bound, and upper bound of the confidence interval

calc_prd_bootstrap()

Calculate PRD with bootstrapping.
from openavmkit.utilities.stats import calc_prd_bootstrap

median_prd, lower_bound, upper_bound = calc_prd_bootstrap(
    predictions,
    ground_truth,
    confidence_interval=0.95,
    iterations=10000,
    seed=777
)
predictions
numpy.ndarray
required
Array of predicted values
ground_truth
numpy.ndarray
required
Array of ground truth values
confidence_interval
float
default:"0.95"
The desired confidence level
iterations
int
default:"10000"
Number of bootstrap iterations
seed
int
default:"777"
Random seed for reproducibility
return
tuple[float, float, float]
Tuple containing median PRD, the lower bound, and upper bound of the confidence interval

Outlier Detection

trim_outliers()

Trim outliers using IQR fences per IAAO guidance, with a maximum trim cap.
from openavmkit.utilities.stats import trim_outliers

trimmed = trim_outliers(values, max_percent=0.10, iqr_factor=1.5)
values
numpy.ndarray
required
1D numeric array with no NaNs allowed
max_percent
float
default:"0.10"
Maximum fraction to remove (e.g., 0.10 = 10%)
iqr_factor
float
default:"1.5"
1.5 for standard outliers, 3.0 for extreme outliers
return
numpy.ndarray
Trimmed array according to IQR rules or symmetric quantile cut if IQR-based trimming exceeds the cap

trim_outliers_mask()

Same as trim_outliers() but returns a boolean mask instead of trimmed values.
from openavmkit.utilities.stats import trim_outliers_mask

mask = trim_outliers_mask(values, max_percent=0.10, iqr_factor=1.5)
trimmed = values[mask]
values
numpy.ndarray
required
1D numeric array with no NaNs allowed
max_percent
float
default:"0.10"
Maximum fraction to remove
iqr_factor
float
default:"1.5"
IQR multiplier for fence calculation
return
numpy.ndarray
Boolean array where True indicates values within the quantile bounds

Variable Selection

calc_correlations()

Calculate correlations and iteratively drop variables with low combined scores.
from openavmkit.utilities.stats import calc_correlations

result = calc_correlations(X, threshold=0.1, do_plots=False)
initial_scores = result['initial']
final_scores = result['final']
bad_vars = result['bad_vars']
X
pandas.DataFrame
required
Input DataFrame containing the variables to evaluate
threshold
float
default:"0.1"
Minimum acceptable combined score for variables. Variables with a score below this value will be dropped.
do_plots
bool
default:"False"
If True, plot the initial and final correlation heatmaps
return
dict
Dictionary with keys:
  • initial: pandas.Series of combined scores from the first iteration
  • final: pandas.Series of combined scores after dropping low-scoring variables
  • bad_vars: list of variables that should be dropped

calc_vif()

Calculate the Variance Inflation Factor (VIF) for each variable in a DataFrame.
from openavmkit.utilities.stats import calc_vif

vif_data = calc_vif(X)
X
pandas.DataFrame
required
Input features DataFrame
return
pandas.DataFrame
DataFrame with columns:
  • variable: Name of each feature in X
  • vif: Variance Inflation Factor value for that feature

calc_vif_recursive_drop()

Recursively drop variables with a Variance Inflation Factor (VIF) exceeding the threshold.
from openavmkit.utilities.stats import calc_vif_recursive_drop

result = calc_vif_recursive_drop(X, threshold=10.0, settings=None)
initial_vif = result['initial']
final_vif = result['final']
X
pandas.DataFrame
required
Input features DataFrame
threshold
float
default:"10.0"
Maximum acceptable VIF. Variables with VIF above this threshold will be removed.
settings
dict
default:"None"
Settings dictionary containing field classifications, if needed for VIF computation
return
dict
Dictionary with keys:
  • initial: pandas.DataFrame of VIF values before dropping variables
  • final: pandas.DataFrame of VIF values after recursively dropping high-VIF variables

Model Performance

calc_mse()

Calculate the Mean Squared Error (MSE) between predictions and ground truth.
from openavmkit.utilities.stats import calc_mse

mse = calc_mse(predictions, ground_truth)
prediction
numpy.ndarray
required
Array of predicted values
ground_truth
numpy.ndarray
required
Array of true values
mse
float
The MSE value

calc_mse_r2_adj_r2()

Calculate the Mean Squared Error (MSE), R-squared, and adjusted R-squared.
from openavmkit.utilities.stats import calc_mse_r2_adj_r2

mse, r2, adj_r2 = calc_mse_r2_adj_r2(predictions, ground_truth, num_vars=5)
predictions
numpy.ndarray
required
Array of predicted values
ground_truth
numpy.ndarray
required
Array of true values
num_vars
int
required
Number of independent variables used to produce the predictions
return
tuple[float, float, float]
Tuple containing:
  • The MSE value
  • The R-squared value
  • The adjusted R-squared value

calc_cross_validation_score()

Calculate cross-validation score using negative mean squared error.
from openavmkit.utilities.stats import calc_cross_validation_score

mse = calc_cross_validation_score(X, y)
X
pandas.DataFrame or numpy.ndarray
required
Input features for modeling
y
pandas.Series or numpy.ndarray
required
Target variable
mse
float
The mean cross-validated mean squared error (positive value)

Helper Classes

ConfidenceStat

A class representing any statistic along with its confidence interval bounds.
from openavmkit.utilities.stats import ConfidenceStat

stat = ConfidenceStat(
    value=1.05,
    confidence_interval=0.95,
    low=1.02,
    high=1.08
)

print(f"Value: {stat.value}")
print(f"95% CI: [{stat.low}, {stat.high}]")
value
float
required
The base value of the statistic
confidence_interval
float
required
The % value of the confidence interval (e.g., 0.95 for 95% confidence interval)
low
float
required
The lower bound of the confidence interval
high
float
required
The upper bound of the confidence interval

Build docs developers (and LLMs) love