stats

The stats module provides comprehensive statistical functions for automated valuation modeling, including ratio study statistics (COD, PRD, PRB), bootstrap confidence intervals, outlier detection, and variable selection utilities.

Ratio Study Statistics

calc_cod()

Calculate the Coefficient of Dispersion (COD) for an array of values.

from openavmkit.utilities.stats import calc_cod
import numpy as np

values = np.array([100, 105, 95, 110, 90])
cod = calc_cod(values)

values

numpy.ndarray

required

Array of numeric values

cod

float

The COD percentage. Returns nan if array is empty, 0.0 if all values are zero, and inf if median is zero but not all values are zero.

calc_ratio_stats_bootstrap()

Calculate ratio study statistics (Median ratio, Mean ratio, COD, PRD) with bootstrap percentile confidence intervals, following IAAO definitions.

from openavmkit.utilities.stats import calc_ratio_stats_bootstrap
import numpy as np

predictions = np.array([95000, 102000, 98000, 105000])
ground_truth = np.array([100000, 100000, 100000, 100000])

stats = calc_ratio_stats_bootstrap(
    predictions, 
    ground_truth,
    confidence_interval=0.95,
    iterations=10000,
    seed=777
)

predictions

numpy.ndarray

required

Array of predicted values

ground_truth

numpy.ndarray

required

Array of corresponding ground truth (e.g., sale price) values

confidence_interval

float

default:"0.95"

The size of the confidence interval (e.g., 0.95 = 95% confidence)

iterations

int

default:"10000"

The number of bootstrap iterations to perform

seed

int

default:"777"

Random seed for reproducibility

return

dict

Dictionary containing:

median_ratio: ConfidenceStat object with point estimate and confidence bounds
mean_ratio: ConfidenceStat object
cod: ConfidenceStat object (COD = 100 * mean(|ri - median(r)|) / median(r))
prd: ConfidenceStat object (PRD = mean(r) / weighted_mean(r))

Returns None if no valid observations remain after filtering.

calc_prd()

Calculate the Price Related Differential (PRD).

from openavmkit.utilities.stats import calc_prd

prd = calc_prd(predictions, ground_truth)

predictions

numpy.ndarray

required

Array of predicted values

ground_truth

numpy.ndarray

required

Array of ground truth values

prd

float

The PRD value, computed as the ratio of the mean ratio to the weighted mean ratio

calc_prb()

Calculate the Price Related Bias (PRB) metric using a regression-based approach.

from openavmkit.utilities.stats import calc_prb

prb, lower, upper = calc_prb(predictions, ground_truth, confidence_interval=0.95)

predictions

numpy.ndarray

required

Array of predicted values

ground_truth

numpy.ndarray

required

Array of ground truth values

confidence_interval

float

default:"0.95"

Desired confidence interval

return

tuple[float, float, float]

Tuple containing:

PRB value
Lower bound of the confidence interval
Upper bound of the confidence interval

Bootstrap Functions

calc_cod_bootstrap()

Calculate COD using bootstrapping to generate confidence intervals.

from openavmkit.utilities.stats import calc_cod_bootstrap

median_cod, lower_bound, upper_bound = calc_cod_bootstrap(
    values,
    confidence_interval=0.95,
    iterations=10000,
    seed=777
)

values

numpy.ndarray

required

Array of numeric values

confidence_interval

float

default:"0.95"

The desired confidence level

iterations

int

default:"10000"

Number of bootstrap iterations

seed

int

default:"777"

Random seed for reproducibility

return

tuple[float, float, float]

Tuple containing the median COD, lower bound, and upper bound of the confidence interval

calc_prd_bootstrap()

Calculate PRD with bootstrapping.

from openavmkit.utilities.stats import calc_prd_bootstrap

median_prd, lower_bound, upper_bound = calc_prd_bootstrap(
    predictions,
    ground_truth,
    confidence_interval=0.95,
    iterations=10000,
    seed=777
)

predictions

numpy.ndarray

required

Array of predicted values

ground_truth

numpy.ndarray

required

Array of ground truth values

confidence_interval

float

default:"0.95"

The desired confidence level

iterations

int

default:"10000"

Number of bootstrap iterations

seed

int

default:"777"

Random seed for reproducibility

return

tuple[float, float, float]

Tuple containing median PRD, the lower bound, and upper bound of the confidence interval

Outlier Detection

trim_outliers()

Trim outliers using IQR fences per IAAO guidance, with a maximum trim cap.

from openavmkit.utilities.stats import trim_outliers

trimmed = trim_outliers(values, max_percent=0.10, iqr_factor=1.5)

values

numpy.ndarray

required

1D numeric array with no NaNs allowed

max_percent

float

default:"0.10"

Maximum fraction to remove (e.g., 0.10 = 10%)

iqr_factor

float

default:"1.5"

1.5 for standard outliers, 3.0 for extreme outliers

return

numpy.ndarray

Trimmed array according to IQR rules or symmetric quantile cut if IQR-based trimming exceeds the cap

trim_outliers_mask()

Same as trim_outliers() but returns a boolean mask instead of trimmed values.

from openavmkit.utilities.stats import trim_outliers_mask

mask = trim_outliers_mask(values, max_percent=0.10, iqr_factor=1.5)
trimmed = values[mask]

values

numpy.ndarray

required

1D numeric array with no NaNs allowed

max_percent

float

default:"0.10"

Maximum fraction to remove

iqr_factor

float

default:"1.5"

IQR multiplier for fence calculation

return

numpy.ndarray

Boolean array where True indicates values within the quantile bounds

Variable Selection

calc_correlations()

Calculate correlations and iteratively drop variables with low combined scores.

from openavmkit.utilities.stats import calc_correlations

result = calc_correlations(X, threshold=0.1, do_plots=False)
initial_scores = result['initial']
final_scores = result['final']
bad_vars = result['bad_vars']

pandas.DataFrame

required

Input DataFrame containing the variables to evaluate

threshold

float

default:"0.1"

Minimum acceptable combined score for variables. Variables with a score below this value will be dropped.

do_plots

bool

default:"False"

If True, plot the initial and final correlation heatmaps

return

dict

Dictionary with keys:

initial: pandas.Series of combined scores from the first iteration
final: pandas.Series of combined scores after dropping low-scoring variables
bad_vars: list of variables that should be dropped

calc_vif()

Calculate the Variance Inflation Factor (VIF) for each variable in a DataFrame.

from openavmkit.utilities.stats import calc_vif

vif_data = calc_vif(X)

pandas.DataFrame

required

Input features DataFrame

return

pandas.DataFrame

DataFrame with columns:

variable: Name of each feature in X
vif: Variance Inflation Factor value for that feature

calc_vif_recursive_drop()

Recursively drop variables with a Variance Inflation Factor (VIF) exceeding the threshold.

from openavmkit.utilities.stats import calc_vif_recursive_drop

result = calc_vif_recursive_drop(X, threshold=10.0, settings=None)
initial_vif = result['initial']
final_vif = result['final']

pandas.DataFrame

required

Input features DataFrame

threshold

float

default:"10.0"

Maximum acceptable VIF. Variables with VIF above this threshold will be removed.

settings

dict

default:"None"

Settings dictionary containing field classifications, if needed for VIF computation

return

dict

Dictionary with keys:

initial: pandas.DataFrame of VIF values before dropping variables
final: pandas.DataFrame of VIF values after recursively dropping high-VIF variables

Model Performance

calc_mse()

Calculate the Mean Squared Error (MSE) between predictions and ground truth.

from openavmkit.utilities.stats import calc_mse

mse = calc_mse(predictions, ground_truth)

prediction

numpy.ndarray

required

Array of predicted values

ground_truth

numpy.ndarray

required

Array of true values

mse

float

The MSE value

calc_mse_r2_adj_r2()

Calculate the Mean Squared Error (MSE), R-squared, and adjusted R-squared.

from openavmkit.utilities.stats import calc_mse_r2_adj_r2

mse, r2, adj_r2 = calc_mse_r2_adj_r2(predictions, ground_truth, num_vars=5)

predictions

numpy.ndarray

required

Array of predicted values

ground_truth

numpy.ndarray

required

Array of true values

num_vars

int

required

Number of independent variables used to produce the predictions

return

tuple[float, float, float]

Tuple containing:

The MSE value
The R-squared value
The adjusted R-squared value

calc_cross_validation_score()

Calculate cross-validation score using negative mean squared error.

from openavmkit.utilities.stats import calc_cross_validation_score

mse = calc_cross_validation_score(X, y)

pandas.DataFrame or numpy.ndarray

required

Input features for modeling

pandas.Series or numpy.ndarray

required

Target variable

mse

float

The mean cross-validated mean squared error (positive value)

Helper Classes

ConfidenceStat

A class representing any statistic along with its confidence interval bounds.

from openavmkit.utilities.stats import ConfidenceStat

stat = ConfidenceStat(
    value=1.05,
    confidence_interval=0.95,
    low=1.02,
    high=1.08
)

print(f"Value: {stat.value}")
print(f"95% CI: [{stat.low}, {stat.high}]")

value

float

required

The base value of the statistic

confidence_interval

float

required

The % value of the confidence interval (e.g., 0.95 for 95% confidence interval)

low

float

required

The lower bound of the confidence interval

high

float

required

The upper bound of the confidence interval

Core Modules

Analysis & Evaluation

Data Processing

Specialized Analysis

Utilities

Cloud & Storage

Quality & Reports

Ratio Study Statistics

calc_cod()

calc_ratio_stats_bootstrap()

calc_prd()

calc_prb()

Bootstrap Functions

calc_cod_bootstrap()

calc_prd_bootstrap()

Outlier Detection

trim_outliers()

trim_outliers_mask()

Variable Selection

calc_correlations()

calc_vif()

calc_vif_recursive_drop()

Model Performance

calc_mse()

calc_mse_r2_adj_r2()

calc_cross_validation_score()

Helper Classes

ConfidenceStat

Build docs developers (and LLMs) love

Core Modules

Analysis & Evaluation

Data Processing

Specialized Analysis

Utilities

Cloud & Storage

Quality & Reports

​Ratio Study Statistics

​calc_cod()

​calc_ratio_stats_bootstrap()

​calc_prd()

​calc_prb()

​Bootstrap Functions

​calc_cod_bootstrap()

​calc_prd_bootstrap()

​Outlier Detection

​trim_outliers()

​trim_outliers_mask()

​Variable Selection

​calc_correlations()

​calc_vif()

​calc_vif_recursive_drop()

​Model Performance

​calc_mse()

​calc_mse_r2_adj_r2()

​calc_cross_validation_score()

​Helper Classes

​ConfidenceStat

Build docs developers (and LLMs) love

Ratio Study Statistics

calc_cod()

calc_ratio_stats_bootstrap()

calc_prd()

calc_prb()

Bootstrap Functions

calc_cod_bootstrap()

calc_prd_bootstrap()

Outlier Detection

trim_outliers()

trim_outliers_mask()

Variable Selection

calc_correlations()

calc_vif()

calc_vif_recursive_drop()

Model Performance

calc_mse()

calc_mse_r2_adj_r2()

calc_cross_validation_score()

Helper Classes

ConfidenceStat