Skip to main content
The modeling module provides functions for training and evaluating various predictive models including MRA, XGBoost, LightGBM, CatBoost, GWR, and more.

Core Classes

DataSplit

Encapsulates the splitting of data into training, test, and other subsets.
from openavmkit.modeling import DataSplit

ds = DataSplit(
    name="residential",
    df_sales=df_sales,
    df_universe=df_universe,
    model_group="residential",
    settings=settings,
    dep_var="sale_price",
    dep_var_test="sale_price",
    ind_vars=independent_vars,
    categorical_vars=categorical_vars,
    interactions={},
    test_keys=test_keys,
    train_keys=train_keys
)

Attributes

  • df_sales - Sales data after processing
  • df_universe - Universe (parcel) data after processing
  • df_train - Training subset of sales data
  • df_test - Test subset of sales data
  • X_train - Feature matrix for the training data
  • X_test - Feature matrix for the test data
  • X_univ - Feature matrix for the universe data
  • y_train - Target array for training
  • y_test - Target array for testing

SingleModelResults

Container for results from a single prediction model.

Attributes

  • ds - The DataSplit object used
  • df_universe - Universe DataFrame with predictions
  • df_test - Test DataFrame with predictions
  • df_sales - Sales DataFrame with predictions
  • model_name - Model name (unique identifier)
  • model_engine - Model engine (“xgboost”, “mra”, etc.)
  • model - The fitted model object
  • pred_test - PredictionResults for the test set
  • pred_train - PredictionResults for the training set
  • pred_sales - PredictionResults for the sales set
  • chd - Calculated CHD (coefficient of horizontal disparity) value
  • utility_test - Composite utility score for the test set
  • utility_train - Composite utility score for the training set

PredictionResults

Container for prediction results and associated performance metrics.

Attributes

  • dep_var - The independent variable used for prediction
  • ind_vars - List of dependent variables
  • y - Ground truth values
  • y_pred - Predicted values
  • mse - Mean squared error
  • rmse - Root mean squared error
  • mape - Mean absolute percent error
  • r2 - R-squared
  • adj_r2 - Adjusted R-squared
  • ratio_study - RatioStudy object

Multiple Regression Analysis (MRA)

run_mra()

Train an MRA model and return its prediction results.
from openavmkit.modeling import run_mra

results = run_mra(
    ds,
    intercept=True,
    verbose=True
)
ds
DataSplit
required
DataSplit object
intercept
bool
default:"True"
Whether to include an intercept in the model
verbose
bool
default:"False"
Whether to print verbose output
model
MRAModel | None
default:"None"
Optional pre-trained MRAModel
results
SingleModelResults
Prediction results from the MRA model

run_multi_mra()

Train a hierarchical Multi-MRA model and return its prediction results.
from openavmkit.modeling import run_multi_mra

results = run_multi_mra(
    ds,
    outpath="out/models/residential",
    location_fields=["neighborhood", "city"],
    optimize_vars=False,
    intercept=True,
    verbose=True,
    min_sample_size=15
)
ds
DataSplit
required
DataSplit object (sales/universe/splits should already be set up)
outpath
str
required
Path to write parameters out to
location_fields
list[str]
required
Ordered list of location field names, most specific to least specific
optimize_vars
bool
default:"False"
Whether to automatically trim the variable selection to the most optimal
intercept
bool
default:"True"
Whether to include an intercept column in the regression
verbose
bool
default:"False"
If True, print verbose output
min_sample_size
int
default:"15"
Minimum number of observations required to fit a local OLS model
results
SingleModelResults
Prediction results from the Multi-MRA model

Tree-Based Models

run_xgboost()

Train an XGBoost model and return its prediction results.
from openavmkit.modeling import run_xgboost

results = run_xgboost(
    ds,
    outpath="out/models/residential",
    use_saved_params=True,
    save_params=True,
    tune_params=False,
    verbose=True,
    params={
        "max_depth": 6,
        "learning_rate": 0.1,
        "n_estimators": 100,
        "objective": "reg:squarederror"
    }
)
ds
DataSplit
required
DataSplit object
outpath
str
required
Path to save/load parameters
use_saved_params
bool
default:"True"
Whether to use saved parameters if available
save_params
bool
default:"True"
Whether to save parameters after training
tune_params
bool
default:"False"
Whether to perform hyperparameter tuning
verbose
bool
default:"False"
Whether to print verbose output
params
dict
default:"None"
Optional dictionary of hyperparameters
results
SingleModelResults
Prediction results from the XGBoost model

run_lightgbm()

Train a LightGBM model and return its prediction results.
from openavmkit.modeling import run_lightgbm

results = run_lightgbm(
    ds,
    outpath="out/models/residential",
    use_saved_params=True,
    save_params=True,
    tune_params=False,
    verbose=True,
    params={
        "num_leaves": 31,
        "learning_rate": 0.1,
        "n_estimators": 100,
        "objective": "regression"
    }
)
ds
DataSplit
required
DataSplit object
outpath
str
required
Path to save/load parameters
use_saved_params
bool
default:"True"
Whether to use saved parameters if available
save_params
bool
default:"True"
Whether to save parameters after training
tune_params
bool
default:"False"
Whether to perform hyperparameter tuning
verbose
bool
default:"False"
Whether to print verbose output
params
dict
default:"None"
Optional dictionary of hyperparameters
results
SingleModelResults
Prediction results from the LightGBM model

run_catboost()

Train a CatBoost model and return its prediction results.
from openavmkit.modeling import run_catboost

results = run_catboost(
    ds,
    outpath="out/models/residential",
    use_saved_params=True,
    save_params=True,
    tune_params=False,
    verbose=True,
    params={
        "depth": 6,
        "learning_rate": 0.1,
        "iterations": 100,
        "loss_function": "RMSE"
    }
)
ds
DataSplit
required
DataSplit object
outpath
str
required
Path to save/load parameters
use_saved_params
bool
default:"True"
Whether to use saved parameters if available
save_params
bool
default:"True"
Whether to save parameters after training
tune_params
bool
default:"False"
Whether to perform hyperparameter tuning
verbose
bool
default:"False"
Whether to print verbose output
params
dict
default:"None"
Optional dictionary of hyperparameters
results
SingleModelResults
Prediction results from the CatBoost model

Geographically Weighted Regression (GWR)

run_gwr()

Train a GWR model and return its prediction results.
from openavmkit.modeling import run_gwr

results = run_gwr(
    ds,
    outpath="out/models/residential",
    use_saved_params=True,
    save_params=True,
    tune_params=False,
    verbose=True
)
ds
DataSplit
required
DataSplit object
outpath
str
required
Path to save/load parameters
use_saved_params
bool
default:"True"
Whether to use saved bandwidth if available
save_params
bool
default:"True"
Whether to save bandwidth after training
tune_params
bool
default:"False"
Whether to perform bandwidth selection
verbose
bool
default:"False"
Whether to print verbose output
results
SingleModelResults
Prediction results from the GWR model

Spatial Models

run_spatial_lag()

Train a spatial lag model and return its prediction results.
from openavmkit.modeling import run_spatial_lag

results = run_spatial_lag(
    ds,
    verbose=True
)
ds
DataSplit
required
DataSplit object
verbose
bool
default:"False"
Whether to print verbose output
results
SingleModelResults
Prediction results from the spatial lag model

Baseline Models

run_average()

Train an average model (baseline) and return its prediction results.
from openavmkit.modeling import run_average

results = run_average(
    ds,
    verbose=True
)
ds
DataSplit
required
DataSplit object
verbose
bool
default:"False"
Whether to print verbose output
results
SingleModelResults
Prediction results from the average model

run_naive_area()

Train a naive area model (baseline using simple $/sqft) and return its prediction results.
from openavmkit.modeling import run_naive_area

results = run_naive_area(
    ds,
    verbose=True
)
ds
DataSplit
required
DataSplit object
verbose
bool
default:"False"
Whether to print verbose output
results
SingleModelResults
Prediction results from the naive area model

run_local_area()

Train a local area model (location-based $/sqft) and return its prediction results.
from openavmkit.modeling import run_local_area

results = run_local_area(
    ds,
    location_field="neighborhood",
    verbose=True
)
ds
DataSplit
required
DataSplit object
location_field
str
required
Field name to use for location grouping
verbose
bool
default:"False"
Whether to print verbose output
results
SingleModelResults
Prediction results from the local area model

run_pass_through()

Generate predictions using a pass-through model (e.g., assessor values).
from openavmkit.modeling import run_pass_through

results = run_pass_through(
    ds,
    model_engine="assessor",
    verbose=True
)
ds
DataSplit
required
DataSplit object
model_engine
str
required
Model engine identifier (e.g., “assessor”)
verbose
bool
default:"False"
Whether to print verbose output
results
SingleModelResults
Prediction results from the pass-through model

Utility Functions

model_utility_score()

Compute a utility score for a model based on error, median ratio, COD, and CHD.
from openavmkit.modeling import model_utility_score

score = model_utility_score(model_results, test_set=False)
model_results
SingleModelResults
required
SingleModelResults object
test_set
bool
default:"False"
If True, compute the score using the test set results
score
float
Computed utility score (lower is better)

simple_ols()

Perform simple OLS regression with one independent variable.
from openavmkit.modeling import simple_ols

results = simple_ols(
    df,
    ind_var="square_feet",
    dep_var="sale_price",
    intercept=True
)
df
pd.DataFrame
required
DataFrame containing the data
ind_var
str
required
Independent variable name
dep_var
str
required
Dependent variable name
intercept
bool
default:"True"
Whether to include an intercept
results
dict
Dictionary containing regression results including slope, r2, and other statistics

simple_mra()

Perform multiple regression analysis with multiple independent variables.
from openavmkit.modeling import simple_mra

results = simple_mra(
    df,
    ind_vars=["square_feet", "bedrooms", "bathrooms"],
    dep_var="sale_price"
)
df
pd.DataFrame
required
DataFrame containing the data
ind_vars
list[str]
required
List of independent variable names
dep_var
str
required
Dependent variable name
results
dict
Dictionary containing regression results

Build docs developers (and LLMs) love