Modeling

The modeling module provides functions for training and evaluating various predictive models including MRA, XGBoost, LightGBM, CatBoost, GWR, and more.

Core Classes

DataSplit

Encapsulates the splitting of data into training, test, and other subsets.

from openavmkit.modeling import DataSplit

ds = DataSplit(
    name="residential",
    df_sales=df_sales,
    df_universe=df_universe,
    model_group="residential",
    settings=settings,
    dep_var="sale_price",
    dep_var_test="sale_price",
    ind_vars=independent_vars,
    categorical_vars=categorical_vars,
    interactions={},
    test_keys=test_keys,
    train_keys=train_keys
)

Attributes

df_sales - Sales data after processing
df_universe - Universe (parcel) data after processing
df_train - Training subset of sales data
df_test - Test subset of sales data
X_train - Feature matrix for the training data
X_test - Feature matrix for the test data
X_univ - Feature matrix for the universe data
y_train - Target array for training
y_test - Target array for testing

SingleModelResults

Container for results from a single prediction model.

Attributes

ds - The DataSplit object used
df_universe - Universe DataFrame with predictions
df_test - Test DataFrame with predictions
df_sales - Sales DataFrame with predictions
model_name - Model name (unique identifier)
model_engine - Model engine (“xgboost”, “mra”, etc.)
model - The fitted model object
pred_test - PredictionResults for the test set
pred_train - PredictionResults for the training set
pred_sales - PredictionResults for the sales set
chd - Calculated CHD (coefficient of horizontal disparity) value
utility_test - Composite utility score for the test set
utility_train - Composite utility score for the training set

PredictionResults

Container for prediction results and associated performance metrics.

Attributes

dep_var - The independent variable used for prediction
ind_vars - List of dependent variables
y - Ground truth values
y_pred - Predicted values
mse - Mean squared error
rmse - Root mean squared error
mape - Mean absolute percent error
r2 - R-squared
adj_r2 - Adjusted R-squared
ratio_study - RatioStudy object

Multiple Regression Analysis (MRA)

run_mra()

Train an MRA model and return its prediction results.

from openavmkit.modeling import run_mra

results = run_mra(
    ds,
    intercept=True,
    verbose=True
)

DataSplit

required

DataSplit object

intercept

bool

default:"True"

Whether to include an intercept in the model

verbose

bool

default:"False"

Whether to print verbose output

model

MRAModel | None

default:"None"

Optional pre-trained MRAModel

results

SingleModelResults

Prediction results from the MRA model

run_multi_mra()

Train a hierarchical Multi-MRA model and return its prediction results.

from openavmkit.modeling import run_multi_mra

results = run_multi_mra(
    ds,
    outpath="out/models/residential",
    location_fields=["neighborhood", "city"],
    optimize_vars=False,
    intercept=True,
    verbose=True,
    min_sample_size=15
)

DataSplit

required

DataSplit object (sales/universe/splits should already be set up)

outpath

str

required

Path to write parameters out to

location_fields

list[str]

required

Ordered list of location field names, most specific to least specific

optimize_vars

bool

default:"False"

Whether to automatically trim the variable selection to the most optimal

intercept

bool

default:"True"

Whether to include an intercept column in the regression

verbose

bool

default:"False"

If True, print verbose output

min_sample_size

int

default:"15"

Minimum number of observations required to fit a local OLS model

results

SingleModelResults

Prediction results from the Multi-MRA model

Tree-Based Models

run_xgboost()

Train an XGBoost model and return its prediction results.

from openavmkit.modeling import run_xgboost

results = run_xgboost(
    ds,
    outpath="out/models/residential",
    use_saved_params=True,
    save_params=True,
    tune_params=False,
    verbose=True,
    params={
        "max_depth": 6,
        "learning_rate": 0.1,
        "n_estimators": 100,
        "objective": "reg:squarederror"
    }
)

DataSplit

required

DataSplit object

outpath

str

required

Path to save/load parameters

use_saved_params

bool

default:"True"

Whether to use saved parameters if available

save_params

bool

default:"True"

Whether to save parameters after training

tune_params

bool

default:"False"

Whether to perform hyperparameter tuning

verbose

bool

default:"False"

Whether to print verbose output

params

dict

default:"None"

Optional dictionary of hyperparameters

results

SingleModelResults

Prediction results from the XGBoost model

run_lightgbm()

Train a LightGBM model and return its prediction results.

from openavmkit.modeling import run_lightgbm

results = run_lightgbm(
    ds,
    outpath="out/models/residential",
    use_saved_params=True,
    save_params=True,
    tune_params=False,
    verbose=True,
    params={
        "num_leaves": 31,
        "learning_rate": 0.1,
        "n_estimators": 100,
        "objective": "regression"
    }
)

DataSplit

required

DataSplit object

outpath

str

required

Path to save/load parameters

use_saved_params

bool

default:"True"

Whether to use saved parameters if available

save_params

bool

default:"True"

Whether to save parameters after training

tune_params

bool

default:"False"

Whether to perform hyperparameter tuning

verbose

bool

default:"False"

Whether to print verbose output

params

dict

default:"None"

Optional dictionary of hyperparameters

results

SingleModelResults

Prediction results from the LightGBM model

run_catboost()

Train a CatBoost model and return its prediction results.

from openavmkit.modeling import run_catboost

results = run_catboost(
    ds,
    outpath="out/models/residential",
    use_saved_params=True,
    save_params=True,
    tune_params=False,
    verbose=True,
    params={
        "depth": 6,
        "learning_rate": 0.1,
        "iterations": 100,
        "loss_function": "RMSE"
    }
)

DataSplit

required

DataSplit object

outpath

str

required

Path to save/load parameters

use_saved_params

bool

default:"True"

Whether to use saved parameters if available

save_params

bool

default:"True"

Whether to save parameters after training

tune_params

bool

default:"False"

Whether to perform hyperparameter tuning

verbose

bool

default:"False"

Whether to print verbose output

params

dict

default:"None"

Optional dictionary of hyperparameters

results

SingleModelResults

Prediction results from the CatBoost model

Geographically Weighted Regression (GWR)

run_gwr()

Train a GWR model and return its prediction results.

from openavmkit.modeling import run_gwr

results = run_gwr(
    ds,
    outpath="out/models/residential",
    use_saved_params=True,
    save_params=True,
    tune_params=False,
    verbose=True
)

DataSplit

required

DataSplit object

outpath

str

required

Path to save/load parameters

use_saved_params

bool

default:"True"

Whether to use saved bandwidth if available

save_params

bool

default:"True"

Whether to save bandwidth after training

tune_params

bool

default:"False"

Whether to perform bandwidth selection

verbose

bool

default:"False"

Whether to print verbose output

results

SingleModelResults

Prediction results from the GWR model

Spatial Models

run_spatial_lag()

Train a spatial lag model and return its prediction results.

from openavmkit.modeling import run_spatial_lag

results = run_spatial_lag(
    ds,
    verbose=True
)

DataSplit

required

DataSplit object

verbose

bool

default:"False"

Whether to print verbose output

results

SingleModelResults

Prediction results from the spatial lag model

Baseline Models

run_average()

Train an average model (baseline) and return its prediction results.

from openavmkit.modeling import run_average

results = run_average(
    ds,
    verbose=True
)

DataSplit

required

DataSplit object

verbose

bool

default:"False"

Whether to print verbose output

results

SingleModelResults

Prediction results from the average model

run_naive_area()

Train a naive area model (baseline using simple $/sqft) and return its prediction results.

from openavmkit.modeling import run_naive_area

results = run_naive_area(
    ds,
    verbose=True
)

DataSplit

required

DataSplit object

verbose

bool

default:"False"

Whether to print verbose output

results

SingleModelResults

Prediction results from the naive area model

run_local_area()

Train a local area model (location-based $/sqft) and return its prediction results.

from openavmkit.modeling import run_local_area

results = run_local_area(
    ds,
    location_field="neighborhood",
    verbose=True
)

DataSplit

required

DataSplit object

location_field

str

required

Field name to use for location grouping

verbose

bool

default:"False"

Whether to print verbose output

results

SingleModelResults

Prediction results from the local area model

run_pass_through()

Generate predictions using a pass-through model (e.g., assessor values).

from openavmkit.modeling import run_pass_through

results = run_pass_through(
    ds,
    model_engine="assessor",
    verbose=True
)

DataSplit

required

DataSplit object

model_engine

str

required

Model engine identifier (e.g., “assessor”)

verbose

bool

default:"False"

Whether to print verbose output

results

SingleModelResults

Prediction results from the pass-through model

Utility Functions

model_utility_score()

Compute a utility score for a model based on error, median ratio, COD, and CHD.

from openavmkit.modeling import model_utility_score

score = model_utility_score(model_results, test_set=False)

model_results

SingleModelResults

required

SingleModelResults object

test_set

bool

default:"False"

If True, compute the score using the test set results

score

float

Computed utility score (lower is better)

simple_ols()

Perform simple OLS regression with one independent variable.

from openavmkit.modeling import simple_ols

results = simple_ols(
    df,
    ind_var="square_feet",
    dep_var="sale_price",
    intercept=True
)

pd.DataFrame

required

DataFrame containing the data

ind_var

str

required

Independent variable name

dep_var

str

required

Dependent variable name

intercept

bool

default:"True"

Whether to include an intercept

results

dict

Dictionary containing regression results including slope, r2, and other statistics

simple_mra()

Perform multiple regression analysis with multiple independent variables.

from openavmkit.modeling import simple_mra

results = simple_mra(
    df,
    ind_vars=["square_feet", "bedrooms", "bathrooms"],
    dep_var="sale_price"
)

pd.DataFrame

required

DataFrame containing the data

ind_vars

list[str]

required

List of independent variable names

dep_var

str

required

Dependent variable name

results

dict

Dictionary containing regression results

Core Modules

Analysis & Evaluation

Data Processing

Specialized Analysis

Utilities

Cloud & Storage

Quality & Reports

Core Classes

DataSplit

Attributes

SingleModelResults

Attributes

PredictionResults

Attributes

Multiple Regression Analysis (MRA)

run_mra()

run_multi_mra()

Tree-Based Models

run_xgboost()

run_lightgbm()

run_catboost()

Geographically Weighted Regression (GWR)

run_gwr()

Spatial Models

run_spatial_lag()

Baseline Models

run_average()

run_naive_area()

run_local_area()

run_pass_through()

Utility Functions

model_utility_score()

simple_ols()

simple_mra()

Build docs developers (and LLMs) love

Core Modules

Analysis & Evaluation

Data Processing

Specialized Analysis

Utilities

Cloud & Storage

Quality & Reports

​Core Classes

​DataSplit

​Attributes

​SingleModelResults

​Attributes

​PredictionResults

​Attributes

​Multiple Regression Analysis (MRA)

​run_mra()

​run_multi_mra()

​Tree-Based Models

​run_xgboost()

​run_lightgbm()

​run_catboost()

​Geographically Weighted Regression (GWR)

​run_gwr()

​Spatial Models

​run_spatial_lag()

​Baseline Models

​run_average()

​run_naive_area()

​run_local_area()

​run_pass_through()

​Utility Functions

​model_utility_score()

​simple_ols()

​simple_mra()

Build docs developers (and LLMs) love

Core Classes

DataSplit

Attributes

SingleModelResults

Attributes

PredictionResults

Attributes

Multiple Regression Analysis (MRA)

run_mra()

run_multi_mra()

Tree-Based Models

run_xgboost()

run_lightgbm()

run_catboost()

Geographically Weighted Regression (GWR)

run_gwr()

Spatial Models

run_spatial_lag()

Baseline Models

run_average()

run_naive_area()

run_local_area()

run_pass_through()

Utility Functions

model_utility_score()

simple_ols()

simple_mra()