Modeling

OpenAVM Kit provides a comprehensive modeling framework that supports multiple algorithms, automated hyperparameter tuning, and ensemble methods. This guide covers the complete modeling workflow from experimentation to finalization.

Modeling Workflow

The modeling process consists of three phases:

Experimentation

Test variables and models quickly without saving results

Identify outliers and adjust model configuration

Finalization

Train final models and generate predictions for all parcels

Supported Algorithms

OpenAVM Kit supports these model types:

XGBoost

Gradient boosting with tree-based learners. Excellent for tabular data with complex interactions.

LightGBM

Fast gradient boosting optimized for speed and memory efficiency.

CatBoost

Gradient boosting with native categorical feature support.

GWR

Geographically Weighted Regression for spatial variation modeling.

Trying Models

Use try_models() for rapid experimentation:

03-model.ipynb:244-256

try_models(
    sup=sales_univ_pair,
    settings=load_settings(),
    save_params=True,
    verbose=verbose,
    run_main=True,
    run_vacant=False,
    run_hedonic=False,
    run_ensemble=True,
    do_shaps=False,
    do_plots=True
)

Parameters Explained

sup

SalesUniversePair

required

Your cleaned data from the previous notebook

save_params

bool

default:"True"

Save hyperparameters for later use. Enables faster re-runs.

use_saved_params

bool

default:"True"

Load previously saved hyperparameters instead of re-tuning

run_main

bool

default:"True"

Run models predicting full market value

run_vacant

bool

default:"True"

Run separate models for vacant land using only vacant sales

run_hedonic

bool

default:"True"

Run hedonic models that predict land and improvement values separately

run_ensemble

bool

default:"True"

Combine multiple models into weighted ensemble

do_shaps

bool

default:"False"

Generate SHAP (SHapley Additive exPlanations) values for model interpretability

do_plots

bool

default:"False"

Create scatter plots comparing predictions to actual sales

Model Types

Main Models

Predict full market value for all property types:

modeling.py

def try_models(sup, settings, run_main=True, ...):
    if run_main:
        # Trains on all sales (improved + vacant)
        # Predicts total property value

Use case: General property valuation

Vacant Models

Predict land value using only vacant land sales:

modeling.py

def try_models(sup, settings, run_vacant=True, ...):
    if run_vacant:
        # Trains only on vacant sales
        # Predicts land value

Use case: Land-only valuations, agricultural properties

Hedonic Models

Separate land value from improvement value:

modeling.py

def try_models(sup, settings, run_hedonic=True, ...):
    if run_hedonic:
        # Removes improvement characteristics
        # Predicts land value as if vacant
        # Derives improvement value by subtraction

Use case: Tax allocation, partial valuations

Hedonic models help allocate value between land and improvements, which is required in many tax jurisdictions.

Configuration

Configure models in your settings.json:

settings.json

{
  "modeling": {
    "model_groups": {
      "single_family": {
        "dep_var": "sale_price",
        "models": {
          "xgboost": {
            "enabled": true,
            "params": {
              "max_depth": 6,
              "learning_rate": 0.1,
              "n_estimators": 100
            }
          },
          "lightgbm": {
            "enabled": true,
            "params": {
              "num_leaves": 31,
              "learning_rate": 0.1
            }
          },
          "catboost": {
            "enabled": true,
            "params": {
              "depth": 6,
              "learning_rate": 0.1
            }
          }
        },
        "ensemble": {
          "enabled": true,
          "method": "weighted_average"
        }
      }
    }
  }
}

Hyperparameter Tuning

OpenAVM Kit automatically tunes hyperparameters using cross-validation:

XGBoost
LightGBM
CatBoost

tuning.py

def _tune_xgboost(ds: DataSplit, settings: dict):
    param_grid = {
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.3],
        'n_estimators': [50, 100, 200],
        'subsample': [0.8, 1.0],
        'colsample_bytree': [0.8, 1.0]
    }
    # Grid search with cross-validation

tuning.py

def _tune_lightgbm(ds: DataSplit, settings: dict):
    param_grid = {
        'num_leaves': [15, 31, 63],
        'learning_rate': [0.01, 0.1],
        'n_estimators': [50, 100, 200],
        'min_child_samples': [20, 50, 100]
    }

tuning.py

def _tune_catboost(ds: DataSplit, settings: dict):
    param_grid = {
        'depth': [4, 6, 8],
        'learning_rate': [0.01, 0.1],
        'iterations': [100, 200, 500],
        'l2_leaf_reg': [1, 3, 5]
    }

Set save_params=True and use_saved_params=True to avoid re-tuning on every run. This saves hours during development.

Ensemble Models

Combine multiple models for better accuracy:

modeling.py

def create_ensemble(models: list, weights: list = None):
    """Weighted average of model predictions.
    
    Default weights based on test set performance.
    """
    if weights is None:
        # Weight by inverse RMSE
        weights = [1/model.rmse for model in models]
    
    predictions = sum(w * m.predict(X) for w, m in zip(weights, models))
    return predictions / sum(weights)

Benefits:

Reduces overfitting
Smooths individual model quirks
Often outperforms single best model

Identifying Outliers

After trying models, analyze prediction errors:

03-model.ipynb:278-281

identify_outliers(
    sup=sales_univ_pair,
    settings=load_settings()
)

This generates CSV files in out/models/{model_group}/{model_type}/{model_name}/ with:

outliers.csv: Sales with prediction ratios outside 0.75-1.25 range
pred_sales.csv: All predictions on sales
pred_universe.csv: Predictions for all parcels

Review outliers to:

Identify invalid sales that slipped through scrutiny
Discover missing variables
Understand model limitations

Finalizing Models

Once satisfied with model performance, finalize to generate production predictions:

03-model.ipynb:301-309

results = from_checkpoint("3-model-02-finalize-models", finalize_models,
    {
        "sup": sales_univ_pair,
        "settings": load_settings(),
        "save_params": True,
        "use_saved_params": True,
        "verbose": verbose
    }
)

What `finalize_models()` Does

pipeline.py

def finalize_models(sup, settings, save_params=True, use_saved_params=True, verbose=False):
    """Train final models and save all results.
    
    - Trains on full training set
    - Generates predictions for all parcels
    - Saves models, predictions, and performance metrics
    - Creates assessment quality reports
    """

Outputs:

Trained model objects (.pkl files)
Predictions for universe and sales
Performance statistics (COD, PRD, PRB)
SHAP values (if enabled)
Scatter plots and visualizations

Model Evaluation Metrics

OpenAVM Kit calculates multiple performance metrics:

COD

float

Coefficient of Dispersion: Measures horizontal equity (similar properties valued similarly)

Target: < 15% for residential, < 20% for other

PRD

float

Price-Related Differential: Detects bias toward high or low values

Target: 0.98 - 1.03

PRB

float

Price-Related Bias: Regression-based vertical equity measure

Target: -0.05 to 0.05

R²

float

R-squared: Proportion of variance explained

Higher is better (0.0 - 1.0)

RMSE

float

Root Mean Square Error: Average prediction error magnitude

Lower is better

Output Files

Finalized models produce:

out/models/
└── {model_group}/
    ├── main/
    │   ├── xgboost/
    │   │   ├── model.pkl
    │   │   ├── pred_sales.csv
    │   │   ├── pred_universe.csv
    │   │   ├── metrics.json
    │   │   └── shap_values.csv
    │   ├── lightgbm/
    │   └── ensemble/
    ├── vacant/
    └── hedonic_land/

Best Practices

Start simple

Begin with a single model type on one model group before expanding to multiple algorithms and property types.

Monitor training time

Hyperparameter tuning can take hours for large datasets. Use save_params=True to avoid repeated tuning.

Check for overfitting

If training metrics are much better than test metrics, your model is overfitting. Reduce model complexity or add regularization.

Validate spatially

Load predictions in GIS software to check for spatial patterns in errors. Systematic geographic bias indicates missing location variables.

Use ensembles

Ensemble models typically outperform individual models and are more robust to data quirks.

Get Started

Core Concepts

Guides

Configuration

Advanced Topics

Modeling Workflow

Supported Algorithms

XGBoost

LightGBM

CatBoost

GWR

Trying Models

Parameters Explained

Model Types

Main Models

Vacant Models

Hedonic Models

Configuration

Hyperparameter Tuning

Ensemble Models

Identifying Outliers

Finalizing Models

What `finalize_models()` Does

Model Evaluation Metrics

Output Files

Best Practices

Next Steps

Jupyter Notebooks

Configuration Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Configuration

Advanced Topics

​Modeling Workflow

​Supported Algorithms

XGBoost

LightGBM

CatBoost

GWR

​Trying Models

​Parameters Explained

​Model Types

​Main Models

​Vacant Models

​Hedonic Models

​Configuration

​Hyperparameter Tuning

​Ensemble Models

​Identifying Outliers

​Finalizing Models

​What finalize_models() Does

​Model Evaluation Metrics

​Output Files

​Best Practices

​Next Steps

Jupyter Notebooks

Configuration Reference

Build docs developers (and LLMs) love

Modeling Workflow

Supported Algorithms

Trying Models

Parameters Explained

Model Types

Main Models

Vacant Models

Hedonic Models

Configuration

Hyperparameter Tuning

Ensemble Models

Identifying Outliers

Finalizing Models

What `finalize_models()` Does

Model Evaluation Metrics

Output Files

Best Practices

Next Steps