Skip to main content
OpenAVM Kit provides a comprehensive modeling framework that supports multiple algorithms, automated hyperparameter tuning, and ensemble methods. This guide covers the complete modeling workflow from experimentation to finalization.

Modeling Workflow

The modeling process consists of three phases:
1
Experimentation
2
Test variables and models quickly without saving results
3
Refinement
4
Identify outliers and adjust model configuration
5
Finalization
6
Train final models and generate predictions for all parcels

Supported Algorithms

OpenAVM Kit supports these model types:

XGBoost

Gradient boosting with tree-based learners. Excellent for tabular data with complex interactions.

LightGBM

Fast gradient boosting optimized for speed and memory efficiency.

CatBoost

Gradient boosting with native categorical feature support.

GWR

Geographically Weighted Regression for spatial variation modeling.

Trying Models

Use try_models() for rapid experimentation:
03-model.ipynb:244-256
try_models(
    sup=sales_univ_pair,
    settings=load_settings(),
    save_params=True,
    verbose=verbose,
    run_main=True,
    run_vacant=False,
    run_hedonic=False,
    run_ensemble=True,
    do_shaps=False,
    do_plots=True
)

Parameters Explained

sup
SalesUniversePair
required
Your cleaned data from the previous notebook
save_params
bool
default:"True"
Save hyperparameters for later use. Enables faster re-runs.
use_saved_params
bool
default:"True"
Load previously saved hyperparameters instead of re-tuning
run_main
bool
default:"True"
Run models predicting full market value
run_vacant
bool
default:"True"
Run separate models for vacant land using only vacant sales
run_hedonic
bool
default:"True"
Run hedonic models that predict land and improvement values separately
run_ensemble
bool
default:"True"
Combine multiple models into weighted ensemble
do_shaps
bool
default:"False"
Generate SHAP (SHapley Additive exPlanations) values for model interpretability
do_plots
bool
default:"False"
Create scatter plots comparing predictions to actual sales

Model Types

Main Models

Predict full market value for all property types:
modeling.py
def try_models(sup, settings, run_main=True, ...):
    if run_main:
        # Trains on all sales (improved + vacant)
        # Predicts total property value
Use case: General property valuation

Vacant Models

Predict land value using only vacant land sales:
modeling.py
def try_models(sup, settings, run_vacant=True, ...):
    if run_vacant:
        # Trains only on vacant sales
        # Predicts land value
Use case: Land-only valuations, agricultural properties

Hedonic Models

Separate land value from improvement value:
modeling.py
def try_models(sup, settings, run_hedonic=True, ...):
    if run_hedonic:
        # Removes improvement characteristics
        # Predicts land value as if vacant
        # Derives improvement value by subtraction
Use case: Tax allocation, partial valuations
Hedonic models help allocate value between land and improvements, which is required in many tax jurisdictions.

Configuration

Configure models in your settings.json:
settings.json
{
  "modeling": {
    "model_groups": {
      "single_family": {
        "dep_var": "sale_price",
        "models": {
          "xgboost": {
            "enabled": true,
            "params": {
              "max_depth": 6,
              "learning_rate": 0.1,
              "n_estimators": 100
            }
          },
          "lightgbm": {
            "enabled": true,
            "params": {
              "num_leaves": 31,
              "learning_rate": 0.1
            }
          },
          "catboost": {
            "enabled": true,
            "params": {
              "depth": 6,
              "learning_rate": 0.1
            }
          }
        },
        "ensemble": {
          "enabled": true,
          "method": "weighted_average"
        }
      }
    }
  }
}

Hyperparameter Tuning

OpenAVM Kit automatically tunes hyperparameters using cross-validation:
tuning.py
def _tune_xgboost(ds: DataSplit, settings: dict):
    param_grid = {
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.3],
        'n_estimators': [50, 100, 200],
        'subsample': [0.8, 1.0],
        'colsample_bytree': [0.8, 1.0]
    }
    # Grid search with cross-validation
Set save_params=True and use_saved_params=True to avoid re-tuning on every run. This saves hours during development.

Ensemble Models

Combine multiple models for better accuracy:
modeling.py
def create_ensemble(models: list, weights: list = None):
    """Weighted average of model predictions.
    
    Default weights based on test set performance.
    """
    if weights is None:
        # Weight by inverse RMSE
        weights = [1/model.rmse for model in models]
    
    predictions = sum(w * m.predict(X) for w, m in zip(weights, models))
    return predictions / sum(weights)
Benefits:
  • Reduces overfitting
  • Smooths individual model quirks
  • Often outperforms single best model

Identifying Outliers

After trying models, analyze prediction errors:
03-model.ipynb:278-281
identify_outliers(
    sup=sales_univ_pair,
    settings=load_settings()
)
This generates CSV files in out/models/{model_group}/{model_type}/{model_name}/ with:
  • outliers.csv: Sales with prediction ratios outside 0.75-1.25 range
  • pred_sales.csv: All predictions on sales
  • pred_universe.csv: Predictions for all parcels
Review outliers to:
  • Identify invalid sales that slipped through scrutiny
  • Discover missing variables
  • Understand model limitations

Finalizing Models

Once satisfied with model performance, finalize to generate production predictions:
03-model.ipynb:301-309
results = from_checkpoint("3-model-02-finalize-models", finalize_models,
    {
        "sup": sales_univ_pair,
        "settings": load_settings(),
        "save_params": True,
        "use_saved_params": True,
        "verbose": verbose
    }
)

What finalize_models() Does

pipeline.py
def finalize_models(sup, settings, save_params=True, use_saved_params=True, verbose=False):
    """Train final models and save all results.
    
    - Trains on full training set
    - Generates predictions for all parcels
    - Saves models, predictions, and performance metrics
    - Creates assessment quality reports
    """
Outputs:
  • Trained model objects (.pkl files)
  • Predictions for universe and sales
  • Performance statistics (COD, PRD, PRB)
  • SHAP values (if enabled)
  • Scatter plots and visualizations

Model Evaluation Metrics

OpenAVM Kit calculates multiple performance metrics:
COD
float
Coefficient of Dispersion: Measures horizontal equity (similar properties valued similarly)
  • Target: < 15% for residential, < 20% for other
PRD
float
Price-Related Differential: Detects bias toward high or low values
  • Target: 0.98 - 1.03
PRB
float
Price-Related Bias: Regression-based vertical equity measure
  • Target: -0.05 to 0.05
float
R-squared: Proportion of variance explained
  • Higher is better (0.0 - 1.0)
RMSE
float
Root Mean Square Error: Average prediction error magnitude
  • Lower is better

Output Files

Finalized models produce:
out/models/
└── {model_group}/
    ├── main/
    │   ├── xgboost/
    │   │   ├── model.pkl
    │   │   ├── pred_sales.csv
    │   │   ├── pred_universe.csv
    │   │   ├── metrics.json
    │   │   └── shap_values.csv
    │   ├── lightgbm/
    │   └── ensemble/
    ├── vacant/
    └── hedonic_land/

Best Practices

Begin with a single model type on one model group before expanding to multiple algorithms and property types.
Hyperparameter tuning can take hours for large datasets. Use save_params=True to avoid repeated tuning.
If training metrics are much better than test metrics, your model is overfitting. Reduce model complexity or add regularization.
Load predictions in GIS software to check for spatial patterns in errors. Systematic geographic bias indicates missing location variables.
Ensemble models typically outperform individual models and are more robust to data quirks.

Next Steps

Jupyter Notebooks

Learn the complete notebook workflow

Configuration Reference

Explore all modeling configuration options

Build docs developers (and LLMs) love