Modeling Workflow
The modeling process consists of three phases:Supported Algorithms
OpenAVM Kit supports these model types:XGBoost
Gradient boosting with tree-based learners. Excellent for tabular data with complex interactions.
LightGBM
Fast gradient boosting optimized for speed and memory efficiency.
CatBoost
Gradient boosting with native categorical feature support.
GWR
Geographically Weighted Regression for spatial variation modeling.
Trying Models
Usetry_models() for rapid experimentation:
03-model.ipynb:244-256
Parameters Explained
Your cleaned data from the previous notebook
Save hyperparameters for later use. Enables faster re-runs.
Load previously saved hyperparameters instead of re-tuning
Run models predicting full market value
Run separate models for vacant land using only vacant sales
Run hedonic models that predict land and improvement values separately
Combine multiple models into weighted ensemble
Generate SHAP (SHapley Additive exPlanations) values for model interpretability
Create scatter plots comparing predictions to actual sales
Model Types
Main Models
Predict full market value for all property types:modeling.py
Vacant Models
Predict land value using only vacant land sales:modeling.py
Hedonic Models
Separate land value from improvement value:modeling.py
Hedonic models help allocate value between land and improvements, which is required in many tax jurisdictions.
Configuration
Configure models in yoursettings.json:
settings.json
Hyperparameter Tuning
OpenAVM Kit automatically tunes hyperparameters using cross-validation:- XGBoost
- LightGBM
- CatBoost
tuning.py
Ensemble Models
Combine multiple models for better accuracy:modeling.py
- Reduces overfitting
- Smooths individual model quirks
- Often outperforms single best model
Identifying Outliers
After trying models, analyze prediction errors:03-model.ipynb:278-281
out/models/{model_group}/{model_type}/{model_name}/ with:
outliers.csv: Sales with prediction ratios outside 0.75-1.25 rangepred_sales.csv: All predictions on salespred_universe.csv: Predictions for all parcels
- Identify invalid sales that slipped through scrutiny
- Discover missing variables
- Understand model limitations
Finalizing Models
Once satisfied with model performance, finalize to generate production predictions:03-model.ipynb:301-309
What finalize_models() Does
pipeline.py
- Trained model objects (
.pklfiles) - Predictions for universe and sales
- Performance statistics (COD, PRD, PRB)
- SHAP values (if enabled)
- Scatter plots and visualizations
Model Evaluation Metrics
OpenAVM Kit calculates multiple performance metrics:Coefficient of Dispersion: Measures horizontal equity (similar properties valued similarly)
- Target: < 15% for residential, < 20% for other
Price-Related Differential: Detects bias toward high or low values
- Target: 0.98 - 1.03
Price-Related Bias: Regression-based vertical equity measure
- Target: -0.05 to 0.05
R-squared: Proportion of variance explained
- Higher is better (0.0 - 1.0)
Root Mean Square Error: Average prediction error magnitude
- Lower is better
Output Files
Finalized models produce:Best Practices
Start simple
Start simple
Begin with a single model type on one model group before expanding to multiple algorithms and property types.
Monitor training time
Monitor training time
Hyperparameter tuning can take hours for large datasets. Use
save_params=True to avoid repeated tuning.Check for overfitting
Check for overfitting
If training metrics are much better than test metrics, your model is overfitting. Reduce model complexity or add regularization.
Validate spatially
Validate spatially
Load predictions in GIS software to check for spatial patterns in errors. Systematic geographic bias indicates missing location variables.
Use ensembles
Use ensembles
Ensemble models typically outperform individual models and are more robust to data quirks.
Next Steps
Jupyter Notebooks
Learn the complete notebook workflow
Configuration Reference
Explore all modeling configuration options