Overview
OpenAVM Kit follows a structured, multi-phase workflow for mass appraisal. Each phase builds upon the previous one, transforming raw data into accurate property valuations.The workflow is implemented through a series of Jupyter notebooks in the
notebooks/pipeline/ directory, each corresponding to a phase of the process.The four-phase workflow
Phase 1: Assemble
Goal: Load and merge raw data into a unified structureWhat happens in the assemble phase
What happens in the assemble phase
This phase takes your raw data files and combines them into the core data structure used throughout OpenAVM Kit:
- Load dataframes from various sources (parcels, sales, characteristics)
- Merge data according to instructions in
settings.json - Create SalesUniversePair - the fundamental data structure containing:
- Universe: all parcels in your jurisdiction
- Sales: transactions with known prices
- Enrich data with calculated fields, spatial joins, and external data sources
- Tag model groups to classify parcels by type (residential, commercial, etc.)
1-assemble-sup.pickle - the assembled and enriched dataload_dataframes(settings)- Load raw dataprocess_dataframes(dataframes, settings)- Merge and enrichtag_model_groups_sup(sup, settings)- Classify parcels
Phase 2: Clean
Goal: Validate and clean data for modelingWhat happens in the clean phase
What happens in the clean phase
This phase ensures data quality and removes invalid or problematic records:
-
Process sales data:
- Filter invalid sales (bad dates, prices, etc.)
- Apply time adjustments to normalize sale prices
- Validate sales within acceptable date ranges
-
Sales scrutiny analysis:
- Run heuristics to identify suspicious sales
- Perform cluster-based outlier detection
- Remove manually flagged exclusions
-
Data enrichment:
- Add spatial lag features (neighborhood effects)
- Calculate street network metrics (frontage, depth)
- Fill unknown values with defaults
-
Split data:
- Divide sales into training (80%) and test (20%) sets
- Ensure splits are stratified by model group
2-clean-sup.pickle - validated and ready-to-model dataprocess_sales(sup, settings)- Clean and validate salesrun_sales_scrutiny(sup, settings)- Identify outliersenrich_sup_spatial_lag(sup, settings)- Add neighborhood effectsfill_unknown_values_sup(sup, settings)- Handle missing data
Phase 3: Model
Goal: Train predictive models and generate valuationsWhat happens in the model phase
What happens in the model phase
This is where machine learning models are trained to predict property values:
-
Variable testing:
- Test which characteristics are most predictive
- Evaluate individual feature importance
-
Model training:
- Main models - predict full market value for improved parcels
- Vacant models - predict land value using vacant land sales
- Hedonic models - predict land value by simulating vacancy
- Ensemble models - combine multiple model predictions
-
Model types available:
- XGBoost (gradient boosting)
- LightGBM (gradient boosting)
- CatBoost (gradient boosting)
- GWR (geographically weighted regression)
- MRA (multiple regression analysis)
- And more…
-
Generate predictions:
- Train on training set
- Validate on test set
- Predict values for entire universe
out/models/try_variables(sup, settings)- Test feature importancetry_models(sup, settings)- Experiment with models quicklyrun_models(sup, settings)- Final model training and prediction
Phase 4: Evaluate
Goal: Assess model performance and generate reportsWhat happens in the evaluate phase
What happens in the evaluate phase
This phase measures how well your models performed:
-
Ratio studies:
- Calculate assessment ratios (predicted / actual)
- Compute standard metrics (COD, PRD, PRB)
- Break down by location, value ranges, model groups
-
Equity studies:
- Horizontal equity - similar properties assessed similarly
- Vertical equity - proportional assessment across value ranges
-
Generate reports:
- PDF reports with statistics and visualizations
- Excel exports with detailed breakdowns
- Scatter plots and diagnostic charts
-
Quality control:
- Identify problematic predictions
- Flag potential outliers
- Review model assumptions
out/RatioStudy(df, settings)- Assessment ratio analysisHorizontalEquityStudy(df, settings)- Horizontal equity analysisVerticalEquityStudy(df, settings)- Vertical equity analysis
Data flow diagram
Iterative refinement
The workflow is designed to be iterative. You’ll often cycle back through phases as you:
- Discover data quality issues requiring cleanup
- Test different modeling approaches
- Refine settings based on evaluation results
- Add new data sources or features
Checkpointing and caching
OpenAVM Kit uses checkpointing to save progress between phases:- Each phase saves its output as a
.picklefile - Intermediate results can be cached to speed up re-runs
- You can jump into any phase by loading the previous phase’s checkpoint
Best practices
Start small
Start small
Begin with a subset of your data to validate the workflow before processing your entire jurisdiction.
Document settings
Document settings
Keep detailed notes about what settings and parameters work best for your locality.
Validate early and often
Validate early and often
Run evaluation metrics during the model phase, not just at the end. This helps catch issues earlier.
Use version control
Use version control
Track changes to your
settings.json and document why you made specific configuration choices.