Overview
This tutorial walks you through creating a simple automated valuation model using OpenAVM Kit’s synthetic data generator. You’ll learn the core workflow: data preparation, model training, and evaluation.
This tutorial uses synthetic data to demonstrate the workflow. For production use, you’ll work with real assessment and sales data from your jurisdiction.
Prerequisites
Before starting, ensure you have:
Python 3.11 or later installed
OpenAVM Kit installed (see installation guide )
Basic familiarity with Python and pandas
Generate synthetic data
Let’s create a synthetic dataset to work with:
Import the synthetic data module
from openavmkit.synthetic.basic import (
generate_inflation_curve,
generate_depreciation_curve,
SyntheticData
)
import pandas as pd
import numpy as np
The synthetic.basic module provides tools for generating realistic property data with known ground truth values.
Create time-based curves
# Generate land value inflation over time
time_land_mult = generate_inflation_curve(
start_year = 2020 ,
end_year = 2024 ,
annual_inflation_rate = 0.05 ,
annual_inflation_rate_stdev = 0.01 ,
seasonality_amplitude = 0.20 ,
monthly_noise = 0.05 ,
daily_noise = 0.01
)
# Generate building depreciation curve
time_bldg_mult = generate_depreciation_curve(
lifetime = 60 ,
weight_linear = 0.2 ,
weight_logistic = 0.8 ,
steepness = 0.3 ,
inflection_point = 20
)
These curves simulate real-world market dynamics:
Inflation curve : Land values trend upward with seasonal variation
Depreciation curve : Building values decline with age using a blended linear/logistic model
Examine the curves
import matplotlib.pyplot as plt
# Plot land inflation
plt.figure( figsize = ( 12 , 4 ))
plt.subplot( 1 , 2 , 1 )
plt.plot(time_land_mult)
plt.title( 'Land Value Inflation (2020-2024)' )
plt.xlabel( 'Days' )
plt.ylabel( 'Multiplier' )
# Plot building depreciation
plt.subplot( 1 , 2 , 2 )
plt.plot(time_bldg_mult)
plt.title( 'Building Depreciation (60 years)' )
plt.xlabel( 'Age (years)' )
plt.ylabel( 'Remaining Value' )
plt.tight_layout()
plt.show()
This visualization helps verify the curves match realistic market behavior.
Build a simple ratio study
Now let’s evaluate assessment quality using OpenAVM Kit’s ratio study tools:
from openavmkit.ratio_study import RatioStudy
import numpy as np
# Simulate assessed values and sale prices
np.random.seed( 42 )
n_properties = 1000
# Ground truth sale prices
sale_prices = np.random.lognormal( mean = 12.5 , sigma = 0.5 , size = n_properties)
# Assessed values with some error
assessed_values = sale_prices * np.random.normal( loc = 1.0 , scale = 0.15 , size = n_properties)
# Create ratio study
rs = RatioStudy(
predictions = assessed_values,
ground_truth = sale_prices,
max_trim = 0.25 # Allow trimming up to 25% of outliers
)
# Display key metrics
print ( f "Sample size: { rs.count } " )
print ( f "Median ratio: { rs.median_ratio :.3f} " )
print ( f "COD (Coefficient of Dispersion): { rs.cod :.3f} " )
print ( f "PRD (Price-Related Differential): { rs.prd :.3f} " )
print ( f "PRB (Price-Related Bias): { rs.prb :.3f} " )
Understanding ratio study metrics
Median ratio : Center of the assessment ratio distribution (target: 1.00)
COD : Measures uniformity of assessments (lower is better, IAAO target: <15 for residential)
PRD : Detects assessment bias related to price levels (target: 0.98-1.03)
PRB : More sensitive measure of vertical equity (target: -0.05 to 0.05)
Working with real data
For production workflows, OpenAVM Kit uses a structured approach:
Initialize a notebook session
from openavmkit.pipeline import init_notebook, load_settings
# Set up the environment for a specific locality
init_notebook( locality = "us-nc-guilford" )
# Load configuration
settings = load_settings( "in/settings.json" )
The init_notebook() function:
Sets up the working directory structure
Loads environment variables from .env
Configures logging and warnings
Creates a NotebookState object for session management
Load and process data
from openavmkit.data import load_dataframe, process_data, SalesUniversePair
# Load raw data
df_parcels = load_dataframe( "in/parcels.parquet" )
df_sales = load_dataframe( "in/sales.parquet" )
# Create a SalesUniversePair object
sup = SalesUniversePair(
universe = df_parcels,
sales = df_sales
)
# Process and enrich the data
sup = process_data(
sup = sup,
settings = settings,
enrich_census = True ,
enrich_osm = False
)
The SalesUniversePair (SUP) is OpenAVM Kit’s core data structure:
universe : All parcels in the jurisdiction
sales : Sales transactions for model training/testing
Train a model
from openavmkit.modeling import fit_model
from openavmkit.utilities.modeling import XGBoostModel
# Define model configuration
model_config = {
"type" : "xgboost" ,
"features" : [
"land_area_sf" ,
"building_area_sf" ,
"year_built" ,
"bedrooms" ,
"bathrooms"
],
"target" : "sale_price"
}
# Fit the model
model = fit_model(
sup = sup,
settings = settings,
model_type = XGBoostModel,
features = model_config[ "features" ]
)
# Generate predictions
sup.sales[ "predicted_value" ] = model.predict(sup.sales)
from openavmkit.ratio_study import RatioStudy
# Create ratio study from predictions
rs = RatioStudy(
predictions = sup.sales[ "predicted_value" ].values,
ground_truth = sup.sales[ "sale_price" ].values,
max_trim = 0.25
)
# Display metrics
print ( f "Model Performance:" )
print ( f " Median Ratio: { rs.median_ratio :.3f} " )
print ( f " COD: { rs.cod :.2f} " )
print ( f " PRD: { rs.prd :.3f} " )
print ( f " PRB: { rs.prb :.3f} " )
# Check IAAO standards
if rs.cod < 15.0 and 0.98 <= rs.prd <= 1.03 :
print ( "✓ Model meets IAAO standards for residential assessments" )
else :
print ( "✗ Model needs improvement to meet IAAO standards" )
Download example data
OpenAVM Kit includes a public dataset for learning. Here’s how to download it:
Create locality structure
# Navigate to your notebooks directory
cd notebooks/pipeline/data
# Create folder for the example locality
mkdir us-nc-guilford
cd us-nc-guilford
Configure cloud access
Create cloud.json with public dataset credentials: {
"type" : "azure" ,
"azure_storage_container_url" : "https://landeconomics.blob.core.windows.net/localities-public"
}
This is a public container - no authentication required!
Sync data
In a Jupyter notebook: from openavmkit.pipeline import init_notebook
from openavmkit.cloud.cloud import cloud_sync
# Initialize for the example locality
init_notebook( locality = "us-nc-guilford" )
# Download data from cloud
cloud_sync()
This creates two folders:
in/ - Input files including settings.json
out/ - Output files from your analysis
Run the pipeline
Open notebooks/pipeline/01-assemble.ipynb and run the cells to:
Load and validate the data
Perform initial data quality checks
Prepare features for modeling
Continue with subsequent notebooks:
02-clean.ipynb - Data cleaning and filtering
03-model.ipynb - Model training and evaluation
assessment_quality.ipynb - Comprehensive quality analysis
Key concepts
SalesUniversePair Core data structure containing both the parcel universe and sales observations, enabling consistent operations across the entire workflow.
Settings dictionary JSON configuration defining field mappings, feature specifications, modeling parameters, and locality-specific settings.
Ratio studies IAAO-standard statistical analysis measuring assessment quality through metrics like COD, PRD, and PRB.
Pipeline functions High-level functions in openavmkit.pipeline that orchestrate common workflows for data processing, modeling, and reporting.
Common workflows
Time adjustment for sales
from openavmkit.time_adjustment import enrich_time_adjustment
# Adjust sale prices to a common date
sup = enrich_time_adjustment(
sup = sup,
settings = settings,
valuation_date = "2024-01-01"
)
Time adjustment normalizes sale prices to a valuation date, accounting for market appreciation.
Sales scrutiny and filtering
from openavmkit.sales_scrutiny_study import run_sales_scrutiny_per_model_group
from openavmkit.cleaning import filter_invalid_sales
# Identify problematic sales
sup = run_sales_scrutiny_per_model_group(
sup = sup,
settings = settings
)
# Remove invalid transactions
sup = filter_invalid_sales(sup, settings)
Sales scrutiny detects outliers, non-arm’s-length transactions, and data errors.
from openavmkit.utilities.modeling import XGBoostModel, LightGBMModel, CatBoostModel
models = {
"XGBoost" : XGBoostModel,
"LightGBM" : LightGBMModel,
"CatBoost" : CatBoostModel
}
results = {}
for name, model_class in models.items():
model = fit_model(sup, settings, model_class, features)
predictions = model.predict(sup.sales)
rs = RatioStudy(predictions, sup.sales[ "sale_price" ], max_trim = 0.25 )
results[name] = { "cod" : rs.cod, "prd" : rs.prd}
# Compare models
import pandas as pd
df_comparison = pd.DataFrame(results).T
print (df_comparison)
Test multiple algorithms to find the best fit for your data.
from openavmkit.utilities.data import calc_spatial_lag
# Calculate spatial lag of sale prices
sup.sales[ "spatial_lag_price" ] = calc_spatial_lag(
gdf = sup.sales,
values = sup.sales[ "sale_price" ],
k_neighbors = 10 ,
bandwidth = 1000 # meters
)
# Use as a feature
features.append( "spatial_lag_price" )
Spatial lag captures neighborhood effects by averaging nearby property values.
Next steps
Explore the notebooks
Work through the complete pipeline notebooks in notebooks/pipeline/ to see the full workflow in action.
Review the API documentation
Browse the API reference to understand available functions and their parameters.
Customize for your data
Create a new locality folder with your jurisdiction’s data and adapt the settings file to match your schema.
Build production models
Use the checkpoint system and cloud sync features to build robust, reproducible modeling pipelines.
Need help? Join the discussion on GitHub for questions and support