Quick start

Overview

This tutorial walks you through creating a simple automated valuation model using OpenAVM Kit’s synthetic data generator. You’ll learn the core workflow: data preparation, model training, and evaluation.

This tutorial uses synthetic data to demonstrate the workflow. For production use, you’ll work with real assessment and sales data from your jurisdiction.

Prerequisites

Before starting, ensure you have:

Python 3.11 or later installed
OpenAVM Kit installed (see installation guide)
Basic familiarity with Python and pandas

Generate synthetic data

Let’s create a synthetic dataset to work with:

Import the synthetic data module

from openavmkit.synthetic.basic import (
    generate_inflation_curve,
    generate_depreciation_curve,
    SyntheticData
)
import pandas as pd
import numpy as np

The synthetic.basic module provides tools for generating realistic property data with known ground truth values.

Create time-based curves

# Generate land value inflation over time
time_land_mult = generate_inflation_curve(
    start_year=2020,
    end_year=2024,
    annual_inflation_rate=0.05,
    annual_inflation_rate_stdev=0.01,
    seasonality_amplitude=0.20,
    monthly_noise=0.05,
    daily_noise=0.01
)

# Generate building depreciation curve
time_bldg_mult = generate_depreciation_curve(
    lifetime=60,
    weight_linear=0.2,
    weight_logistic=0.8,
    steepness=0.3,
    inflection_point=20
)

These curves simulate real-world market dynamics:

Inflation curve: Land values trend upward with seasonal variation
Depreciation curve: Building values decline with age using a blended linear/logistic model

Examine the curves

import matplotlib.pyplot as plt

# Plot land inflation
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(time_land_mult)
plt.title('Land Value Inflation (2020-2024)')
plt.xlabel('Days')
plt.ylabel('Multiplier')

# Plot building depreciation
plt.subplot(1, 2, 2)
plt.plot(time_bldg_mult)
plt.title('Building Depreciation (60 years)')
plt.xlabel('Age (years)')
plt.ylabel('Remaining Value')
plt.tight_layout()
plt.show()

This visualization helps verify the curves match realistic market behavior.

Build a simple ratio study

Now let’s evaluate assessment quality using OpenAVM Kit’s ratio study tools:

from openavmkit.ratio_study import RatioStudy
import numpy as np

# Simulate assessed values and sale prices
np.random.seed(42)
n_properties = 1000

# Ground truth sale prices
sale_prices = np.random.lognormal(mean=12.5, sigma=0.5, size=n_properties)

# Assessed values with some error
assessed_values = sale_prices * np.random.normal(loc=1.0, scale=0.15, size=n_properties)

# Create ratio study
rs = RatioStudy(
    predictions=assessed_values,
    ground_truth=sale_prices,
    max_trim=0.25  # Allow trimming up to 25% of outliers
)

# Display key metrics
print(f"Sample size: {rs.count}")
print(f"Median ratio: {rs.median_ratio:.3f}")
print(f"COD (Coefficient of Dispersion): {rs.cod:.3f}")
print(f"PRD (Price-Related Differential): {rs.prd:.3f}")
print(f"PRB (Price-Related Bias): {rs.prb:.3f}")

Understanding ratio study metrics

Median ratio: Center of the assessment ratio distribution (target: 1.00)
COD: Measures uniformity of assessments (lower is better, IAAO target: <15 for residential)
PRD: Detects assessment bias related to price levels (target: 0.98-1.03)
PRB: More sensitive measure of vertical equity (target: -0.05 to 0.05)

Working with real data

For production workflows, OpenAVM Kit uses a structured approach:

Initialize a notebook session

from openavmkit.pipeline import init_notebook, load_settings

# Set up the environment for a specific locality
init_notebook(locality="us-nc-guilford")

# Load configuration
settings = load_settings("in/settings.json")

The init_notebook() function:

Sets up the working directory structure
Loads environment variables from .env
Configures logging and warnings
Creates a NotebookState object for session management

Load and process data

from openavmkit.data import load_dataframe, process_data, SalesUniversePair

# Load raw data
df_parcels = load_dataframe("in/parcels.parquet")
df_sales = load_dataframe("in/sales.parquet")

# Create a SalesUniversePair object
sup = SalesUniversePair(
    universe=df_parcels,
    sales=df_sales
)

# Process and enrich the data
sup = process_data(
    sup=sup,
    settings=settings,
    enrich_census=True,
    enrich_osm=False
)

The SalesUniversePair (SUP) is OpenAVM Kit’s core data structure:

universe: All parcels in the jurisdiction
sales: Sales transactions for model training/testing

Train a model

from openavmkit.modeling import fit_model
from openavmkit.utilities.modeling import XGBoostModel

# Define model configuration
model_config = {
    "type": "xgboost",
    "features": [
        "land_area_sf",
        "building_area_sf",
        "year_built",
        "bedrooms",
        "bathrooms"
    ],
    "target": "sale_price"
}

# Fit the model
model = fit_model(
    sup=sup,
    settings=settings,
    model_type=XGBoostModel,
    features=model_config["features"]
)

# Generate predictions
sup.sales["predicted_value"] = model.predict(sup.sales)

Evaluate performance

from openavmkit.ratio_study import RatioStudy

# Create ratio study from predictions
rs = RatioStudy(
    predictions=sup.sales["predicted_value"].values,
    ground_truth=sup.sales["sale_price"].values,
    max_trim=0.25
)

# Display metrics
print(f"Model Performance:")
print(f"  Median Ratio: {rs.median_ratio:.3f}")
print(f"  COD: {rs.cod:.2f}")
print(f"  PRD: {rs.prd:.3f}")
print(f"  PRB: {rs.prb:.3f}")

# Check IAAO standards
if rs.cod < 15.0 and 0.98 <= rs.prd <= 1.03:
    print("✓ Model meets IAAO standards for residential assessments")
else:
    print("✗ Model needs improvement to meet IAAO standards")

Download example data

OpenAVM Kit includes a public dataset for learning. Here’s how to download it:

Create locality structure

# Navigate to your notebooks directory
cd notebooks/pipeline/data

# Create folder for the example locality
mkdir us-nc-guilford
cd us-nc-guilford

Configure cloud access

Create cloud.json with public dataset credentials:

{
  "type": "azure",
  "azure_storage_container_url": "https://landeconomics.blob.core.windows.net/localities-public"
}

This is a public container - no authentication required!

Sync data

In a Jupyter notebook:

from openavmkit.pipeline import init_notebook
from openavmkit.cloud.cloud import cloud_sync

# Initialize for the example locality
init_notebook(locality="us-nc-guilford")

# Download data from cloud
cloud_sync()

This creates two folders:

in/ - Input files including settings.json
out/ - Output files from your analysis

Run the pipeline

Open notebooks/pipeline/01-assemble.ipynb and run the cells to:

Load and validate the data
Perform initial data quality checks
Prepare features for modeling

Continue with subsequent notebooks:

02-clean.ipynb - Data cleaning and filtering
03-model.ipynb - Model training and evaluation
assessment_quality.ipynb - Comprehensive quality analysis

Key concepts

SalesUniversePair

Core data structure containing both the parcel universe and sales observations, enabling consistent operations across the entire workflow.

Settings dictionary

JSON configuration defining field mappings, feature specifications, modeling parameters, and locality-specific settings.

Ratio studies

IAAO-standard statistical analysis measuring assessment quality through metrics like COD, PRD, and PRB.

Pipeline functions

High-level functions in openavmkit.pipeline that orchestrate common workflows for data processing, modeling, and reporting.

Common workflows

Time adjustment for sales

from openavmkit.time_adjustment import enrich_time_adjustment

# Adjust sale prices to a common date
sup = enrich_time_adjustment(
    sup=sup,
    settings=settings,
    valuation_date="2024-01-01"
)

Time adjustment normalizes sale prices to a valuation date, accounting for market appreciation.

Sales scrutiny and filtering

from openavmkit.sales_scrutiny_study import run_sales_scrutiny_per_model_group
from openavmkit.cleaning import filter_invalid_sales

# Identify problematic sales
sup = run_sales_scrutiny_per_model_group(
    sup=sup,
    settings=settings
)

# Remove invalid transactions
sup = filter_invalid_sales(sup, settings)

Sales scrutiny detects outliers, non-arm’s-length transactions, and data errors.

Model comparison

from openavmkit.utilities.modeling import XGBoostModel, LightGBMModel, CatBoostModel

models = {
    "XGBoost": XGBoostModel,
    "LightGBM": LightGBMModel,
    "CatBoost": CatBoostModel
}

results = {}
for name, model_class in models.items():
    model = fit_model(sup, settings, model_class, features)
    predictions = model.predict(sup.sales)
    rs = RatioStudy(predictions, sup.sales["sale_price"], max_trim=0.25)
    results[name] = {"cod": rs.cod, "prd": rs.prd}

# Compare models
import pandas as pd
df_comparison = pd.DataFrame(results).T
print(df_comparison)

Test multiple algorithms to find the best fit for your data.

Spatial analysis

from openavmkit.utilities.data import calc_spatial_lag

# Calculate spatial lag of sale prices
sup.sales["spatial_lag_price"] = calc_spatial_lag(
    gdf=sup.sales,
    values=sup.sales["sale_price"],
    k_neighbors=10,
    bandwidth=1000  # meters
)

# Use as a feature
features.append("spatial_lag_price")

Spatial lag captures neighborhood effects by averaging nearby property values.

Next steps

Explore the notebooks

Work through the complete pipeline notebooks in notebooks/pipeline/ to see the full workflow in action.

Review the API documentation

Browse the API reference to understand available functions and their parameters.

Customize for your data

Create a new locality folder with your jurisdiction’s data and adapt the settings file to match your schema.

Build production models

Use the checkpoint system and cloud sync features to build robust, reproducible modeling pipelines.

Need help?

Join the discussion on GitHub for questions and support

Get Started

Core Concepts

Guides

Configuration

Advanced Topics

Overview

Prerequisites

Generate synthetic data

Build a simple ratio study

Working with real data

Initialize a notebook session

Load and process data

Train a model

Evaluate performance

Download example data

Key concepts

SalesUniversePair

Settings dictionary

Ratio studies

Pipeline functions

Common workflows

Next steps

Need help?

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Configuration

Advanced Topics

​Overview

​Prerequisites

​Generate synthetic data

​Build a simple ratio study

​Working with real data

​Initialize a notebook session

​Load and process data

​Train a model

​Evaluate performance

​Download example data

​Key concepts

SalesUniversePair

Settings dictionary

Ratio studies

Pipeline functions

​Common workflows

​Next steps

Need help?

Build docs developers (and LLMs) love

Overview

Prerequisites

Generate synthetic data

Build a simple ratio study

Working with real data

Initialize a notebook session

Load and process data

Train a model

Evaluate performance

Download example data

Key concepts

Common workflows

Next steps