Data Assembly

The data assembly phase loads your raw data files and combines them into a unified structure called a SalesUniversePair (or sup). This structure contains two key dataframes: Universe (all parcels) and Sales (transactions).

Understanding SalesUniversePair

A SalesUniversePair is the fundamental data structure in OpenAVM Kit:

Universe DataFrame: All parcels in your jurisdiction with their current characteristics
Sales DataFrame: Transaction records with prices, dates, and historical characteristics

# Accessing the dataframes
sup = SalesUniversePair(...)
df_universe = sup["universe"]
df_sales = sup["sales"]

Loading Data

Use load_dataframes() to read all configured data sources from your settings.

from openavmkit.pipeline import load_dataframes, load_settings

settings = load_settings()
dataframes = load_dataframes(settings, verbose=True)

Supported File Formats

Geospatial
Tabular

GeoPackage (.gpkg) - Recommended
Shapefile (.shp)
GeoJSON (.geojson)
Parquet with geometry (.parquet)

CSV (.csv)
Parquet (.parquet)
Excel (.xlsx)

Configuration Example

Define your data sources in settings.json:

settings.json

{
  "data": {
    "load": {
      "geo_parcels": {
        "path": "parcels.gpkg",
        "type": "geopackage",
        "layer": "parcels"
      },
      "property_characteristics": {
        "path": "property_data.csv",
        "type": "csv"
      },
      "sales_records": {
        "path": "sales.csv",
        "type": "csv"
      }
    }
  }
}

The geo_parcels layer is required and must contain a geometry column with parcel boundaries.

Processing Data

Once loaded, use process_dataframes() to merge and enrich the data:

pipeline.py:764-790

from openavmkit.pipeline import process_dataframes

sales_univ_pair = process_dataframes(
    dataframes=dataframes,
    settings=settings,
    verbose=True
)

This function:

Merges dataframes using key fields
Enriches geometries with calculated metrics (area, aspect ratio, etc.)
Processes sales records and links them to parcels
Validates data and ensures consistency

What `process_data()` Does

The process_data() function (called internally by process_dataframes()) performs:

data.py

def process_data(dataframes: dict[str, pd.DataFrame], settings: dict, verbose: bool = False):
    """Process and merge dataframes according to settings.
    
    - Combines multiple data sources
    - Enriches with time fields
    - Calculates geometric properties
    - Splits into sales and universe
    """

Key operations:

Merge all dataframes on common keys (typically key or parcel_id)
Calculate land area from GIS geometries
Enrich time-based fields for sales dates
Separate sales transactions from universe parcels

Examining Your Data

After assembly, inspect the results to verify everything loaded correctly:

from openavmkit.pipeline import examine_sup

examine_sup(sales_univ_pair, settings)

Output shows:

Field names and types
Non-zero and non-null counts
Unique value counts for categorical fields
Distribution statistics

Detailed Examination

For deeper inspection, use the detailed examination function:

from openavmkit.pipeline import examine_sup_in_ridiculous_detail

examine_sup_in_ridiculous_detail(sales_univ_pair, settings)

This displays:

describe() statistics for all numeric fields
value_counts() for all categorical fields
Data quality metrics

Example Workflow

Here’s a complete example from the 01-assemble.ipynb notebook:

Load settings and initialize

settings = load_settings()

Load all dataframes

dataframes = from_checkpoint("1-assemble-01-load_dataframes", load_dataframes,
    {
        "settings": load_settings(),
        "verbose": verbose
    }
)

Process into SalesUniversePair

sales_univ_pair = from_checkpoint("1-assemble-02-process_data", process_dataframes,
    {
        "dataframes": dataframes,
        "settings": load_settings(), 
        "verbose": verbose
    }
)

Enrich with street data (optional)

sales_univ_pair = from_checkpoint("1-assemble-03-enrich_streets", enrich_sup_streets,
    {
        "sup": sales_univ_pair,
        "settings":load_settings(), 
        "verbose":verbose
    }
)

Tag model groups

sales_univ_pair = from_checkpoint("1-assemble-04-tag_modeling_groups", tag_model_groups_sup,
    {
        "sup": sales_univ_pair, 
        "settings": load_settings(), 
        "verbose": verbose
    }
)

Write outputs

write_notebook_output_sup(
    sales_univ_pair, 
    "1-assemble", 
    parquet=True, 
    gpkg=False, 
    shp=False
)

Output Files

The assembly process writes several output files to out/look/:

1-assemble-universe.parquet - All parcels with current characteristics
1-assemble-sales.parquet - Sales metadata only
1-assemble-sales-hydrated.parquet - Sales with full parcel characteristics
1-assemble-sup.pickle - Complete SalesUniversePair object

Load these files in QGIS, ArcGIS, or Felt to visualize your data on a map and verify spatial accuracy.

Common Issues

Missing 'geo_parcels' error

The geo_parcels layer is required and must be defined in your settings:

"geo_parcels": {
  "path": "parcels.gpkg",
  "type": "geopackage"
}

No geometry column found

Ensure your geo_parcels file contains a geometry column with valid polygon or multipolygon geometries.

Merge key mismatches

Verify that all dataframes share a common key field (typically key or parcel_id). Configure merge keys in your settings if using custom field names.

Get Started

Core Concepts

Guides

Configuration

Advanced Topics

Understanding SalesUniversePair

Loading Data

Supported File Formats

Configuration Example

Processing Data

What `process_data()` Does

Examining Your Data

Detailed Examination

Example Workflow

Output Files

Common Issues

Next Steps

Data Cleaning

Modeling

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Configuration

Advanced Topics

​Understanding SalesUniversePair

​Loading Data

​Supported File Formats

​Configuration Example

​Processing Data

​What process_data() Does

​Examining Your Data

​Detailed Examination

​Example Workflow

​Output Files

​Common Issues

​Next Steps

Data Cleaning

Modeling

Build docs developers (and LLMs) love

Understanding SalesUniversePair

Loading Data

Supported File Formats

Configuration Example

Processing Data

What `process_data()` Does

Examining Your Data

Detailed Examination

Example Workflow

Output Files

Common Issues

Next Steps