Skip to main content
The data assembly phase loads your raw data files and combines them into a unified structure called a SalesUniversePair (or sup). This structure contains two key dataframes: Universe (all parcels) and Sales (transactions).

Understanding SalesUniversePair

A SalesUniversePair is the fundamental data structure in OpenAVM Kit:
  • Universe DataFrame: All parcels in your jurisdiction with their current characteristics
  • Sales DataFrame: Transaction records with prices, dates, and historical characteristics
# Accessing the dataframes
sup = SalesUniversePair(...)
df_universe = sup["universe"]
df_sales = sup["sales"]

Loading Data

Use load_dataframes() to read all configured data sources from your settings.
from openavmkit.pipeline import load_dataframes, load_settings

settings = load_settings()
dataframes = load_dataframes(settings, verbose=True)

Supported File Formats

  • GeoPackage (.gpkg) - Recommended
  • Shapefile (.shp)
  • GeoJSON (.geojson)
  • Parquet with geometry (.parquet)

Configuration Example

Define your data sources in settings.json:
settings.json
{
  "data": {
    "load": {
      "geo_parcels": {
        "path": "parcels.gpkg",
        "type": "geopackage",
        "layer": "parcels"
      },
      "property_characteristics": {
        "path": "property_data.csv",
        "type": "csv"
      },
      "sales_records": {
        "path": "sales.csv",
        "type": "csv"
      }
    }
  }
}
The geo_parcels layer is required and must contain a geometry column with parcel boundaries.

Processing Data

Once loaded, use process_dataframes() to merge and enrich the data:
pipeline.py:764-790
from openavmkit.pipeline import process_dataframes

sales_univ_pair = process_dataframes(
    dataframes=dataframes,
    settings=settings,
    verbose=True
)
This function:
  1. Merges dataframes using key fields
  2. Enriches geometries with calculated metrics (area, aspect ratio, etc.)
  3. Processes sales records and links them to parcels
  4. Validates data and ensures consistency

What process_data() Does

The process_data() function (called internally by process_dataframes()) performs:
data.py
def process_data(dataframes: dict[str, pd.DataFrame], settings: dict, verbose: bool = False):
    """Process and merge dataframes according to settings.
    
    - Combines multiple data sources
    - Enriches with time fields
    - Calculates geometric properties
    - Splits into sales and universe
    """
Key operations:
  • Merge all dataframes on common keys (typically key or parcel_id)
  • Calculate land area from GIS geometries
  • Enrich time-based fields for sales dates
  • Separate sales transactions from universe parcels

Examining Your Data

After assembly, inspect the results to verify everything loaded correctly:
from openavmkit.pipeline import examine_sup

examine_sup(sales_univ_pair, settings)
Output shows:
  • Field names and types
  • Non-zero and non-null counts
  • Unique value counts for categorical fields
  • Distribution statistics

Detailed Examination

For deeper inspection, use the detailed examination function:
from openavmkit.pipeline import examine_sup_in_ridiculous_detail

examine_sup_in_ridiculous_detail(sales_univ_pair, settings)
This displays:
  • describe() statistics for all numeric fields
  • value_counts() for all categorical fields
  • Data quality metrics

Example Workflow

Here’s a complete example from the 01-assemble.ipynb notebook:
1
Load settings and initialize
2
settings = load_settings()
3
Load all dataframes
4
dataframes = from_checkpoint("1-assemble-01-load_dataframes", load_dataframes,
    {
        "settings": load_settings(),
        "verbose": verbose
    }
)
5
Process into SalesUniversePair
6
sales_univ_pair = from_checkpoint("1-assemble-02-process_data", process_dataframes,
    {
        "dataframes": dataframes,
        "settings": load_settings(), 
        "verbose": verbose
    }
)
7
Enrich with street data (optional)
8
sales_univ_pair = from_checkpoint("1-assemble-03-enrich_streets", enrich_sup_streets,
    {
        "sup": sales_univ_pair,
        "settings":load_settings(), 
        "verbose":verbose
    }
)
9
Tag model groups
10
sales_univ_pair = from_checkpoint("1-assemble-04-tag_modeling_groups", tag_model_groups_sup,
    {
        "sup": sales_univ_pair, 
        "settings": load_settings(), 
        "verbose": verbose
    }
)
11
Write outputs
12
write_notebook_output_sup(
    sales_univ_pair, 
    "1-assemble", 
    parquet=True, 
    gpkg=False, 
    shp=False
)

Output Files

The assembly process writes several output files to out/look/:
  • 1-assemble-universe.parquet - All parcels with current characteristics
  • 1-assemble-sales.parquet - Sales metadata only
  • 1-assemble-sales-hydrated.parquet - Sales with full parcel characteristics
  • 1-assemble-sup.pickle - Complete SalesUniversePair object
Load these files in QGIS, ArcGIS, or Felt to visualize your data on a map and verify spatial accuracy.

Common Issues

The geo_parcels layer is required and must be defined in your settings:
"geo_parcels": {
  "path": "parcels.gpkg",
  "type": "geopackage"
}
Ensure your geo_parcels file contains a geometry column with valid polygon or multipolygon geometries.
Verify that all dataframes share a common key field (typically key or parcel_id). Configure merge keys in your settings if using custom field names.

Next Steps

Data Cleaning

Clean and validate your assembled data

Modeling

Build predictive models with your data

Build docs developers (and LLMs) love