SalesUniversePair (or sup). This structure contains two key dataframes: Universe (all parcels) and Sales (transactions).
Understanding SalesUniversePair
ASalesUniversePair is the fundamental data structure in OpenAVM Kit:
- Universe DataFrame: All parcels in your jurisdiction with their current characteristics
- Sales DataFrame: Transaction records with prices, dates, and historical characteristics
Loading Data
Useload_dataframes() to read all configured data sources from your settings.
Supported File Formats
- Geospatial
- Tabular
- GeoPackage (
.gpkg) - Recommended - Shapefile (
.shp) - GeoJSON (
.geojson) - Parquet with geometry (
.parquet)
Configuration Example
Define your data sources insettings.json:
settings.json
The
geo_parcels layer is required and must contain a geometry column with parcel boundaries.Processing Data
Once loaded, useprocess_dataframes() to merge and enrich the data:
pipeline.py:764-790
- Merges dataframes using key fields
- Enriches geometries with calculated metrics (area, aspect ratio, etc.)
- Processes sales records and links them to parcels
- Validates data and ensures consistency
What process_data() Does
The process_data() function (called internally by process_dataframes()) performs:
data.py
- Merge all dataframes on common keys (typically
keyorparcel_id) - Calculate land area from GIS geometries
- Enrich time-based fields for sales dates
- Separate sales transactions from universe parcels
Examining Your Data
After assembly, inspect the results to verify everything loaded correctly:- Field names and types
- Non-zero and non-null counts
- Unique value counts for categorical fields
- Distribution statistics
Detailed Examination
For deeper inspection, use the detailed examination function:describe()statistics for all numeric fieldsvalue_counts()for all categorical fields- Data quality metrics
Example Workflow
Here’s a complete example from the01-assemble.ipynb notebook:
dataframes = from_checkpoint("1-assemble-01-load_dataframes", load_dataframes,
{
"settings": load_settings(),
"verbose": verbose
}
)
sales_univ_pair = from_checkpoint("1-assemble-02-process_data", process_dataframes,
{
"dataframes": dataframes,
"settings": load_settings(),
"verbose": verbose
}
)
sales_univ_pair = from_checkpoint("1-assemble-03-enrich_streets", enrich_sup_streets,
{
"sup": sales_univ_pair,
"settings":load_settings(),
"verbose":verbose
}
)
sales_univ_pair = from_checkpoint("1-assemble-04-tag_modeling_groups", tag_model_groups_sup,
{
"sup": sales_univ_pair,
"settings": load_settings(),
"verbose": verbose
}
)
Output Files
The assembly process writes several output files toout/look/:
1-assemble-universe.parquet- All parcels with current characteristics1-assemble-sales.parquet- Sales metadata only1-assemble-sales-hydrated.parquet- Sales with full parcel characteristics1-assemble-sup.pickle- Complete SalesUniversePair object
Common Issues
Missing 'geo_parcels' error
Missing 'geo_parcels' error
The
geo_parcels layer is required and must be defined in your settings:No geometry column found
No geometry column found
Ensure your
geo_parcels file contains a geometry column with valid polygon or multipolygon geometries.Merge key mismatches
Merge key mismatches
Verify that all dataframes share a common key field (typically
key or parcel_id). Configure merge keys in your settings if using custom field names.Next Steps
Data Cleaning
Clean and validate your assembled data
Modeling
Build predictive models with your data