SalesUniversePair
The SalesUniversePair is the fundamental data structure in OpenAVM Kit. Nearly every function in the library operates on or returns this structure.Defined in
openavmkit/data.py:96, SalesUniversePair is a Python dataclass that bundles together two related DataFrames that need to be processed in tandem.Structure definition
The two DataFrames
Sales DataFrame
Sales DataFrame
Purpose: Contains transaction records with known sale pricesKey characteristics:
- Represents transactions and any known data at the time of the transaction
- Allows duplicate parcel keys since a parcel may have sold multiple times
- Each row has a unique
key_saleidentifier - Used for training and validating predictive models
key_sale- Unique transaction identifierkey- Parcel identifier (can appear multiple times)sale_price- Transaction pricesale_date- When the transaction occurredvalid_sale- Boolean indicating if sale should be used for modelingvacant_sale- Boolean indicating if parcel was vacant at time of sale
Universe DataFrame
Universe DataFrame
Purpose: Contains the current state of all parcels in the jurisdictionKey characteristics:
- Represents the current state of all parcels
- Forbids duplicate parcel keys - each parcel appears exactly once
- Each row has a unique
keyidentifier - This is the dataset we generate predictions for
key- Unique parcel identifieris_vacant- Boolean indicating current vacancy statusmodel_group- Classification (residential, commercial, etc.)- Various characteristics (land area, building area, zoning, etc.)
Why this structure exists
The SalesUniversePair structure is necessary because:- Consistency: Sales and universe data need to be processed together to ensure consistency in field calculations, transformations, and enrichments
- Historical context: Sales represent historical transactions with characteristics at the time of sale, while universe represents current parcel state
- Overlays: The sales DataFrame acts as an “overlay” on the universe, containing only transaction-specific information without duplicating parcel characteristics
Many functions “hydrate” sales by merging them with universe data. This combines the transaction information with the full parcel characteristics.
Key operations
Creating a SalesUniversePair
Accessing DataFrames
Modifying DataFrames
Using set()
Using set()
Replace an entire DataFrame:
Using update_sales()
Using update_sales()
Update sales DataFrame as an overlay without redundancy:This function:
- Preserves existing fields from the original sales DataFrame
- Adds only new fields generated in the update
- Avoids duplicating information already in universe
- Optionally filters rows based on
allow_remove_rows
Using copy()
Using copy()
Create a deep copy of the entire structure:
Hydrating sales data
Theget_hydrated_sales_from_sup() function merges sales and universe data:
- Takes the universe DataFrame and filters to parcels that have sales
- Merges universe data with sales data
- Sales data overrides universe data where conflicts exist
- Returns a GeoDataFrame if geometry is present
This creates a “complete” sales DataFrame with all parcel characteristics at the time of sale.
Related data structures
TimingData
Used internally to track performance metrics during data processing.TreeBasedCategoricalData
Stores categorical variable encodings for tree-based machine learning models.Model result structures
Various model classes return structured results containing:- Trained model objects
- Predictions
- Performance metrics
- Feature importance
- SHAP values (for tree-based models)
Data flow through the pipeline
Best practices
Keep sales as an overlay
Keep sales as an overlay
Don’t duplicate universe fields in sales unless they differ at time of sale. Let hydration merge them when needed.
Use update_sales() for incremental changes
Use update_sales() for incremental changes
When adding new calculated fields to sales, use
update_sales() rather than set() to maintain the overlay structure.Validate keys
Validate keys
Ensure universe has unique keys and sales has valid
key references to universe parcels.Handle geometry carefully
Handle geometry carefully
If working with spatial data, ensure both DataFrames are GeoDataFrames with consistent CRS: