Skip to main content
The data module provides core data structures and functions for loading, processing, and enriching assessment and sales data.

Core Data Structures

SalesUniversePair

A container for the sales and universe DataFrames. Many functions operate on this data structure.
from openavmkit.data import SalesUniversePair

sup = SalesUniversePair(sales=df_sales, universe=df_universe)
sales
pd.DataFrame
required
DataFrame containing sales data
universe
pd.DataFrame
required
DataFrame containing universe (parcel) data

Methods

copy() Create a copy of the SalesUniversePair object.
sup_copy = sup.copy()
sup_copy
SalesUniversePair
A new SalesUniversePair object with copied DataFrames
set(key, value) Set the sales or universe DataFrame.
sup.set("sales", new_sales_df)
sup.set("universe", new_universe_df)
key
str
required
Either “sales” or “universe”
value
pd.DataFrame
required
The new DataFrame to set for the specified key
update_sales(new_sales, allow_remove_rows) Update the sales DataFrame with new information as an overlay without redundancy.
sup.update_sales(new_sales_df, allow_remove_rows=True)
new_sales
pd.DataFrame
required
New sales DataFrame with updates
allow_remove_rows
bool
required
If True, allows the update to remove rows from sales. If False, preserves all original rows
limit_sales_to_keys(new_sale_keys) Update the sales DataFrame to only those that match a key in new_sale_keys.
sup.limit_sales_to_keys(["sale_123", "sale_456", "sale_789"])
new_sale_keys
list[str]
required
List of sale keys to filter to

Data Loading Functions

load_dataframe()

Load a single DataFrame based on configuration settings.
from openavmkit.data import load_dataframe

df = load_dataframe(
    entry,
    settings,
    verbose=True,
    fields_cat=categorical_fields,
    fields_bool=boolean_fields,
    fields_num=numeric_fields
)
entry
dict
required
Configuration entry for loading the dataframe
settings
dict
required
Settings dictionary
verbose
bool
default:"False"
If True, prints detailed logs during data loading
fields_cat
list[str]
default:"None"
List of categorical field names
fields_bool
list[str]
default:"None"
List of boolean field names
fields_num
list[str]
default:"None"
List of numeric field names
df
pd.DataFrame
The loaded DataFrame

Data Processing Functions

process_data()

Process raw dataframes according to settings and return a SalesUniversePair.
from openavmkit.data import process_data

sup = process_data(dataframes, settings, verbose=True)
dataframes
dict[str, pd.DataFrame]
required
Dictionary mapping keys to DataFrames
settings
dict
required
Settings dictionary
verbose
bool
default:"False"
If True, prints progress information
sup
SalesUniversePair
A SalesUniversePair containing processed sales and universe data

get_hydrated_sales_from_sup()

Merge the sales and universe DataFrames to “hydrate” the sales data.
from openavmkit.data import get_hydrated_sales_from_sup

df_hydrated = get_hydrated_sales_from_sup(sup)
sup
SalesUniversePair
required
SalesUniversePair containing sales and universe DataFrames
df_hydrated
pd.DataFrame | gpd.GeoDataFrame
The merged (hydrated) sales DataFrame

get_sup_model_group()

Get a subset of a SalesUniversePair for a specific model group.
from openavmkit.data import get_sup_model_group

sup_mg = get_sup_model_group(sup, model_group_id)
sup
SalesUniversePair
required
The SalesUniversePair to filter
model_group_id
str
required
The model group identifier to filter by
sup_mg
SalesUniversePair
A new SalesUniversePair containing only the specified model group

Enrichment Functions

enrich_time()

Enrich the DataFrame by converting specified time fields to datetime and deriving additional fields.
from openavmkit.data import enrich_time

df = enrich_time(df, time_formats, settings)
df
pd.DataFrame
required
Input DataFrame
time_formats
dict
required
Dictionary mapping field names to datetime formats
settings
dict
required
Settings dictionary
df
pd.DataFrame
DataFrame with enriched time fields

enrich_sup_spatial_lag()

Enrich the sales and universe DataFrames with spatial lag features.
from openavmkit.data import enrich_sup_spatial_lag

sup = enrich_sup_spatial_lag(sup, settings, verbose=True)
sup
SalesUniversePair
required
SalesUniversePair containing sales and universe DataFrames
settings
dict
required
Settings dictionary
verbose
bool
default:"False"
If True, prints progress information
sup
SalesUniversePair
Enriched SalesUniversePair with spatial lag features

enrich_df_streets()

Enrich a GeoDataFrame with street network data.
This function can be VERY computationally and memory intensive for large datasets.
from openavmkit.data import enrich_df_streets

df = enrich_df_streets(
    df,
    settings,
    spacing=1.0,
    max_ray_length=25.0,
    network_buffer=500.0,
    verbose=True
)
df_in
gpd.GeoDataFrame
required
Input GeoDataFrame containing parcels
settings
dict
required
Settings dictionary containing configuration for the enrichment
spacing
float
default:"1.0"
Spacing in meters for ray casting to calculate distances to streets
max_ray_length
float
default:"25.0"
Maximum length of rays to shoot for distance calculations, in meters
network_buffer
float
default:"500.0"
Buffer around the street network to consider for distance calculations, in meters
verbose
bool
default:"False"
If True, prints progress information
df
gpd.GeoDataFrame
Enriched GeoDataFrame with additional columns for street-related metrics

Utility Functions

get_sale_field()

Determine the appropriate sale price field based on time adjustment settings.
from openavmkit.data import get_sale_field

sale_field = get_sale_field(settings, df)
settings
dict
required
Settings dictionary
df
pd.DataFrame
default:"None"
Optional DataFrame to check field existence
sale_field
str
Field name to be used for sale price (either “sale_price” or “sale_price_time_adj”)

get_vacant_sales()

Filter the sales DataFrame to return only vacant (unimproved) sales.
from openavmkit.data import get_vacant_sales

df_vacant = get_vacant_sales(df, settings, invert=False)
df_in
pd.DataFrame
required
Input DataFrame
settings
dict
required
Settings dictionary
invert
bool
default:"False"
If True, return non-vacant (improved) sales
df_vacant
pd.DataFrame
DataFrame with an added is_vacant column

get_vacant()

Filter the DataFrame based on the ‘is_vacant’ column.
from openavmkit.data import get_vacant

df_vacant = get_vacant(df, settings, invert=False)
df_in
pd.DataFrame
required
Input DataFrame
settings
dict
required
Settings dictionary
invert
bool
default:"False"
If True, return non-vacant rows
df_vacant
pd.DataFrame
DataFrame filtered by the is_vacant flag

get_train_test_keys()

Get the training and testing keys for the sales DataFrame.
from openavmkit.data import get_train_test_keys

train_keys, test_keys = get_train_test_keys(df, settings)
df_in
pd.DataFrame
required
Input DataFrame containing sales data
settings
dict
required
Settings dictionary
keys_train
np.ndarray
Keys for training set
keys_test
np.ndarray
Keys for testing set

get_train_test_masks()

Get the training and testing masks for the sales DataFrame.
from openavmkit.data import get_train_test_masks

mask_train, mask_test = get_train_test_masks(df, settings)
df_in
pd.DataFrame
required
Input DataFrame containing sales data
settings
dict
required
Settings dictionary
mask_train
pd.Series
Boolean mask for training set
mask_test
pd.Series
Boolean mask for testing set

Field Classification Functions

get_field_classifications()

Retrieve a mapping of field names to their classifications (land, improvement, or other) and types.
from openavmkit.data import get_field_classifications

field_map = get_field_classifications(settings)
settings
dict
required
Settings dictionary
field_map
dict
Dictionary mapping field names to type and class

get_important_field()

Retrieve the important field name for a given field alias from settings.
from openavmkit.data import get_important_field

field_name = get_important_field(settings, "deed_id", df)
settings
dict
required
Settings dictionary
field_name
str
required
Identifier for the field
df
pd.DataFrame
default:"None"
Optional DataFrame to check field existence
field_name
str | None
The mapped field name if found, else None

get_report_locations()

Retrieve report location fields from settings.
from openavmkit.data import get_report_locations

locations = get_report_locations(settings, df)
settings
dict
required
Settings dictionary
df
pd.DataFrame
default:"None"
Optional DataFrame to filter available locations
locations
list[str]
List of report location field names

Output Functions

write_parquet()

Write data to a parquet file.
from openavmkit.data import write_parquet

write_parquet(df, "out/data.parquet")
df
pd.DataFrame
required
Data to be written
path
str
required
File path for saving the parquet

write_gpkg()

Write data to a GeoPackage file.
from openavmkit.data import write_gpkg

write_gpkg(gdf, "out/data.gpkg")
df
gpd.GeoDataFrame
required
Data to be written
path
str
required
File path for saving the GeoPackage

write_zipped_shapefile()

Write data to a zipped shapefile.
from openavmkit.data import write_zipped_shapefile

write_zipped_shapefile(gdf, "out/data.shp.zip")
df
gpd.GeoDataFrame
required
Data to be written
path
str
required
File path for saving the zipped shapefile

write_csv()

Write data to a CSV file.
from openavmkit.data import write_csv

write_csv(df, "out/data.csv")
df
pd.DataFrame
required
Data to be written
path
str
required
File path for saving the CSV

Build docs developers (and LLMs) love