Data

The data module provides core data structures and functions for loading, processing, and enriching assessment and sales data.

Core Data Structures

SalesUniversePair

A container for the sales and universe DataFrames. Many functions operate on this data structure.

from openavmkit.data import SalesUniversePair

sup = SalesUniversePair(sales=df_sales, universe=df_universe)

sales

pd.DataFrame

required

DataFrame containing sales data

universe

pd.DataFrame

required

DataFrame containing universe (parcel) data

Methods

copy() Create a copy of the SalesUniversePair object.

sup_copy = sup.copy()

sup_copy

SalesUniversePair

A new SalesUniversePair object with copied DataFrames

set(key, value) Set the sales or universe DataFrame.

sup.set("sales", new_sales_df)
sup.set("universe", new_universe_df)

key

str

required

Either “sales” or “universe”

value

pd.DataFrame

required

The new DataFrame to set for the specified key

update_sales(new_sales, allow_remove_rows) Update the sales DataFrame with new information as an overlay without redundancy.

sup.update_sales(new_sales_df, allow_remove_rows=True)

new_sales

pd.DataFrame

required

New sales DataFrame with updates

allow_remove_rows

bool

required

If True, allows the update to remove rows from sales. If False, preserves all original rows

limit_sales_to_keys(new_sale_keys) Update the sales DataFrame to only those that match a key in new_sale_keys.

sup.limit_sales_to_keys(["sale_123", "sale_456", "sale_789"])

new_sale_keys

list[str]

required

List of sale keys to filter to

Data Loading Functions

load_dataframe()

Load a single DataFrame based on configuration settings.

from openavmkit.data import load_dataframe

df = load_dataframe(
    entry,
    settings,
    verbose=True,
    fields_cat=categorical_fields,
    fields_bool=boolean_fields,
    fields_num=numeric_fields
)

entry

dict

required

Configuration entry for loading the dataframe

settings

dict

required

Settings dictionary

verbose

bool

default:"False"

If True, prints detailed logs during data loading

fields_cat

list[str]

default:"None"

List of categorical field names

fields_bool

list[str]

default:"None"

List of boolean field names

fields_num

list[str]

default:"None"

List of numeric field names

pd.DataFrame

The loaded DataFrame

Data Processing Functions

process_data()

Process raw dataframes according to settings and return a SalesUniversePair.

from openavmkit.data import process_data

sup = process_data(dataframes, settings, verbose=True)

dataframes

dict[str, pd.DataFrame]

required

Dictionary mapping keys to DataFrames

settings

dict

required

Settings dictionary

verbose

bool

default:"False"

If True, prints progress information

sup

SalesUniversePair

A SalesUniversePair containing processed sales and universe data

get_hydrated_sales_from_sup()

Merge the sales and universe DataFrames to “hydrate” the sales data.

from openavmkit.data import get_hydrated_sales_from_sup

df_hydrated = get_hydrated_sales_from_sup(sup)

sup

SalesUniversePair

required

SalesUniversePair containing sales and universe DataFrames

df_hydrated

pd.DataFrame | gpd.GeoDataFrame

The merged (hydrated) sales DataFrame

get_sup_model_group()

Get a subset of a SalesUniversePair for a specific model group.

from openavmkit.data import get_sup_model_group

sup_mg = get_sup_model_group(sup, model_group_id)

sup

SalesUniversePair

required

The SalesUniversePair to filter

model_group_id

str

required

The model group identifier to filter by

sup_mg

SalesUniversePair

A new SalesUniversePair containing only the specified model group

Enrichment Functions

enrich_time()

Enrich the DataFrame by converting specified time fields to datetime and deriving additional fields.

from openavmkit.data import enrich_time

df = enrich_time(df, time_formats, settings)

pd.DataFrame

required

Input DataFrame

time_formats

dict

required

Dictionary mapping field names to datetime formats

settings

dict

required

Settings dictionary

pd.DataFrame

DataFrame with enriched time fields

enrich_sup_spatial_lag()

Enrich the sales and universe DataFrames with spatial lag features.

from openavmkit.data import enrich_sup_spatial_lag

sup = enrich_sup_spatial_lag(sup, settings, verbose=True)

sup

SalesUniversePair

required

SalesUniversePair containing sales and universe DataFrames

settings

dict

required

Settings dictionary

verbose

bool

default:"False"

If True, prints progress information

sup

SalesUniversePair

Enriched SalesUniversePair with spatial lag features

enrich_df_streets()

Enrich a GeoDataFrame with street network data.

This function can be VERY computationally and memory intensive for large datasets.

from openavmkit.data import enrich_df_streets

df = enrich_df_streets(
    df,
    settings,
    spacing=1.0,
    max_ray_length=25.0,
    network_buffer=500.0,
    verbose=True
)

df_in

gpd.GeoDataFrame

required

Input GeoDataFrame containing parcels

settings

dict

required

Settings dictionary containing configuration for the enrichment

spacing

float

default:"1.0"

Spacing in meters for ray casting to calculate distances to streets

max_ray_length

float

default:"25.0"

Maximum length of rays to shoot for distance calculations, in meters

network_buffer

float

default:"500.0"

Buffer around the street network to consider for distance calculations, in meters

verbose

bool

default:"False"

If True, prints progress information

gpd.GeoDataFrame

Enriched GeoDataFrame with additional columns for street-related metrics

Utility Functions

get_sale_field()

Determine the appropriate sale price field based on time adjustment settings.

from openavmkit.data import get_sale_field

sale_field = get_sale_field(settings, df)

settings

dict

required

Settings dictionary

pd.DataFrame

default:"None"

Optional DataFrame to check field existence

sale_field

str

Field name to be used for sale price (either “sale_price” or “sale_price_time_adj”)

get_vacant_sales()

Filter the sales DataFrame to return only vacant (unimproved) sales.

from openavmkit.data import get_vacant_sales

df_vacant = get_vacant_sales(df, settings, invert=False)

df_in

pd.DataFrame

required

Input DataFrame

settings

dict

required

Settings dictionary

invert

bool

default:"False"

If True, return non-vacant (improved) sales

df_vacant

pd.DataFrame

DataFrame with an added is_vacant column

get_vacant()

Filter the DataFrame based on the ‘is_vacant’ column.

from openavmkit.data import get_vacant

df_vacant = get_vacant(df, settings, invert=False)

df_in

pd.DataFrame

required

Input DataFrame

settings

dict

required

Settings dictionary

invert

bool

default:"False"

If True, return non-vacant rows

df_vacant

pd.DataFrame

DataFrame filtered by the is_vacant flag

get_train_test_keys()

Get the training and testing keys for the sales DataFrame.

from openavmkit.data import get_train_test_keys

train_keys, test_keys = get_train_test_keys(df, settings)

df_in

pd.DataFrame

required

Input DataFrame containing sales data

settings

dict

required

Settings dictionary

keys_train

np.ndarray

Keys for training set

keys_test

np.ndarray

Keys for testing set

get_train_test_masks()

Get the training and testing masks for the sales DataFrame.

from openavmkit.data import get_train_test_masks

mask_train, mask_test = get_train_test_masks(df, settings)

df_in

pd.DataFrame

required

Input DataFrame containing sales data

settings

dict

required

Settings dictionary

mask_train

pd.Series

Boolean mask for training set

mask_test

pd.Series

Boolean mask for testing set

Field Classification Functions

get_field_classifications()

Retrieve a mapping of field names to their classifications (land, improvement, or other) and types.

from openavmkit.data import get_field_classifications

field_map = get_field_classifications(settings)

settings

dict

required

Settings dictionary

field_map

dict

Dictionary mapping field names to type and class

get_important_field()

Retrieve the important field name for a given field alias from settings.

from openavmkit.data import get_important_field

field_name = get_important_field(settings, "deed_id", df)

settings

dict

required

Settings dictionary

field_name

str

required

Identifier for the field

pd.DataFrame

default:"None"

Optional DataFrame to check field existence

field_name

str | None

The mapped field name if found, else None

get_report_locations()

Retrieve report location fields from settings.

from openavmkit.data import get_report_locations

locations = get_report_locations(settings, df)

settings

dict

required

Settings dictionary

pd.DataFrame

default:"None"

Optional DataFrame to filter available locations

locations

list[str]

List of report location field names

Output Functions

write_parquet()

Write data to a parquet file.

from openavmkit.data import write_parquet

write_parquet(df, "out/data.parquet")

pd.DataFrame

required

Data to be written

path

str

required

File path for saving the parquet

write_gpkg()

Write data to a GeoPackage file.

from openavmkit.data import write_gpkg

write_gpkg(gdf, "out/data.gpkg")

gpd.GeoDataFrame

required

Data to be written

path

str

required

File path for saving the GeoPackage

write_zipped_shapefile()

Write data to a zipped shapefile.

from openavmkit.data import write_zipped_shapefile

write_zipped_shapefile(gdf, "out/data.shp.zip")

gpd.GeoDataFrame

required

Data to be written

path

str

required

File path for saving the zipped shapefile

write_csv()

Write data to a CSV file.

from openavmkit.data import write_csv

write_csv(df, "out/data.csv")

pd.DataFrame

required

Data to be written

path

str

required

File path for saving the CSV

Core Modules

Analysis & Evaluation

Data Processing

Specialized Analysis

Utilities

Cloud & Storage

Quality & Reports

Core Data Structures

SalesUniversePair

Methods

Data Loading Functions

load_dataframe()

Data Processing Functions

process_data()

get_hydrated_sales_from_sup()

get_sup_model_group()

Enrichment Functions

enrich_time()

enrich_sup_spatial_lag()

enrich_df_streets()

Utility Functions

get_sale_field()

get_vacant_sales()

get_vacant()

get_train_test_keys()

get_train_test_masks()

Field Classification Functions

get_field_classifications()

get_important_field()

get_report_locations()

Output Functions

write_parquet()

write_gpkg()

write_zipped_shapefile()

write_csv()

Build docs developers (and LLMs) love

Core Modules

Analysis & Evaluation

Data Processing

Specialized Analysis

Utilities

Cloud & Storage

Quality & Reports

​Core Data Structures

​SalesUniversePair

​Methods

​Data Loading Functions

​load_dataframe()

​Data Processing Functions

​process_data()

​get_hydrated_sales_from_sup()

​get_sup_model_group()

​Enrichment Functions

​enrich_time()

​enrich_sup_spatial_lag()

​enrich_df_streets()

​Utility Functions

​get_sale_field()

​get_vacant_sales()

​get_vacant()

​get_train_test_keys()

​get_train_test_masks()

​Field Classification Functions

​get_field_classifications()

​get_important_field()

​get_report_locations()

​Output Functions

​write_parquet()

​write_gpkg()

​write_zipped_shapefile()

​write_csv()

Build docs developers (and LLMs) love

Core Data Structures

SalesUniversePair

Methods

Data Loading Functions

load_dataframe()

Data Processing Functions

process_data()

get_hydrated_sales_from_sup()

get_sup_model_group()

Enrichment Functions

enrich_time()

enrich_sup_spatial_lag()

enrich_df_streets()

Utility Functions

get_sale_field()

get_vacant_sales()

get_vacant()

get_train_test_keys()

get_train_test_masks()

Field Classification Functions

get_field_classifications()

get_important_field()

get_report_locations()

Output Functions

write_parquet()

write_gpkg()

write_zipped_shapefile()

write_csv()