Classes
SalesScrutinyStudy
Sales scrutiny study object that performs cluster-based analysis on sales to identify anomalies. Attributes:df_vacant(pd.DataFrame): DataFrame of sales that were allegedly vacant (no building) at time of saledf_improved(pd.DataFrame): DataFrame of sales that were allegedly improved (had building) at time of salesettings(dict): Settings dictionarymodel_group(str): The model group to investigatesummaries(dict[str, SalesScrutinyStudySummary]): Dictionary in which the results are storedunit(str): The area unit (“sqft” or “sqm”)
__init__
The data you wish to analyze
Settings dictionary
Model group to analyze
write()
Writes the sales scrutiny report to disk.
The root path for output files
get_scrutinized()
Remove flagged sales from the dataset and return the modified dataset.
The DataFrame you wish to clean
Whether to print verbose output
DataFrame with flagged sales marked as invalid
SalesScrutinyStudySummary
Summary statistics for a sales scrutiny study. Attributes:num_sales_flagged(int): The number of sales flagged by the studynum_sales_total(int): The number of sales that were testednum_flagged_sales_by_type(dict[str, int]): Dictionary breaking down number of flagged sales by anomaly type
Functions
run_heuristics
Identifies and flags anomalous sales by heuristic. Drops them if the user specifies.The data you want to analyze/clean
Settings dictionary
If True, drops all sales flagged by this method
Whether to print verbose output
The original data with any modifications
- Duplicate Deed IDs and Dates: Flags sales with identical deed IDs and sale dates (potential multi-parcel sales)
- Duplicate Dates and Prices: Flags sales made on the same date for the same price (potential multi-parcel sales)
- Misclassified Vacant Sales: Flags vacant sales where the building year built is older than the sale year
drop_manual_exclusions
Drops sales that the user has individually marked as invalid.The data you want to clean
Settings dictionary
Whether to print verbose output
The original data with any modifications
settings.analysis.sales_scrutiny.invalid_key_file containing sale keys to exclude.
run_sales_scrutiny
Run sales scrutiny analysis on an individual model group.The data that you want to analyze
Configuration settings
The model group you want to analyze
If True, enables verbose logging
Updated DataFrame after sales scrutiny analysis with flagged sales marked as invalid
run_sales_scrutiny_per_model_group
Run sales scrutiny analysis for each model group within a DataFrame.The data that you want to analyze
Configuration settings
If True, enables verbose logging
Updated DataFrame after sales scrutiny analysis
mark_ss_ids_per_model_group
Cluster parcels for a sales scrutiny study by assigning sales scrutiny IDs. This function processes each model group within the provided dataset, identifies clusters of parcels for scrutiny, and writes the cluster identifiers into a new field.The data you want to mark
Configuration settings
If True, prints verbose output during processing
Updated DataFrame with marked sales scrutiny IDs (ss_id column)
Anomaly Types
The sales scrutiny study identifies five types of anomalies:Anomaly 1: Weird Price/Area and Weird Area
Criteria:- Price per area is extremely high or low (> 2 standard deviations)
- Area itself is extremely high or low (> 2 standard deviations)
Anomaly 2: Low Price and Low Price/Area
Criteria:- Sale price is low (< median for cluster)
- Price per area is low (< 2 standard deviations below median)
- Area is in normal range (within 1 standard deviation)
- Non-arms-length transaction
- Distressed sale
- Missing property characteristics
- Property condition issues not captured in data
Anomaly 3: High Price and High Price/Area
Criteria:- Sale price is high (> median for cluster)
- Price per area is high (> 2 standard deviations above median)
- Area is in normal range (within 1 standard deviation)
- Luxury property with superior characteristics not captured
- Data entry error (inflated sale price)
- Unique property features
Anomaly 4: Normal Price and High Price/Area
Criteria:- Sale price is in normal range (within 1 standard deviation of median)
- Price per area is high (> 2 standard deviations above median)
- Area is in normal range (within 1 standard deviation)
- Incorrectly classified property (e.g., vacant sale with building present)
- Missing area data
- Premium location within cluster
Anomaly 5: Normal Price and Low Price/Area
Criteria:- Sale price is in normal range (within 1 standard deviation of median)
- Price per area is low (< 2 standard deviations below median)
- Area is in normal range (within 1 standard deviation)
- Large property in a typically smaller-property area
- Inflated area measurement
- Property with significant unusable area
Clustering Methodology
Sales are grouped into clusters based on:- Location: Neighborhood or other geographic identifier
- Categorical fields: Property characteristics (from settings)
- Numeric fields: Continuous property characteristics (from settings)
- Vacant status: Analyzed separately for vacant vs. improved
Bimodal Cluster Detection
The study also identifies clusters with bimodal distributions using Gaussian Mixture Models. Criteria for Bimodal Classification:- BIC(1) - BIC(2) ≥ 10 (two components strongly preferred)
- Ashman’s D > 2.0 (components well-separated)
- Minimum component weight ≥ 0.15 (avoids spurious modes)
- Mixed property types incorrectly grouped together
- Distinct sub-markets within the cluster
- Invalid sales mixed with valid sales