Skip to main content
Data cleaning prepares your assembled data for modeling by validating sales, filling gaps, and filtering out invalid transactions. This ensures only high-quality, market-reflective sales are used for training.

Why Clean Data?

Raw sales data often contains:
  • Non-market transactions: Family transfers, estate sales, foreclosures
  • Invalid prices: Zero or negative sale prices, missing values
  • Outdated sales: Transactions too old to reflect current market
  • Data gaps: Missing property characteristics needed for modeling

The Cleaning Pipeline

The cleaning process follows these steps:
1
Load assembled data
2
sales_univ_pair = read_pickle("out/1-assemble-sup")
3
Fill unknown values
4
sales_univ_pair = fill_unknown_values_sup(sales_univ_pair, settings)
5
Cluster for equity analysis
6
sales_univ_pair = mark_horizontal_equity_clusters_per_model_group_sup(
    sup=sales_univ_pair,
    settings=settings,
    verbose=verbose
)
7
Process and validate sales
8
sales_univ_pair = process_sales(
    sup=sales_univ_pair,
    settings=settings,
    verbose=verbose
)
9
Run sales scrutiny
10
sales_univ_pair = run_sales_scrutiny(
    sup=sales_univ_pair,
    settings=settings,
    drop_cluster_outliers=False,
    drop_heuristic_outliers=True,
    verbose=verbose
)

Sales Validation

The clean_valid_sales() function performs initial validation:
cleaning.py:18-184
def clean_valid_sales(sup: SalesUniversePair, settings: dict):
    """Clean and validate sales data.
    
    - Filters by date range
    - Removes invalid prices (zero, negative, null)
    - Checks vacancy status consistency
    - Marks sales valid for ratio studies
    """

What Gets Filtered

Sales are marked invalid if:
  • Sale price is zero
  • Sale price is negative
  • Sale price is null/missing
cleaning.py:88-92
# sale prices of 0 and negative and null are invalid
df_sales.loc[
    df_sales["sale_price"].isna() | df_sales["sale_price"].le(0), "valid_sale"
] = False

Ratio Study Validity

Two validity flags are created: valid_for_ratio_study: Sale is valid AND vacancy status matches current parcel status
cleaning.py:127-134
df_sales.loc[
    df_sales["valid_sale"] & df_sales["vacant_sale"].eq(df_sales["univ_is_vacant"]),
    "valid_for_ratio_study",
] = True
valid_for_land_ratio_study: Sale is valid AND was vacant at time of sale
cleaning.py:136-144
df_sales.loc[
    df_sales["valid_sale"] & df_sales["vacant_sale"].eq(True),
    "valid_for_land_ratio_study",
] = True

Filtering Invalid Sales

Use custom filters to exclude non-market transactions:
cleaning.py:228-293
def filter_invalid_sales(sup: SalesUniversePair, settings: dict, verbose: bool = False):
    """Filter out invalid sales based on configurable conditions.
    
    Applies user-defined filters to identify:
    - Non-arms-length transactions
    - Family transfers
    - Unusual sale types
    """

Configuration Example

settings.json
{
  "data": {
    "process": {
      "invalid_sales": {
        "enabled": true,
        "filter": [
          {
            "field": "sale_type",
            "operator": "in",
            "value": ["Estate Sale", "Family Transfer", "Foreclosure"]
          },
          {
            "field": "deed_type",
            "operator": "==",
            "value": "Gift"
          }
        ]
      }
    }
  }
}
This excludes:
  • Estate sales, family transfers, and foreclosures
  • Gift deeds (non-market transactions)

Sales Scrutiny

Sales scrutiny identifies outlier transactions that don’t reflect market value using statistical clustering.

Heuristic Method

Applies rule-based filters to flag suspicious sales:
sales_univ_pair = run_heuristics(
    sup=sales_univ_pair,
    settings=settings,
    drop_outliers=True,
    verbose=verbose
)
Common heuristics:
  • Sales far from typical price ranges
  • Unusually short or long time on market
  • Sales between related parties

Cluster Method

Groups similar properties and identifies price outliers within each cluster:
pipeline.py:1030-1082
def run_sales_scrutiny(
    sup: SalesUniversePair,
    settings: dict,
    drop_cluster_outliers: bool = False,
    drop_heuristic_outliers: bool = True,
    verbose: bool = False
):
    """Run sales scrutiny analysis.
    
    1. Apply heuristic filters
    2. Drop manually excluded sales
    3. Cluster-based outlier detection
    """
Configuration:
settings.json
{
  "analysis": {
    "sales_scrutiny": {
      "clusters_enabled": true,
      "heuristics_enabled": true,
      "invalid_key_file": "in/invalid_sales.csv"
    }
  }
}
Be cautious with drop_cluster_outliers=True. This permanently removes sales from your dataset. Start with False to review flagged sales first.

Filling Unknown Values

Models cannot handle missing data, so gaps must be filled:
cleaning.py:187-225
def fill_unknown_values_sup(sup: SalesUniversePair, settings: dict):
    """Fill unknown values with defaults specified in settings.
    
    - Universe: Fill all characteristics
    - Sales: Fill only sale-specific metadata
    """

Configuration

Define fill values in your data dictionary:
settings.json
{
  "data": {
    "dictionary": {
      "characteristics": {
        "numeric": {
          "bldg_area_finished_sqft": {
            "fill_value": 0
          },
          "bldg_year_built": {
            "fill_value": 1950
          }
        },
        "categorical": {
          "bldg_type": {
            "fill_value": "UNKNOWN"
          }
        },
        "boolean": {
          "has_garage": {
            "fill_value": false
          }
        }
      }
    }
  }
}
Use sensible defaults that represent “typical” or “neutral” values rather than statistical means, which can introduce bias.

Output Files

Cleaned data is written to:
  • out/2-clean-sup.pickle - Complete cleaned SalesUniversePair
  • out/look/2-clean-universe.parquet - Cleaned universe data
  • out/look/2-clean-sales-hydrated.parquet - Cleaned sales with characteristics
  • out/sales_scrutiny/ - Sales scrutiny reports and flagged sales

Validation Checklist

Before proceeding to modeling, verify:
1
Check sales counts
2
print(f"Valid sales: {len(sup.sales[sup.sales['valid_sale'] == True])}")
print(f"Valid for ratio study: {len(sup.sales[sup.sales['valid_for_ratio_study'] == True])}")
3
Review excluded sales
4
Check out/sales_scrutiny/ for reports on excluded sales
5
Examine data quality
6
examine_sup(sales_univ_pair, settings)
7
Look for:
8
  • High non-null percentages for key fields
  • Reasonable value distributions
  • No unexpected missing data
  • 9
    Visualize on map
    10
    Load parquet files in GIS software to spot spatial patterns

    Next Steps

    Modeling

    Build predictive models with your cleaned data

    Configuration Reference

    Explore all cleaning and validation options

    Build docs developers (and LLMs) love