Skip to main content

perform_spatial_inference

perform_spatial_inference(
    df: gpd.GeoDataFrame, 
    s_infer: dict, 
    key: str, 
    verbose: bool = False
) -> gpd.GeoDataFrame
Perform spatial inference using specified model(s) to predict missing values in spatial datasets. This is the main orchestration function that handles the entire inference pipeline.
df
gpd.GeoDataFrame
required
Input GeoDataFrame with features and target variable.
s_infer
dict
required
Inference settings from config. Dictionary where keys are field names to infer and values are configuration dicts containing:
  • model: Model configuration (type, proxies, locations, interactions)
  • filters: Optional filter conditions to determine inference scope
  • fill: Optional list of fields to use for direct filling before inference
  • round: Whether to round predictions (default: True)
key
str
required
Key field name for caching and identification.
verbose
bool
default:"False"
Whether to print progress and diagnostic information.
return
gpd.GeoDataFrame
GeoDataFrame with inferred values. Adds a boolean field inferred_{field_name} for each inferred field.

Example

import geopandas as gpd
from openavmkit.inference import perform_spatial_inference

# Load data
df = gpd.read_file("parcels.geojson")

# Configure inference
s_infer = {
    "bldg_sqft": {
        "model": {
            "type": "lightgbm",
            "proxies": ["bldg_footprint_sqft", "land_sqft"],
            "locations": ["neighborhood", "property_class"],
            "interactions": [["neighborhood", "property_class"]]
        },
        "fill": ["tax_bldg_sqft"],
        "round": True
    }
}

# Perform inference
df_result = perform_spatial_inference(df, s_infer, key="parcel_id", verbose=True)

InferenceModel (Base Class)

class InferenceModel(ABC)
Abstract base class for all inference models. All inference models must implement the three core methods: fit(), predict(), and evaluate().

fit

fit(df: pd.DataFrame, target: str, settings: Dict[str, Any]) -> None
Fit the model using training data.
df
pd.DataFrame
required
Training data containing features and target variable.
target
str
required
Field name of the target variable to predict.
settings
Dict[str, Any]
required
Settings dictionary containing model configuration:
  • proxies: List of proxy/feature field names
  • locations: List of location/grouping field names
  • interactions: List of interaction feature definitions

predict

predict(df: pd.DataFrame) -> pd.Series
Make predictions on new data.
df
pd.DataFrame
required
Data to perform predictions on.
return
pd.Series
Predicted values of the target variable chosen during fit().

evaluate

evaluate(df: pd.DataFrame, target: str) -> Dict[str, float]
Evaluate model performance on training data.
df
pd.DataFrame
required
Training data with features and true target values.
target
str
required
Field name of the target variable.
return
Dict[str, float]
Dictionary containing evaluation metrics:
  • mae: Mean Absolute Error
  • mape: Mean Absolute Percentage Error
  • rmse: Root Mean Squared Error
  • r2: R-squared score

Available Models

RatioProxyModel

model = RatioProxyModel()
Ratio-based proxy model that calculates median ratios between target and proxy variables, stratified by location and grouping variables. Simple but effective for linear relationships. Best for: Linear relationships, simple proxy-based predictions

RandomForestModel

model = RandomForestModel()
Random Forest regressor with 200 trees. Handles non-linear relationships and feature interactions well. Configuration:
  • n_estimators: 200
  • max_depth: None
  • min_samples_split: 5
  • min_samples_leaf: 2
Best for: Non-linear relationships, robust predictions

LightGBMModel

model = LightGBMModel()
LightGBM gradient boosting model. Fast training and high accuracy with good handling of categorical features. Configuration:
  • n_estimators: 200
  • max_depth: -1 (no limit)
  • learning_rate: 0.05
  • subsample: 0.8
  • colsample_bytree: 0.8
Best for: Large datasets, categorical features, high accuracy requirements

XGBoostModel

model = XGBoostModel()
XGBoost gradient boosting model. Excellent performance with regularization to prevent overfitting. Configuration:
  • n_estimators: 200
  • max_depth: 6
  • learning_rate: 0.05
  • subsample: 0.8
  • colsample_bytree: 0.8
Best for: Structured data, preventing overfitting, competition-grade accuracy

EnsembleModel

model = EnsembleModel()
Ensemble model combining LightGBM, XGBoost, and Random Forest. Automatically optimizes weights for each model to minimize RMSE on validation data. Best for: Maximum accuracy, reducing model variance

Model Selection

To experiment with different models and find the best performer:
s_infer = {
    "bldg_sqft": {
        "model": {
            "type": "lightgbm",  # Current model
            "experiment": True,   # Enable experimentation
            "proxies": ["bldg_footprint_sqft"],
            "locations": ["neighborhood"]
        }
    }
}
When experiment: True, the system will:
  1. Train and evaluate all available models
  2. Report performance metrics for each
  3. Compare results on both validation and fill data
  4. Recommend the best performing model
  5. Still use the specified type for final predictions

CategoricalEncoder

encoder = CategoricalEncoder()
Universal categorical encoder that handles unseen categories by mapping them to a special unknown value.

fit

fit(series: pd.Series) -> None
Fit encoder by learning categories from the series.

transform

transform(series: pd.Series) -> np.ndarray
Transform values, mapping unseen categories to unknown.

fit_transform

fit_transform(series: pd.Series) -> np.ndarray
Fit and transform in one step.

Advanced Configuration

Interaction Features

Create interaction features by combining multiple fields:
s_infer = {
    "bldg_sqft": {
        "model": {
            "type": "xgboost",
            "proxies": ["bldg_footprint_sqft", "stories"],
            "locations": ["neighborhood", "property_class"],
            "interactions": [
                ["neighborhood", "property_class"],
                ["neighborhood", "zoning"]
            ]
        }
    }
}

Fill Before Inference

Use known values from related fields before running inference:
s_infer = {
    "bldg_sqft": {
        "model": {...},
        "fill": ["tax_bldg_sqft", "assessor_bldg_sqft"]
    }
}

Filtered Inference

Only infer values for specific subsets:
s_infer = {
    "bldg_sqft": {
        "model": {...},
        "filters": [
            ["and",
                ["==", "is_vacant", False],
                [">", "land_sqft", 1000]
            ]
        ]
    }
}

Build docs developers (and LLMs) love