perform_spatial_inference(
df: gpd.GeoDataFrame,
s_infer: dict,
key: str,
verbose: bool = False
) -> gpd.GeoDataFrame
Perform spatial inference using specified model(s) to predict missing values in spatial datasets. This is the main orchestration function that handles the entire inference pipeline.
Input GeoDataFrame with features and target variable.
Inference settings from config. Dictionary where keys are field names to infer and values are configuration dicts containing:
model: Model configuration (type, proxies, locations, interactions)
filters: Optional filter conditions to determine inference scope
fill: Optional list of fields to use for direct filling before inference
round: Whether to round predictions (default: True)
Key field name for caching and identification.
Whether to print progress and diagnostic information.
GeoDataFrame with inferred values. Adds a boolean field inferred_{field_name} for each inferred field.
Example
import geopandas as gpd
from openavmkit.inference import perform_spatial_inference
# Load data
df = gpd.read_file("parcels.geojson")
# Configure inference
s_infer = {
"bldg_sqft": {
"model": {
"type": "lightgbm",
"proxies": ["bldg_footprint_sqft", "land_sqft"],
"locations": ["neighborhood", "property_class"],
"interactions": [["neighborhood", "property_class"]]
},
"fill": ["tax_bldg_sqft"],
"round": True
}
}
# Perform inference
df_result = perform_spatial_inference(df, s_infer, key="parcel_id", verbose=True)
InferenceModel (Base Class)
class InferenceModel(ABC)
Abstract base class for all inference models. All inference models must implement the three core methods: fit(), predict(), and evaluate().
fit
fit(df: pd.DataFrame, target: str, settings: Dict[str, Any]) -> None
Fit the model using training data.
Training data containing features and target variable.
Field name of the target variable to predict.
Settings dictionary containing model configuration:
proxies: List of proxy/feature field names
locations: List of location/grouping field names
interactions: List of interaction feature definitions
predict
predict(df: pd.DataFrame) -> pd.Series
Make predictions on new data.
Data to perform predictions on.
Predicted values of the target variable chosen during fit().
evaluate
evaluate(df: pd.DataFrame, target: str) -> Dict[str, float]
Evaluate model performance on training data.
Training data with features and true target values.
Field name of the target variable.
Dictionary containing evaluation metrics:
mae: Mean Absolute Error
mape: Mean Absolute Percentage Error
rmse: Root Mean Squared Error
r2: R-squared score
Available Models
RatioProxyModel
model = RatioProxyModel()
Ratio-based proxy model that calculates median ratios between target and proxy variables, stratified by location and grouping variables. Simple but effective for linear relationships.
Best for: Linear relationships, simple proxy-based predictions
RandomForestModel
model = RandomForestModel()
Random Forest regressor with 200 trees. Handles non-linear relationships and feature interactions well.
Configuration:
- n_estimators: 200
- max_depth: None
- min_samples_split: 5
- min_samples_leaf: 2
Best for: Non-linear relationships, robust predictions
LightGBMModel
LightGBM gradient boosting model. Fast training and high accuracy with good handling of categorical features.
Configuration:
- n_estimators: 200
- max_depth: -1 (no limit)
- learning_rate: 0.05
- subsample: 0.8
- colsample_bytree: 0.8
Best for: Large datasets, categorical features, high accuracy requirements
XGBoostModel
XGBoost gradient boosting model. Excellent performance with regularization to prevent overfitting.
Configuration:
- n_estimators: 200
- max_depth: 6
- learning_rate: 0.05
- subsample: 0.8
- colsample_bytree: 0.8
Best for: Structured data, preventing overfitting, competition-grade accuracy
EnsembleModel
Ensemble model combining LightGBM, XGBoost, and Random Forest. Automatically optimizes weights for each model to minimize RMSE on validation data.
Best for: Maximum accuracy, reducing model variance
Model Selection
To experiment with different models and find the best performer:
s_infer = {
"bldg_sqft": {
"model": {
"type": "lightgbm", # Current model
"experiment": True, # Enable experimentation
"proxies": ["bldg_footprint_sqft"],
"locations": ["neighborhood"]
}
}
}
When experiment: True, the system will:
- Train and evaluate all available models
- Report performance metrics for each
- Compare results on both validation and fill data
- Recommend the best performing model
- Still use the specified
type for final predictions
CategoricalEncoder
encoder = CategoricalEncoder()
Universal categorical encoder that handles unseen categories by mapping them to a special unknown value.
fit
fit(series: pd.Series) -> None
Fit encoder by learning categories from the series.
transform(series: pd.Series) -> np.ndarray
Transform values, mapping unseen categories to unknown.
fit_transform(series: pd.Series) -> np.ndarray
Fit and transform in one step.
Advanced Configuration
Interaction Features
Create interaction features by combining multiple fields:
s_infer = {
"bldg_sqft": {
"model": {
"type": "xgboost",
"proxies": ["bldg_footprint_sqft", "stories"],
"locations": ["neighborhood", "property_class"],
"interactions": [
["neighborhood", "property_class"],
["neighborhood", "zoning"]
]
}
}
}
Fill Before Inference
Use known values from related fields before running inference:
s_infer = {
"bldg_sqft": {
"model": {...},
"fill": ["tax_bldg_sqft", "assessor_bldg_sqft"]
}
}
Filtered Inference
Only infer values for specific subsets:
s_infer = {
"bldg_sqft": {
"model": {...},
"filters": [
["and",
["==", "is_vacant", False],
[">", "land_sqft", 1000]
]
]
}
}