SHAP Analysis

The shap_analysis module provides comprehensive SHAP (SHapley Additive exPlanations) value calculation and visualization tools for XGBoost, LightGBM, and CatBoost models. SHAP values explain individual predictions by quantifying each feature’s contribution.

Core Functions

get_full_model_shaps

get_full_model_shaps(
    model: XGBoostModel | LightGBMModel | CatBoostModel,
    X_train: pd.DataFrame,
    X_test: pd.DataFrame,
    X_sales: pd.DataFrame,
    X_univ: pd.DataFrame,
    verbose: bool = False
) -> dict

Calculate SHAP values for all data subsets (train, test, sales, universe) from a trained model. This is the primary function for generating comprehensive SHAP explanations across all relevant datasets in a typical AVM workflow.

model

XGBoostModel | LightGBMModel | CatBoostModel

required

Trained prediction model (must be one of the supported tree-based models)

X_train

pd.DataFrame

required

Training set features (independent variables)

X_test

pd.DataFrame

required

Test set features

X_sales

pd.DataFrame

required

Sales data features

X_univ

pd.DataFrame

required

Universe (full population) features

verbose

bool

default:"False"

Print detailed progress information during SHAP calculation

shap_dict

dict

Dictionary containing shap.Explanation objects with keys:

"train": SHAP values for training data
"test": SHAP values for test data
"sales": SHAP values for sales data
"univ": SHAP values for universe data

Performance Notes:

XGBoost: Uses approximate mode by default for speed
LightGBM: Uses exact native pred_contrib=True method
CatBoost: Uses native get_feature_importance(type="ShapValues") with approximate mode

make_shap_table

make_shap_table(
    expl: shap.Explanation,
    list_keys: list[str],
    list_vars: list[str],
    list_keys_sale: list[str] = None,
    include_pred: bool = True
) -> pd.DataFrame

Convert SHAP explanation into a tabular DataFrame breaking down feature contributions. This function transforms SHAP values into a flat table format suitable for analysis, export, or further processing.

expl

shap.Explanation

required

SHAP Explanation object (output from get_full_model_shaps or tree explainer)

list_keys

list[str]

required

Primary keys in the same row order as explained data (e.g., parcel IDs)

list_vars

list[str]

required

Feature names in canonical training order

list_keys_sale

list[str]

default:"None"

Optional transaction keys (for sales data)

include_pred

bool

default:"True"

Include a column reconstructing the model prediction: base_value + sum(SHAP values)

df_shap

pd.DataFrame

DataFrame with columns:

key: Primary identifier
key_sale: Transaction identifier (if provided)
base_value: Model’s base prediction value
One column per feature with SHAP contribution values
contribution_sum: Reconstructed prediction (if include_pred=True)

Column Order: [key, key_sale?, base_value, feature_1, feature_2, ..., feature_n, contribution_sum?]

plot_full_beeswarm

plot_full_beeswarm(
    explanation: shap.Explanation,
    title: str = "SHAP Beeswarm",
    save_path: str | None = None,
    save_kwargs: dict | None = None,
    wrap_width: int = 20
) -> None

Create a comprehensive SHAP beeswarm plot with wrapped feature names. Beeswarm plots show the distribution of SHAP values for each feature, with color indicating feature value magnitude.

explanation

shap.Explanation

required

SHAP Explanation object to visualize

title

str

default:"SHAP Beeswarm"

Plot title

save_path

str

default:"None"

Optional file path to save figure (e.g., "plots/shap_beeswarm.png") Format inferred from extension (.png, .pdf, .svg)

save_kwargs

dict

default:"None"

Additional arguments for plt.savefig() (e.g., {"dpi": 300, "bbox_inches": "tight"})

wrap_width

int

default:"20"

Maximum character width for feature name wrapping

Features:

Automatic figure sizing based on feature count
Wrapped feature names for readability
Color-coded by feature value (red = high, blue = low)
Sorted by mean absolute SHAP value

Private Helper Functions

_xgboost_shap

_xgboost_shap(
    model: XGBoostModel,
    X_train: pd.DataFrame,
    background_size: int = 100,
    approximate: bool = True,
    check_additivity: bool = False
) -> shap.TreeExplainer

Create a SHAP TreeExplainer for XGBoost models with categorical feature support. Settings:

Uses tree_path_dependent perturbation for categorical splits
Enables categorical DMatrix properties automatically

_lightgbm_shap

_lightgbm_shap(
    model: LightGBMModel,
    X_train: pd.DataFrame,
    background_size: int = 100,
    approximate: bool = True,
    check_additivity: bool = False
) -> shap.TreeExplainer

Create a SHAP TreeExplainer for LightGBM models. Behavior:

With categorical features: uses tree_path_dependent mode without background data
Without categorical features: uses interventional mode with background samples

_catboost_shap

_catboost_shap(
    model: CatBoostModel,
    X_train: pd.DataFrame,
    background_size: int = 100,
    approximate: bool = True,
    check_additivity: bool = False
) -> shap.TreeExplainer

Create a SHAP TreeExplainer for CatBoost models. Settings:

Uses tree_path_dependent mode (required for categorical splits)
Tags explainer with _cb_model attribute for special handling

_shap_explain

_shap_explain(
    model_type: str,
    te: shap.TreeExplainer,
    X_to_explain: pd.DataFrame,
    approximate: bool = True,
    check_additivity: bool = False,
    cat_data: TreeBasedCategoricalData | None = None,
    verbose: bool = False,
    label: str = ""
) -> shap.Explanation

Internal function that computes SHAP values using the appropriate backend for each model type. Fast Paths:

CatBoost: Uses native get_feature_importance(type="ShapValues") with “Approximate” mode
LightGBM: Uses native booster.predict(pred_contrib=True) for speed
XGBoost: Uses standard TreeExplainer with categorical support

Usage Examples

Example 1: Calculate SHAP Values for All Datasets

from openavmkit.shap_analysis import get_full_model_shaps, plot_full_beeswarm
from openavmkit.modeling import XGBoostModel

# Train model (simplified)
model = XGBoostModel()
model.fit(X_train, y_train)

# Calculate SHAP values for all subsets
shap_results = get_full_model_shaps(
    model=model,
    X_train=X_train,
    X_test=X_test,
    X_sales=X_sales,
    X_univ=X_universe,
    verbose=True
)

# Visualize test set SHAP values
plot_full_beeswarm(
    shap_results["test"],
    title="SHAP Values - Test Set",
    save_path="plots/shap_test_beeswarm.png",
    save_kwargs={"dpi": 300, "bbox_inches": "tight"}
)

Example 2: Create SHAP Contribution Table

from openavmkit.shap_analysis import make_shap_table

# Convert SHAP values to tabular format
df_shap = make_shap_table(
    expl=shap_results["univ"],
    list_keys=df_universe["key"].tolist(),
    list_vars=feature_names,
    include_pred=True
)

# Inspect contributions
print(df_shap.head())
#     key  base_value  bldg_area  land_area  age  ...  contribution_sum
# 0  P001     11.852      0.324     0.156   -0.089 ...           12.243
# 1  P002     11.852      0.187     0.098   -0.034 ...           12.103

# Find properties where age has large negative impact
df_age_impact = df_shap[df_shap["age"] < -0.5]
print(f"Properties with age reducing value by >50%: {len(df_age_impact)}")

# Export for further analysis
df_shap.to_csv("out/shap_contributions_universe.csv", index=False)

Example 3: Compare Feature Importance Across Subsets

import numpy as np
import pandas as pd

# Calculate mean absolute SHAP values for each subset
subsets = ["train", "test", "sales", "univ"]
importance_data = {}

for subset in subsets:
    shap_vals = shap_results[subset].values
    mean_abs_shap = np.abs(shap_vals).mean(axis=0)
    importance_data[subset] = mean_abs_shap

df_importance = pd.DataFrame(
    importance_data,
    index=feature_names
).sort_values(by="test", ascending=False)

print(df_importance.head(10))
#                      train    test   sales    univ
# bldg_area_finished   0.432   0.428   0.445   0.431
# land_area            0.287   0.289   0.312   0.285
# age                  0.156   0.162   0.178   0.154

Example 4: Individual Prediction Explanation

# Explain a single property's valuation
property_idx = 42
shap_table = make_shap_table(
    shap_results["univ"],
    list_keys=df_universe["key"].tolist(),
    list_vars=feature_names
)

property_shap = shap_table.iloc[property_idx]
print(f"Property: {property_shap['key']}")
print(f"Base Value: ${np.exp(property_shap['base_value']):,.0f}")
print(f"Predicted Value: ${np.exp(property_shap['contribution_sum']):,.0f}")
print("\nTop 5 Positive Contributors:")

contributions = property_shap.drop(['key', 'base_value', 'contribution_sum'])
top_positive = contributions.nlargest(5)
for feature, value in top_positive.items():
    print(f"  {feature}: +${np.exp(value)-1:,.0f}")

print("\nTop 5 Negative Contributors:")
top_negative = contributions.nsmallest(5)
for feature, value in top_negative.items():
    print(f"  {feature}: -${1-np.exp(value):,.0f}")

Understanding SHAP Values

What SHAP Values Represent

Base Value: Average model prediction across training data
SHAP Value: Change in prediction (on log scale for log models) attributable to that feature
Prediction: base_value + sum(all SHAP values)

Interpretation

Positive SHAP value: Feature increases predicted value
Negative SHAP value: Feature decreases predicted value
Magnitude: Larger absolute values = stronger influence
Additivity: SHAP values sum exactly to the prediction

For Log-Scale Models

If your model predicts log(price), SHAP values are also on log scale:

# Convert from log scale to dollar impact
base_price = np.exp(base_value)
feature_impact_dollars = base_price * (np.exp(shap_value) - 1)

Model Support Matrix

Model Type	Categorical Support	Approximate Mode	Native SHAP
XGBoost	✓	✓	via TreeExplainer
LightGBM	✓	✗	✓ (pred_contrib)
CatBoost	✓	✓	✓ (get_feature_importance)

Notes:

All models support categorical features through appropriate handling
LightGBM’s native method is exact and fast
CatBoost’s “Approximate” mode provides significant speed gains

Core Modules

Analysis & Evaluation

Data Processing

Specialized Analysis

Utilities

Cloud & Storage

Quality & Reports

Core Functions

get_full_model_shaps

make_shap_table

plot_full_beeswarm

Private Helper Functions

_xgboost_shap

_lightgbm_shap

_catboost_shap

_shap_explain

Usage Examples

Example 1: Calculate SHAP Values for All Datasets

Example 2: Create SHAP Contribution Table

Example 3: Compare Feature Importance Across Subsets

Example 4: Individual Prediction Explanation

Understanding SHAP Values

What SHAP Values Represent

Interpretation

For Log-Scale Models

Model Support Matrix

Build docs developers (and LLMs) love

Core Modules

Analysis & Evaluation

Data Processing

Specialized Analysis

Utilities

Cloud & Storage

Quality & Reports

​Core Functions

​get_full_model_shaps

​make_shap_table

​plot_full_beeswarm

​Private Helper Functions

​_xgboost_shap

​_lightgbm_shap

​_catboost_shap

​_shap_explain

​Usage Examples

​Example 1: Calculate SHAP Values for All Datasets

​Example 2: Create SHAP Contribution Table

​Example 3: Compare Feature Importance Across Subsets

​Example 4: Individual Prediction Explanation

​Understanding SHAP Values

​What SHAP Values Represent

​Interpretation

​For Log-Scale Models

​Model Support Matrix

Build docs developers (and LLMs) love

Core Functions

get_full_model_shaps

make_shap_table

plot_full_beeswarm

Private Helper Functions

_xgboost_shap

_lightgbm_shap

_catboost_shap

_shap_explain

Usage Examples

Example 1: Calculate SHAP Values for All Datasets

Example 2: Create SHAP Contribution Table

Example 3: Compare Feature Importance Across Subsets

Example 4: Individual Prediction Explanation

Understanding SHAP Values

What SHAP Values Represent

Interpretation

For Log-Scale Models

Model Support Matrix