Skip to main content
This guide demonstrates a complete classification workflow using the Bank Marketing dataset (11,162 samples).

Dataset Overview

The Bank Marketing dataset contains information about direct marketing campaigns (phone calls) of a Portuguese banking institution. Features:
  • Demographics: age, job, marital, education
  • Financial: default, balance, housing, loan
  • Campaign: contact, day, month, duration, campaign
  • Previous campaigns: pdays, previous, poutcome
Target:
  • deposit: Whether the client subscribed to a term deposit (yes / no)
Size: 11,162 rows Challenge: This is an imbalanced classification problem with more “no” than “yes” responses.

Basic Usage

Run a classification experiment with default settings:
python -m src.main run \
  --data data/sample/bank.csv \
  --target deposit \
  --task classification \
  --max-iterations 3 \
  --verbose

What Happens

  1. Data Profiling: Analyzes schema, categorical features, class distribution
  2. Baseline Model: Trains Logistic Regression to establish baseline accuracy
  3. Iteration Loop: Gemini designs experiments considering class imbalance
  4. Report Generation: Creates narrative report with classification insights

Expected Output

╔══════════════════════════════════════════════════════════════╗
║  DATA PROFILE                                                ║
╚══════════════════════════════════════════════════════════════╝

Dataset: bank.csv
Rows: 11,162 | Columns: 17
Task: classification

Target Distribution:
  no:  9,629 (86.3%)
  yes: 1,533 (13.7%)

⚠ Class imbalance detected (ratio: 6.3:1)

Features:
  Categorical: 10 (job, marital, education, default, housing, ...)
  Numerical: 7 (age, balance, duration, campaign, pdays, ...)
  Missing values: 0

╔══════════════════════════════════════════════════════════════╗
║  ITERATION 1 - GEMINI'S REASONING                            ║
║  Thought Signature Active | Context: 4 turns                 ║
╚══════════════════════════════════════════════════════════════╝

Based on the data profile, I observe:
- Significant class imbalance (86.3% "no" vs 13.7% "yes")
- Mix of categorical and numerical features
- "duration" may be a strong predictor

For this iteration, I'm testing Random Forest with class weights
to handle the imbalance...

┌─────────────────────────────────────────────────────────────┐
│ RESULTS ANALYSIS                                            │
├─────────────────────────────────────────────────────────────┤
│ Trend: IMPROVING                                            │
│ F1-Score: 0.4523   ★ NEW BEST                               │
│ Accuracy: 0.8912                                            │
│ Precision: 0.6234 | Recall: 0.3521                          │
│                                                             │
│ Key Observations:                                           │
│   - Class weights improved minority class recall            │
│   - Tree-based model handles mixed feature types well       │
│   - Precision-recall tradeoff evident                       │
└─────────────────────────────────────────────────────────────┘

With Classification Constraints

Optimize for specific classification metrics:
python -m src.main run \
  --data data/sample/bank.csv \
  --target deposit \
  --task classification \
  --constraints constraints_classification.md \
  --max-iterations 5 \
  --verbose

Impact of Constraints

With F1-focused constraints, Gemini will:
  • Balance precision and recall optimization
  • Apply techniques like SMOTE or class weights
  • Test different decision thresholds
  • Focus on ensemble methods

Advanced Configuration

# Imbalance-Aware Constraints

## Class Imbalance Strategy
- Use SMOTE for oversampling minority class
- Apply class_weight='balanced' for all models
- Consider ensemble methods

## Metrics
- Primary: F1-score
- Secondary: ROC-AUC
- Report precision and recall separately

## Models
- XGBClassifier with scale_pos_weight
- RandomForest with class_weight
- LGBMClassifier with is_unbalance=True

Interpreting Classification Results

Confusion Matrix

The final report includes a confusion matrix:
                 Predicted
                 No    Yes
Actual   No    8,234   145
         Yes     856   677
Interpretation:
  • True Negatives (8,234): Correctly predicted “no”
  • False Positives (145): Predicted “yes” but actual “no”
  • False Negatives (856): Predicted “no” but actual “yes” (costly!)
  • True Positives (677): Correctly predicted “yes”

Key Metrics

## Best Model

**Model**: XGBClassifier
**F1-Score**: 0.5234
**Accuracy**: 0.8923
**Precision**: 0.6521
**Recall**: 0.4412
**ROC-AUC**: 0.8456

### Hyperparameters
- n_estimators: 200
- max_depth: 5
- learning_rate: 0.05
- scale_pos_weight: 6.3
- subsample: 0.8

### Preprocessing
- Categorical encoding: One-hot
- Feature scaling: StandardScaler
- Class imbalance: scale_pos_weight

Metric Definitions

MetricFormulaInterpretation
Accuracy(TP + TN) / TotalOverall correctness (misleading with imbalance)
PrecisionTP / (TP + FP)Of predicted “yes”, how many are correct?
RecallTP / (TP + FN)Of actual “yes”, how many did we catch?
F1-Score2 × (P × R) / (P + R)Harmonic mean of precision and recall
ROC-AUCArea under ROC curveOverall discrimination ability

Feature Importance

Gemini’s analysis typically reveals:
### Top 5 Features
1. **duration** (0.352) - Call duration is the strongest predictor
2. **poutcome** (0.128) - Previous campaign outcome matters
3. **month** (0.095) - Seasonality affects response
4. **balance** (0.082) - Account balance correlates with subscription
5. **age** (0.071) - Age influences decision-making

Common Results

Typical progression for this dataset:
IterationModelF1-ScorePrecisionRecallStrategy
BaselineLogisticRegression0.34210.52340.2567Baseline
1RandomForest0.45230.62340.3521Class weights
2XGBClassifier0.50120.64320.4098scale_pos_weight
3XGBClassifier + tuning0.52340.65210.4412Hyperparameter opt
4LGBMClassifier0.51560.62890.4378Alternative booster
5Ensemble0.53120.66780.4456Voting classifier

Why These Results?

  • Class imbalance is challenging: Only 13.7% positive class
  • Duration is strong predictor: But may not be available before call
  • Boosting handles imbalance well: XGBoost and LightGBM with class weights
  • F1-score around 0.50-0.53: Typical for this dataset with proper handling
  • Precision-recall tradeoff: Can tune threshold based on business needs

Threshold Optimization

For classification, you can optimize the decision threshold:
## Threshold Analysis

| Threshold | Precision | Recall | F1-Score | Use Case |
|-----------|-----------|--------|----------|----------|
| 0.3 | 0.4512 | 0.6234 | 0.5234 | Maximize reach |
| 0.5 | 0.6521 | 0.4412 | 0.5234 | Balanced (default) |
| 0.7 | 0.7834 | 0.2891 | 0.4234 | High confidence |

**Recommendation**: Use threshold 0.4 to balance cost of false negatives 
vs. campaign efficiency.

Viewing Results in MLflow

mlflow ui --backend-store-uri file:./outputs/mlruns
# Open http://127.0.0.1:5000
In MLflow, you can:
  • Compare F1-scores across iterations
  • View confusion matrices
  • Analyze precision-recall curves
  • Download classification reports
  • Compare feature importance across models

Next Steps

Regression Example

Learn about regression experiments

Advanced Constraints

Complex constraint configurations

Metrics

Understanding evaluation metrics

Class Imbalance

Handling imbalanced datasets

Build docs developers (and LLMs) love