Why Use Pipelines?
Prevent data leakage
Ensures that preprocessing is fit only on training data, then applied to test data:
Creating a Pipeline
A pipeline is a sequence of named steps, where each step (except the last) must be a transformer, and the last step can be a transformer or a predictor.Basic Example
Step Naming Convention
Each step is a tuple of[name, transformer]:
Step names must be unique and non-empty strings. They’re used to access steps and set parameters.
Pipeline Workflow
During Fitting
The pipeline appliesfitTransform() to each intermediate step, passing the transformed data to the next step:
During Prediction
The pipeline appliestransform() to each intermediate step:
Accessing Pipeline Steps
You can access individual steps in several ways:Classification Pipelines
Pipelines work seamlessly with classification models:Complex Pipelines
Multiple Preprocessing Steps
Pipeline with Feature Selection
Setting Pipeline Parameters
During Initialization
After Creation with setParams
Use double-underscore notation to set nested parameters:Getting Pipeline Parameters
Pipeline with Sample Weights
Pass sample weights through the entire pipeline:Sample weights are automatically routed to all steps that support them.
Pipeline Methods
Pipelines support all standard model methods:- fit()
- predict()
- predictProba()
- score()
- transform()
- fitTransform()
Train the entire pipeline.Fits each step sequentially, transforming data between steps.
Hyperparameter Tuning with Pipelines
Pipelines integrate seamlessly with GridSearchCV:Complete Example
Here’s a real-world pipeline for a classification task:Best Practices
Order matters
Put steps in the right order:
- Handle missing values (SimpleImputer)
- Generate features (PolynomialFeatures)
- Scale/normalize (StandardScaler)
- Select features (SelectKBest)
- Train model
Advanced Topics
ColumnTransformer
Apply different transformations to different columns:FeatureUnion
Combine multiple transformers in parallel:Related Topics
- Data Preprocessing - Learn about transformers
- Model Training - Understand fit/predict
- Model Evaluation - Evaluate pipelines
- Model Selection - Grid search with pipelines