Pipelines are a powerful way to chain together preprocessing steps and models into a single, cohesive workflow. They help prevent data leakage, ensure consistency, and make your code cleaner.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Seyamalam/bun-scikit/llms.txt
Use this file to discover all available pages before exploring further.
Why Use Pipelines?
Prevent data leakage
Ensures that preprocessing is fit only on training data, then applied to test data:
Creating a Pipeline
A pipeline is a sequence of named steps, where each step (except the last) must be a transformer, and the last step can be a transformer or a predictor.Basic Example
Step Naming Convention
Each step is a tuple of[name, transformer]:
Step names must be unique and non-empty strings. They’re used to access steps and set parameters.
Pipeline Workflow
During Fitting
The pipeline appliesfitTransform() to each intermediate step, passing the transformed data to the next step:
During Prediction
The pipeline appliestransform() to each intermediate step:
Accessing Pipeline Steps
You can access individual steps in several ways:Classification Pipelines
Pipelines work seamlessly with classification models:Complex Pipelines
Multiple Preprocessing Steps
Pipeline with Feature Selection
Setting Pipeline Parameters
During Initialization
After Creation with setParams
Use double-underscore notation to set nested parameters:Getting Pipeline Parameters
Pipeline with Sample Weights
Pass sample weights through the entire pipeline:Sample weights are automatically routed to all steps that support them.
Pipeline Methods
Pipelines support all standard model methods:- fit()
- predict()
- predictProba()
- score()
- transform()
- fitTransform()
Train the entire pipeline.Fits each step sequentially, transforming data between steps.
Hyperparameter Tuning with Pipelines
Pipelines integrate seamlessly with GridSearchCV:Complete Example
Here’s a real-world pipeline for a classification task:Best Practices
Order matters
Put steps in the right order:
- Handle missing values (SimpleImputer)
- Generate features (PolynomialFeatures)
- Scale/normalize (StandardScaler)
- Select features (SelectKBest)
- Train model
Advanced Topics
ColumnTransformer
Apply different transformations to different columns:FeatureUnion
Combine multiple transformers in parallel:Related Topics
- Data Preprocessing - Learn about transformers
- Model Training - Understand fit/predict
- Model Evaluation - Evaluate pipelines
- Model Selection - Grid search with pipelines