Overview
MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline or workflow. The pipeline concept is inspired by scikit-learn.Main Concepts
DataFrame
Machine learning can be applied to a wide variety of data types. The Pipeline API uses DataFrames from Spark SQL to support various data types including:- Basic types (Double, String, Array, etc.)
- Structured types from Spark SQL
- ML Vector types for feature vectors
Pipeline Components
Transformer
A Transformer is an algorithm that transforms one DataFrame into another DataFrame. It implements atransform() method.
Examples:
- A feature transformer reads a column (e.g., text), maps it to a new column (e.g., feature vectors), and outputs a new DataFrame
- A learning model reads features, predicts labels, and outputs a DataFrame with predictions
Estimator
An Estimator is an algorithm that can be fit on a DataFrame to produce a Transformer. It implements afit() method.
Example:
- LogisticRegression is an Estimator that trains on data
- Calling
fit()produces a LogisticRegressionModel (a Transformer)
Pipeline
A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow. A Pipeline is itself an Estimator that produces a PipelineModel (a Transformer) when fit.Parameter
All Transformers and Estimators share a common API for specifying parameters:How Pipelines Work
Training Time
When you callpipeline.fit() on training data:
- Each Transformer’s
transform()method is called on the DataFrame - Each Estimator’s
fit()method is called, producing a Transformer - The Transformers from Estimators are used in the resulting PipelineModel

Prediction Time
When you callmodel.transform() on test data:
- Data flows through the fitted pipeline in order
- Each stage’s
transform()method is called - Final predictions are produced

Pipelines ensure that training and test data go through identical feature processing steps, preventing training/serving skew.
Complete Pipeline Example
Here’s a complete example that demonstrates building and using a Pipeline:Pipeline Parameters
You can specify parameters for specific stages in a Pipeline:DAG Pipelines
While most examples show linear Pipelines, you can create non-linear Pipelines as long as the data flow forms a Directed Acyclic Graph (DAG). The graph is specified implicitly based on input and output column names.If your Pipeline forms a DAG, stages must be specified in topological order.
Runtime Checking
Since Pipelines operate on DataFrames with varied types, they use runtime checking instead of compile-time type checking. Type checking is done using the DataFrame schema before running the Pipeline.Pipeline Persistence
You can save and load Pipelines and PipelineModels:Backwards Compatibility
MLlib maintains backwards compatibility for ML persistence:- Minor/Patch versions: Full backwards compatibility - models saved in version X can be loaded in version Y
- Major versions: Best-effort compatibility, but no guarantees
- Minor/patch versions: Identical behavior (except bug fixes)
- Major versions: Best-effort, but no guarantees
Best Practices
Use unique Pipeline stages
Use unique Pipeline stages
Each Pipeline stage should be a unique instance. Don’t insert the same instance (e.g.,
myHashingTF) twice, as stages must have unique IDs.Consistent feature processing
Consistent feature processing
Use Pipelines to ensure training and test data undergo identical feature transformations, preventing subtle bugs.
Organize complex workflows
Organize complex workflows
Break complex ML workflows into clear stages. This makes your code more maintainable and easier to debug.
Save entire pipelines
Save entire pipelines
Save the entire PipelineModel rather than individual stages. This ensures you can reproduce predictions exactly.
Next Steps
Model Tuning
Learn how to tune hyperparameters with cross-validation
Classification & Regression
Explore ML algorithms for supervised learning
Feature Engineering
Transform and prepare data with feature transformers
Estimator & Transformer
Deep dive into the core abstractions
