What is MLlib?
MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline. At a high level, it provides:- ML Algorithms: Common learning algorithms like classification, regression, clustering, and collaborative filtering
- Featurization: Feature extraction, transformation, dimensionality reduction, and selection
- Pipelines: Tools for constructing, evaluating, and tuning ML Pipelines
- Persistence: Saving and loading algorithms, models, and Pipelines
- Utilities: Linear algebra, statistics, and data handling
DataFrame-based API
The primary Machine Learning API for Spark uses DataFrames from Spark SQL. The DataFrame-based API in thespark.ml package is now the main API, while the RDD-based API in spark.mllib is in maintenance mode.
As of Spark 2.0, the RDD-based API has entered maintenance mode. The DataFrame-based API provides better performance and a more user-friendly interface.
Why DataFrames?
DataFrames provide several advantages over RDDs:- User-friendly API: More intuitive than RDDs with support for SQL/DataFrame queries
- Performance: Tungsten and Catalyst optimizations improve execution speed
- Uniform APIs: Consistent interface across languages (Python, Scala, Java, R)
- ML Pipelines: Facilitate feature transformations and model workflows
- Data sources: Easy integration with various data formats and sources
Key Components
MLlib provides several core abstractions:Transformer
A Transformer converts one DataFrame into another, typically by appending columns. For example:- A feature transformer reads a column and maps it into a new column
- A learned model transforms features into predictions
Estimator
An Estimator fits on a DataFrame to produce a Transformer. For example:- A learning algorithm like LogisticRegression is an Estimator
- Calling
fit()trains the model and produces a Transformer
Pipeline
A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.Getting Started Example
Here’s a simple example using logistic regression:Dependencies
MLlib uses linear algebra packages Breeze and dev.ludovic.netlib for optimized numerical processing. These packages may call native acceleration libraries like Intel MKL or OpenBLAS if available.Python Requirements
To use MLlib in Python, you need:- NumPy version 1.4 or newer
- PySpark installed and configured
Common ML Tasks
MLlib supports various machine learning tasks:Classification & Regression
Logistic regression, decision trees, random forests, gradient-boosted trees, and more
Clustering
K-means, LDA, bisecting k-means, Gaussian mixture models, and power iteration clustering
Collaborative Filtering
Alternating Least Squares (ALS) for recommendation systems
Feature Engineering
Feature extractors, transformers, and selectors for data preprocessing
ML Pipelines
Pipelines are a key concept in MLlib that help you chain together multiple transformations and models:Model Persistence
You can save and load models for later use:ML persistence maintains backwards compatibility across minor and patch versions of Spark, but not necessarily across major versions.
Next Steps
ML Pipelines
Learn how to build and use ML Pipelines
Classification & Regression
Explore supervised learning algorithms
Feature Engineering
Transform and prepare your data for ML
Clustering
Discover unsupervised learning techniques
