MLlib Guide

MLlib is Spark’s machine learning library designed to make practical machine learning scalable and easy. It provides tools for common ML tasks including classification, regression, clustering, and collaborative filtering.

What is MLlib?

MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline. At a high level, it provides:

ML Algorithms: Common learning algorithms like classification, regression, clustering, and collaborative filtering
Featurization: Feature extraction, transformation, dimensionality reduction, and selection
Pipelines: Tools for constructing, evaluating, and tuning ML Pipelines
Persistence: Saving and loading algorithms, models, and Pipelines
Utilities: Linear algebra, statistics, and data handling

DataFrame-based API

The primary Machine Learning API for Spark uses DataFrames from Spark SQL. The DataFrame-based API in the spark.ml package is now the main API, while the RDD-based API in spark.mllib is in maintenance mode.

As of Spark 2.0, the RDD-based API has entered maintenance mode. The DataFrame-based API provides better performance and a more user-friendly interface.

Why DataFrames?

DataFrames provide several advantages over RDDs:

User-friendly API: More intuitive than RDDs with support for SQL/DataFrame queries
Performance: Tungsten and Catalyst optimizations improve execution speed
Uniform APIs: Consistent interface across languages (Python, Scala, Java, R)
ML Pipelines: Facilitate feature transformations and model workflows
Data sources: Easy integration with various data formats and sources

Key Components

MLlib provides several core abstractions:

Transformer

A Transformer converts one DataFrame into another, typically by appending columns. For example:

A feature transformer reads a column and maps it into a new column
A learned model transforms features into predictions

Estimator

An Estimator fits on a DataFrame to produce a Transformer. For example:

A learning algorithm like LogisticRegression is an Estimator
Calling fit() trains the model and produces a Transformer

Pipeline

A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.

Getting Started Example

Here’s a simple example using logistic regression:

from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Create logistic regression model
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(training)

# Print model parameters
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))

Dependencies

MLlib uses linear algebra packages Breeze and dev.ludovic.netlib for optimized numerical processing. These packages may call native acceleration libraries like Intel MKL or OpenBLAS if available.

If native acceleration libraries are not enabled, you’ll see a warning message and a pure JVM implementation will be used instead, which may be slower.

Python Requirements

To use MLlib in Python, you need:

NumPy version 1.4 or newer
PySpark installed and configured

Common ML Tasks

MLlib supports various machine learning tasks:

Classification & Regression

Logistic regression, decision trees, random forests, gradient-boosted trees, and more

Clustering

K-means, LDA, bisecting k-means, Gaussian mixture models, and power iteration clustering

Collaborative Filtering

Alternating Least Squares (ALS) for recommendation systems

Feature Engineering

Feature extractors, transformers, and selectors for data preprocessing

ML Pipelines

Pipelines are a key concept in MLlib that help you chain together multiple transformations and models:

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

# Configure pipeline stages
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol="words", outputCol="features")
lr = LogisticRegression(maxIter=10)

# Create pipeline
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# Fit pipeline to training data
model = pipeline.fit(training)

# Make predictions
predictions = model.transform(test)

Model Persistence

You can save and load models for later use:

# Save model
model.save("path/to/model")

# Load model
from pyspark.ml.classification import LogisticRegressionModel
loadedModel = LogisticRegressionModel.load("path/to/model")

ML persistence maintains backwards compatibility across minor and patch versions of Spark, but not necessarily across major versions.

Next Steps

ML Pipelines

Learn how to build and use ML Pipelines

Classification & Regression

Explore supervised learning algorithms

Feature Engineering

Transform and prepare your data for ML

Clustering

Discover unsupervised learning techniques

Get Started

Core Concepts

Spark SQL

Structured Streaming

Machine Learning

Graph Processing

Deployment

Configuration & Tuning

Monitoring

What is MLlib?

DataFrame-based API

Why DataFrames?

Key Components

Transformer

Estimator

Pipeline

Getting Started Example

Dependencies

Python Requirements

Common ML Tasks

Classification & Regression

Clustering

Collaborative Filtering

Feature Engineering

ML Pipelines

Model Persistence

Next Steps

ML Pipelines

Classification & Regression

Feature Engineering

Clustering

Build docs developers (and LLMs) love

Get Started

Core Concepts

Spark SQL

Structured Streaming

Machine Learning

Graph Processing

Deployment

Configuration & Tuning

Monitoring

​What is MLlib?

​DataFrame-based API

​Why DataFrames?

​Key Components

​Transformer

​Estimator

​Pipeline

​Getting Started Example

​Dependencies

​Python Requirements

​Common ML Tasks

Classification & Regression

Clustering

Collaborative Filtering

Feature Engineering

​ML Pipelines

​Model Persistence

​Next Steps

ML Pipelines

Classification & Regression

Feature Engineering

Clustering

Build docs developers (and LLMs) love

What is MLlib?

DataFrame-based API

Why DataFrames?

Key Components

Transformer

Estimator

Pipeline

Getting Started Example

Dependencies

Python Requirements

Common ML Tasks

ML Pipelines

Model Persistence

Next Steps