Metrics and Instrumentation

There are several ways to monitor Spark applications: web UIs, metrics, and external instrumentation.

Spark History Server

It is possible to construct the UI of an application through Spark’s history server, provided that the application’s event logs exist.

Starting the History Server

You can start the history server by executing:

./sbin/start-history-server.sh

This creates a web interface at http://<server-url>:18080 by default, listing incomplete and completed applications and attempts.

Configuring Event Logging

To view the web UI after an application has finished, you need to configure event logging:

spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenode/shared/spark-logs

When using the file-system provider class, the base logging directory must be supplied in the spark.history.fs.logDirectory configuration option, and should contain sub-directories that each represents an application’s event logs.

Environment Variables

You can configure the history server using these environment variables:

Environment Variable	Description
`SPARK_DAEMON_MEMORY`	Memory to allocate to the history server (default: 1g)
`SPARK_DAEMON_JAVA_OPTS`	JVM options for the history server (default: none)
`SPARK_DAEMON_CLASSPATH`	Classpath for the history server (default: none)
`SPARK_PUBLIC_DNS`	The public address for the history server
`SPARK_HISTORY_OPTS`	`spark.history.*` configuration options for the history server

Configuration Options

Basic Configuration

Property	Default	Description
`spark.history.provider`	`org.apache.spark.deploy.history.FsHistoryProvider`	Name of the class implementing the application history backend
`spark.history.fs.logDirectory`	`file:/tmp/spark-events`	Directory containing application event logs to load
`spark.history.fs.update.interval`	`10s`	Period at which the filesystem history provider checks for new or updated logs
`spark.history.ui.port`	`18080`	Port to which the web interface binds
`spark.history.retainedApplications`	`50`	Number of applications to retain UI data for in the cache

Kerberos Configuration

Property	Default	Description
`spark.history.kerberos.enabled`	`false`	Whether the history server should use kerberos to login
`spark.history.kerberos.principal`	`(none)`	Kerberos principal name for the History Server
`spark.history.kerberos.keytab`	`(none)`	Location of the kerberos keytab file

Cleaner Configuration

Property	Default	Description
`spark.history.fs.cleaner.enabled`	`false`	Whether the History Server should periodically clean up event logs
`spark.history.fs.cleaner.interval`	`1d`	How often the filesystem job history cleaner checks for files to delete
`spark.history.fs.cleaner.maxAge`	`7d`	Job history files older than this will be deleted
`spark.history.fs.cleaner.maxNum`	`Int.MaxValue`	Maximum number of files in the event log directory

Storage Configuration

Property	Default	Description
`spark.history.store.maxDiskUsage`	`10g`	Maximum disk usage for the local directory where cache application history information is stored
`spark.history.store.path`	`(none)`	Local directory where to cache application history data
`spark.history.store.serializer`	`JSON`	Serializer for writing/reading in-memory UI objects (JSON or PROTOBUF)

Event Log Compaction

A long-running application can generate a huge single event log file which may cost a lot to maintain. Enabling spark.eventLog.rolling.enabled and spark.eventLog.rolling.maxFileSize creates rolling event log files instead.

Compaction is a LOSSY operation. Some events will be discarded and will no longer be visible in the UI.

The History Server can apply compaction on rolling event log files to reduce the overall size of logs via spark.history.fs.eventLog.rolling.maxFilesToRetain. The compaction tries to exclude events which point to outdated data:

Events for jobs that have finished, and related stage/tasks events
Events for executors that have terminated
Events for SQL executions that have finished, and related job/stage/tasks events

REST API

In addition to viewing the metrics in the UI, they are also available as JSON. This gives you an easy way to create new visualizations and monitoring tools for Spark.

API Access

The endpoints are mounted at /api/v1. For example:

History server: http://<server-url>:18080/api/v1
Running application: http://localhost:4040/api/v1

API Endpoints

Applications

/applications - List all applications
/applications/[app-id]/jobs - All jobs for an application
/applications/[app-id]/stages - All stages for an application

Executors

/applications/[app-id]/executors - Active executors
/applications/[app-id]/allexecutors - All executors
/applications/[app-id]/executors/[executor-id]/threads - Thread dumps

Storage

/applications/[app-id]/storage/rdd - Stored RDDs
/applications/[app-id]/storage/rdd/[rdd-id] - RDD storage status

SQL

/applications/[app-id]/sql - All queries
/applications/[app-id]/sql/[execution-id] - Query details

Executor Task Metrics

The REST API exposes Task Metrics collected by Spark executors with the granularity of task execution:

Metric	Description
`executorRunTime`	Elapsed time the executor spent running this task (milliseconds)
`executorCpuTime`	CPU time the executor spent running this task (nanoseconds)
`executorDeserializeTime`	Elapsed time spent to deserialize this task (milliseconds)
`jvmGCTime`	Elapsed time the JVM spent in garbage collection (milliseconds)
`resultSerializationTime`	Elapsed time spent serializing the task result (milliseconds)
`memoryBytesSpilled`	Number of in-memory bytes spilled by this task
`diskBytesSpilled`	Number of on-disk bytes spilled by this task
`peakExecutionMemory`	Peak memory used by internal data structures during shuffles, aggregations and joins
`inputMetrics.bytesRead`	Total number of bytes read
`inputMetrics.recordsRead`	Total number of records read
`outputMetrics.bytesWritten`	Total number of bytes written
`shuffleReadMetrics.remoteBytesRead`	Number of remote bytes read in shuffle operations
`shuffleReadMetrics.fetchWaitTime`	Time spent waiting for remote shuffle blocks (milliseconds)
`shuffleWriteMetrics.bytesWritten`	Number of bytes written in shuffle operations
`shuffleWriteMetrics.writeTime`	Time spent blocking on writes to disk or buffer cache (nanoseconds)

Executor Metrics

Executor-level metrics are sent from each executor to the driver as part of the Heartbeat:

Memory Metrics
Task Metrics
GC Metrics
Peak Memory Metrics

memoryUsed - Storage memory used by this executor
diskUsed - Disk space used for RDD storage
maxMemory - Total amount of memory available for storage
usedOnHeapStorageMemory - Used on heap memory for storage
usedOffHeapStorageMemory - Used off heap memory for storage

activeTasks - Number of tasks currently executing
failedTasks - Number of tasks that have failed
completedTasks - Number of tasks that have completed
totalDuration - Elapsed time the JVM spent executing tasks (milliseconds)

totalGCTime - Elapsed time the JVM spent in garbage collection (milliseconds)
MinorGCCount - Total minor GC count
MinorGCTime - Elapsed total minor GC time (milliseconds)
MajorGCCount - Total major GC count
MajorGCTime - Elapsed total major GC time (milliseconds)

JVMHeapMemory - Peak memory usage of the heap
JVMOffHeapMemory - Peak memory usage of non-heap memory
OnHeapExecutionMemory - Peak on heap execution memory in use
OffHeapExecutionMemory - Peak off heap execution memory in use
DirectPoolMemory - Peak memory for direct buffer pool

Metrics System

Spark has a configurable metrics system based on the Dropwizard Metrics Library. This allows you to report Spark metrics to a variety of sinks including HTTP, JMX, and CSV files.

Configuration

The metrics system is configured via a configuration file at $SPARK_HOME/conf/metrics.properties. A custom file location can be specified via the spark.metrics.conf configuration property.

Instead of using the configuration file, you can use configuration parameters with prefix spark.metrics.conf.

Namespace Configuration

By default, the root namespace used for driver or executor metrics is the value of spark.app.id. For tracking metrics across apps, use spark.metrics.namespace:

spark.metrics.namespace=${spark.app.name}

Metrics Instances

Spark’s metrics are decoupled into different instances:

master - The Spark standalone master process
applications - A component within the master which reports on various applications
worker - A Spark standalone worker process
executor - A Spark executor
driver - The Spark driver process
shuffleService - The Spark shuffle service
applicationMaster - The Spark ApplicationMaster when running on YARN

Available Sinks

Each instance can report to zero or more sinks:

ConsoleSink

Logs metrics information to the console

CSVSink

Exports metrics data to CSV files at regular intervals

JmxSink

Registers metrics for viewing in a JMX console

MetricsServlet

Adds a servlet within the existing Spark UI to serve metrics data as JSON

PrometheusServlet

Adds a servlet to serve metrics data in Prometheus format (Experimental)

GraphiteSink

Sends metrics to a Graphite node

Slf4jSink

Sends metrics to slf4j as log entries

StatsdSink

Sends metrics to a StatsD node

Prometheus Endpoints

The Prometheus Servlet mirrors the JSON data in time-series format:

Component	Port	JSON Endpoint	Prometheus Endpoint
Master	8080	`/metrics/master/json/`	`/metrics/master/prometheus/`
Master	8080	`/metrics/applications/json/`	`/metrics/applications/prometheus/`
Worker	8081	`/metrics/json/`	`/metrics/prometheus/`
Driver	4040	`/metrics/json/`	`/metrics/prometheus/`
Driver	4040	`/api/v1/applications/{id}/executors/`	`/metrics/executors/prometheus/`

Example Configuration

Using Spark configuration parameters for a Graphite sink:

spark.metrics.conf.*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
spark.metrics.conf.*.sink.graphite.host=<graphiteEndPoint_hostName>
spark.metrics.conf.*.sink.graphite.port=<graphite_listening_port>
spark.metrics.conf.*.sink.graphite.period=10
spark.metrics.conf.*.sink.graphite.unit=seconds
spark.metrics.conf.*.sink.graphite.prefix=optional_prefix

Default metrics configuration values:

*.sink.servlet.class = org.apache.spark.metrics.sink.MetricsServlet
*.sink.servlet.path = /metrics/json
master.sink.servlet.path = /metrics/master/json
applications.sink.servlet.path = /metrics/applications/json

Advanced Instrumentation

Custom Metrics

You can define custom metrics sources to instrument your Spark applications with application-specific metrics.

External Monitoring

Spark’s metrics can be integrated with external monitoring systems through:

Prometheus: Use the PrometheusServlet for scraping metrics
Graphite: Configure GraphiteSink to push metrics
JMX: Use JmxSink for JMX-based monitoring tools
StatsD: Configure StatsdSink for StatsD integration

For production environments, configure both event logging and metrics collection to ensure comprehensive monitoring and troubleshooting capabilities.

Get Started

Core Concepts

Spark SQL

Structured Streaming

Machine Learning

Graph Processing

Deployment

Configuration & Tuning

Monitoring

​Spark History Server

​Starting the History Server

​Configuring Event Logging

​Environment Variables

​Configuration Options

​Event Log Compaction

​REST API

​API Access

​API Endpoints

Applications

Executors

Storage

SQL

​Executor Task Metrics

​Executor Metrics

​Metrics System

​Configuration

​Namespace Configuration

​Metrics Instances

​Available Sinks

ConsoleSink

CSVSink

JmxSink

MetricsServlet

PrometheusServlet

GraphiteSink

Slf4jSink

StatsdSink

​Prometheus Endpoints

​Example Configuration

​Advanced Instrumentation

​Custom Metrics

​External Monitoring

Build docs developers (and LLMs) love

Spark History Server

Starting the History Server

Configuring Event Logging

Environment Variables

Configuration Options

Event Log Compaction

REST API

API Access

API Endpoints

Executor Task Metrics

Executor Metrics

Metrics System

Configuration

Namespace Configuration

Metrics Instances

Available Sinks

Prometheus Endpoints

Example Configuration

Advanced Instrumentation

Custom Metrics

External Monitoring