Skip to main content
There are several ways to monitor Spark applications: web UIs, metrics, and external instrumentation.

Spark History Server

It is possible to construct the UI of an application through Spark’s history server, provided that the application’s event logs exist.

Starting the History Server

You can start the history server by executing:
./sbin/start-history-server.sh
This creates a web interface at http://<server-url>:18080 by default, listing incomplete and completed applications and attempts.

Configuring Event Logging

To view the web UI after an application has finished, you need to configure event logging:
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenode/shared/spark-logs
When using the file-system provider class, the base logging directory must be supplied in the spark.history.fs.logDirectory configuration option, and should contain sub-directories that each represents an application’s event logs.

Environment Variables

You can configure the history server using these environment variables:
Environment VariableDescription
SPARK_DAEMON_MEMORYMemory to allocate to the history server (default: 1g)
SPARK_DAEMON_JAVA_OPTSJVM options for the history server (default: none)
SPARK_DAEMON_CLASSPATHClasspath for the history server (default: none)
SPARK_PUBLIC_DNSThe public address for the history server
SPARK_HISTORY_OPTSspark.history.* configuration options for the history server

Configuration Options

PropertyDefaultDescription
spark.history.providerorg.apache.spark.deploy.history.FsHistoryProviderName of the class implementing the application history backend
spark.history.fs.logDirectoryfile:/tmp/spark-eventsDirectory containing application event logs to load
spark.history.fs.update.interval10sPeriod at which the filesystem history provider checks for new or updated logs
spark.history.ui.port18080Port to which the web interface binds
spark.history.retainedApplications50Number of applications to retain UI data for in the cache
PropertyDefaultDescription
spark.history.kerberos.enabledfalseWhether the history server should use kerberos to login
spark.history.kerberos.principal(none)Kerberos principal name for the History Server
spark.history.kerberos.keytab(none)Location of the kerberos keytab file
PropertyDefaultDescription
spark.history.fs.cleaner.enabledfalseWhether the History Server should periodically clean up event logs
spark.history.fs.cleaner.interval1dHow often the filesystem job history cleaner checks for files to delete
spark.history.fs.cleaner.maxAge7dJob history files older than this will be deleted
spark.history.fs.cleaner.maxNumInt.MaxValueMaximum number of files in the event log directory
PropertyDefaultDescription
spark.history.store.maxDiskUsage10gMaximum disk usage for the local directory where cache application history information is stored
spark.history.store.path(none)Local directory where to cache application history data
spark.history.store.serializerJSONSerializer for writing/reading in-memory UI objects (JSON or PROTOBUF)

Event Log Compaction

A long-running application can generate a huge single event log file which may cost a lot to maintain. Enabling spark.eventLog.rolling.enabled and spark.eventLog.rolling.maxFileSize creates rolling event log files instead.
Compaction is a LOSSY operation. Some events will be discarded and will no longer be visible in the UI.
The History Server can apply compaction on rolling event log files to reduce the overall size of logs via spark.history.fs.eventLog.rolling.maxFilesToRetain. The compaction tries to exclude events which point to outdated data:
  • Events for jobs that have finished, and related stage/tasks events
  • Events for executors that have terminated
  • Events for SQL executions that have finished, and related job/stage/tasks events

REST API

In addition to viewing the metrics in the UI, they are also available as JSON. This gives you an easy way to create new visualizations and monitoring tools for Spark.

API Access

The endpoints are mounted at /api/v1. For example:
  • History server: http://<server-url>:18080/api/v1
  • Running application: http://localhost:4040/api/v1

API Endpoints

Applications

  • /applications - List all applications
  • /applications/[app-id]/jobs - All jobs for an application
  • /applications/[app-id]/stages - All stages for an application

Executors

  • /applications/[app-id]/executors - Active executors
  • /applications/[app-id]/allexecutors - All executors
  • /applications/[app-id]/executors/[executor-id]/threads - Thread dumps

Storage

  • /applications/[app-id]/storage/rdd - Stored RDDs
  • /applications/[app-id]/storage/rdd/[rdd-id] - RDD storage status

SQL

  • /applications/[app-id]/sql - All queries
  • /applications/[app-id]/sql/[execution-id] - Query details

Executor Task Metrics

The REST API exposes Task Metrics collected by Spark executors with the granularity of task execution:
MetricDescription
executorRunTimeElapsed time the executor spent running this task (milliseconds)
executorCpuTimeCPU time the executor spent running this task (nanoseconds)
executorDeserializeTimeElapsed time spent to deserialize this task (milliseconds)
jvmGCTimeElapsed time the JVM spent in garbage collection (milliseconds)
resultSerializationTimeElapsed time spent serializing the task result (milliseconds)
memoryBytesSpilledNumber of in-memory bytes spilled by this task
diskBytesSpilledNumber of on-disk bytes spilled by this task
peakExecutionMemoryPeak memory used by internal data structures during shuffles, aggregations and joins
inputMetrics.bytesReadTotal number of bytes read
inputMetrics.recordsReadTotal number of records read
outputMetrics.bytesWrittenTotal number of bytes written
shuffleReadMetrics.remoteBytesReadNumber of remote bytes read in shuffle operations
shuffleReadMetrics.fetchWaitTimeTime spent waiting for remote shuffle blocks (milliseconds)
shuffleWriteMetrics.bytesWrittenNumber of bytes written in shuffle operations
shuffleWriteMetrics.writeTimeTime spent blocking on writes to disk or buffer cache (nanoseconds)

Executor Metrics

Executor-level metrics are sent from each executor to the driver as part of the Heartbeat:
  • memoryUsed - Storage memory used by this executor
  • diskUsed - Disk space used for RDD storage
  • maxMemory - Total amount of memory available for storage
  • usedOnHeapStorageMemory - Used on heap memory for storage
  • usedOffHeapStorageMemory - Used off heap memory for storage

Metrics System

Spark has a configurable metrics system based on the Dropwizard Metrics Library. This allows you to report Spark metrics to a variety of sinks including HTTP, JMX, and CSV files.

Configuration

The metrics system is configured via a configuration file at $SPARK_HOME/conf/metrics.properties. A custom file location can be specified via the spark.metrics.conf configuration property.
Instead of using the configuration file, you can use configuration parameters with prefix spark.metrics.conf.

Namespace Configuration

By default, the root namespace used for driver or executor metrics is the value of spark.app.id. For tracking metrics across apps, use spark.metrics.namespace:
spark.metrics.namespace=${spark.app.name}

Metrics Instances

Spark’s metrics are decoupled into different instances:
  • master - The Spark standalone master process
  • applications - A component within the master which reports on various applications
  • worker - A Spark standalone worker process
  • executor - A Spark executor
  • driver - The Spark driver process
  • shuffleService - The Spark shuffle service
  • applicationMaster - The Spark ApplicationMaster when running on YARN

Available Sinks

Each instance can report to zero or more sinks:

ConsoleSink

Logs metrics information to the console

CSVSink

Exports metrics data to CSV files at regular intervals

JmxSink

Registers metrics for viewing in a JMX console

MetricsServlet

Adds a servlet within the existing Spark UI to serve metrics data as JSON

PrometheusServlet

Adds a servlet to serve metrics data in Prometheus format (Experimental)

GraphiteSink

Sends metrics to a Graphite node

Slf4jSink

Sends metrics to slf4j as log entries

StatsdSink

Sends metrics to a StatsD node

Prometheus Endpoints

The Prometheus Servlet mirrors the JSON data in time-series format:
ComponentPortJSON EndpointPrometheus Endpoint
Master8080/metrics/master/json//metrics/master/prometheus/
Master8080/metrics/applications/json//metrics/applications/prometheus/
Worker8081/metrics/json//metrics/prometheus/
Driver4040/metrics/json//metrics/prometheus/
Driver4040/api/v1/applications/{id}/executors//metrics/executors/prometheus/

Example Configuration

Using Spark configuration parameters for a Graphite sink:
spark.metrics.conf.*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
spark.metrics.conf.*.sink.graphite.host=<graphiteEndPoint_hostName>
spark.metrics.conf.*.sink.graphite.port=<graphite_listening_port>
spark.metrics.conf.*.sink.graphite.period=10
spark.metrics.conf.*.sink.graphite.unit=seconds
spark.metrics.conf.*.sink.graphite.prefix=optional_prefix
Default metrics configuration values:
*.sink.servlet.class = org.apache.spark.metrics.sink.MetricsServlet
*.sink.servlet.path = /metrics/json
master.sink.servlet.path = /metrics/master/json
applications.sink.servlet.path = /metrics/applications/json

Advanced Instrumentation

Custom Metrics

You can define custom metrics sources to instrument your Spark applications with application-specific metrics.

External Monitoring

Spark’s metrics can be integrated with external monitoring systems through:
  • Prometheus: Use the PrometheusServlet for scraping metrics
  • Graphite: Configure GraphiteSink to push metrics
  • JMX: Use JmxSink for JMX-based monitoring tools
  • StatsD: Configure StatsdSink for StatsD integration
For production environments, configure both event logging and metrics collection to ensure comprehensive monitoring and troubleshooting capabilities.

Build docs developers (and LLMs) love