The Apache Spark platform adapter lets Wayang route operators to a Spark cluster (or a local Spark context) without changing a single line of your pipeline code. You registerDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/apache/wayang/llms.txt
Use this file to discover all available pages before exploring further.
Spark.basicPlugin() alongside any other platforms you have, and Wayang’s cost-based optimizer decides which operators run on Spark and which run elsewhere. Spark’s strength is large-scale batch processing: its in-memory RDD model and cluster scheduler amortize startup overhead over large datasets in a way that local engines cannot match.
Requirements
Before registering the Spark platform, ensure the following are in place:- Apache Spark 3.4.4 with Scala 2.12 installed and
SPARK_HOMEset to the installation directory. - Apache Hadoop 3+ with
HADOOP_HOMEset (required for HDFS and distributed file I/O). - Java 17 with
JAVA_HOMEset. Running on Java 17 requires passing extra JVM flags to open internal modules — see the installation guide for the exact flags.
Maven dependency
WAYANG_VERSION with the latest release on Maven Central.
Entry-point class: Spark
The Spark class in org.apache.wayang.spark exposes four static factory methods:
basicPlugin() and graphPlugin() are independent — register only the ones you need. conversionPlugin() is included automatically when you register basicPlugin(), but you can add it explicitly if you need fine-grained control over cross-platform transfers.
Supported operators
Spark.basicPlugin() covers all standard Wayang operators on Spark:
- Sources:
TextFileSource,ObjectFileSource,CollectionSource,ParquetSource,KafkaTopicSource - Transformations:
MapOperator,MapPartitionsOperator,FlatMapOperator,FilterOperator,SortOperator,DistinctOperator,ZipWithIdOperator - Aggregations:
ReduceByOperator,GlobalReduceOperator,MaterializedGroupByOperator,GlobalMaterializedGroupOperator,CountOperator - Multi-input:
JoinOperator,CartesianOperator,CoGroupOperator,UnionAllOperator,IntersectOperator - Control flow:
LoopOperator,DoWhileOperator,RepeatOperator,SampleOperator - Sinks:
TextFileSink,ObjectFileSink,LocalCallbackSink,KafkaTopicSink,ParquetSink,TableSink
Spark.graphPlugin() adds PageRankOperator.
Spark.mlPlugin() adds KMeans, LinearRegression, LogisticRegression, DecisionTreeClassification, DecisionTreeRegression, LinearSVC, ModelTransform, and PredictOperator.
Configuration
- Default properties
- Override via Configuration
- conf/spark/default.properties
wayang-spark-defaults.properties (bundled inside wayang-spark) sets these defaults:wayang.spark.init.ms = 4500 is the estimated startup cost (in milliseconds) the optimizer charges for any plan that uses Spark. This is the primary reason the optimizer avoids Spark for tiny datasets — the 4.5-second fixed cost only pays off when the data volume is large enough to justify distributed execution.RDD channels vs Dataset channels
By default, Wayang’s Spark backend represents data as Java RDDs (JavaRDD<T>). As of recent releases, Spark Dataset/DataFrame channels are also available for operators that benefit from Spark’s Catalyst optimizer (structured query optimization, columnar reads, Spark ML integration).
- RDD channels are the default for general-purpose pipelines.
- Dataset channels are preferable when reading Parquet/Delta, interoperating with Spark ML stages, or using schema-aware processing.
readParquet(..., preferDataset = true) API and the automatic RDD ↔ Dataset conversion operators (SparkRddToDatasetOperator, SparkDatasetToRddOperator), see the Spark Dataset Pipelines guide.
Example: register only Spark (force Spark execution)
When you register only Spark, the optimizer has no other choice — every operator runs on Spark. This is useful for integration tests that must verify Spark behavior, or for pipelines that are known to be large and should never fall back to local execution.Why register only Spark in tests? If you register both Java and Spark, the optimizer will keep small test datasets on Java Streams (correct behavior, but it means Spark is never exercised). Registering Spark alone forces every operator onto Spark so you can confirm connectivity, serialization, and UDF behavior.
