Wayang’s extensibility model is built entirely on plugins. Every platform that Wayang can execute on — Java Streams, Apache Spark, Apache Flink, PostgreSQL, and more — is connected to the framework through a plugin. Plugins are additive: the more you register, the more options the optimizer has when planning your job. This page explains what a plugin is, how it works internally, and which built-in plugins are available.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/apache/wayang/llms.txt
Use this file to discover all available pages before exploring further.
What Is a Plugin?
APlugin is an interface in wayang-core that bundles three categories of components and registers all of them with the Configuration at startup:
wayang.withPlugin(Java.basicPlugin()), Wayang calls plugin.configure(configuration), which whitelists all the plugin’s platforms, mappings, and channel conversions. From that point on, the optimizer knows about them and can select them during plan enumeration.
How Mappings Work
AMapping is a set of PlanTransformation rules. Each rule specifies a pattern (a subgraph of logical operators) and a replacement (one or more execution operators on a specific platform). The optimizer applies these transformations when expanding a logical plan into candidate physical plans.
For example, Java.basicPlugin() contributes a mapping that says: “a MapOperator<In, Out> can be replaced by a JavaMapOperator<In, Out> running on the JavaPlatform.” Similarly, Spark.basicPlugin() contributes “a MapOperator<In, Out> can be replaced by a SparkMapOperator<In, Out> running on the SparkPlatform.”
With both plugins registered, the optimizer has two candidate implementations for every MapOperator and will pick whichever is cheaper given the estimated data size and platform costs.
Channel Conversions
When the optimizer assigns two adjacent operators to different platforms, it needs a recipe to move data across the boundary. That recipe is aChannelConversion. Each conversion has a source channel type, a target channel type, and an estimated cost (time + resources).
Typical conversions:
- Java
CollectionChannel→ HDFS text file → SparkRddChannel— used when the optimizer sends data from a Java stage to a Spark stage. - Spark
RddChannel→ JavaCollectionChannel— used when the optimizer pulls Spark results back into a local Java stage. - Any platform → JDBC → PostgreSQL table — used when the Postgres plugin is active and the optimizer routes output to a SQL stage.
Java.channelConversionPlugin(), Spark.conversionPlugin()).
The optimizer includes channel conversion costs in its total plan cost. A cross-platform split is only chosen if the savings from using a faster platform outweigh the serialization overhead of the boundary.
Built-In Plugins
Java
The Java plugin maps most built-in operators to Java Streams and in-memory collections. It is the default choice for small data and for local development because it has zero startup overhead.| Entry point | What it adds |
|---|---|
Java.basicPlugin() | Mappings for all standard operators (map, filter, flatMap, reduceByKey, join, etc.) to Java Streams |
Java.channelConversionPlugin() | Conversions between Java channels and cross-platform serialized formats |
Java.graphPlugin() | PageRankOperator mapping for the JavaPlatform |
Apache Spark
The Spark plugin maps operators to Spark RDD and DataFrame operations. Add it alongside the Java plugin to let the optimizer choose between local and distributed execution per operator.| Entry point | What it adds |
|---|---|
Spark.basicPlugin() | Mappings for standard operators to Spark RDD operations |
Spark.graphPlugin() | PageRankOperator mapping for Spark GraphX |
Spark.conversionPlugin() | Conversions between Spark channels and cross-platform formats |
Spark.mlPlugin() | KMeansOperator, LogisticRegressionOperator, etc. via Spark MLlib |
Apache Flink
The Flink plugin enables execution on Apache Flink for batch workloads. It mirrors the Java/Spark structure.PostgreSQL and SQLite3
The database plugins unlock SQL pushdown: when the optimizer assigns an operator to a database platform, the execution operator translates it to SQL and runs it inside the database engine. This is especially efficient forFilterOperator (maps to WHERE) and MapOperator over Record types (maps to SELECT).
Apache Kafka
Kafka support is built into both the Java and Spark platform modules. When eitherJava.basicPlugin() or Spark.basicPlugin() is registered, the KafkaTopicSource and KafkaTopicSink operators are automatically mapped to their corresponding Kafka implementations.
Configuration:
IEJoin (Inequality Join)
The IEJoin plugin adds support forIEJoinOperator and IESelfJoinOperator — operators for non-equi joins with inequality predicates (e.g., a.ts < b.ts AND a.ts > b.ts - interval). It provides implementations on both Java and Spark.
IEJoin.plugin() requires both Java.basicPlugin() and Spark.basicPlugin() to already be registered (or registered alongside it), because it depends on the JavaPlatform and SparkPlatform being active.
Spatial
The Spatial plugin enables geo-spatial operators (SpatialFilterOperator, SpatialJoinOperator) and GeoJsonFileSource. Implementations are available on Java, Spark, and PostgreSQL (with PostGIS).
Machine Learning
MachineLearning.plugin() is a lightweight coordination plugin for ML workflows. It does not register platform mappings on its own — actual ML operator implementations are contributed by Spark.mlPlugin() (for Spark MLlib operators) or by the wayang-tensorflow module. Use it as the entry point to pull in the wayang-ml module dependency and to combine with the platform-specific ML plugins below.
Registering Multiple Plugins
Plugins are completely additive. EverywithPlugin() call extends the optimizer’s search space without removing anything that was registered before. The more plugins you register, the more platform options the optimizer has for each operator.
Writing a Custom Plugin
If you need an operator or platform not covered by the built-in plugins, you can implementPlugin directly:
