Apache Wayang abstracts your data pipeline away from any single execution engine by treating every engine as a pluggable platform. You register the platforms you want available, and Wayang’s cost-based optimizer decides — at runtime — which platform (or combination of platforms) best executes each operator in your job. Adding more platforms to the registry gives the optimizer more choices, which generally produces better execution plans, especially when data volumes or query shapes change across pipeline stages.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/apache/wayang/llms.txt
Use this file to discover all available pages before exploring further.
All supported platforms
Java Streams
Single-JVM execution backed by
java.util.stream. Zero setup, ideal for development, testing, and small datasets.Apache Spark
Distributed batch processing on Spark 3.4.4 / Scala 2.12. Handles large-scale data across a cluster or locally.
Apache Flink
Unified stream and batch engine. DataSet API (legacy) and bounded DataStream API both supported.
PostgreSQL
Push filter, projection, join, and reduce operators into PostgreSQL via JDBC. Avoids moving data out of the database.
SQLite3
Lightweight embedded JDBC platform. Good for edge devices and integration tests that need an in-process SQL engine.
Apache Kafka
KafkaTopicSource and KafkaTopicSink are first-class logical operators. Both Java Streams and Spark can execute them.Apache Giraph
Graph-processing platform for large-scale PageRank and similar BSP algorithms. Entry point:
Giraph.plugin().TensorFlow
Machine-learning platform exposing
DLTrainingOperator and PredictOperator. Entry point: Tensorflow.plugin().Platform table
| Platform | Maven module | Entry-point class | Notes |
|---|---|---|---|
| Java Streams | wayang-java | Java | Always available; fallback for many operator types |
| Apache Spark | wayang-spark | Spark | Requires SPARK_HOME; Spark 3.4.4, Scala 2.12 |
| Apache Flink | wayang-flink | Flink | DataSet (deprecated) and bounded stream modes |
| PostgreSQL | wayang-postgres | Postgres | JDBC; filter/join/projection pushed into SQL |
| SQLite3 | wayang-sqlite3 | Sqlite3 | JDBC; embedded, file-based or in-memory |
| Apache Kafka | wayang-java / wayang-spark | KafkaTopicSource / KafkaTopicSink | Logical operators; executed on Java or Spark |
| Apache Giraph | wayang-giraph | Giraph | PageRank; needs a Hadoop/Giraph cluster |
| TensorFlow | wayang-tensorflow | Tensorflow | DL training and inference operators |
Plugin architecture
Every platform ships as one or more plugins. A plugin bundles:- Operator mappings — a table that tells the optimizer “logical operator X can be executed as physical operator Y on this platform.”
- Channel conversions — cost-annotated rules for moving data between platforms (for example, collecting a Spark RDD back to a Java
Collectionto hand off to a downstream operator on Java Streams). - Default configuration — a bundled
.propertiesfile with cost-model parameters, parallelism settings, and connection defaults.
WayangContext:
You only pay the overhead of a platform (driver startup, network round-trips, JVM serialization) when the optimizer actually chooses it for at least one operator in the job. Registering an additional platform does not force its use.
Choosing which plugins to register
| Scenario | Recommended plugins |
|---|---|
| Local development / unit tests | Java.basicPlugin() |
| Small–medium production jobs | Java.basicPlugin() |
| Large datasets on a cluster | Spark.basicPlugin() |
| Let Wayang pick automatically | Java.basicPlugin() + Spark.basicPlugin() |
| Graph analytics | add Spark.graphPlugin() or Giraph.plugin() |
| SQL-heavy pipelines | add Postgres.plugin() or Sqlite3.plugin() |
| ML training | add Tensorflow.plugin() |
