Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/apache/wayang/llms.txt

Use this file to discover all available pages before exploring further.

Apache Wayang abstracts your data pipeline away from any single execution engine by treating every engine as a pluggable platform. You register the platforms you want available, and Wayang’s cost-based optimizer decides — at runtime — which platform (or combination of platforms) best executes each operator in your job. Adding more platforms to the registry gives the optimizer more choices, which generally produces better execution plans, especially when data volumes or query shapes change across pipeline stages.

All supported platforms

Java Streams

Single-JVM execution backed by java.util.stream. Zero setup, ideal for development, testing, and small datasets.

Apache Spark

Distributed batch processing on Spark 3.4.4 / Scala 2.12. Handles large-scale data across a cluster or locally.

Apache Flink

Unified stream and batch engine. DataSet API (legacy) and bounded DataStream API both supported.

PostgreSQL

Push filter, projection, join, and reduce operators into PostgreSQL via JDBC. Avoids moving data out of the database.

SQLite3

Lightweight embedded JDBC platform. Good for edge devices and integration tests that need an in-process SQL engine.

Apache Kafka

KafkaTopicSource and KafkaTopicSink are first-class logical operators. Both Java Streams and Spark can execute them.

Apache Giraph

Graph-processing platform for large-scale PageRank and similar BSP algorithms. Entry point: Giraph.plugin().

TensorFlow

Machine-learning platform exposing DLTrainingOperator and PredictOperator. Entry point: Tensorflow.plugin().

Platform table

PlatformMaven moduleEntry-point classNotes
Java Streamswayang-javaJavaAlways available; fallback for many operator types
Apache Sparkwayang-sparkSparkRequires SPARK_HOME; Spark 3.4.4, Scala 2.12
Apache Flinkwayang-flinkFlinkDataSet (deprecated) and bounded stream modes
PostgreSQLwayang-postgresPostgresJDBC; filter/join/projection pushed into SQL
SQLite3wayang-sqlite3Sqlite3JDBC; embedded, file-based or in-memory
Apache Kafkawayang-java / wayang-sparkKafkaTopicSource / KafkaTopicSinkLogical operators; executed on Java or Spark
Apache Giraphwayang-giraphGiraphPageRank; needs a Hadoop/Giraph cluster
TensorFlowwayang-tensorflowTensorflowDL training and inference operators

Plugin architecture

Every platform ships as one or more plugins. A plugin bundles:
  1. Operator mappings — a table that tells the optimizer “logical operator X can be executed as physical operator Y on this platform.”
  2. Channel conversions — cost-annotated rules for moving data between platforms (for example, collecting a Spark RDD back to a Java Collection to hand off to a downstream operator on Java Streams).
  3. Default configuration — a bundled .properties file with cost-model parameters, parallelism settings, and connection defaults.
You register plugins on the WayangContext:
WayangContext wayang = new WayangContext(new Configuration())
        .withPlugin(Java.basicPlugin())   // Java Streams operators
        .withPlugin(Spark.basicPlugin()); // Spark operators
The optimizer inspects every registered mapping when it builds the execution plan. More mappings available means more candidate plans to compare, which typically lets the optimizer find cheaper routes, especially when it can split a single logical plan across platforms.
You only pay the overhead of a platform (driver startup, network round-trips, JVM serialization) when the optimizer actually chooses it for at least one operator in the job. Registering an additional platform does not force its use.

Choosing which plugins to register

ScenarioRecommended plugins
Local development / unit testsJava.basicPlugin()
Small–medium production jobsJava.basicPlugin()
Large datasets on a clusterSpark.basicPlugin()
Let Wayang pick automaticallyJava.basicPlugin() + Spark.basicPlugin()
Graph analyticsadd Spark.graphPlugin() or Giraph.plugin()
SQL-heavy pipelinesadd Postgres.plugin() or Sqlite3.plugin()
ML trainingadd Tensorflow.plugin()
In production, register every platform you have available and let the cost-based optimizer decide. Artificially limiting the platform set only helps when you need to guarantee a specific engine for compliance or environment reasons.

Build docs developers (and LLMs) love