Apache Wayang is built around a clean separation between what your pipeline does and where it runs. You write a logical plan once using Wayang’s API, and the framework handles the entire journey from that high-level description down to physical execution on one or more platforms. Understanding the layers between your code and the actual compute engine helps you reason about performance, debugging, and extensibility.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/apache/wayang/llms.txt
Use this file to discover all available pages before exploring further.
Architectural Layers
Wayang’s execution pipeline passes through five distinct layers. Each layer transforms the pipeline representation into something more concrete, until it eventually runs on one or more real execution engines.User API → WayangPlan
Your application code uses either the fluent Java/Scala API (
JavaPlanBuilder / PlanBuilder) or the low-level operator graph API. Either way, the result is a WayangPlan — a directed acyclic graph (or DAG with loops) of logical operators connected by typed slots. Logical operators such as MapOperator, FilterOperator, and TextFileSource describe intent without knowing which platform will run them.WayangPlan → Optimizer
When you call
execute(), the WayangContext creates a Job and hands the WayangPlan to the optimizer. The optimizer annotates every operator in the plan with cardinality estimates (how many records will flow through each edge), computes load profiles (CPU, memory, network cost), and enumerates candidate physical implementations for every logical operator.Optimizer → Physical Execution Plan (ExecutionPlan)
The optimizer selects the cheapest combination of execution operators by comparing cost across all registered platforms. The resulting
ExecutionPlan is a graph of ExecutionTask nodes, each backed by a platform-specific execution operator (e.g., JavaMapOperator or SparkFilterOperator). Operators are grouped into ExecutionStage blocks that belong to a single platform.ExecutionPlan → Platform Executors
The
CrossPlatformExecutor walks the ExecutionPlan stage by stage. For each stage it calls the appropriate platform’s Executor (e.g., JavaExecutor, SparkExecutor). Stages on the same platform run without data serialization; stages on different platforms exchange data through channels — the typed data conduits described below.The WayangContext Entry Point
WayangContext is the single object your application interacts with before execution. It holds the Configuration and the registered plugins. Every call to withPlugin() extends the set of platforms and operator mappings the optimizer can use.
withPlugin() method returns this, so you can chain as many plugins as needed. Once a plugin is registered, its platforms, operator mappings, and channel conversions become available to the optimizer.
You can also inspect or export plans before running them using wayang.explain(wayangPlan), which prints both the logical and the resolved physical plan to help with debugging.
Logical vs. Execution Operators
The operator model has two tiers:| Tier | Example classes | Role |
|---|---|---|
| Logical operators | MapOperator, FilterOperator, TextFileSource | Describe the computation in platform-agnostic terms. Live in wayang-basic. |
| Execution operators | JavaMapOperator, SparkFilterOperator | Platform-specific implementations. Live in the platform modules (e.g., wayang-java, wayang-spark). |
Mapping objects contributed by plugins. During optimization, the optimizer expands each logical operator into its candidate execution operators and picks the one with the lowest estimated cost.
How Plugins Connect Operators to Platforms
APlugin bundles three things that it contributes to the Configuration when registered:
Mappings — each mapping declares that a logical operator pattern can be replaced by one or more execution operators on a specific platform.ChannelConversions — recipes for converting data between channel types when the optimizer routes consecutive stages to different platforms.Configurationproperties — platform-specific tuning parameters (e.g., Spark master URL, cost coefficients).
Java.basicPlugin() is registered, it contributes JavaMapOperator as the execution implementation of MapOperator on the JavaPlatform, along with channel descriptors for Java collection and iterator channels.
Channels: Data Flow Between Operators
AChannel in Wayang’s execution plan is the data conduit between two ExecutionTask nodes. Each platform defines its own channel types:
- Java —
CollectionChannel(in-memoryCollection<T>),StreamChannel(JavaStream<T>) - Spark —
RddChannel(Spark RDD),BroadcastChannel - Cross-platform — when two consecutive stages run on different platforms, a
ChannelConversionwrites the data in a serialized format (e.g., a temp file) that the next platform can read
Execution Flow Diagram
The optimizer re-evaluates cardinality estimates after each loop iteration when a
doWhile or repeat operator is present. If actual observed cardinalities differ significantly from estimates, the CrossPlatformExecutor can replan subsequent stages at runtime (adaptive re-optimization).Configuration
Configuration is a layered key-value store that controls every aspect of the runtime. You can override defaults by setting properties programmatically or by placing a wayang.properties file on the classpath.
WayangContext forks the configuration it is given (so the original is not modified), and plugins may further layer their own properties on top via setProperties(Configuration).