One of Apache Wayang’s defining capabilities is that you don’t pick a platform — the optimizer does. Given a logical plan and a set of registered plugins, the optimizer assigns every operator to the platform where it will run cheapest, and it can even split a single job across multiple engines within the same execution. This page explains the mechanics: how data sizes are estimated, how costs are computed, how the search space is pruned, and how the final assignment is made.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/apache/wayang/llms.txt
Use this file to discover all available pages before exploring further.
The Cost Model at a Glance
The optimizer works in four phases:- Cardinality estimation — estimate how many records flow through every edge in the plan.
- Load estimation — for each (operator, platform) pair, compute a
LoadProfile(CPU, memory, network). - Plan enumeration — expand the logical plan into candidate physical plans by substituting each logical operator with one or more execution operators.
- Pruning and selection — discard dominated candidates, squash uncertainty intervals to scalars, and pick the plan with the lowest total cost.
ProbabilisticDoubleInterval — an interval with an associated correctness probability — so the optimizer can reason about uncertainty rather than pretending estimates are exact.
Cardinality Estimation
Every logical operator that contributes aCardinalityEstimator tells the optimizer how its output cardinality relates to its input cardinalities. Estimates are represented as CardinalityEstimate(lowerBound, upperBound, correctnessProbability).
MapOperator— output cardinality equals input cardinality (1:1 transformation). Its estimator usesDefaultCardinalityEstimatorwith a selectivity of1.0.FilterOperator— reads aProbabilisticDoubleIntervalselectivity from theConfigurationfor the givenPredicateDescriptor. Output range is[lower × selectivityLower, upper × selectivityUpper]with combined correctness probability.TextFileSource— samples the first 1 MiB of the file, measures average bytes per line, and extrapolates to the full file size with a correctness probability of0.95.FlatMapOperator,JoinOperator,GroupByOperator— each supplies its own estimator appropriate to the semantics.
LoopContext, which manages per-iteration OptimizationContexts and can refine estimates as the loop progresses.
Wayang stores observed actual cardinalities in a
CardinalityRepository. Over time, this data is used to improve future estimates for the same operator types and UDF signatures — a lightweight form of learned statistics.ProbabilisticDoubleInterval: Uncertainty-Aware Costs
All cost quantities — selectivities, load profiles, time estimates, and final cost estimates — are represented asProbabilisticDoubleInterval:
wayang.core.optimizer.cost.squasher) collapses an interval to a single scalar for the final comparison. The default squasher is the geometric mean: sqrt(lower × upper).
Load Estimation: CPU, Memory, and Network
Once cardinality estimates are available, the optimizer estimates the actual resource consumption of each execution operator using aLoadProfileEstimator. A LoadProfile contains:
| Resource | Unit |
|---|---|
| CPU load | operations (scaled by frequency) |
| RAM load | bytes |
| Disk I/O | bytes |
| Network I/O | bytes |
LoadProfile is converted to a TimeEstimate (wall-clock time with a correctness interval) using the platform’s LoadProfileToTimeConverter, and then to a cost using the platform’s TimeToCostConverter. Different platforms have very different cost profiles:
- Java Streams — minimal startup overhead; cost scales linearly with data size. Cheap for small data.
- Apache Spark — significant startup overhead (JVM initialization, SparkContext creation, task scheduling). Worth it only when data is large enough that the parallel speedup outweighs the startup cost.
- Apache Flink — similar structure to Spark; optimized for streaming but also effective for large batch workloads.
- PostgreSQL / SQLite — near-zero startup; highly efficient for filter-heavy workloads that fit in a database’s query optimizer.
Plan Enumeration
The plan enumerator expands every logical operator into the set of execution operators provided by registered plugins. For a plan withn operators, each with k platform options, the search space is k^n candidate plans. Two strategies keep this tractable:
- Operator-by-operator expansion — the enumerator builds partial plans left-to-right, pruning dominated partial assignments before expanding the next operator.
- Pruning strategies (
PlanEnumerationPruningStrategy) — pluggable strategies discard any partial plan whose lower-bound cost already exceeds the best known complete plan. The default strategy isLatentOperatorPruningStrategy.
How the Optimizer Chooses Platforms
The optimizer assigns asquashedCostEstimate to every candidate complete plan and selects the plan with the smallest value. The key comparison is:
operatorCost is obtained by the OperatorContext.updateCostEstimate() path, which calls LoadProfileEstimators.estimateLoadProfile(), converts to time via LoadProfileToTimeConverter, and converts to cost via TimeToCostConverter.
- Run
TextFileSource→FilterOperatoron Java (small data, filter is cheap locally). - Switch to Spark for a
JoinOperatoron two large datasets (parallelism pays off). - Use a cross-platform channel (temp file on HDFS) to hand data from Java to Spark.
Cross-Platform Execution
When the optimizer selects different platforms for different parts of the plan, theCrossPlatformExecutor orchestrates the hand-off. It groups execution tasks into ExecutionStage blocks: a stage boundary exists wherever two adjacent tasks belong to different platforms.
At a stage boundary, Wayang inserts a ChannelConversion operator to serialize the output of the upstream stage into a format the downstream platform can read. Common conversions include:
- Java
Collection→ HDFS text file → Spark RDD - Spark RDD → materialized file → Flink
DataSet - Any platform → JDBC → PostgreSQL table
Adaptive Re-Optimization in Loops
When the plan contains adoWhile or repeat operator, Wayang can observe actual output cardinalities after each iteration and compare them with estimates. If the deviation exceeds a configured threshold (a CardinalityBreakpoint), the executor re-invokes the optimizer to produce a revised plan for subsequent iterations. This is especially valuable for iterative algorithms like PageRank or K-Means where convergence speed is data-dependent.
Customizing the Cost Model
For advanced use cases, you can replace the default cost model entirely by implementingEstimatableCost and registering it with the Configuration:
wayang-ml module ships a reference implementation (OrtMLModel) that loads an ONNX model and uses a one-hot encoding of the plan graph to predict execution time — a drop-in replacement for the formula-based cost estimator.