Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/apache/wayang/llms.txt

Use this file to discover all available pages before exploring further.

Wayang ships a rich library of built-in operators that cover the full spectrum of data processing tasks — from simple field projections to iterative machine-learning loops spanning multiple platforms. Every operator in this catalog is a logical operator: it describes what to compute, not how or where. The optimizer resolves each logical operator to a platform-specific execution operator at runtime, drawing on the mappings contributed by whatever plugins you have registered.

Logical vs. Execution Operators

The distinction between logical and execution operators is fundamental to Wayang’s design:
Logical operators live in wayang-basic and wayang-core. They describe computation in abstract, platform-agnostic terms. You use them directly when building a WayangPlan or when calling methods on DataQuanta.
// MapOperator is a logical operator — no platform knowledge here
MapOperator<String, Integer> lengthOp =
    new MapOperator<>(String::length, String.class, Integer.class);
Key properties of logical operators:
  • They hold a FunctionDescriptor (UDF) or other parameters.
  • They provide a CardinalityEstimator so the optimizer can reason about data sizes.
  • They know nothing about JVM threads, Spark RDDs, or JDBC connections.

UDFs: FunctionDescriptor Hierarchy

User-defined functions (UDFs) are wrapped in typed descriptor objects rather than passed as raw lambdas. This lets the optimizer inspect metadata (load estimates, complexity class, SQL translations) without executing the function.
DescriptorUsed byJava / Scala type
TransformationDescriptorMapOperator, MapPartitionsOperatorT → R
PredicateDescriptorFilterOperatorT → Boolean
FlatMapDescriptorFlatMapOperatorT → Iterable<R>
ReduceDescriptorReduceOperator, ReduceByOperator, GlobalReduceOperator(T, T) → T
MapPartitionsDescriptorMapPartitionsOperatorIterable<T> → Iterable<R>
All descriptor interfaces extend FunctionDescriptor.SerializableFunction (or the predicate/binary-operator variants), which extends java.io.Serializable — a requirement for shipping UDFs to remote executors like Spark workers.

Transformations

These operators process elements one by one or partition by partition.

map

Applies a TransformationDescriptor to each element. Output cardinality equals input cardinality. The most common operator in any pipeline.

flatMap

Applies a FlatMapDescriptor that returns zero or more output elements per input element. Output cardinality is estimated via a configurable selectivity.

filter

Applies a PredicateDescriptor. Only elements for which the predicate returns true are forwarded. Also accepts an optional SQL WHERE clause for database pushdown.

mapPartitions

Applies a MapPartitionsDescriptor to an entire partition at once — useful when the UDF needs to look at multiple records together (e.g., mini-batch scoring).

sort

Sorts the dataset by a key function. Platform executors use native sort implementations (e.g., RDD.sortBy on Spark).

distinct

Removes duplicate elements. Uses hashing or sorting depending on the platform.

sample

Returns a fixed or fraction-based random sample. Supports Bernoulli, reservoir, and other sampling strategies.

zipWithId

Appends a unique 64-bit ID to every element, producing Tuple2<Long, T>. IDs are not necessarily sequential but are globally unique within the dataset.

Aggregations

reduceByKey

Groups by a key function and reduces values within each group using a binary operator. Equivalent to groupByKey + reduce.

reduce / globalReduce

ReduceOperator reduces a keyed group; GlobalReduceOperator reduces the entire dataset to a single element.

groupBy / materializedGroupBy

GroupByOperator streams groups lazily; MaterializedGroupByOperator collects each group into a Collection before forwarding. Use the materialized variant when downstream UDFs need random access within a group.

count

Counts the number of elements in the dataset. Returns a DataQuanta<Long>.

coGroup

Joins two datasets by key and forwards matched groups as Tuple2<Iterable<A>, Iterable<B>>. Similar to SQL FULL OUTER JOIN at the group level.

Binary and Set Operators

OperatorDescription
JoinOperatorInner equi-join on two datasets by key functions.
CartesianOperatorCross product of two datasets. Use with caution on large inputs.
UnionAllOperatorConcatenates two datasets without deduplication.
IntersectOperatorReturns elements present in both datasets.
CoGroupOperatorGroups both inputs by the same key for custom join logic.

Control Flow

Wayang supports iterative algorithms natively through loop operators. All three compile down to the LoopSubplan / LoopHeadOperator mechanism in the core.
Runs the body subplan, evaluates a convergence predicate on a designated convergence output, and repeats until the predicate returns false (or until a maximum iteration count is reached).
// Conceptual doWhile usage via the Scala/Java API
DataQuanta<PageRankVector> ranks = initialRanks
    .doWhile(
        /* convergence check */ vectors -> !hasConverged(vectors),
        start -> start.map(/* pagerank step */)
    );

Sources

Sources have no inputs; they produce records from an external system.
OperatorOutput typeNotes
TextFileSourceStringReads lines from any URL the configured filesystem supports (local, HDFS, S3). Samples the first MiB for cardinality estimation.
ObjectFileSourceTReads serialized Java objects from a binary file.
CollectionSourceTWraps an in-memory Collection. Useful for small reference data or tests.
ParquetSourceRecordReads columnar Parquet files; projection and predicate pushdown where supported.
GeoJsonFileSourceSpatialGeometryReads GeoJSON files for spatial processing (requires Spatial.plugin()).

Sinks

Sinks have no outputs; they write records to an external system.
OperatorTargetNotes
TextFileSinkText fileWrites each record as a line via a TransformationDescriptor<T, String>.
ObjectFileSinkBinary fileSerializes Java objects for later reading by ObjectFileSource.
TableSinkSQL tableInserts records into a database table via JDBC (Postgres / SQLite3).
ParquetSinkParquet fileWrites columnar Parquet output.
KafkaTopicSinkKafka topicProduces records to a Kafka topic. Supported on Java and Spark.
ApacheIcebergSinkIceberg tableWrites records to an Apache Iceberg table.
LocalCallbackSinkIn-process callbackCalls a Consumer<T> for every output record, e.g., to collect results into a local List.

ML Operators

Machine learning operators follow the same logical/execution split: you express intent (train a model, predict labels) and the optimizer selects the implementation (Spark MLlib, TensorFlow, or a Java-native implementation).

KMeansOperator

Clusters input feature vectors using iterative K-Means. Returns cluster centroids or assignments depending on variant.

DLTrainingOperator

Trains a deep learning model described by a DLModel. Backed by TensorFlow or other configured DL frameworks.

PredictOperator

Applies a trained model to feature vectors to produce predictions. Works with any model type that implements the model interface.

LogisticRegressionOperator

Trains a logistic regression classifier. Takes feature vectors and binary labels (0.0/1.0). Returns a LogisticRegressionModel.

LinearRegressionOperator

Trains a linear regression model for continuous prediction targets.

DecisionTreeClassificationOperator

Trains a decision tree for classification. Configurable max depth and minimum instances per node.

DecisionTreeRegressionOperator

Trains a decision tree for regression (continuous labels). Same parameters as the classification variant.

LinearSVCOperator

Trains a linear support vector classifier with an L2 regularization parameter.
ML operators require MachineLearning.plugin() and, for deep learning, wayang-tensorflow on the classpath. See the Plugins page for dependency details.

Chaining Operators with DataQuanta

The DataQuanta<T> class in wayang-api-scala-java is the fluent handle you use to chain operators without constructing operator objects manually. Every method returns a new DataQuanta representing the next output slot.
import org.apache.wayang.core.api.Configuration;
import org.apache.wayang.core.api.WayangContext;
import org.apache.wayang.api.JavaPlanBuilder;
import org.apache.wayang.basic.data.Tuple2;
import org.apache.wayang.java.Java;
import org.apache.wayang.spark.Spark;
import java.util.Arrays;
import java.util.Collection;

public class WordCount {
    public static void main(String[] args) {
        WayangContext wayang = new WayangContext(new Configuration())
                .withPlugin(Java.basicPlugin())
                .withPlugin(Spark.basicPlugin());

        JavaPlanBuilder planBuilder = new JavaPlanBuilder(wayang)
                .withJobName("WordCount")
                .withUdfJarOf(WordCount.class);

        // Build the plan and collect results
        Collection<Tuple2<String, Integer>> results = planBuilder
                // Source
                .readTextFile("file:///path/to/input.txt")          // TextFileSource
                // Transformations
                .flatMap(line -> Arrays.asList(line.split("\\W+"))) // FlatMapOperator
                .filter(word -> !word.isEmpty())                     // FilterOperator
                .map(word -> new Tuple2<>(word.toLowerCase(), 1))    // MapOperator
                // Aggregation
                .reduceByKey(
                    Tuple2::getField0,
                    (a, b) -> new Tuple2<>(a.getField0(), a.getField1() + b.getField1())
                )                                                    // ReduceByOperator
                // Execute and collect
                .collect();                                          // triggers execution

        results.forEach(t -> System.out.println(t.getField0() + ": " + t.getField1()));
    }
}
DataQuanta methods like filter(), map(), and flatMap() accept plain Scala/Java lambdas. Under the hood they wrap them in the appropriate FunctionDescriptor (e.g., PredicateDescriptor for filter). You only need to construct descriptors directly if you want to attach a custom LoadProfileEstimator or SQL alternative.

Build docs developers (and LLMs) love