Wayang ships a rich library of built-in operators that cover the full spectrum of data processing tasks — from simple field projections to iterative machine-learning loops spanning multiple platforms. Every operator in this catalog is a logical operator: it describes what to compute, not how or where. The optimizer resolves each logical operator to a platform-specific execution operator at runtime, drawing on the mappings contributed by whatever plugins you have registered.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/apache/wayang/llms.txt
Use this file to discover all available pages before exploring further.
Logical vs. Execution Operators
The distinction between logical and execution operators is fundamental to Wayang’s design:- Logical Operators
- Execution Operators
Logical operators live in Key properties of logical operators:
wayang-basic and wayang-core. They describe computation in abstract, platform-agnostic terms. You use them directly when building a WayangPlan or when calling methods on DataQuanta.- They hold a
FunctionDescriptor(UDF) or other parameters. - They provide a
CardinalityEstimatorso the optimizer can reason about data sizes. - They know nothing about JVM threads, Spark RDDs, or JDBC connections.
UDFs: FunctionDescriptor Hierarchy
User-defined functions (UDFs) are wrapped in typed descriptor objects rather than passed as raw lambdas. This lets the optimizer inspect metadata (load estimates, complexity class, SQL translations) without executing the function.| Descriptor | Used by | Java / Scala type |
|---|---|---|
TransformationDescriptor | MapOperator, MapPartitionsOperator | T → R |
PredicateDescriptor | FilterOperator | T → Boolean |
FlatMapDescriptor | FlatMapOperator | T → Iterable<R> |
ReduceDescriptor | ReduceOperator, ReduceByOperator, GlobalReduceOperator | (T, T) → T |
MapPartitionsDescriptor | MapPartitionsOperator | Iterable<T> → Iterable<R> |
FunctionDescriptor.SerializableFunction (or the predicate/binary-operator variants), which extends java.io.Serializable — a requirement for shipping UDFs to remote executors like Spark workers.
Transformations
These operators process elements one by one or partition by partition.map
Applies a
TransformationDescriptor to each element. Output cardinality equals input cardinality. The most common operator in any pipeline.flatMap
Applies a
FlatMapDescriptor that returns zero or more output elements per input element. Output cardinality is estimated via a configurable selectivity.filter
Applies a
PredicateDescriptor. Only elements for which the predicate returns true are forwarded. Also accepts an optional SQL WHERE clause for database pushdown.mapPartitions
Applies a
MapPartitionsDescriptor to an entire partition at once — useful when the UDF needs to look at multiple records together (e.g., mini-batch scoring).sort
Sorts the dataset by a key function. Platform executors use native sort implementations (e.g.,
RDD.sortBy on Spark).distinct
Removes duplicate elements. Uses hashing or sorting depending on the platform.
sample
Returns a fixed or fraction-based random sample. Supports Bernoulli, reservoir, and other sampling strategies.
zipWithId
Appends a unique 64-bit ID to every element, producing
Tuple2<Long, T>. IDs are not necessarily sequential but are globally unique within the dataset.Aggregations
reduceByKey
Groups by a key function and reduces values within each group using a binary operator. Equivalent to
groupByKey + reduce.reduce / globalReduce
ReduceOperator reduces a keyed group; GlobalReduceOperator reduces the entire dataset to a single element.groupBy / materializedGroupBy
GroupByOperator streams groups lazily; MaterializedGroupByOperator collects each group into a Collection before forwarding. Use the materialized variant when downstream UDFs need random access within a group.count
Counts the number of elements in the dataset. Returns a
DataQuanta<Long>.coGroup
Joins two datasets by key and forwards matched groups as
Tuple2<Iterable<A>, Iterable<B>>. Similar to SQL FULL OUTER JOIN at the group level.Binary and Set Operators
| Operator | Description |
|---|---|
JoinOperator | Inner equi-join on two datasets by key functions. |
CartesianOperator | Cross product of two datasets. Use with caution on large inputs. |
UnionAllOperator | Concatenates two datasets without deduplication. |
IntersectOperator | Returns elements present in both datasets. |
CoGroupOperator | Groups both inputs by the same key for custom join logic. |
Control Flow
Wayang supports iterative algorithms natively through loop operators. All three compile down to theLoopSubplan / LoopHeadOperator mechanism in the core.
- doWhile
- repeat
- loop
Runs the body subplan, evaluates a convergence predicate on a designated convergence output, and repeats until the predicate returns
false (or until a maximum iteration count is reached).Sources
Sources have no inputs; they produce records from an external system.- File / Object Sources
- Database / Table Sources
- Streaming / Cloud Sources
| Operator | Output type | Notes |
|---|---|---|
TextFileSource | String | Reads lines from any URL the configured filesystem supports (local, HDFS, S3). Samples the first MiB for cardinality estimation. |
ObjectFileSource | T | Reads serialized Java objects from a binary file. |
CollectionSource | T | Wraps an in-memory Collection. Useful for small reference data or tests. |
ParquetSource | Record | Reads columnar Parquet files; projection and predicate pushdown where supported. |
GeoJsonFileSource | SpatialGeometry | Reads GeoJSON files for spatial processing (requires Spatial.plugin()). |
Sinks
Sinks have no outputs; they write records to an external system.| Operator | Target | Notes |
|---|---|---|
TextFileSink | Text file | Writes each record as a line via a TransformationDescriptor<T, String>. |
ObjectFileSink | Binary file | Serializes Java objects for later reading by ObjectFileSource. |
TableSink | SQL table | Inserts records into a database table via JDBC (Postgres / SQLite3). |
ParquetSink | Parquet file | Writes columnar Parquet output. |
KafkaTopicSink | Kafka topic | Produces records to a Kafka topic. Supported on Java and Spark. |
ApacheIcebergSink | Iceberg table | Writes records to an Apache Iceberg table. |
LocalCallbackSink | In-process callback | Calls a Consumer<T> for every output record, e.g., to collect results into a local List. |
ML Operators
Machine learning operators follow the same logical/execution split: you express intent (train a model, predict labels) and the optimizer selects the implementation (Spark MLlib, TensorFlow, or a Java-native implementation).KMeansOperator
Clusters input feature vectors using iterative K-Means. Returns cluster centroids or assignments depending on variant.
DLTrainingOperator
Trains a deep learning model described by a
DLModel. Backed by TensorFlow or other configured DL frameworks.PredictOperator
Applies a trained model to feature vectors to produce predictions. Works with any model type that implements the model interface.
LogisticRegressionOperator
Trains a logistic regression classifier. Takes feature vectors and binary labels (0.0/1.0). Returns a
LogisticRegressionModel.LinearRegressionOperator
Trains a linear regression model for continuous prediction targets.
DecisionTreeClassificationOperator
Trains a decision tree for classification. Configurable max depth and minimum instances per node.
DecisionTreeRegressionOperator
Trains a decision tree for regression (continuous labels). Same parameters as the classification variant.
LinearSVCOperator
Trains a linear support vector classifier with an L2 regularization parameter.
ML operators require
MachineLearning.plugin() and, for deep learning, wayang-tensorflow on the classpath. See the Plugins page for dependency details.Chaining Operators with DataQuanta
TheDataQuanta<T> class in wayang-api-scala-java is the fluent handle you use to chain operators without constructing operator objects manually. Every method returns a new DataQuanta representing the next output slot.
