The Java Streams platform is Wayang’s built-in local execution engine. It runs entirely inside the JVM that starts your application, usesDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/apache/wayang/llms.txt
Use this file to discover all available pages before exploring further.
java.util.stream under the hood, and requires no external cluster, daemon, or additional installation. Because of this it starts in milliseconds, making it the right choice for development, unit tests, exploratory data work, and any job where the dataset fits comfortably in local memory. When Wayang’s cost-based optimizer is given both a local engine and a distributed engine, it almost always prefers Java Streams for small data because the startup cost of a cluster engine exceeds the savings from parallelism.
Maven dependency
Add thewayang-java artifact to your project. You also need wayang-core, wayang-basic, and the API module — wayang-java is the platform-specific piece on top of those:
WAYANG_VERSION with the latest release on Maven Central.
Entry-point class: Java
The Java class in org.apache.wayang.java is the registry for all Java-platform components. It exposes three static factory methods:
Java.channelConversionPlugin() is not needed when you register only the Java platform. It becomes relevant when you register both Java and another platform (such as Spark) and the optimizer needs to move data between the two engines within a single job.Supported operators
Java.basicPlugin() provides implementations for all standard Wayang logical operators:
- Sources:
TextFileSource,ObjectFileSource,CollectionSource,ParquetSource,KafkaTopicSource, cloud storage sources (S3, GCS, Azure Blob), Apache Iceberg - Transformations:
MapOperator,MapPartitionsOperator,FlatMapOperator,FilterOperator,SortOperator,DistinctOperator,ZipWithIdOperator - Aggregations:
ReduceByOperator,GlobalReduceOperator,MaterializedGroupByOperator,GlobalMaterializedGroupOperator,CountOperator - Multi-input:
JoinOperator,CartesianOperator,CoGroupOperator,UnionAllOperator,IntersectOperator - Control flow:
LoopOperator,DoWhileOperator,RepeatOperator,SampleOperator - Sinks:
TextFileSink,ObjectFileSink,LocalCallbackSink,KafkaTopicSink,ParquetSink,TableSink, Apache Iceberg sink
Java.graphPlugin() adds PageRankOperator on top of those.
When the optimizer picks Java Streams
Wayang’s optimizer assigns a cost to each candidate execution plan. Cost is a function of CPU, memory, disk I/O, and network I/O predicted for each operator. Java Streams costs are dominated by CPU time on a single core. Distributed engines like Spark carry a fixed initialization cost (wayang.spark.init.ms = 4500 milliseconds by default in wayang-spark-defaults.properties) plus serialization and network overhead.
For small datasets the fixed overhead of a cluster engine makes its total cost higher than single-core Java execution, so the optimizer picks Java Streams. The crossover point depends on your hardware and configuration, but a rough heuristic is: if your dataset fits in a few hundred megabytes and your pipeline is not graph-heavy, expect the optimizer to stay on Java Streams.
If you want to guarantee Java Streams execution without registering any other platform, just don’t register any other platform:
Configuration defaults
The Java platform reads its defaults fromwayang-java-defaults.properties, bundled inside wayang-java. The key properties:
Configuration object before passing it to WayangContext:
Complete local word-count example
The following example runs entirely on the Java Streams platform. No cluster, no environment variables, no additional setup required beyond the Maven dependencies.Using Java Streams alongside other platforms
Registering both Java Streams and a distributed engine is the recommended production pattern. The optimizer picks the cheapest engine per operator:When mixing platforms, Wayang inserts conversion operators automatically wherever data must cross a platform boundary. Each conversion carries a measured cost, so the optimizer accounts for the transfer overhead when choosing the plan.
