Wayang’s Spark backend can run entire pipelines on SparkDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/apache/wayang/llms.txt
Use this file to discover all available pages before exploring further.
Dataset[Row] — also known as DataFrames. Use this mode when you ingest data from lakehouse formats such as Parquet or Delta, interoperate with Spark ML stages, or prefer schema-aware processing over raw RDDs. This guide explains how to opt in, how mixing Dataset and RDD operators works, and what to watch for in practice.
When to use Dataset channels
Lakehouse storage
Reading Parquet or Delta directly into datasets avoids repeated schema inference and keeps Spark’s optimized Parquet reader in play.
Spark ML
ML operators already convert RDDs into DataFrames internally. Feeding them a dataset channel skips that conversion and preserves column names.
Federated pipelines
Mix dataset-backed Spark stages with other platforms. Wayang inserts conversions only where strictly necessary.
How preferDataset works
The ParquetSource logical operator exposes a preferDatasetOutput(boolean) builder method. When set to true, the Spark execution operator (SparkParquetSource) will advertise DatasetChannel.UNCACHED_DESCRIPTOR as its preferred output channel and hand a Dataset<Row> directly to downstream operators instead of converting to an RDD.
ParquetSink accepts a preferDataset flag in its constructor:
Full example pipeline
Mixing with RDD operators
If a downstream operator only supportsRddChannel, Wayang inserts a conversion operator automatically. Two conversion operators are available:
SparkRddToDatasetOperator
Converts a JavaRDD<Record> into a Dataset<Row>. The conversion uses DatasetConverters.recordsToDataset(...), which infers the schema from the RecordType attached to the data type, or falls back to sampled schema inference.
SparkDatasetToRddOperator
Converts a Dataset<Row> back into a JavaRDD<Record>.
RecordType for schema preservation
RecordType carries field names so that SparkRddToDatasetOperator can produce a precise schema without sampling the RDD. Provide field names when constructing your sources:
RecordType is present, the RDD-to-Dataset conversion maps each field by name rather than by column position, which avoids silent schema mismatches when parquet files evolve.
Developer checklist
Use RecordType whenever possible
Attach field names to your logical operator output types. This lets the RDD↔Dataset converters produce a precise schema and avoids sampling overhead.
Reuse sparkExecutor.ss
When writing custom Spark execution operators that build DataFrames, always obtain the
SparkSession from sparkExecutor.ss rather than calling SparkSession.builder(). Creating extra contexts causes failures unless spark.driver.allowMultipleContexts is true.Inspect plan explanations
Enable
wayang.core.explain.enabled = true and review the generated plan files to verify that dataset-to-RDD conversions are not inserted unexpectedly. Each conversion adds serialization overhead.Advertise DatasetChannel in custom operators
If you write a custom execution operator that consumes or produces a
Dataset<Row>, declare DatasetChannel.UNCACHED_DESCRIPTOR (and optionally DatasetChannel.CACHED_DESCRIPTOR) in getSupportedInputChannels / getSupportedOutputChannels. Otherwise the optimizer cannot keep data in dataset form through your operator.Current limitations
- Parquet only. Only
ParquetSourceandParquetSinkexpose thepreferDatasetAPI today. Text sources (TextFileSource) and object sources still produceRddChanneloutput. - ML4All pipelines. ML4All stages currently emit plain
double[]/DoubleRDDs. They benefit from the internal DataFrame conversions within ML operators but do not exposeDatasetChanneloutputs yet. - No dataset-aware
map/filter. The standardMapOperatorandFilterOperatorSpark implementations useRddChannel. A dataset-aware variant would allow schema-preserving transforms without leaving Catalyst’s purview. - No Delta Lake source. There is no built-in Delta source yet; Delta tables can be read as Parquet using
SparkParquetSourcebut without Delta transaction log awareness.
Spark channel type summary
| Channel | Class | Carries | Notes |
|---|---|---|---|
| Uncached RDD | RddChannel.UNCACHED_DESCRIPTOR | JavaRDD<T> | Default Spark channel |
| Cached RDD | RddChannel.CACHED_DESCRIPTOR | Persisted JavaRDD<T> | Used when data is reused |
| Uncached Dataset | DatasetChannel.UNCACHED_DESCRIPTOR | Dataset<Row> | Preferred for Parquet/lakehouse |
| Cached Dataset | DatasetChannel.CACHED_DESCRIPTOR | Persisted Dataset<Row> | For multi-consumer dataset plans |
| Broadcast | BroadcastChannel.DESCRIPTOR | Broadcast variable | For small side-inputs |
