The DataGen connector provides aDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/apache/flink/llms.txt
Use this file to discover all available pages before exploring further.
Source implementation that generates input data for Flink pipelines. It is useful when developing locally or running demos without access to external systems such as Kafka. The connector is built-in and requires no additional dependencies.
How it works
DataGeneratorSource produces N records in parallel. The source divides the full record count into sub-sequences—one per parallel subtask—and maps each sequence index to a record using a user-supplied GeneratorFunction<Long, T>.
The output order depends on parallelism. With parallelism 1, records are produced in index order. With higher parallelism, each subtask produces its sub-range in order but the overall interleaving is not deterministic.
Basic usage
The following example produces["Number: 0", "Number: 1", ..., "Number: 999"]:
Rate limiting
DataGeneratorSource has built-in support for rate limiting. The rate applies across all parallel subtasks combined.
RateLimiterStrategy implementations:
| Strategy | Description |
|---|---|
RateLimiterStrategy.perSecond(rate) | Limits throughput to the specified number of records per second across all subtasks. |
RateLimiterStrategy.perCheckpoint(recordsPerCheckpoint) | Emits at most the specified number of records between consecutive checkpoints. |
Boundedness
DataGeneratorSource is always bounded. Setting numberOfRecords to Long.MAX_VALUE makes it effectively unbounded in practice—the source will never reach the end.
For finite sequences, run the job in BATCH execution mode for better performance:
Watermarks
You can produce deterministic watermarks directly from generated events by providing a customWatermarkStrategy:
Exactly-once guarantees
DataGeneratorSource supports at-least-once and end-to-end exactly-once processing under one condition: the GeneratorFunction must be deterministic—the same Long index must always produce the same output record. When this holds, Flink can replay the same indices after a recovery and produce identical records.
If your
GeneratorFunction uses ThreadLocalRandom, System.currentTimeMillis(), or any other non-deterministic call, the source cannot provide exactly-once guarantees because re-executed indices will produce different records.Configuration summary
| Parameter | Type | Description |
|---|---|---|
generatorFunction | GeneratorFunction<Long, T> | Maps a sequence index to a generated record. |
count | long | Total number of records to generate. Use Long.MAX_VALUE for an unbounded stream. |
rateLimiterStrategy | RateLimiterStrategy | Optional. Controls the maximum emission rate. |
outputTypeInfo | TypeInformation<T> | Type information for the generated records. |

