The I/O packages provide transforms for reading from and writing to various data sources and sinks.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/apache/beam/llms.txt
Use this file to discover all available pages before exploring further.
textio
Package textio provides transforms for reading and writing text files.Read
Reads a set of files matching a glob pattern and returns lines as PCollection<string>.The scope to insert the transform into
File path or glob pattern (e.g., “gs://bucket/.txt”, “/path/to/.log”)
Optional configuration: ReadAutoCompression(), ReadGzip(), ReadUncompressed()
PCollection<string> of lines (newlines removed)
ReadAll
Expands and reads filenames from an input PCollection<string> of globs.The scope to insert the transform into
PCollection<string> of file paths or glob patterns
Optional compression configuration
PCollection<string> of all lines from all matched files
ReadWithFilename
Reads files and returns PCollection<KV<string,string>> of filename and line.PCollection<KV<string,string>> where key is filename and value is line content
Write
Writes a PCollection<string> to a file as separate lines.The scope to insert the transform into
Output file path
PCollection<string> to write (each element becomes a line)
Immediate
Reads a local file at pipeline construction time and embeds data into the pipeline.The scope to insert the transform into
Local file path to read immediately
PCollection<string> with file contents, or error if read fails
fileio
Package fileio provides lower-level transforms for matching and reading files with more control.MatchFiles
Finds all files matching a glob pattern and returns PCollection<FileMetadata>.The scope to insert the transform into
File path or glob pattern
Options: MatchEmptyAllow(), MatchEmptyDisallow(), MatchEmptyAllowIfWildcard()
PCollection<FileMetadata> with file path, size, and last modified time
MatchAll
Matches files from an input PCollection<string> of glob patterns.PCollection<string> of glob patterns
PCollection<FileMetadata> of all matching files
MatchContinuously
Continuously watches for new files matching a pattern at regular intervals.The scope to insert the transform into
File pattern to watch
How often to check for new files
Options: MatchStart(), MatchEnd(), MatchDuplicateSkip(), MatchDuplicateAllowIfModified(), MatchApplyWindow()
Unbounded PCollection<FileMetadata> of matching files over time
ReadMatches
Converts PCollection<FileMetadata> to PCollection<ReadableFile>.The scope to insert the transform into
PCollection<FileMetadata> from MatchFiles/MatchAll
Options: ReadAutoCompression(), ReadGzip(), ReadUncompressed(), ReadDirectorySkip()
PCollection<ReadableFile> ready for reading
FileMetadata
Contains metadata about a matched file.Full path to the file
File size in bytes
When the file was last modified
ReadableFile
Wrapper around FileMetadata providing methods to read file contents.Open
Opens the file for reading.Context for the file operation
Reader for file contents (caller must close) or error
Read
Reads the entire file into memory.Context for the file operation
Complete file contents or error
ReadString
Reads the entire file as a string.File contents as string or error
Other I/O Packages
The Go SDK includes I/O connectors for various data sources:avroio
Read and write Avro files.bigqueryio
Read from and write to Google BigQuery.parquetio
Read and write Parquet files.pubsubio
Read from and write to Google Cloud Pub/Sub.databaseio
Generic database I/O for SQL databases.mongodbio
Read from and write to MongoDB.Complete Example
Best Practices
File Paths
File Paths
- Use appropriate URI schemes:
gs://for GCS,s3://for S3, local paths for local files - Glob patterns support
*and**wildcards - Always validate file paths before pipeline execution
Compression
Compression
- Use ReadAutoCompression() for mixed compression types
- Explicit compression options are faster (no detection overhead)
- Gzip is automatically detected from
.gzextension
Performance
Performance
- textio.Read uses splittable DoFn for parallel processing
- Large files are automatically split into chunks
- Use fileio for custom file processing logic
- Avoid Immediate() for large files (embeds in pipeline)
Streaming
Streaming
- Use MatchContinuously for streaming file ingestion
- Configure deduplication based on your use case
- Consider windowing strategies for time-based processing