Skip to main content

Overview

For extremely large datasets, Lodum supports O(1) memory streaming serialization. This allows you to encode massive object graphs directly to an IO stream (like a file or socket) without building the entire representation in memory.
Streaming serialization is essential when:
  • Working with datasets that don’t fit in memory
  • Sending data over networks incrementally
  • Processing large files without loading them entirely

Streaming Dump (Serialization)

Use json.dump() to write JSON directly to a stream:

Writing to Files

from lodum import lodum, json
from pathlib import Path

@lodum
class Point:
    def __init__(self, x: int, y: int):
        self.x = x
        self.y = y

@lodum
class LargeDataset:
    def __init__(self, points: list[Point]):
        self.points = points

# Create a large dataset
points = [Point(i, i * 2) for i in range(1_000_000)]
dataset = LargeDataset(points=points)

# Stream to file without loading entire JSON in memory
output_path = Path("large_dataset.json")
json.dump(dataset, output_path)

Writing to File Objects

import sys

@lodum
class Data:
    def __init__(self, items: list[int]):
        self.items = items

data = Data(items=list(range(1_000_000)))

# Stream to stdout
json.dump(data, sys.stdout)

# Stream to a file handle
with open("output.json", "w") as f:
    json.dump(data, f)

String vs Streaming Modes

Lodum automatically chooses the right mode based on whether you provide a target:
# String mode (IR-based, entire result in memory)
json_string = json.dumps(data)

# Streaming mode (O(1) memory, writes directly to stream)
json.dump(data, output_file)
dump() returns None when writing to a stream, and returns the string when no target is provided.

Streaming Load (Deserialization)

Use json.stream() for lazy, iterator-based deserialization of JSON arrays:

Processing Large Arrays

from lodum import lodum, json
from pathlib import Path

@lodum
class Record:
    def __init__(self, id: int, value: str):
        self.id = id
        self.value = value

# Assuming records.json contains: [{"id": 1, "value": "a"}, {"id": 2, "value": "b"}, ...]
for record in json.stream(Record, Path("records.json")):
    process_record(record)  # Process one at a time
    # Only one Record object is in memory at a time

Streaming from File Objects

# Stream from a file handle (must be opened in binary mode)
with open("large_array.json", "rb") as f:
    for item in json.stream(Record, f):
        print(f"Processing record {item.id}")
The stream() function requires the ijson package for incremental JSON parsing:
pip install lodum[ijson]

Complete Example

Here’s a complete example showing both streaming serialization and deserialization:
from lodum import lodum, json
from pathlib import Path
import io

@lodum
class DataPoint:
    def __init__(self, timestamp: int, value: float):
        self.timestamp = timestamp
        self.value = value

@lodum
class TimeSeries:
    def __init__(self, name: str, data: list[DataPoint]):
        self.name = name
        self.data = data

# Generate large dataset
data_points = [
    DataPoint(timestamp=i, value=i * 0.5)
    for i in range(1_000_000)
]
series = TimeSeries(name="sensor_readings", data=data_points)

# Stream to file (O(1) memory)
output = Path("sensor_data.json")
json.dump(series, output)

print(f"Serialized {len(data_points)} points to {output}")

# Load back from file
loaded_series = json.load(TimeSeries, output)
print(f"Loaded series: {loaded_series.name}")
print(f"First point: {loaded_series.data[0].timestamp}")

Streaming vs Non-Streaming Comparison

from lodum import json

# Entire JSON string built in memory
json_string = json.dumps(large_object)  

# Write string to file
with open("output.json", "w") as f:
    f.write(json_string)

Format Support

Streaming serialization is currently available for:

JSON

Full streaming support via json.dump() and json.stream()

Other Formats

Check format-specific documentation for streaming capabilities

Memory Efficiency

The streaming dumper writes JSON tokens directly to the stream as objects are traversed:
# Internal implementation (simplified)
class JsonStreamingDumper:
    def begin_struct(self, cls):
        self.write_raw("{")
    
    def field(self, name, value, handler, depth, seen):
        if not self._first_item:
            self.write_raw(",")
        self.write_raw(json.dumps(name))
        self.write_raw(":")
        handler(value, self, depth, seen)  # Recursively stream value
    
    def end_struct(self):
        self.write_raw("}")
This architecture ensures that:
  • Only the current object being serialized is in memory
  • No intermediate IR (intermediate representation) is built
  • Large collections are streamed item-by-item

Reading Large Files

When loading large JSON files without streaming, use the max_size parameter to control memory limits:
from lodum import json

# Limit input size to 50MB
data = json.load(MyClass, "large_file.json", max_size=50 * 1024 * 1024)
The default max_size is 10MB. Adjust this based on your memory constraints.

Best Practices

If your dataset exceeds available memory or you’re processing millions of records, always use streaming:
# Good: O(1) memory
json.dump(huge_dataset, output_file)

# Bad: Entire dataset in memory
json_str = json.dumps(huge_dataset)
The json.stream() function requires binary mode because it uses ijson internally:
# Correct
with open("data.json", "rb") as f:
    for item in json.stream(Record, f):
        process(item)

# Incorrect (will fail)
with open("data.json", "r") as f:  # Text mode
    for item in json.stream(Record, f):
        process(item)
Don’t accumulate streamed items back into a list:
# Good: Process one at a time
for record in json.stream(Record, file):
    process_and_discard(record)

# Bad: Defeats the purpose of streaming
all_records = list(json.stream(Record, file))

Performance Considerations

Streaming serialization provides:
  • Constant memory usage regardless of dataset size
  • Lower peak memory compared to building entire JSON string
  • Faster time-to-first-byte when sending over networks
However, streaming may be slightly slower than non-streaming for small datasets due to the overhead of writing individual tokens.

Next Steps

Schema Generation

Generate JSON schemas from your data models

Extensions

Learn about numpy, pandas, and polars support

Build docs developers (and LLMs) love