Data loading

Meganeura’s data loading primitives keep all data in CPU memory as flat f32 slices and yield mini-batches on demand. The DataLoader handles shuffling, batching, and epoch reset. MnistDataset parses the MNIST IDX format from disk.

DataLoader

DataLoader iterates over an in-memory dataset in mini-batches:

pub struct DataLoader {
    // flat f32 data: n * sample_size elements
    // flat f32 labels: n * label_size elements
    batch_size: usize,
    // ...
}

Data is stored as row-major flat arrays. Each call to next_batch() gathers batch_size samples into contiguous scratch buffers according to the current permutation.

Creating a loader from raw tensors

use meganeura::DataLoader;

let n = 3200;
let input_dim = 784;
let classes = 10;
let batch = 32;

// n * input_dim elements
let images: Vec<f32> = (0..n * input_dim)
    .map(|i| ((i % 256) as f32) / 255.0)
    .collect();

// n * classes elements, one-hot encoded
let mut labels = vec![0.0_f32; n * classes];
for b in 0..n {
    labels[b * classes + (b % classes)] = 1.0;
}

let mut loader = DataLoader::new(images, labels, input_dim, classes, batch);

DataLoader::new accepts:

data — all samples concatenated; length must equal n * sample_size
labels — all labels concatenated; length must equal n * label_size
sample_size — number of floats per sample
label_size — number of floats per label
batch_size — samples per mini-batch

The dataset must contain at least batch_size samples. DataLoader::new panics if n < batch_size.

Iterating batches

Call next_batch() in a loop. It returns None when fewer than batch_size samples remain (the partial last batch is dropped):

while let Some(batch) = loader.next_batch() {
    session.set_input("x", batch.data);
    session.set_input("labels", batch.labels);
    session.step();
    session.wait();
}

Batch borrows internal scratch buffers valid until the next call to next_batch, shuffle, or reset:

pub struct Batch<'a> {
    pub data: &'a [f32],    // batch_size * sample_size elements
    pub labels: &'a [f32],  // batch_size * label_size elements
}

Shuffling and resetting

Shuffle the sample order before each epoch. Pass the epoch index as a seed so every epoch uses a different permutation:

loader.shuffle(epoch as u64);
loader.reset();

shuffle uses Fisher-Yates with a lightweight LCG PRNG. reset moves the position cursor back to the start without changing the order.

Trainer::train_epoch calls loader.shuffle(epoch as u64) and loader.reset() automatically. You only need to call these explicitly when managing the loop yourself.

Utility methods

Method	Returns	Description
`loader.len()`	`usize`	Total number of samples
`loader.num_batches()`	`usize`	Number of complete batches per epoch
`loader.is_empty()`	`bool`	Whether the dataset has zero samples

MnistDataset

MnistDataset loads the MNIST handwritten digit dataset from the standard IDX binary format. Images are flattened to [N, 784] and normalized to [0, 1]. Labels are one-hot encoded to [N, 10].

pub struct MnistDataset {
    pub images: Vec<f32>,  // shape [n, 784], values in [0, 1]
    pub labels: Vec<f32>,  // shape [n, 10], one-hot
    pub n: usize,
}

Loading from disk

use meganeura::MnistDataset;
use std::path::Path;

// Gzip-compressed files (the common download format)
let mnist = MnistDataset::load_gz(
    Path::new("data/train-images-idx3-ubyte.gz"),
    Path::new("data/train-labels-idx1-ubyte.gz"),
)?;

// Or raw uncompressed IDX files
let mnist = MnistDataset::load(
    Path::new("data/train-images-idx3-ubyte"),
    Path::new("data/train-labels-idx1-ubyte"),
)?;

println!("loaded {} images", mnist.n);

Both methods return io::Result<MnistDataset>. Download the standard MNIST files from yann.lecun.com/exdb/mnist.

Converting to a DataLoader

Call .loader(batch_size) to consume the dataset and produce a DataLoader:

let mut loader = mnist.loader(32);  // batch_size = 32

This is equivalent to calling DataLoader::new(mnist.images, mnist.labels, 784, 10, batch_size).

Custom datasets

You can wire up any data source by loading it into flat Vec<f32> arrays and passing them to DataLoader::new. The following example from examples/mnist.rs shows how to create a synthetic dataset when real data is unavailable:

fn synthetic_loader(n: usize, input_dim: usize, classes: usize, batch: usize) -> DataLoader {
    let images: Vec<f32> = (0..n * input_dim)
        .map(|i| ((i % 256) as f32) / 255.0)
        .collect();
    let mut labels = vec![0.0_f32; n * classes];
    for b in 0..n {
        labels[b * classes + (b % classes)] = 1.0;
    }
    DataLoader::new(images, labels, input_dim, classes, batch)
}

For multi-modal or image datasets:

Preprocess all samples offline

Load images, tokenize text, or run feature extraction. Store the result as a flat Vec<f32> where each sample occupies sample_size contiguous elements.

Encode labels as floats

One-hot encode classification labels, or use raw regression targets. Each label must occupy exactly label_size contiguous floats.

Create the loader

Pass both flat arrays to DataLoader::new with the correct sample_size, label_size, and batch_size.

Wire input names to your graph

Set TrainConfig::data_input and TrainConfig::label_input to match the names you used in g.input(...). The default names are "x" and "labels".

Complete data loading example

The following pattern from examples/mnist.rs handles both real and synthetic data gracefully:

fn load_mnist_or_synthetic(batch: usize, input_dim: usize, classes: usize) -> DataLoader {
    let data_dir = Path::new("data");
    let gz_images = data_dir.join("train-images-idx3-ubyte.gz");
    let gz_labels = data_dir.join("train-labels-idx1-ubyte.gz");
    let raw_images = data_dir.join("train-images-idx3-ubyte");
    let raw_labels = data_dir.join("train-labels-idx1-ubyte");

    if gz_images.exists() && gz_labels.exists() {
        let mnist = MnistDataset::load_gz(&gz_images, &gz_labels)
            .expect("failed to parse MNIST gz files");
        return mnist.loader(batch);
    }

    if raw_images.exists() && raw_labels.exists() {
        let mnist = MnistDataset::load(&raw_images, &raw_labels)
            .expect("failed to parse MNIST files");
        return mnist.loader(batch);
    }

    // Fall back to deterministic synthetic data
    let n = 3200;
    let images: Vec<f32> = (0..n * input_dim)
        .map(|i| ((i % 256) as f32) / 255.0)
        .collect();
    let mut labels = vec![0.0_f32; n * classes];
    for b in 0..n {
        labels[b * classes + (b % classes)] = 1.0;
    }
    DataLoader::new(images, labels, input_dim, classes, batch)
}

Get Started

Concepts

Training

Inference

Built-in Models

Advanced

DataLoader

Creating a loader from raw tensors

Iterating batches

Shuffling and resetting

Utility methods

MnistDataset

Loading from disk

Converting to a DataLoader

Custom datasets

Complete data loading example

Build docs developers (and LLMs) love

Get Started

Concepts

Training

Inference

Built-in Models

Advanced

​DataLoader

​Creating a loader from raw tensors

​Iterating batches

​Shuffling and resetting

​Utility methods

​MnistDataset

​Loading from disk

​Converting to a DataLoader

​Custom datasets

​Complete data loading example

Build docs developers (and LLMs) love

DataLoader

Creating a loader from raw tensors

Iterating batches

Shuffling and resetting

Utility methods

MnistDataset

Loading from disk

Converting to a DataLoader

Custom datasets

Complete data loading example