Documentation Index
Fetch the complete documentation index at: https://mintlify.com/karpathy/nanochat/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The nanochat dataloader provides efficient, distributed data loading for pretraining and fine-tuning with automatic tokenization and sequence packing.tokenizing_distributed_data_loader_bos_bestfit
Creates a distributed data loader with BOS-aligned best-fit packing.Parameters
List of paths to compressed data shard files (.zst format)
Number of sequences per batch (per device)
Maximum sequence length (context window)
Tokenizer instance with
encode() methodTotal number of distributed processes
Rank of current process (0 to world_size-1)
Returns
Generator yielding batches with:
input_ids: Token IDs of shape(batch_size, sequence_len)attention_mask: Optional attention mask (for packed sequences)
tokenizing_distributed_data_loader_with_state_bos_bestfit
Stateful version that supports checkpointing and resuming.Mutable dictionary to track loader state:
shard_idx: Current shard indexbyte_offset: Byte offset within current shardtokens_consumed: Total tokens processed
BOS-Aligned Best-Fit Packing
The dataloader uses a sophisticated packing algorithm optimized for LLM training:BOS Alignment
- Every sequence starts with a Beginning of Sequence (BOS) token
- Multiple documents can be packed into a single sequence
- Each document boundary is marked with BOS
- This allows the model to learn document boundaries naturally
Best-Fit Packing
Efficiency
Packing achieves ~99% token utilization compared to ~50-60% for naive batching:Distributed Loading
Each rank loads a different subset of data:- No data duplication across ranks
- Deterministic training (same data order for same seed)
- Balanced workload (each rank processes similar amount of data)
Data Shard Format
Data shards are Zstandard-compressed text files:- Raw text documents separated by newlines
- ~250M characters per shard (~100MB compressed)
- UTF-8 encoding
Creating Shards
Seedev/repackage_data_reference.py for shard generation:
Usage Example
See Also
- Pretraining - Using the dataloader for base model training
- Tokenizer - BPE tokenizer used by the dataloader