Documentation Index Fetch the complete documentation index at: https://mintlify.com/karpathy/nanoGPT/llms.txt
Use this file to discover all available pages before exploring further.
The fastest way to get started with nanoGPT is to train a character-level model on the works of Shakespeare. This small-scale training run completes in about 3 minutes on a GPU.
Prepare the dataset
First, download and tokenize the Shakespeare dataset:
python data/shakespeare_char/prepare.py
This creates train.bin and val.bin files containing the character-level tokenized text.
Training configurations
GPU training
CPU training
Apple Silicon
Train a baby GPT using the default configuration: python train.py config/train_shakespeare_char.py
Model architecture The configuration in config/train_shakespeare_char.py defines a small Transformer: # Model parameters
n_layer = 6
n_head = 6
n_embd = 384
dropout = 0.2
# Training parameters
batch_size = 64
block_size = 256 # context of up to 256 previous characters
learning_rate = 1e-3
max_iters = 5000
This configuration trains a 6-layer Transformer with 6 attention heads and 384 feature channels.
Expected results
Training time : ~3 minutes on A100 GPU
Best validation loss : 1.4697
Output directory : out-shakespeare-char
If you don’t have a GPU, you can still train on CPU with reduced parameters: python train.py config/train_shakespeare_char.py \
--device=cpu \
--compile=False \
--eval_iters=20 \
--log_interval=1 \
--block_size=64 \
--batch_size=12 \
--n_layer=4 \
--n_head=4 \
--n_embd=128 \
--max_iters=2000 \
--lr_decay_iters=2000 \
--dropout=0.0
Parameter adjustments
Set --device=cpu and --compile=False (PyTorch 2.0 compile not supported on CPU)
Reduce context size to 64 characters (--block_size=64)
Use smaller batch size of 12 (--batch_size=12)
Smaller model: 4 layers, 4 heads, 128 embedding size
Shorter training: 2000 iterations
No dropout for small networks (--dropout=0.0)
Expected results
Training time : ~3 minutes on CPU
Validation loss : ~1.88 (higher than GPU version)
On Apple Silicon Macbooks with recent PyTorch versions, use Metal Performance Shaders: python train.py config/train_shakespeare_char.py --device=mps
The --device=mps flag uses the on-chip GPU and can significantly accelerate training (2-3x speedup).
Configuration parameters
Key parameters from config/train_shakespeare_char.py:
Parameter Value Description out_dir'out-shakespeare-char'Checkpoint directory eval_interval250 Steps between evaluations eval_iters200 Batches to use for evaluation gradient_accumulation_steps1 No gradient accumulation batch_size64 Batch size per iteration block_size256 Context length in characters learning_rate1e-3 Higher LR for baby networks max_iters5000 Total training iterations warmup_iters100 Linear warmup steps beta20.99 Adam beta2 (higher due to small batch)
Sample from the model
After training completes, generate text samples:
python sample.py --out_dir=out-shakespeare-char
Example output
After 3 minutes of training on GPU:
ANGELO:
And cowards it be strawn to my bed,
And thrust the gates of my threats,
Because he that ale away, and hang'd
An one with him.
DUKE VINCENTIO:
I thank your eyes against it.
DUKE VINCENTIO:
Then will answer him to save the malm:
And what have you tyrannous shall do this?
Character-level models produce lower quality text than BPE-tokenized models. For better results, consider finetuning a pretrained GPT-2 model on this dataset.
Advanced configuration
Adjust model size
Modify n_layer, n_head, and n_embd in the config file or via command line: python train.py config/train_shakespeare_char.py --n_layer=8 --n_embd=512
Extend training
Increase max_iters and lr_decay_iters for longer training: python train.py config/train_shakespeare_char.py --max_iters=10000 --lr_decay_iters=10000
Enable logging
Track training progress with Weights & Biases: python train.py config/train_shakespeare_char.py --wandb_log=True
Next steps
Reproduce GPT-2 Train a 124M parameter model on OpenWebText
Finetuning Finetune pretrained GPT-2 models on custom data