The goal of nanochat is to improve the state of the art in micro models that are accessible to work with end to end on budgets of < $1000.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/karpathy/nanochat/llms.txt
Use this file to discover all available pages before exploring further.
Philosophy
Accessibility is about overall cost but also about cognitive complexity. nanochat is not an exhaustively configurable LLM “framework”: ❌ No giant configuration objects❌ No model factories
❌ No if-then-else monsters in the codebase ✅ Single, cohesive, minimal codebase
✅ Readable and hackable
✅ Maximally forkable “strong baseline”
✅ Runs start to end to produce a ChatGPT model you can talk to
Current Focus
The most interesting area of contribution is speeding up the time to GPT-2 (achieving a CORE score above 0.256525). Currently this takes ~3 hours on an 8XH100 node, but we can improve it further by optimizing the pretraining stage. See the Time-to-GPT-2 Leaderboard for details on how to participate.Contribution Guidelines
Code Quality
- Keep code minimal, readable, and hackable
- Avoid adding abstraction layers or configuration complexity
- Don’t significantly bloat the codebase
- Avoid esoteric or overly specialized optimizations
Principled Improvements
nanochat cares about training an entire miniseries of models, not just targeting a single model size. Your changes must: ✅ Generalize across different model depths (--depth parameter)✅ Work for the full range of model sizes (not just d24 or d26)
✅ Maintain the “single dial of complexity” philosophy The depth parameter automatically determines all other hyperparameters (width, heads, learning rate, training horizon, weight decay, etc.) so models come out compute-optimal. Users shouldn’t have to think about these details.
Submitting Changes
-
Test across depths: Verify your change works for multiple
--depthsettings (e.g., d12, d16, d20, d24) -
Measure improvements: Show gains in:
- Training time (wall clock)
- Validation loss (
val_bpb) - CORE metric
- Efficiency (MFU, throughput)
- Document your approach: Explain the reasoning and any tradeoffs
-
Create a PR: Include:
- Clear description of the change
- Performance improvements with evidence
- Any AI-assisted code (see policy below)
AI Contribution Policy
Disclosure required. When submitting a PR, please declare:- Any parts with substantial LLM contribution
- Code you have not written personally
- Code you do not fully understand
Development Workflow
Quick Iteration
For rapid experimentation (~5 minutes per run), train a d12 model:- Validation loss curves
- Training throughput
- Final CORE score
Scaling Laws
For deeper analysis, run scaling law experiments:Full Miniseries
To train the complete miniseries across all depths:Monitoring
Watch these WandB metrics:- Loss curves:
val_bpbvs.step,total_training_time,total_training_flops - Capability:
core_metric(DCLM CORE score) - Efficiency:
train/mfu,train/tok_per_sec, VRAM usage
Areas to Contribute
Pretraining Optimization
- Training efficiency improvements
- Better hyperparameter scaling across depths
- Data loading and preprocessing speedups
- Mixed precision strategies
Model Architecture
- Architecture improvements that generalize
- Attention mechanisms
- Normalization strategies
- Initialization methods
Evaluation
- Additional task implementations
- Improved evaluation metrics
- Faster evaluation methods
Fine-tuning
- SFT improvements
- RL training enhancements
- New capabilities (see counting r in strawberry guide)
Documentation
- Tutorials and guides
- Example notebooks
- Architecture explanations
- Performance optimization tips
What NOT to Contribute
❌ Configuration complexity: Giant YAML configs, complex factories, excessive abstraction ❌ Single-model optimizations: Tweaks that only work for d24 or d26 ❌ Framework bloat: Trying to make nanochat support every possible use case ❌ Breaking changes: Modifications that fundamentally alter the simplicity philosophy Remember: nanochat is intentionally not a framework. It’s a strong baseline that should stay minimal and hackable.Getting Help
- DeepWiki: Use DeepWiki to ask questions about the repo
- Discussions: GitHub Discussions for design questions and ideas
- Discord: #nanochat channel for real-time help
- Issues: GitHub Issues for bug reports
Community Resources
- Leaderboard - Time-to-GPT-2 competition
- Guides - Tutorials and writeups
- GitHub Discussions - Q&A and announcements
Recognition
Contributors who improve the leaderboard get:- Credit in the leaderboard table
- Recognition in commit history
- Mention in related writeups and discussions
Acknowledgements
nanochat benefits from the broader community:- Inspired by nanoGPT and modded-nanoGPT
- Built on datasets from HuggingFace
- Developed with compute from Lambda
- Guidance from Alec Radford
- Repo management by @svlandeg