Documentation Index
Fetch the complete documentation index at: https://mintlify.com/karpathy/nanochat/llms.txt
Use this file to discover all available pages before exploring further.
nanochat includes a comprehensive task system for evaluating and fine-tuning language models. Tasks provide datasets of conversations along with evaluation criteria.
Task Base Classes
Task
The base class for all tasks provides a lightweight slicing interface over datasets.
from tasks.common import Task
Properties:
eval_type: Returns either 'categorical' for multiple choice tasks or 'generative' for open-ended tasks
start, stop, step: Allow logical slicing over the dataset
Methods:
num_examples(): Returns total number of examples in the dataset
get_example(index): Returns a conversation dict with messages array
evaluate(conversation, assistant_response): Returns evaluation score (typically 0 or 1)
__len__(): Returns the effective length considering slicing parameters
__getitem__(index): Array-style access to conversations
TaskMixture
Combines multiple tasks with deterministic shuffling for SFT training.
from tasks.common import TaskMixture
mixed = TaskMixture([task1, task2, task3])
Tasks are shuffled with a fixed seed (42) to mix examples throughout training. To oversample a task, include it multiple times in the list.
TaskSequence
Sequentially concatenates tasks for curriculum-based training.
from tasks.common import TaskSequence
sequence = TaskSequence([task1, task2, task3])
Evaluation Tasks
ARC
Multiple choice science questions from Allen AI.
from tasks.arc import ARC
# Easy subset
task = ARC(subset="ARC-Easy", split="validation")
# Challenge subset
task = ARC(subset="ARC-Challenge", split="test")
Parameters:
subset: "ARC-Easy" or "ARC-Challenge"
split: "train", "validation", or "test"
Eval type: categorical
Dataset: allenai/ai2_arc
MMLU
Massive Multitask Language Understanding - multiple choice questions across 57 subjects.
from tasks.mmlu import MMLU
# All subjects
task = MMLU(subset="all", split="validation")
# Auxiliary training data
task = MMLU(subset="auxiliary_train", split="train")
Parameters:
subset: "all" or "auxiliary_train"
split: "train", "validation", "dev", or "test"
Eval type: categorical
Subjects: 57 topics including abstract_algebra, anatomy, astronomy, computer_science, mathematics, physics, and more
Dataset: cais/mmlu
GSM8K
8,000 grade school math problems with step-by-step solutions using tool calls.
from tasks.gsm8k import GSM8K
task = GSM8K(subset="main", split="train")
Parameters:
subset: "main" or "socratic"
split: "train" or "test"
Eval type: generative
Format: Solutions use <<expression=result>> syntax for calculator tool calls. Final answers are marked with #### number.
Example:
Question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?
Answer: Weng earns 12/60 = $<<12/60=0.2>>0.2 per minute.
Working 50 minutes, she earned 0.2 x 50 = $<<0.2*50=10>>10.
#### 10
Dataset: openai/gsm8k
HumanEval
Python coding benchmark (the name is a misnomer - it has nothing to do with humans).
from tasks.humaneval import HumanEval
task = HumanEval()
Eval type: generative
Format: Each example contains a function signature with docstring (prompt), the canonical solution, and test cases. Evaluation executes the generated code against test cases.
Dataset: openai/openai_humaneval
Fine-tuning Tasks
SmolTalk
General conversational dataset from HuggingFace.
from tasks.smoltalk import SmolTalk
task = SmolTalk(split="train") # 460K conversations
task = SmolTalk(split="test") # 24K conversations
Parameters:
Format: Multi-turn conversations with optional system message. Conversations alternate between user and assistant roles.
Dataset: HuggingFaceTB/smol-smoltalk
SpellingBee
Teaches models to spell words and count letter occurrences.
from tasks.spellingbee import SpellingBee
task = SpellingBee(size=1000, split="train")
Parameters:
size: Number of examples to generate
split: "train" or "test"
Eval type: generative
Purpose: Smaller models struggle with character-level understanding since they work with tokens. This task helps by:
- Practicing word spelling (mapping tokens to character sequences)
- Counting letter occurrences using both manual and Python verification
Example question variations:
- “How many r are in strawberry?”
- “Count the number of e in the word hello”
- Includes Spanish, Chinese, Korean, French, German, and Japanese variations
Response format: The assistant manually spells out the word, counts occurrences step-by-step, then verifies with Python:
Final answer uses GSM8K-style #### 3 format.
SimpleSpelling
Condensed version focusing only on spelling practice.
from tasks.spellingbee import SimpleSpelling
task = SimpleSpelling(size=1000, split="train")
Format: User asks “Spell the word: example” and assistant responds “example:e,x,a,m,p,l,e”
CustomJSON
Load custom conversations from JSONL files.
from tasks.customjson import CustomJSON
task = CustomJSON(filepath="data/conversations.jsonl")
File format: Each line is a JSON array of message objects:
[{"role":"user","content":"Hi"},{"role":"assistant","content":"Hello"}]
[{"role":"user","content":"Another conversation"},{"role":"assistant","content":"Yes"}]
Requirements:
- At least 2 messages per conversation
- Messages must alternate: user, assistant, user, assistant…
- Each message needs
role and content fields
- Content must be a string
Helper Functions
render_mc
Standard format for multiple choice questions:
from tasks.common import render_mc
question = "What is the capital of France?"
letters = ("A", "B", "C", "D")
choices = ["London", "Paris", "Berlin", "Madrid"]
user_message = render_mc(question, letters, choices)
Important design decisions:
- Letter comes AFTER the choice for better token binding in smaller models
- No whitespace before the letter (“=A” not ”= A”) to match tokenization of assistant responses
Output format:
Multiple Choice question: What is the capital of France?
- London=A
- Paris=B
- Berlin=C
- Madrid=D
Respond only with the letter of the correct answer.
Usage Examples
Training with task mixtures
from tasks.smoltalk import SmolTalk
from tasks.spellingbee import SpellingBee
from tasks.common import TaskMixture
# Oversample SpellingBee by including it twice
task = TaskMixture([
SmolTalk(split="train"),
SpellingBee(size=5000, split="train"),
SpellingBee(size=5000, split="train"),
])
Slicing datasets
from tasks.mmlu import MMLU
# First 100 examples
task = MMLU(subset="all", split="validation", start=0, stop=100)
# Every 10th example
task = MMLU(subset="all", split="validation", step=10)
Custom evaluation
task = GSM8K(subset="main", split="test")
for i in range(10):
conversation = task[i]
# Generate response with your model
response = model.generate(conversation['messages'][0]['content'])
# Evaluate
score = task.evaluate(conversation, response)
print(f"Problem {i}: {'✓' if score else '✗'}")