Documentation Index
Fetch the complete documentation index at: https://mintlify.com/karpathy/nanochat/llms.txt
Use this file to discover all available pages before exploring further.
Overview
TheEngine class provides an efficient interface for autoregressive token generation with KV caching and tool use support.
Engine Class
Constructor
The GPT model instance to use for generation
Tokenizer instance with
encode(), decode(), and encode_special() methods (needed for tool use)Generation Method
generate()
Generator that yields tokens one at a time for autoregressive generation.Input token IDs as a list of integers. Must be pre-tokenized.
Number of parallel samples to generate. Uses KV cache cloning for efficiency.
Maximum number of tokens to generate. If None, generates until model outputs end token or reaches context limit.
Sampling temperature. Higher values (e.g., 1.5) make output more random, lower values (e.g., 0.7) more deterministic. Set to 0.0 for greedy decoding.
If set, only sample from the top-k most likely tokens. None means no top-k filtering.
Random seed for reproducible generation
Yields
The generator yields(token_column, token_masks) tuples:
List of length
num_samples containing the next token ID for each sampleList of length
num_samples with values:1if token was sampled from the model0if token was forced (e.g., tool use output)
Tool Use
The engine automatically detects and executes Python expressions in tool blocks:- When model generates
<|python_start|>, enters tool mode - Collects tokens until
<|python_end|> - Evaluates the Python expression safely
- Forces
<|output_start|>result<|output_end|>tokens
- Basic arithmetic:
2 + 2,10 * 5 - String methods:
'strawberry'.count('r')
- 3-second timeout per expression
- No dangerous operations (import, exec, eval, file access)
- Limited character set for string operations
KV Cache
The engine automatically manages a key-value cache for efficient generation:Benefits
Speed
Dramatically faster by caching past key/value states
Efficient Sampling
Clone cache once for multiple parallel samples
Memory Optimized
Only stores compressed KV states, not full activations
Flash Attention 3
Optimized for FA3’s flash_attn_with_kvcache API
Cache Structure
The KVCache stores tensors in Flash Attention 3’s native(B, T, H, D) layout:
- Keys:
(batch_size, seq_len, n_kv_head, head_dim) - Values:
(batch_size, seq_len, n_kv_head, head_dim)
cache_seqlens tensor allows FA3 to update cache in-place.
Helper Functions
sample_next_token()
Samples the next token from logits:Logits tensor of shape
(B, vocab_size)Random number generator for sampling
Sampling temperature. 0.0 for greedy decoding.
If set, only sample from top-k tokens
Usage Examples
Basic Generation
Chat Conversation
Multiple Samples
Tool Use
See Also
- Chat CLI - Command-line interface using Engine
- Chat Web UI - Web-based interface using Engine
- GPT Model - Underlying model architecture