Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/karpathy/nanoGPT/llms.txt

Use this file to discover all available pages before exploring further.

GPT class

The main GPT Language Model implementation. This class provides the complete transformer architecture with support for training, inference, and text generation.

Constructor

GPT(config)
Initializes a new GPT model instance.
config
GPTConfig
required
Configuration object specifying model architecture parameters

Attributes

config
GPTConfig
The configuration object containing model hyperparameters
transformer
nn.ModuleDict
Container with the following components:
  • wte: Token embedding layer (vocab_size x n_embd)
  • wpe: Position embedding layer (block_size x n_embd)
  • drop: Dropout layer
  • h: ModuleList of transformer blocks
  • ln_f: Final layer normalization
lm_head
nn.Linear
Language model head that projects embeddings to vocabulary logits (n_embd x vocab_size, without bias)

Methods

forward

forward(idx, targets=None)
Performs a forward pass through the model.
idx
torch.LongTensor
required
Input token indices of shape (batch_size, sequence_length)
targets
torch.LongTensor
Target token indices for computing loss. If provided, loss will be calculated using cross-entropy.
Returns: Tuple of (logits, loss)
  • logits: Predicted logits of shape (batch_size, sequence_length, vocab_size) during training, or (batch_size, 1, vocab_size) during inference
  • loss: Cross-entropy loss if targets are provided, otherwise None

generate

@torch.no_grad()
generate(idx, max_new_tokens, temperature=1.0, top_k=None)
Generates new tokens autoregressively from a conditioning sequence.
idx
torch.LongTensor
required
Conditioning sequence of token indices with shape (batch_size, sequence_length)
max_new_tokens
int
required
Number of new tokens to generate
temperature
float
default:"1.0"
Sampling temperature. Values < 1.0 make the model more confident (less random), values > 1.0 make it more diverse (more random).
top_k
int
If specified, only the top_k most likely tokens are considered for sampling. Others are set to zero probability.
Returns: torch.LongTensor of shape (batch_size, sequence_length + max_new_tokens) with generated tokens appended
Make sure the model is in eval mode (model.eval()) before calling this method for generation.

from_pretrained

@classmethod
from_pretrained(cls, model_type, override_args=None)
Loads a pretrained GPT-2 model from OpenAI/HuggingFace.
model_type
str
required
One of: 'gpt2' (124M), 'gpt2-medium' (350M), 'gpt2-large' (774M), or 'gpt2-xl' (1558M)
override_args
dict
Optional arguments to override. Currently only dropout can be overridden.
Returns: GPT model instance with loaded pretrained weights

crop_block_size

crop_block_size(block_size)
Reduces the model’s context length via model surgery.
block_size
int
required
New block size (must be ≤ current block_size)
This method modifies the model in-place by truncating position embeddings and attention bias matrices. Use this when you want to reduce context length, for example when loading GPT-2 (block size 1024) but using a smaller context window.

get_num_params

get_num_params(non_embedding=True)
Counts the total number of parameters in the model.
non_embedding
bool
default:"True"
If True, excludes position embeddings from the count (recommended for fair comparison since they’re not used in the final layer)
Returns: int - Total number of parameters

configure_optimizers

configure_optimizers(weight_decay, learning_rate, betas, device_type)
Creates an AdamW optimizer with weight decay applied only to certain parameters.
weight_decay
float
required
Weight decay coefficient (typically 0.1)
learning_rate
float
required
Learning rate for the optimizer
betas
tuple
required
Beta coefficients for AdamW (typically (0.9, 0.95))
device_type
str
required
Device type (‘cuda’ or ‘cpu’) - used to determine if fused AdamW is available
Returns: torch.optim.AdamW optimizer

estimate_mfu

estimate_mfu(fwdbwd_per_iter, dt)
Estimates Model FLOPs Utilization (MFU) in units of A100 bfloat16 peak FLOPS.
fwdbwd_per_iter
int
required
Number of forward-backward passes per iteration (typically batch_size * gradient_accumulation_steps)
dt
float
required
Time delta in seconds for the iteration
Returns: float - MFU as a ratio (0.0 to 1.0+) where 1.0 represents 100% of A100 peak performance
MFU calculation is based on the PaLM paper (Appendix B). A100 bfloat16 peak FLOPS is assumed to be 312 TFLOPS.

GPTConfig

Dataclass containing model architecture configuration.
@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50304
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768
    dropout: float = 0.0
    bias: bool = True
block_size
int
default:"1024"
Maximum sequence length / context window size
vocab_size
int
default:"50304"
Vocabulary size. Default is GPT-2’s 50257 rounded up to nearest multiple of 64 for efficiency.
n_layer
int
default:"12"
Number of transformer blocks
n_head
int
default:"12"
Number of attention heads
n_embd
int
default:"768"
Embedding dimension size
dropout
float
default:"0.0"
Dropout probability. Use 0.0 for pretraining, try 0.1+ for finetuning.
bias
bool
default:"True"
Whether to use bias in Linear and LayerNorm layers. True matches GPT-2, False is slightly better and faster.

Build docs developers (and LLMs) love