Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/karpathy/nanoGPT/llms.txt

Use this file to discover all available pages before exploring further.

The transformer architecture consists of stacked blocks, each containing attention and feedforward layers with layer normalization.

Block

The Block class represents a single transformer block with pre-normalization architecture.

Class definition

class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = LayerNorm(config.n_embd, bias=config.bias)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = LayerNorm(config.n_embd, bias=config.bias)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x
Location: model.py:94-106

Parameters

config
GPTConfig
required
Configuration object containing model hyperparameters.

Components

ln_1
LayerNorm
First layer normalization applied before the attention layer.
attn
CausalSelfAttention
Multi-head causal self-attention mechanism.
ln_2
LayerNorm
Second layer normalization applied before the MLP layer.
mlp
MLP
Feedforward network applied after attention.

Architecture

The Block implements pre-normalization with residual connections:
  1. Apply LayerNorm to input
  2. Apply attention
  3. Add residual connection
  4. Apply LayerNorm
  5. Apply MLP
  6. Add residual connection
This uses pre-normalization (LayerNorm before the sublayer) rather than post-normalization, which tends to be more stable for training.

MLP

The MLP class is a two-layer feedforward network with GELU activation.

Class definition

class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.gelu    = nn.GELU()
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x
Location: model.py:78-92

Parameters

config
GPTConfig
required
Configuration object containing model hyperparameters.

Components

c_fc
nn.Linear
First linear layer that expands dimensionality from n_embd to 4 * n_embd.
gelu
nn.GELU
Gaussian Error Linear Unit activation function.
c_proj
nn.Linear
Second linear layer that projects back down from 4 * n_embd to n_embd.
dropout
nn.Dropout
Dropout layer applied to the output.

Architecture

The MLP follows the standard transformer feedforward network design:
input (n_embd) → Linear → GELU → Linear → Dropout → output (n_embd)
                (4*n_embd)       (n_embd)
The hidden dimension is 4x the embedding dimension, which is standard in transformer architectures.

LayerNorm

Custom LayerNorm implementation with optional bias parameter.

Class definition

class LayerNorm(nn.Module):
    """ LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False """

    def __init__(self, ndim, bias):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(ndim))
        self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None

    def forward(self, input):
        return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)
Location: model.py:18-27

Parameters

ndim
int
required
Dimensionality to normalize over (typically config.n_embd).
bias
bool
required
Whether to include a learnable bias parameter. PyTorch’s standard LayerNorm doesn’t support bias=False.

Components

weight
nn.Parameter
Learnable scale parameter initialized to ones with shape (ndim,).
bias
nn.Parameter | None
Optional learnable bias parameter initialized to zeros with shape (ndim,). Set to None if bias=False.

Why custom LayerNorm?

PyTorch’s built-in nn.LayerNorm doesn’t support disabling the bias parameter. This implementation allows you to set bias=False in the config for potentially better performance.
The epsilon value is fixed at 1e-5 for numerical stability during normalization.

Build docs developers (and LLMs) love