Documentation Index
Fetch the complete documentation index at: https://mintlify.com/meta-llama/llama/llms.txt
Use this file to discover all available pages before exploring further.
Overview
TheFeedForward class implements a SwiGLU (Swish-Gated Linear Unit) feedforward network, which is a key component of each Transformer layer. It uses a gating mechanism with the SiLU (Swish) activation function for improved performance.
Definition
Initialization
Parameters
Input dimension. This is the model’s hidden dimension from
ModelArgs.dim.Base hidden dimension of the feedforward layer. The actual hidden dimension is computed as
int(2 * hidden_dim / 3) and then adjusted.Value to ensure the hidden dimension is a multiple of this number. Rounds up the computed hidden dimension to the nearest multiple for computational efficiency on modern hardware.
Optional custom multiplier for the hidden dimension. When provided, scales the computed hidden dimension by this factor before rounding to
multiple_of.Hidden Dimension Computation
The actual hidden dimension is computed using the following logic (model.py:331-335):- Base scaling by 2/3
- Optional custom scaling via
ffn_dim_multiplier - Rounding up to nearest multiple of
multiple_of
Attributes
Linear transformation for the first gate layer. Projects from
dim to hidden_dim without bias.Linear transformation for the output layer. Projects from
hidden_dim back to dim without bias.Linear transformation for the second gate layer. Projects from
dim to hidden_dim without bias.Forward Pass
Parameters
Input tensor with shape
(batch_size, seq_len, dim).Returns
Output tensor with shape
(batch_size, seq_len, dim).Implementation
The forward pass implements the SwiGLU activation function (model.py:348):- Gate branch:
F.silu(self.w1(x))- Apply SiLU activation to first projection - Linear branch:
self.w3(x)- Second projection without activation - Gating: Element-wise multiplication of the two branches
- Output projection:
self.w2(...)- Project back to model dimension
SwiGLU Activation
SwiGLU combines two concepts:- GLU (Gated Linear Unit): Uses a gating mechanism with element-wise multiplication
- SiLU/Swish: Smooth activation function
x * sigmoid(x)
SwiGLU(x) = (Swish(W1·x) ⊙ W3·x) · W2
Where ⊙ represents element-wise multiplication.
Usage in TransformerBlock
The FeedForward module is instantiated inTransformerBlock with specific dimension calculations:
Performance Considerations
- The
multiple_ofparameter (default 256) ensures hidden dimensions are multiples of large powers of 2, improving GPU/TPU efficiency - Model parallelism is handled via
ColumnParallelLinearandRowParallelLinearfrom FairScale - The 2/3 scaling factor and 4x base expansion result in an effective hidden dimension of approximately
8/3 * dimbefore rounding