Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt

Use this file to discover all available pages before exploring further.

Before any attention or feed-forward computation can happen, raw integer token IDs must be converted into dense floating-point vectors that the rest of the network can process. Then, because a transformer has no inherent sense of sequence order, fixed position signals must be injected into those vectors. These two steps — token embedding lookup and sinusoidal positional encoding — are the first things transformer_block executes on both the source and target token sequences.

Token embeddings

A token embedding is conceptually simple: maintain a matrix of shape (vocab_size, d_model) where each row is the learned vector for one vocabulary entry. Given a token ID, fetch that row. The only extra step here is a scale factor of sqrt(d_model), which keeps embedding magnitudes consistent with the scale of dot products computed later in attention.

Function signature

void embeddings_forward(float* out, int* tokens, float* weight, int B, int T, int d_model);
ParameterShapeDescription
outB × T × d_modelOutput buffer (caller-allocated)
tokensB × TInteger token IDs
weightvocab_size × d_modelThe embedding lookup table
BscalarBatch size
TscalarSequence length
d_modelscalarEmbedding dimension

Implementation

void embeddings_forward(float* out, int* tokens, float* weight, int B, int T, int d_model){
    // A lookup table. Given a token ID, fetch its row from the weight matrix. Then scale by sqrt(d_model)
    float scale_factor = sqrt(d_model);
    for(int b = 0; b<B; b++){
        for(int t = 0; t<T; t++){
            int curr_token = tokens[b*T + t];
            // Look up its row in the weight matrix
            for(int row = 0; row < d_model; row++){
                // Scale each value by sqrt(d_model) and write to output
                out[b*T*d_model + t*d_model + row] = weight[curr_token*d_model + row] * scale_factor;
            }
        }
    }
}
The three nested loops are:
  1. Batch loop (b): iterates over each sequence in the batch independently.
  2. Sequence loop (t): steps through each token position. The token ID at position [b, t] is read as tokens[b*T + t].
  3. Dimension loop (row): copies one element of the embedding vector, applying the scale factor. The weight row for curr_token starts at weight[curr_token * d_model].

Why scale by √d_model?

The dot products inside attention are computed between Q and K vectors of dimension d_k = d_model / num_heads. Dividing by sqrt(d_k) in the attention formula prevents those dot products from becoming very large in high dimensions. Scaling embeddings up by sqrt(d_model) at the input stage counterbalances this and keeps the magnitude of the embedded vectors consistent with the scale of later dot products — a detail from the original “Attention Is All You Need” paper.

Positional encoding

A transformer processes all token positions simultaneously; unlike an RNN, it has no recurrence that naturally encodes order. Positional encoding compensates for this by adding a fixed, deterministic vector to each token’s embedding. The vectors are constructed using alternating sine and cosine functions at different frequencies, so every position gets a unique signal and nearby positions get similar signals.

The formulas

PE[pos][2i]   = sin(pos / 10000^(2i / d_model))
PE[pos][2i+1] = cos(pos / 10000^(2i / d_model))
  • pos — the token’s position in the sequence (0-indexed)
  • i — the dimension index, stepping through 0 … d_model/2 - 1
  • Even indices get a sin value; odd indices get a cos value
  • The denominator 10000^(2i/d_model) creates a geometric progression of wavelengths, from very short (high-frequency) to very long (low-frequency)

Function signature

void positional_encoding(float* x, int B, int T, int d_model);
x is modified in-place. There is no separate output buffer — the function adds the PE values directly on top of the existing embedding vectors.

Implementation

void positional_encoding(float* x, int B, int T, int d_model){
    // Transformers have no sense of order by default.
    // PE injects position information by adding a fixed vector to each token's embedding.
    // The values come from sin/cos formulas, not learned.

    for(int b = 0; b<B; b++){
        for(int t = 0; t<T; t++){
            for(int i = 0; i<d_model/2; i++){
                float den = pow(10000,(2.0f*i / d_model));
                float even = sin(t/den);
                float odd  = cos(t/den);
                x[b*T*d_model + t*d_model + 2*i]   += even;
                x[b*T*d_model + t*d_model + 2*i+1] += odd;
            }
        }
    }
}
The innermost loop runs d_model / 2 times — once per pair of (even, odd) dimensions. For each pair:
  1. The denominator den = 10000^(2i/d_model) is computed once via pow.
  2. sin(t / den) is written to the even index 2*i.
  3. cos(t / den) is written to the odd index 2*i + 1.
  4. Both are added (+=) rather than assigned, so the embedding values accumulated by embeddings_forward are preserved.
positional_encoding is always called immediately after embeddings_forward on the same buffer. The embedding lookup writes absolute values into out; positional encoding then adds the PE signal on top. In transformer_block, this two-step sequence happens independently for source tokens and target tokens before they enter their respective encoder or decoder stacks.
For sequences longer than the training length, sinusoidal PE generalizes better than learned position embeddings because the sin/cos formulas are defined for any position value — you are not limited to a fixed vocabulary of positions.

Build docs developers (and LLMs) love