Before any attention or feed-forward computation can happen, raw integer token IDs must be converted into dense floating-point vectors that the rest of the network can process. Then, because a transformer has no inherent sense of sequence order, fixed position signals must be injected into those vectors. These two steps — token embedding lookup and sinusoidal positional encoding — are the first thingsDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt
Use this file to discover all available pages before exploring further.
transformer_block executes on both the source and target token sequences.
Token embeddings
A token embedding is conceptually simple: maintain a matrix of shape(vocab_size, d_model) where each row is the learned vector for one vocabulary entry. Given a token ID, fetch that row. The only extra step here is a scale factor of sqrt(d_model), which keeps embedding magnitudes consistent with the scale of dot products computed later in attention.
Function signature
| Parameter | Shape | Description |
|---|---|---|
out | B × T × d_model | Output buffer (caller-allocated) |
tokens | B × T | Integer token IDs |
weight | vocab_size × d_model | The embedding lookup table |
B | scalar | Batch size |
T | scalar | Sequence length |
d_model | scalar | Embedding dimension |
Implementation
- Batch loop (
b): iterates over each sequence in the batch independently. - Sequence loop (
t): steps through each token position. The token ID at position[b, t]is read astokens[b*T + t]. - Dimension loop (
row): copies one element of the embedding vector, applying the scale factor. The weight row forcurr_tokenstarts atweight[curr_token * d_model].
Why scale by √d_model?
The dot products inside attention are computed between Q and K vectors of dimensiond_k = d_model / num_heads. Dividing by sqrt(d_k) in the attention formula prevents those dot products from becoming very large in high dimensions. Scaling embeddings up by sqrt(d_model) at the input stage counterbalances this and keeps the magnitude of the embedded vectors consistent with the scale of later dot products — a detail from the original “Attention Is All You Need” paper.
Positional encoding
A transformer processes all token positions simultaneously; unlike an RNN, it has no recurrence that naturally encodes order. Positional encoding compensates for this by adding a fixed, deterministic vector to each token’s embedding. The vectors are constructed using alternating sine and cosine functions at different frequencies, so every position gets a unique signal and nearby positions get similar signals.The formulas
pos— the token’s position in the sequence (0-indexed)i— the dimension index, stepping through0 … d_model/2 - 1- Even indices get a
sinvalue; odd indices get acosvalue - The denominator
10000^(2i/d_model)creates a geometric progression of wavelengths, from very short (high-frequency) to very long (low-frequency)
Function signature
x is modified in-place. There is no separate output buffer — the function adds the PE values directly on top of the existing embedding vectors.
Implementation
d_model / 2 times — once per pair of (even, odd) dimensions. For each pair:
- The denominator
den = 10000^(2i/d_model)is computed once viapow. sin(t / den)is written to the even index2*i.cos(t / den)is written to the odd index2*i + 1.- Both are added (
+=) rather than assigned, so the embedding values accumulated byembeddings_forwardare preserved.
positional_encoding is always called immediately after embeddings_forward on the same buffer. The embedding lookup writes absolute values into out; positional encoding then adds the PE signal on top. In transformer_block, this two-step sequence happens independently for source tokens and target tokens before they enter their respective encoder or decoder stacks.