Documentation Index
Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt
Use this file to discover all available pages before exploring further.
transformer_block is the function that assembles every component into a single executable forward pass. It takes raw integer token IDs for both the source and target sequences, runs them through embeddings and positional encoding, stacks N encoder blocks and N decoder blocks in sequence, applies a final linear projection, and outputs a probability distribution over the vocabulary for each token position. Nothing is hidden — every allocation, every function call, and every pointer hand-off is explicit in the source.
Function signature
out must be pre-allocated by the caller with size B * T * vocab_size floats. It receives the final softmax probabilities.
Data flow
Source token embedding + positional encoding
Source integer token IDs are converted to dense vectors and position signals are injected in-place:After these two calls,
src_embeddings_out holds a B × T × d_model tensor where each token vector encodes both semantic identity (from the lookup table) and position (from the sinusoidal PE).N encoder blocks
The encoder stack is a simple loop. The first iteration reads from After the loop completes,
src_embeddings_out; every subsequent iteration reads from enc_out (the previous block’s output):enc_out holds the final encoder hidden states for all token positions.Target token embedding + positional encoding
The same two-step process is repeated for the target sequence, completely independently of the source path:
N decoder blocks
The decoder loop follows the same pattern as the encoder. Every decoder block receives The encoder output
enc_out as its cross-attention source:enc_out is read-only in this loop — it is computed once and reused by all N decoder blocks.Linear projection
The decoder output is projected from
d_model dimensions to vocab_size dimensions with a single matrix multiply:projection_forward is a thin wrapper around matmul:Softmax → output probabilities
A numerically stable softmax converts the raw logits into a probability distribution over the vocabulary for each token position:
out now holds B × T × vocab_size floats. Each row of vocab_size values sums to 1.0 and represents the model’s predicted next-token probability distribution at that position.Memory management
transformer_block allocates five buffers and frees all of them before returning:
| Buffer | Shape | Purpose |
|---|---|---|
src_embeddings_out | B × T × d_model | Source token embeddings + PE |
enc_out | B × T × d_model | Encoder stack output |
tgt_embeddings_out | B × T × d_model | Target token embeddings + PE |
dec_out | B × T × d_model | Decoder stack output |
proj_out | B × T × vocab_size | Pre-softmax logits |
Test case
Themain() function in model.cpp exercises the full transformer with small, deterministic parameters:
src_tokens and tgt_tokens use the same tokens array and the same embedding weight matrix — a 5×4 near-identity matrix. All weight matrices are identity. This configuration allows the output to be verified by hand: any unexpected value is a bug in the index arithmetic, not in learned parameters.
Expected output
The first token’s probability row (position 0) should be approximately:main() (the debug helper PrintOutputFlat captures the first 16 of the 20 output elements, so the full Token 3 row is not printed):
Utility functions
Two helper functions inmodel.cpp are available for debugging during development:
d_model=4, T=4 test case) and are useful for checking that a specific component produces the right values before wiring it into the full forward pass.
In this test, all N encoder blocks share the same weight matrices
Wq/Wk/Wv/Wo, and all N decoder blocks share Wq1/Wk1/Wv1/Wo1 and Wq2/Wk2/Wv2/Wo2. A production transformer has distinct learned weight matrices for each layer. To support that, transformer_block would need to accept weight arrays indexed by layer number, or be refactored into a loop that receives per-layer weight structs.