Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt

Use this file to discover all available pages before exploring further.

transformer_block is the function that assembles every component into a single executable forward pass. It takes raw integer token IDs for both the source and target sequences, runs them through embeddings and positional encoding, stacks N encoder blocks and N decoder blocks in sequence, applies a final linear projection, and outputs a probability distribution over the vocabulary for each token position. Nothing is hidden — every allocation, every function call, and every pointer hand-off is explicit in the source.

Function signature

void transformer_block(float* out, int* src_tokens, float* src_embed_weight,
                       int* tgt_tokens, float* tgt_embed_weight,
                       float* Wq,  float* Wk,  float* Wv,  float* Wo,   // encoder self-attn
                       float* Wq1, float* Wk1, float* Wv1, float* Wo1,  // decoder masked self-attn
                       float* Wq2, float* Wk2, float* Wv2, float* Wo2,  // decoder cross-attn
                       float* W1, float* b1, float* W2, float* b2,       // FFN weights
                       float* gamma1, float* beta1,  // layernorm 1
                       float* gamma2, float* beta2,  // layernorm 2
                       float* gamma3, float* beta3,  // layernorm 3 (decoder only)
                       float* proj_weight, int vocab_size,
                       float eps, int B, int T, int num_heads,
                       int d_model, int d_ff, int N);
The output buffer out must be pre-allocated by the caller with size B * T * vocab_size floats. It receives the final softmax probabilities.

Data flow

1

Source token embedding + positional encoding

Source integer token IDs are converted to dense vectors and position signals are injected in-place:
float* src_embeddings_out = new float[B * T * d_model];

embeddings_forward(src_embeddings_out, src_tokens, src_embed_weight, B, T, d_model);
positional_encoding(src_embeddings_out, B, T, d_model);
After these two calls, src_embeddings_out holds a B × T × d_model tensor where each token vector encodes both semantic identity (from the lookup table) and position (from the sinusoidal PE).
2

N encoder blocks

The encoder stack is a simple loop. The first iteration reads from src_embeddings_out; every subsequent iteration reads from enc_out (the previous block’s output):
float* enc_out = new float[B * T * d_model];

for(int n = 0; n < N; n++){
    if(n == 0){
        encoder_block(enc_out, src_embeddings_out,
                      Wq, Wk, Wv, Wo, W1, b1, W2, b2,
                      gamma1, beta1, gamma2, beta2,
                      eps, B, T, num_heads, d_model, d_ff);
    } else {
        encoder_block(enc_out, enc_out,
                      Wq, Wk, Wv, Wo, W1, b1, W2, b2,
                      gamma1, beta1, gamma2, beta2,
                      eps, B, T, num_heads, d_model, d_ff);
    }
}
After the loop completes, enc_out holds the final encoder hidden states for all token positions.
3

Target token embedding + positional encoding

The same two-step process is repeated for the target sequence, completely independently of the source path:
float* tgt_embeddings_out = new float[B * T * d_model];

embeddings_forward(tgt_embeddings_out, tgt_tokens, tgt_embed_weight, B, T, d_model);
positional_encoding(tgt_embeddings_out, B, T, d_model);
4

N decoder blocks

The decoder loop follows the same pattern as the encoder. Every decoder block receives enc_out as its cross-attention source:
float* dec_out = new float[B * T * d_model];

for(int n = 0; n < N; n++){
    if(n == 0){
        decoder_block(dec_out, tgt_embeddings_out, enc_out,
                      Wq1, Wk1, Wv1, Wo1,
                      Wq2, Wk2, Wv2, Wo2,
                      W1, b1, W2, b2,
                      gamma1, beta1, gamma2, beta2, gamma3, beta3,
                      eps, B, T, num_heads, d_model, d_ff);
    } else {
        decoder_block(dec_out, dec_out, enc_out,
                      Wq1, Wk1, Wv1, Wo1,
                      Wq2, Wk2, Wv2, Wo2,
                      W1, b1, W2, b2,
                      gamma1, beta1, gamma2, beta2, gamma3, beta3,
                      eps, B, T, num_heads, d_model, d_ff);
    }
}
The encoder output enc_out is read-only in this loop — it is computed once and reused by all N decoder blocks.
5

Linear projection

The decoder output is projected from d_model dimensions to vocab_size dimensions with a single matrix multiply:
float* proj_out = new float[B * T * vocab_size];
projection_forward(proj_out, dec_out, proj_weight, B, T, d_model, vocab_size);
projection_forward is a thin wrapper around matmul:
void projection_forward(float* out, float* x, float* W,
                        int B, int T, int d_model, int vocab_size){
    matmul(x, W, nullptr, out, B*T, d_model, vocab_size);
}
6

Softmax → output probabilities

A numerically stable softmax converts the raw logits into a probability distribution over the vocabulary for each token position:
softmax(out, proj_out, B, T, vocab_size);
out now holds B × T × vocab_size floats. Each row of vocab_size values sums to 1.0 and represents the model’s predicted next-token probability distribution at that position.

Memory management

transformer_block allocates five buffers and frees all of them before returning:
BufferShapePurpose
src_embeddings_outB × T × d_modelSource token embeddings + PE
enc_outB × T × d_modelEncoder stack output
tgt_embeddings_outB × T × d_modelTarget token embeddings + PE
dec_outB × T × d_modelDecoder stack output
proj_outB × T × vocab_sizePre-softmax logits
// At end of transformer_block:
delete[] src_embeddings_out;
delete[] enc_out;
delete[] tgt_embeddings_out;
delete[] dec_out;
delete[] proj_out;
Each encoder and decoder block also allocates and frees its own internal buffers independently. The peak memory usage at any given moment is the five buffers above plus the internal buffers of a single block — the blocks free their memory before returning, so they do not accumulate.

Test case

The main() function in model.cpp exercises the full transformer with small, deterministic parameters:
int B = 1, T = 4, d_model = 4, d_ff = 16;
int num_heads = 2, vocab_size = 5, N = 2;

int tokens[] = {2, 0, 3, 1};

// All QKV and FFN weights are identity matrices
float Wq[16] = {1,0,0,0, 0,1,0,0, 0,0,1,0, 0,0,0,1};
// ... (Wk, Wv, Wo, Wq1, ... all identity)

// Projection: maps first 4 of 5 vocab dims, last col is zero
float proj_weight[4 * 5] = {
    1, 0, 0, 0, 0,
    0, 1, 0, 0, 0,
    0, 0, 1, 0, 0,
    0, 0, 0, 1, 0
};

float transformer_out[1 * 4 * 5] = {0};

transformer_block(transformer_out, tokens, weight, tokens, weight, ...);
Both src_tokens and tgt_tokens use the same tokens array and the same embedding weight matrix — a 5×4 near-identity matrix. All weight matrices are identity. This configuration allows the output to be verified by hand: any unexpected value is a bug in the index arithmetic, not in learned parameters.

Expected output

The first token’s probability row (position 0) should be approximately:
[0.0565683, 0.0630902, 0.667757, 0.0911744, 0.121411]
These five values sum to approximately 1.0. The values for all four token positions, as logged by main() (the debug helper PrintOutputFlat captures the first 16 of the 20 output elements, so the full Token 3 row is not printed):
Token 0:  0.0565683  0.0630902  0.667757   0.0911744  0.121411
Token 1:  0.554341   0.0482132  0.0564849  0.207866   0.133096
Token 2:  0.112971   0.0572973  0.0554605  0.650646   0.123625
Token 3:  0.0814983  (remaining vocab positions not captured by the debug helper)

Utility functions

Two helper functions in model.cpp are available for debugging during development:
// Prints a 4×4 output as a 2D matrix — better for visual inspection
void PrintOutputMatrix(float* weight, float* arr);

// Prints weight-output pairs on separate lines — easier for value-by-value comparison
void PrintOutputFlat(float* weight, float* arr);
Both functions print 16 elements (hard-coded for the d_model=4, T=4 test case) and are useful for checking that a specific component produces the right values before wiring it into the full forward pass.
In this test, all N encoder blocks share the same weight matrices Wq/Wk/Wv/Wo, and all N decoder blocks share Wq1/Wk1/Wv1/Wo1 and Wq2/Wk2/Wv2/Wo2. A production transformer has distinct learned weight matrices for each layer. To support that, transformer_block would need to accept weight arrays indexed by layer number, or be refactored into a loop that receives per-layer weight structs.
To port this to CUDA: replace every new float[size]() with cudaMalloc, replace the three-loop matmul with a __global__ kernel that assigns one output cell per thread, and replace softmax with a parallel reduction kernel. The function signatures and data flow stay exactly the same — only the memory allocator and the compute kernels change.

Build docs developers (and LLMs) love