Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt

Use this file to discover all available pages before exploring further.

There is no “2D” or “3D” in hardware. Memory is one long line of boxes. A matrix is a human concept — a way of grouping a run of numbers into rows and columns so the math is easier to reason about. Both CPU and GPU memory are identical in this regard: when you allocate a float*, you get a pointer to a contiguous block of floats with no structure at all. Understanding how to compute the flat offset for any position in an N-dimensional tensor is the single prerequisite for reading or writing any kernel in this module. Without it, every loop body is opaque.

The Core Index Rule

To access element (i1, i2, i3) in a tensor stored in row-major order with shape (D1, D2, D3), the flat offset is:
flat_offset = i1 * (D2 * D3) + i2 * D3 + i3
Each index is multiplied by the product of all dimensions to its right. This generalises to any number of dimensions. The rightmost index is multiplied by 1 (no dimensions to its right). The leftmost index is multiplied by the product of every other dimension.

Examples From the Kernels

These three index patterns appear directly in matmul.cpp and are used unchanged throughout the whole codebase: 2D — shape (M, K): a simple matrix Element (m, k) lives at:
m * K + k
A 3×3 matrix of values 1–9 is stored in memory as nine consecutive floats:
// "Matrix" A, shape (3, 3) — stored as a flat array
float A[] = {1, 2, 3,   // row 0
             4, 5, 6,   // row 1
             7, 8, 9};  // row 2

// Element at row 1, column 2 → offset = 1*3 + 2 = 5
float val = A[1 * 3 + 2];  // val == 6
3D — shape (B, T, C): batch of token sequences Element (b, t, c) lives at:
b * T * C + t * C + c
This is the shape of every activation tensor in a transformer. B is the batch size, T is the sequence length, C is the embedding dimension. To jump to the start of token (b, t)’s full C-dimensional vector, drop the + c and keep a pointer:
float* x_bt = x + b * T * C + t * C;
// x_bt[i] now accesses x[b, t, i] for i in 0..C-1
This pointer-into-flat-array pattern appears in every function in layernorm.cpp, softmax.cpp, and train_gpt2_annotated.c. 4D — shape (B, NH, T, T): the attention weight matrix Element (b, h, t1, t2) lives at:
b * NH * T * T + h * T * T + t1 * T + t2
This is the shape of the preatt and att tensors in attention_forward. For a fixed (b, h, t1) row of attention scores across all key positions t2:
float* att_bth = att + b * NH * T * T + h * T * T + t1 * T;
// att_bth[t2] is the attention score from query t1 to key t2

The Derivation Method

To find the flat index for any element, ask one question per dimension: what is the product of all dimensions to the right of this one? That product is the stride for this dimension’s index. Multiply each index by its stride and sum them.
For shape (D1, D2, D3, D4):
IndexStride
i1D2 * D3 * D4
i2D3 * D4
i3D4
i41
Flat offset = i1*(D2*D3*D4) + i2*(D3*D4) + i3*D4 + i4.

Why This Matters for GPU

A CUDA kernel does not have loops over every output position — it has threads. Each thread is assigned one output position (m, n) by the CUDA runtime, and it uses m * N + n to find where to write its result. The index formula is identical to the CPU version. The only thing that changes between a CPU kernel and a GPU kernel is the execution model: instead of a for (int m ...) loop, each iteration runs as a separate thread in parallel. This is the deeper reason why the kernels in this module are written in plain C loops before being ported to CUDA. Every loop iteration corresponds to exactly one future CUDA thread. Understanding the index math at the CPU level means the GPU port is a mechanical substitution of loop variables with blockIdx and threadIdx expressions.

Build docs developers (and LLMs) love