There is no “2D” or “3D” in hardware. Memory is one long line of boxes. A matrix is a human concept — a way of grouping a run of numbers into rows and columns so the math is easier to reason about. Both CPU and GPU memory are identical in this regard: when you allocate aDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt
Use this file to discover all available pages before exploring further.
float*, you get a pointer to a contiguous block of floats with no structure at all. Understanding how to compute the flat offset for any position in an N-dimensional tensor is the single prerequisite for reading or writing any kernel in this module. Without it, every loop body is opaque.
The Core Index Rule
To access element(i1, i2, i3) in a tensor stored in row-major order with shape (D1, D2, D3), the flat offset is:
Examples From the Kernels
These three index patterns appear directly inmatmul.cpp and are used unchanged throughout the whole codebase:
2D — shape (M, K): a simple matrix
Element (m, k) lives at:
(B, T, C): batch of token sequences
Element (b, t, c) lives at:
B is the batch size, T is the sequence length, C is the embedding dimension. To jump to the start of token (b, t)’s full C-dimensional vector, drop the + c and keep a pointer:
layernorm.cpp, softmax.cpp, and train_gpt2_annotated.c.
4D — shape (B, NH, T, T): the attention weight matrix
Element (b, h, t1, t2) lives at:
preatt and att tensors in attention_forward. For a fixed (b, h, t1) row of attention scores across all key positions t2:
The Derivation Method
For shape(D1, D2, D3, D4):
| Index | Stride |
|---|---|
i1 | D2 * D3 * D4 |
i2 | D3 * D4 |
i3 | D4 |
i4 | 1 |
i1*(D2*D3*D4) + i2*(D3*D4) + i3*D4 + i4.
Why This Matters for GPU
A CUDA kernel does not have loops over every output position — it has threads. Each thread is assigned one output position(m, n) by the CUDA runtime, and it uses m * N + n to find where to write its result. The index formula is identical to the CPU version. The only thing that changes between a CPU kernel and a GPU kernel is the execution model: instead of a for (int m ...) loop, each iteration runs as a separate thread in parallel.
This is the deeper reason why the kernels in this module are written in plain C loops before being ported to CUDA. Every loop iteration corresponds to exactly one future CUDA thread. Understanding the index math at the CPU level means the GPU port is a mechanical substitution of loop variables with blockIdx and threadIdx expressions.