Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt

Use this file to discover all available pages before exploring further.

Matrix multiplication is the most-executed operation in any transformer. Every linear projection — the QKV weight matrices, the output projection, both layers of the feed-forward block — is a matmul. In PyTorch that single call to F.linear hides a highly tuned kernel. Here, there is no hiding: the kernel is three nested loops, a local accumulator, and one memory write per output element. Understanding this baseline is the foundation for understanding what cuBLAS and Tensor Cores actually accelerate.

Function Signature

void matmul(float* A, float* B, float* bias, float* out, int M, int K, int N);
ParameterShapeDescription
A(M, K)Left matrix — M rows, K columns
B(K, N)Right matrix — K rows, N columns
bias(N,)Optional bias vector; nullptr to skip
out(M, N)Output matrix — M rows, N columns
MNumber of output rows
KShared inner dimension (contraction axis)
NNumber of output columns
All four arrays are flat 1D float* buffers. The 2D structure is encoded entirely in the index arithmetic inside the function.

Full Implementation

void matmul(float* A, float* B, float* bias, float* out, int M, int K, int N) {

    for (int m = 0; m < M; m++) {
        for (int n = 0; n < N; n++) {
            float val = (bias != nullptr) ? bias[n] : 0.0f;
            for (int k = 0; k < K; k++) {
                val += A[m * K + k] * B[k * N + n];
            }
            out[m * N + n] = val;
        }
    }

}

The Accumulator Pattern

The local variable val is the accumulator. It is initialised once — either to bias[n] or to 0.0f — and then updated K times inside the inner loop. Only after all K multiplications are done does the result get written to out[m*N + n]. This matters because writes to main memory are expensive. By accumulating in a register-sized local variable, the kernel performs exactly one write to the output array per (m, n) output element, regardless of how large K is.

Optional Bias

float val = (bias != nullptr) ? bias[n] : 0.0f;
The bias is a 1D vector of shape (N,). Its index is always just n — the output column. Initialising the accumulator with bias[n] before the inner loop means the bias addition costs zero extra passes over the data.

Flat Index Formulas

ArrayShapeElement (r, c) → flat offset
A(M, K)m * K + k
B(K, N)k * N + n
out(M, N)m * N + n
Each formula follows the same rule: the first index is multiplied by the size of the second dimension (the number of columns), then the column index is added.

Worked Example

int main() {
    float A[]    = {1,2,3, 4,5,6, 7,8,9};  // shape (3,3), flat
    float B[]    = {1,2,3, 4,5,6, 7,8,9};
    float bias[] = {1,2,3};                  // shape (3,), one per output column
    float out[9] = {0};                      // shape (3,3), initialised to 0

    matmul(A, B, bias, out, 3, 3, 3);

    for (int i = 0; i < 9; i++) {
        cout << out[i] << endl;
    }
}
Expected output (printed row by row):
31
38
45
67
83
99
103
128
153
Row 0 of the output is [31, 38, 45]. Without bias that would be [30, 36, 42]. The bias [1, 2, 3] adds one element per column, giving [31, 38, 45]. Rows 1 and 2 follow the same logic.
This is the same pattern a CUDA kernel uses. In a GPU version you assign one (m, n) output position per thread — the outer two loops disappear and become threadIdx / blockIdx coordinates. The inner loop over k remains, and the same accumulator pattern applies. The index formulas m*K+k and k*N+n do not change at all.

Build docs developers (and LLMs) love