Matrix Multiplication: Core of Every Neural Network Layer

Matrix multiplication is the most-executed operation in any transformer. Every linear projection — the QKV weight matrices, the output projection, both layers of the feed-forward block — is a matmul. In PyTorch that single call to F.linear hides a highly tuned kernel. Here, there is no hiding: the kernel is three nested loops, a local accumulator, and one memory write per output element. Understanding this baseline is the foundation for understanding what cuBLAS and Tensor Cores actually accelerate.

Function Signature

void matmul(float* A, float* B, float* bias, float* out, int M, int K, int N);

Parameter	Shape	Description
`A`	`(M, K)`	Left matrix — M rows, K columns
`B`	`(K, N)`	Right matrix — K rows, N columns
`bias`	`(N,)`	Optional bias vector; `nullptr` to skip
`out`	`(M, N)`	Output matrix — M rows, N columns
`M`	—	Number of output rows
`K`	—	Shared inner dimension (contraction axis)
`N`	—	Number of output columns

All four arrays are flat 1D float* buffers. The 2D structure is encoded entirely in the index arithmetic inside the function.

Full Implementation

void matmul(float* A, float* B, float* bias, float* out, int M, int K, int N) {

    for (int m = 0; m < M; m++) {
        for (int n = 0; n < N; n++) {
            float val = (bias != nullptr) ? bias[n] : 0.0f;
            for (int k = 0; k < K; k++) {
                val += A[m * K + k] * B[k * N + n];
            }
            out[m * N + n] = val;
        }
    }

}

The Accumulator Pattern

The local variable val is the accumulator. It is initialised once — either to bias[n] or to 0.0f — and then updated K times inside the inner loop. Only after all K multiplications are done does the result get written to out[m*N + n]. This matters because writes to main memory are expensive. By accumulating in a register-sized local variable, the kernel performs exactly one write to the output array per (m, n) output element, regardless of how large K is.

Optional Bias

float val = (bias != nullptr) ? bias[n] : 0.0f;

The bias is a 1D vector of shape (N,). Its index is always just n — the output column. Initialising the accumulator with bias[n] before the inner loop means the bias addition costs zero extra passes over the data.

Flat Index Formulas

Array	Shape	Element `(r, c)` → flat offset
`A`	`(M, K)`	`m * K + k`
`B`	`(K, N)`	`k * N + n`
`out`	`(M, N)`	`m * N + n`

Each formula follows the same rule: the first index is multiplied by the size of the second dimension (the number of columns), then the column index is added.

Worked Example

int main() {
    float A[]    = {1,2,3, 4,5,6, 7,8,9};  // shape (3,3), flat
    float B[]    = {1,2,3, 4,5,6, 7,8,9};
    float bias[] = {1,2,3};                  // shape (3,), one per output column
    float out[9] = {0};                      // shape (3,3), initialised to 0

    matmul(A, B, bias, out, 3, 3, 3);

    for (int i = 0; i < 9; i++) {
        cout << out[i] << endl;
    }
}

Expected output (printed row by row):

Row 0 of the output is [31, 38, 45]. Without bias that would be [30, 36, 42]. The bias [1, 2, 3] adds one element per column, giving [31, 38, 45]. Rows 1 and 2 follow the same logic.

This is the same pattern a CUDA kernel uses. In a GPU version you assign one (m, n) output position per thread — the outer two loops disappear and become threadIdx / blockIdx coordinates. The inner loop over k remains, and the same accumulator pattern applies. The index formulas m*K+k and k*N+n do not change at all.

Overview

C++ Core

LLM Kernels in C

GPU Fundamentals

Transformer in C++

Matrix Multiplication: Core of Every Neural Network Layer

Function Signature

Full Implementation

The Accumulator Pattern

Optional Bias

Flat Index Formulas

Worked Example

Build docs developers (and LLMs) love

Overview

C++ Core

LLM Kernels in C

GPU Fundamentals

Transformer in C++

Documentation Index

​Function Signature

​Full Implementation

​The Accumulator Pattern

​Optional Bias

​Flat Index Formulas

​Worked Example

Build docs developers (and LLMs) love

Function Signature

Full Implementation

The Accumulator Pattern

Optional Bias

Flat Index Formulas

Worked Example