Matrix multiplication is the most-executed operation in any transformer. Every linear projection — the QKV weight matrices, the output projection, both layers of the feed-forward block — is a matmul. In PyTorch that single call toDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt
Use this file to discover all available pages before exploring further.
F.linear hides a highly tuned kernel. Here, there is no hiding: the kernel is three nested loops, a local accumulator, and one memory write per output element. Understanding this baseline is the foundation for understanding what cuBLAS and Tensor Cores actually accelerate.
Function Signature
| Parameter | Shape | Description |
|---|---|---|
A | (M, K) | Left matrix — M rows, K columns |
B | (K, N) | Right matrix — K rows, N columns |
bias | (N,) | Optional bias vector; nullptr to skip |
out | (M, N) | Output matrix — M rows, N columns |
M | — | Number of output rows |
K | — | Shared inner dimension (contraction axis) |
N | — | Number of output columns |
float* buffers. The 2D structure is encoded entirely in the index arithmetic inside the function.
Full Implementation
The Accumulator Pattern
The local variableval is the accumulator. It is initialised once — either to bias[n] or to 0.0f — and then updated K times inside the inner loop. Only after all K multiplications are done does the result get written to out[m*N + n]. This matters because writes to main memory are expensive. By accumulating in a register-sized local variable, the kernel performs exactly one write to the output array per (m, n) output element, regardless of how large K is.
Optional Bias
(N,). Its index is always just n — the output column. Initialising the accumulator with bias[n] before the inner loop means the bias addition costs zero extra passes over the data.
Flat Index Formulas
| Array | Shape | Element (r, c) → flat offset |
|---|---|---|
A | (M, K) | m * K + k |
B | (K, N) | k * N + n |
out | (M, N) | m * N + n |
Worked Example
[31, 38, 45]. Without bias that would be [30, 36, 42]. The bias [1, 2, 3] adds one element per column, giving [31, 38, 45]. Rows 1 and 2 follow the same logic.
This is the same pattern a CUDA kernel uses. In a GPU version you assign one
(m, n) output position per thread — the outer two loops disappear and become threadIdx / blockIdx coordinates. The inner loop over k remains, and the same accumulator pattern applies. The index formulas m*K+k and k*N+n do not change at all.