Example: Matrix Multiplication on the GPU with cudaz

This example demonstrates a 2D launch configuration — the key step up from the 1D “one thread per element” model used in the increment example. A matrix multiplication kernel assigns one output cell C[ROW][COL] to each thread in a 2D block, so the grid and block dimensions map directly onto the rows and columns of the matrix. The kernel and test are taken from the cudaz test suite (test/general.zig) and adapted here as a standalone program.

The CUDA Kernel

extern "C" __global__ void matmul(float* A, float* B, float* C, const int N) {
    int ROW = blockIdx.y*blockDim.y+threadIdx.y;
    int COL = blockIdx.x*blockDim.x+threadIdx.x;
    float tmpSum = 0;
    if (ROW < N && COL < N) {
        for (int i = 0; i < N; i++) {
            tmpSum += A[ROW * N + i] * B[i * N + COL];
        }
    }
    C[ROW * N + COL] = tmpSum;
}

Each thread is responsible for a single element of the output matrix C. The thread’s row is derived from the Y dimension of its block/thread coordinates, and its column from the X dimension. The inner loop accumulates the dot product of row ROW of A with column COL of B.

The matrices are stored in row-major order as flat 1D arrays. Element M[row][col] of an N×N matrix lives at index row * N + col in the flat buffer. The kernel uses this convention for both reading A and B and writing C.

The Full Zig Program

main.zig

const std = @import("std");
const Cuda = @import("cudaz");

const cuda_src =
    \\extern "C" __global__ void matmul(float* A, float* B, float* C, const int N) {
    \\    int ROW = blockIdx.y*blockDim.y+threadIdx.y;
    \\    int COL = blockIdx.x*blockDim.x+threadIdx.x;
    \\    
    \\    float tmpSum = 0;
    \\
    \\    if (ROW < N && COL < N) {
    \\        // each thread computes one element of C
    \\        for (int i = 0; i < N; i++) {
    \\            tmpSum += A[ROW * N + i] * B[i * N + COL];
    \\        }
    \\    }
    \\    C[ROW * N + COL] = tmpSum;
    \\ }
;

pub fn main() !void {
    var gpa = std.heap.DebugAllocator(.{}).init;
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    const device = try Cuda.Device.default();
    defer device.deinit();

    // Compile the kernel source to PTX at runtime
    const ptx_data = try Cuda.Compile.cudaText(cuda_src, .{}, allocator);
    defer allocator.free(ptx_data);

    // Load PTX and get a handle to the "matmul" function
    const module = try Cuda.Device.loadPtxText(ptx_data);
    const func = try module.getFunc("matmul");

    // A = B = [[1, 2], [3, 4]] in row-major order
    const a = [_]f32{ 1.0, 2.0, 3.0, 4.0 };
    const a_slice = try device.htodCopy(f32, &a);
    const b_slice = try a_slice.clone();
    var c_slice = try a_slice.clone();

    // 2×2 thread block, 1×1 grid — one thread per output cell
    const cfg = Cuda.LaunchConfig{
        .block_dim = .{ 2, 2, 1 },
        .grid_dim  = .{ 1, 1, 1 },
        .shared_mem_bytes = 0,
    };

    // N must be `var` so the driver can take a pointer to it (see warning below)
    var n: i32 = 2;

    try func.run(.{ &a_slice.device_ptr, &b_slice.device_ptr, &c_slice.device_ptr, &n }, cfg);

    var result = try Cuda.Device.syncReclaim(f32, allocator, c_slice);
    defer result.deinit(allocator);
    // result.items = { 7.0, 10.0, 15.0, 22.0 }
    std.debug.print("C = {any}\n", .{result.items});
}

Key Concepts

2D Thread Indexing

In a 1D kernel every thread has a single index i. In a 2D kernel each thread has two coordinates computed from the Y and X dimensions independently:

int ROW = blockIdx.y * blockDim.y + threadIdx.y;
int COL = blockIdx.x * blockDim.x + threadIdx.x;

This maps naturally onto matrix row/column addressing with no index arithmetic in user code.

2D LaunchConfig

const cfg = Cuda.LaunchConfig{
    .block_dim = .{ 2, 2, 1 },
    .grid_dim  = .{ 1, 1, 1 },
    .shared_mem_bytes = 0,
};

block_dim = .{ 2, 2, 1 } creates a 2×2 thread block (4 threads total), which is exactly enough for a 2×2 output matrix with one thread per cell. For an N×N matrix you would typically use a tile size like { 16, 16, 1 } and set grid_dim to { N/16, N/16, 1 }.

Kernel integer parameters like N must be declared as var, not const, and must not be comptime values. The CUDA driver passes parameters by pointer — it takes the address of each argument and hands that pointer to the kernel. If n is const or a comptime integer, the compiler may not allocate stack storage for it, leaving the driver with a dangling or invalid pointer. The result is a segmentation fault at launch time. Always write var n: i32 = 2; rather than const n = 2 or passing 2 directly.

Cloning Device Buffers

const b_slice = try a_slice.clone();
var c_slice   = try a_slice.clone();

clone() allocates a new GPU buffer of the same size and copies the device data into it — a device-to-device copy that avoids a round-trip through host memory. Here it is used to give B the same values as A and to give C an output buffer of the right size (its initial contents are overwritten by the kernel).

Expected Output

C = { 7.0e0, 1.0e1, 1.5e1, 2.2e1 }

The 2×2 result represents A × B where A = B = [[1,2],[3,4]]:

	col 0	col 1
row 0	1·1 + 2·3 = 7	1·2 + 2·4 = 10
row 1	3·1 + 4·3 = 15	3·2 + 4·4 = 22

Getting Started

Core Concepts

Guides

Examples

Example: Matrix Multiplication on the GPU with cudaz

The CUDA Kernel

The Full Zig Program

Key Concepts

2D Thread Indexing

2D LaunchConfig

Cloning Device Buffers

Expected Output

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Guides

Examples

Documentation Index

​The CUDA Kernel

​The Full Zig Program

​Key Concepts

​2D Thread Indexing

​2D LaunchConfig

​Cloning Device Buffers

​Expected Output

Build docs developers (and LLMs) love

The CUDA Kernel

The Full Zig Program

Key Concepts

2D Thread Indexing

2D LaunchConfig

Cloning Device Buffers

Expected Output