Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/akhildevelops/cudaz/llms.txt

Use this file to discover all available pages before exploring further.

This example demonstrates a 2D launch configuration — the key step up from the 1D “one thread per element” model used in the increment example. A matrix multiplication kernel assigns one output cell C[ROW][COL] to each thread in a 2D block, so the grid and block dimensions map directly onto the rows and columns of the matrix. The kernel and test are taken from the cudaz test suite (test/general.zig) and adapted here as a standalone program.

The CUDA Kernel

extern "C" __global__ void matmul(float* A, float* B, float* C, const int N) {
    int ROW = blockIdx.y*blockDim.y+threadIdx.y;
    int COL = blockIdx.x*blockDim.x+threadIdx.x;
    float tmpSum = 0;
    if (ROW < N && COL < N) {
        for (int i = 0; i < N; i++) {
            tmpSum += A[ROW * N + i] * B[i * N + COL];
        }
    }
    C[ROW * N + COL] = tmpSum;
}
Each thread is responsible for a single element of the output matrix C. The thread’s row is derived from the Y dimension of its block/thread coordinates, and its column from the X dimension. The inner loop accumulates the dot product of row ROW of A with column COL of B.
The matrices are stored in row-major order as flat 1D arrays. Element M[row][col] of an N×N matrix lives at index row * N + col in the flat buffer. The kernel uses this convention for both reading A and B and writing C.

The Full Zig Program

main.zig
const std = @import("std");
const Cuda = @import("cudaz");

const cuda_src =
    \\extern "C" __global__ void matmul(float* A, float* B, float* C, const int N) {
    \\    int ROW = blockIdx.y*blockDim.y+threadIdx.y;
    \\    int COL = blockIdx.x*blockDim.x+threadIdx.x;
    \\    
    \\    float tmpSum = 0;
    \\
    \\    if (ROW < N && COL < N) {
    \\        // each thread computes one element of C
    \\        for (int i = 0; i < N; i++) {
    \\            tmpSum += A[ROW * N + i] * B[i * N + COL];
    \\        }
    \\    }
    \\    C[ROW * N + COL] = tmpSum;
    \\ }
;

pub fn main() !void {
    var gpa = std.heap.DebugAllocator(.{}).init;
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    const device = try Cuda.Device.default();
    defer device.deinit();

    // Compile the kernel source to PTX at runtime
    const ptx_data = try Cuda.Compile.cudaText(cuda_src, .{}, allocator);
    defer allocator.free(ptx_data);

    // Load PTX and get a handle to the "matmul" function
    const module = try Cuda.Device.loadPtxText(ptx_data);
    const func = try module.getFunc("matmul");

    // A = B = [[1, 2], [3, 4]] in row-major order
    const a = [_]f32{ 1.0, 2.0, 3.0, 4.0 };
    const a_slice = try device.htodCopy(f32, &a);
    const b_slice = try a_slice.clone();
    var c_slice = try a_slice.clone();

    // 2×2 thread block, 1×1 grid — one thread per output cell
    const cfg = Cuda.LaunchConfig{
        .block_dim = .{ 2, 2, 1 },
        .grid_dim  = .{ 1, 1, 1 },
        .shared_mem_bytes = 0,
    };

    // N must be `var` so the driver can take a pointer to it (see warning below)
    var n: i32 = 2;

    try func.run(.{ &a_slice.device_ptr, &b_slice.device_ptr, &c_slice.device_ptr, &n }, cfg);

    var result = try Cuda.Device.syncReclaim(f32, allocator, c_slice);
    defer result.deinit(allocator);
    // result.items = { 7.0, 10.0, 15.0, 22.0 }
    std.debug.print("C = {any}\n", .{result.items});
}

Key Concepts

2D Thread Indexing

In a 1D kernel every thread has a single index i. In a 2D kernel each thread has two coordinates computed from the Y and X dimensions independently:
int ROW = blockIdx.y * blockDim.y + threadIdx.y;
int COL = blockIdx.x * blockDim.x + threadIdx.x;
This maps naturally onto matrix row/column addressing with no index arithmetic in user code.

2D LaunchConfig

const cfg = Cuda.LaunchConfig{
    .block_dim = .{ 2, 2, 1 },
    .grid_dim  = .{ 1, 1, 1 },
    .shared_mem_bytes = 0,
};
block_dim = .{ 2, 2, 1 } creates a 2×2 thread block (4 threads total), which is exactly enough for a 2×2 output matrix with one thread per cell. For an N×N matrix you would typically use a tile size like { 16, 16, 1 } and set grid_dim to { N/16, N/16, 1 }.
Kernel integer parameters like N must be declared as var, not const, and must not be comptime values. The CUDA driver passes parameters by pointer — it takes the address of each argument and hands that pointer to the kernel. If n is const or a comptime integer, the compiler may not allocate stack storage for it, leaving the driver with a dangling or invalid pointer. The result is a segmentation fault at launch time. Always write var n: i32 = 2; rather than const n = 2 or passing 2 directly.

Cloning Device Buffers

const b_slice = try a_slice.clone();
var c_slice   = try a_slice.clone();
clone() allocates a new GPU buffer of the same size and copies the device data into it — a device-to-device copy that avoids a round-trip through host memory. Here it is used to give B the same values as A and to give C an output buffer of the right size (its initial contents are overwritten by the kernel).

Expected Output

C = { 7.0e0, 1.0e1, 1.5e1, 2.2e1 }
The 2×2 result represents A × B where A = B = [[1,2],[3,4]]:
col 0col 1
row 01·1 + 2·3 = 71·2 + 2·4 = 10
row 13·1 + 4·3 = 153·2 + 4·4 = 22

Build docs developers (and LLMs) love