Documentation Index
Fetch the complete documentation index at: https://mintlify.com/akhildevelops/cudaz/llms.txt
Use this file to discover all available pages before exploring further.
This example demonstrates a 2D launch configuration — the key step up from the 1D “one thread per element” model used in the increment example. A matrix multiplication kernel assigns one output cell C[ROW][COL] to each thread in a 2D block, so the grid and block dimensions map directly onto the rows and columns of the matrix. The kernel and test are taken from the cudaz test suite (test/general.zig) and adapted here as a standalone program.
The CUDA Kernel
extern "C" __global__ void matmul(float* A, float* B, float* C, const int N) {
int ROW = blockIdx.y*blockDim.y+threadIdx.y;
int COL = blockIdx.x*blockDim.x+threadIdx.x;
float tmpSum = 0;
if (ROW < N && COL < N) {
for (int i = 0; i < N; i++) {
tmpSum += A[ROW * N + i] * B[i * N + COL];
}
}
C[ROW * N + COL] = tmpSum;
}
Each thread is responsible for a single element of the output matrix C. The thread’s row is derived from the Y dimension of its block/thread coordinates, and its column from the X dimension. The inner loop accumulates the dot product of row ROW of A with column COL of B.
The matrices are stored in row-major order as flat 1D arrays. Element
M[row][col] of an N×N matrix lives at index row * N + col in the
flat buffer. The kernel uses this convention for both reading A and B
and writing C.
The Full Zig Program
const std = @import("std");
const Cuda = @import("cudaz");
const cuda_src =
\\extern "C" __global__ void matmul(float* A, float* B, float* C, const int N) {
\\ int ROW = blockIdx.y*blockDim.y+threadIdx.y;
\\ int COL = blockIdx.x*blockDim.x+threadIdx.x;
\\
\\ float tmpSum = 0;
\\
\\ if (ROW < N && COL < N) {
\\ // each thread computes one element of C
\\ for (int i = 0; i < N; i++) {
\\ tmpSum += A[ROW * N + i] * B[i * N + COL];
\\ }
\\ }
\\ C[ROW * N + COL] = tmpSum;
\\ }
;
pub fn main() !void {
var gpa = std.heap.DebugAllocator(.{}).init;
defer _ = gpa.deinit();
const allocator = gpa.allocator();
const device = try Cuda.Device.default();
defer device.deinit();
// Compile the kernel source to PTX at runtime
const ptx_data = try Cuda.Compile.cudaText(cuda_src, .{}, allocator);
defer allocator.free(ptx_data);
// Load PTX and get a handle to the "matmul" function
const module = try Cuda.Device.loadPtxText(ptx_data);
const func = try module.getFunc("matmul");
// A = B = [[1, 2], [3, 4]] in row-major order
const a = [_]f32{ 1.0, 2.0, 3.0, 4.0 };
const a_slice = try device.htodCopy(f32, &a);
const b_slice = try a_slice.clone();
var c_slice = try a_slice.clone();
// 2×2 thread block, 1×1 grid — one thread per output cell
const cfg = Cuda.LaunchConfig{
.block_dim = .{ 2, 2, 1 },
.grid_dim = .{ 1, 1, 1 },
.shared_mem_bytes = 0,
};
// N must be `var` so the driver can take a pointer to it (see warning below)
var n: i32 = 2;
try func.run(.{ &a_slice.device_ptr, &b_slice.device_ptr, &c_slice.device_ptr, &n }, cfg);
var result = try Cuda.Device.syncReclaim(f32, allocator, c_slice);
defer result.deinit(allocator);
// result.items = { 7.0, 10.0, 15.0, 22.0 }
std.debug.print("C = {any}\n", .{result.items});
}
Key Concepts
2D Thread Indexing
In a 1D kernel every thread has a single index i. In a 2D kernel each thread has two coordinates computed from the Y and X dimensions independently:
int ROW = blockIdx.y * blockDim.y + threadIdx.y;
int COL = blockIdx.x * blockDim.x + threadIdx.x;
This maps naturally onto matrix row/column addressing with no index arithmetic in user code.
2D LaunchConfig
const cfg = Cuda.LaunchConfig{
.block_dim = .{ 2, 2, 1 },
.grid_dim = .{ 1, 1, 1 },
.shared_mem_bytes = 0,
};
block_dim = .{ 2, 2, 1 } creates a 2×2 thread block (4 threads total), which is exactly enough for a 2×2 output matrix with one thread per cell. For an N×N matrix you would typically use a tile size like { 16, 16, 1 } and set grid_dim to { N/16, N/16, 1 }.
Kernel integer parameters like N must be declared as var, not
const, and must not be comptime values. The CUDA driver passes
parameters by pointer — it takes the address of each argument and hands that
pointer to the kernel. If n is const or a comptime integer, the compiler
may not allocate stack storage for it, leaving the driver with a dangling or
invalid pointer. The result is a segmentation fault at launch time. Always
write var n: i32 = 2; rather than const n = 2 or passing 2 directly.
Cloning Device Buffers
const b_slice = try a_slice.clone();
var c_slice = try a_slice.clone();
clone() allocates a new GPU buffer of the same size and copies the device data into it — a device-to-device copy that avoids a round-trip through host memory. Here it is used to give B the same values as A and to give C an output buffer of the right size (its initial contents are overwritten by the kernel).
Expected Output
C = { 7.0e0, 1.0e1, 1.5e1, 2.2e1 }
The 2×2 result represents A × B where A = B = [[1,2],[3,4]]:
| col 0 | col 1 |
|---|
| row 0 | 1·1 + 2·3 = 7 | 1·2 + 2·4 = 10 |
| row 1 | 3·1 + 4·3 = 15 | 3·2 + 4·4 = 22 |