Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/akhildevelops/cudaz/llms.txt

Use this file to discover all available pages before exploring further.

This guide walks you through a complete, working cudaz program: an array increment example that copies three f32 values to the GPU, runs a CUDA kernel that adds 1 to each element in parallel, and retrieves the results back to the host. By the end you will have seen every essential cudaz concept — device setup, memory management, kernel compilation, kernel execution, and result retrieval — in one self-contained program.
1

Add imports and create type aliases

Import cudaz and create short aliases for the three types you will use most often: Device, Compile, and LaunchConfig.
const std = @import("std");
const Cuda = @import("cudaz");
const CuDevice = Cuda.Device;
const CuCompile = Cuda.Compile;
const CuLaunchConfig = Cuda.LaunchConfig;
Cuda.Device handles device lifecycle and memory transfers. Cuda.Compile compiles CUDA C source to PTX at runtime. Cuda.LaunchConfig describes the GPU thread hierarchy (grids, blocks, threads) for a kernel launch.
2

Initialize the GPU device

Call CuDevice.default() to initialize the CUDA runtime and acquire the primary context for GPU ordinal 0. Use defer device.deinit() so the context is released automatically when main returns.
const device = try CuDevice.default();
defer device.deinit();
If you have multiple GPUs and want a specific one, use CuDevice.new(ordinal) instead.
3

Copy data to the GPU

Define your host data as a fixed-size array, then call device.htodCopy(f32, &data) to allocate GPU memory of the correct size and copy the values over in one call. The returned CudaSlice(f32) owns the device allocation, so defer cu_slice.free() will release it.
const data = [_]f32{ 1.2, 2.8, 0.123 };
const cu_slice = try device.htodCopy(f32, &data);
defer cu_slice.free();
CudaSlice(T) is a type-safe wrapper around a CUdeviceptr. It carries both the device pointer and the element count, so subsequent operations never need a separate length argument.
4

Compile and load the kernel

Write the CUDA C kernel as an inline Zig multiline string literal, compile it to PTX using NVRTC, load the PTX into the device, and look up the function by name.
// Define the CUDA kernel as an inline string
const increment_kernel =
    \\extern "C" __global__ void increment(float *out)
    \\{
    \\    int i = blockIdx.x * blockDim.x + threadIdx.x;
    \\    out[i] = out[i] + 1;
    \\}
;

// Compile the kernel source to PTX bytecode via NVRTC
const ptx = try CuCompile.cudaText(increment_kernel, .{}, allocator);
defer allocator.free(ptx);

// Load the PTX into the device and retrieve the function handle
const module = try CuDevice.loadPtxText(ptx);
const function = try module.getFunc("increment");
CuCompile.cudaText invokes NVRTC at runtime — no separate nvcc compilation step is needed. The resulting PTX is a null-terminated [:0]const u8 that loadPtxText feeds directly to the CUDA driver.
5

Run the kernel

Launch the kernel by passing a tuple of kernel arguments and a LaunchConfig that specifies the thread hierarchy. Because the input array has 3 elements, set block_dim to {3, 1, 1} so each thread handles one element.
try function.run(
    .{&cu_slice.device_ptr},
    CuLaunchConfig{
        .block_dim = .{ 3, 1, 1 },
        .grid_dim  = .{ 1, 1, 1 },
        .shared_mem_bytes = 0,
    },
);
The first argument to function.run is a tuple of pointers to the kernel parameters — here just the device pointer. LaunchConfig maps directly to the CUDA grid/block/thread hierarchy.
6

Retrieve results from the GPU

Call CuDevice.syncReclaim to synchronize the device, copy the GPU memory back to the host, and return a std.ArrayList(f32).
var incremented_arr = try CuDevice.syncReclaim(f32, allocator, cu_slice);
defer incremented_arr.deinit(allocator);
After this call, incremented_arr.items contains { 2.2, 3.8, 1.123 } — each original value incremented by 1.

Full working example

The following is the complete source from the increment example in the cudaz repository. You can use this as a starting template for your own project.
const std = @import("std");
const Cuda = @import("cudaz");
const CuDevice = Cuda.Device;
const CuCompile = Cuda.Compile;
const CuLaunchConfig = Cuda.LaunchConfig;

// Cuda Kernel
const increment_kernel = @embedFile("./increment.cu");

pub fn main() !void {
    // Initialize allocator
    var gpa = std.heap.DebugAllocator(.{}).init;
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();
    std.debug.print("Initialized allocator\n", .{});

    // Initialize GPU
    const device = try CuDevice.default();
    defer device.deinit();
    std.debug.print("Cuda device is setup\n", .{});

    // Copy data from host to GPU
    const data = [_]f32{ 1.2, 2.8, 0.123 };
    const cu_slice = try device.htodCopy(f32, &data);
    defer cu_slice.free();
    std.debug.print("Copied array {any} from system to GPU\n", .{data});

    // Compile and load the Kernel
    std.debug.print("Kernel program:\n{s}\n\n", .{increment_kernel});
    const ptx = try CuCompile.cudaText(increment_kernel, .{}, allocator);
    defer allocator.free(ptx);
    const module = try CuDevice.loadPtxText(ptx);
    const function = try module.getFunc("increment");
    std.debug.print("Compiled Cuda Kernel that increments each value by 1 and loaded into GPU\n", .{});

    // Run the kernel on the data
    try function.run(
        .{&cu_slice.device_ptr},
        CuLaunchConfig{ .block_dim = .{ 3, 1, 1 }, .grid_dim = .{ 1, 1, 1 }, .shared_mem_bytes = 0 },
    );
    std.debug.print("Ran the Kernel against the array in GPU\n", .{});

    // Retrieve incremented data back to the system
    var incremented_arr = try CuDevice.syncReclaim(f32, allocator, cu_slice);
    defer incremented_arr.deinit(allocator);
    std.debug.print("Retrieved incremented data {any} from GPU to system\n", .{incremented_arr.items});
}
The increment example uses @embedFile("./increment.cu") to load the kernel from a separate .cu file rather than an inline string — a pattern that scales better as kernel complexity grows. For more advanced patterns including custom C struct passing and random number generation, explore the examples directory in the cudaz repository.

Build docs developers (and LLMs) love