This example is the canonical “hello world” for cudaz — it takes a small float array on the CPU, copies it to the GPU, launches a parallel CUDA kernel that adds 1 to every element simultaneously, and retrieves the results back to the host. It covers the complete round-trip: host allocation, device transfer, kernel compilation, kernel launch, and result retrieval.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/akhildevelops/cudaz/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
Before running this example make sure you have the following installed:- CUDA Toolkit — provides
libcuda,libnvrtc, and thenvcccompiler. Download from developer.nvidia.com/cuda-downloads. - Zig 0.14.0 — the version targeted by cudaz. Install from ziglang.org/download.
- cudaz — declared as a dependency in
build.zig.zon(shown below); fetched automatically byzig build.
Project Structure
The CUDA Kernel
The kernel is a plain CUDA C file that NVRTC compiles at runtime to PTX assembly.src/increment.cu
i = blockIdx.x * blockDim.x + threadIdx.x. Thread i reads out[i], adds 1, and writes the result back — all 3 elements (or however many you launch) execute in parallel with no inter-thread dependencies.
The Main Program
The Zig host program orchestrates the entire workflow: allocator setup, device initialization, memory transfer, kernel compilation, launch, and result retrieval.src/main.zig
Build Configuration
The build script wires up the cudaz dependency, links the required system libraries, and creates arun step.
build.zig
If your CUDA toolkit is not on the default system library path, pass a custom
path to the
cudaz dependency: .{ .CUDA_PATH = @as([]const u8, "/usr/local/cuda") }.Running the Example
Clone the repository and run the example with a single command:Expected Output
Step-by-Step Walkthrough
Initialize allocator and device
A
DebugAllocator is created for all host-side heap allocations. CuDevice.default() selects GPU 0 and initializes the CUDA driver context. Both resources are cleaned up with defer.Copy the host array to GPU with htodCopy
device.htodCopy(f32, &data) allocates GPU memory exactly large enough to hold the slice, then performs a host-to-device DMA transfer in one call. The returned cu_slice holds the device pointer and the element count; call cu_slice.free() when done.Compile the kernel inline with Compile.cudaText
CuCompile.cudaText(increment_kernel, .{}, allocator) invokes NVRTC to JIT-compile the embedded .cu source string into a PTX binary at runtime. The second argument is a compile-options struct where you can enable fast-math flags and similar settings. The returned ptx slice is heap-allocated and must be freed after the module is loaded.Load PTX and get a function handle
CuDevice.loadPtxText(ptx) uploads the PTX to the driver, which translates it into GPU machine code for the attached hardware. module.getFunc("increment") looks up the exported kernel by name and returns a callable handle.Launch the kernel with run()
function.run(.{&cu_slice.device_ptr}, cfg) launches the kernel. The first argument is a tuple of kernel parameters — here a single pointer to the device buffer. The LaunchConfig specifies block_dim = .{ 3, 1, 1 } so that 3 threads run in a single block, one per array element.