This guide walks you through a complete, working cudaz program: an array increment example that copies threeDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/akhildevelops/cudaz/llms.txt
Use this file to discover all available pages before exploring further.
f32 values to the GPU, runs a CUDA kernel that adds 1 to each element in parallel, and retrieves the results back to the host. By the end you will have seen every essential cudaz concept — device setup, memory management, kernel compilation, kernel execution, and result retrieval — in one self-contained program.
Add imports and create type aliases
Import cudaz and create short aliases for the three types you will use most often:
Device, Compile, and LaunchConfig.Cuda.Device handles device lifecycle and memory transfers. Cuda.Compile compiles CUDA C source to PTX at runtime. Cuda.LaunchConfig describes the GPU thread hierarchy (grids, blocks, threads) for a kernel launch.Initialize the GPU device
Call If you have multiple GPUs and want a specific one, use
CuDevice.default() to initialize the CUDA runtime and acquire the primary context for GPU ordinal 0. Use defer device.deinit() so the context is released automatically when main returns.CuDevice.new(ordinal) instead.Copy data to the GPU
Define your host data as a fixed-size array, then call
device.htodCopy(f32, &data) to allocate GPU memory of the correct size and copy the values over in one call. The returned CudaSlice(f32) owns the device allocation, so defer cu_slice.free() will release it.CudaSlice(T) is a type-safe wrapper around a CUdeviceptr. It carries both the device pointer and the element count, so subsequent operations never need a separate length argument.Compile and load the kernel
Write the CUDA C kernel as an inline Zig multiline string literal, compile it to PTX using NVRTC, load the PTX into the device, and look up the function by name.
CuCompile.cudaText invokes NVRTC at runtime — no separate nvcc compilation step is needed. The resulting PTX is a null-terminated [:0]const u8 that loadPtxText feeds directly to the CUDA driver.Run the kernel
Launch the kernel by passing a tuple of kernel arguments and a The first argument to
LaunchConfig that specifies the thread hierarchy. Because the input array has 3 elements, set block_dim to {3, 1, 1} so each thread handles one element.function.run is a tuple of pointers to the kernel parameters — here just the device pointer. LaunchConfig maps directly to the CUDA grid/block/thread hierarchy.