Documentation Index
Fetch the complete documentation index at: https://mintlify.com/akhildevelops/cudaz/llms.txt
Use this file to discover all available pages before exploring further.
CudaSlice(T) is fully generic—T can be any C-compatible struct, not just primitives like f32 or i32. This lets you move arrays of rich data types (coordinates, colors, records) to the GPU in a single htodCopy call, operate on them in a CUDA kernel, and retrieve the results with syncReclaim. The key requirement is that the struct layout is identical on both sides of the language boundary, which means defining it in a shared C header and then importing that header into both Zig and the CUDA source.
Defining the Shared C Type
Place your struct in a plain C header file that lives alongside your project source. Both Zig (via@cImport) and the CUDA kernel source will reference this definition, so naming it something neutral like c/tuple.h keeps it easy to find.
c/tuple.h
Importing the Type in Zig
Use@cImport and @cInclude to pull the header into Zig. The resulting namespace exposes the struct as Ctype.tuple, and its memory layout is guaranteed to match what the C compiler (and NVCC) expects.
Allocating and Copying Custom Types to the GPU
Once the type is imported, use it anywhere you would use a primitive withCudaSlice. Build a host-side ArrayList, populate it, then call device.htodCopy to allocate GPU memory and copy the data in one step.
htodCopy allocates @sizeOf(Ctype.tuple) * src_array.items.len bytes on the device, copies the host slice into that allocation, and returns a CudaSlice(Ctype.tuple) that you pass directly to your kernel.
Writing the CUDA Kernel
Because NVRTC (the runtime compiler cudaz uses) does not support#include directives, you cannot simply include tuple.h from inside your .cu source string. Instead, redefine the struct inline at the top of the kernel source. The layout produced by NVRTC will be identical to the one in the header as long as the field order and types match.
offset.cu
@embedFile or passed as a string) through Cuda.Compile.cudaText, then load the resulting PTX:
Running the Kernel
Allocate the output buffer on the device, then callfunction.run with a tuple of pointers to each kernel argument’s device_ptr. The LaunchConfig controls grid and block dimensions.
Device.syncReclaim: