cudaz supports two approaches to loading GPU kernels: compiling CUDA C source code at runtime using NVIDIA’s NVRTC library, and loading pre-compiled PTX directly. Runtime compilation viaDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/akhildevelops/cudaz/llms.txt
Use this file to discover all available pages before exploring further.
Cuda.Compile lets you embed kernel source in your Zig program or read it from a .cu file, compile it on the fly, and load the resulting PTX into the active CUDA context—no offline nvcc step required.
Compile from Inline Text
The most straightforward path is to embed your CUDA C source as a Zig string literal and pass it toCuda.Compile.cudaText. The function returns a sentinel-terminated PTX string that you load with Device.loadPtxText.
@embedFile to pull a .cu file into your binary at compile time and pass the resulting slice directly to cudaText, which is the pattern used in the custom-type example.
Compile from a File
When you want to read a.cu file at runtime rather than embed it, use Cuda.Compile.cudaFile. It accepts a std.Io.File handle and an std.Io instance, reads up to 1 MiB of source, and delegates to cudaText internally.
Compile Options
BothcudaText and cudaFile accept an optional Options struct as their second argument. When null or .{} is passed, all fields default to their zero values and NVRTC uses its built-in defaults.
| Field | Type | NVRTC flag | Description |
|---|---|---|---|
ftz | ?bool | --ftz=true/false | Flush denormal values to zero. |
prec_sqrt | ?bool | --prec-sqrt=true/false | Use IEEE-compliant square root. |
prec_div | ?bool | --prec-div=true/false | Use IEEE-compliant division. |
use_fast_math | ?bool | --fmad=true/false | Enable fused multiply-add fast math. |
maxrregcount | ?usize | --maxrregcount=N | Cap the number of registers per thread. |
include_paths | [][]const u8 | --include-path=… | Additional directories to search for headers. |
arch | [][]const u8 | -arch=… | Target GPU virtual architectures (e.g. compute_86). |
macro | [][]const u8 | --define-macro=… | Preprocessor macro definitions. |
null or an empty slice is simply omitted from the NVRTC command line:
Loading Pre-compiled PTX
If you already have PTX output from an offlinenvcc invocation, you can skip runtime compilation entirely and load the PTX directly via the Device module.
From an in-memory slice (e.g., produced by cudaText or embedded with @embedFile):
Module from which you retrieve individual kernel functions by name using module.getFunc.
Compilation Error Handling
NVRTC errors surface as Zig error values from theNvrtcError.Error error set. When the source code itself is invalid—missing semicolons, type errors, and so on—NVRTC returns NVRTC_ERROR_COMPILATION. cudaz intercepts that specific error inside cudaProgram, fetches the compiler log with nvrtcGetProgramLog, and prints it to stderr. After printing, cudaProgram returns normally, and the subsequent call to getPtx will return its own NVRTC error because no valid PTX was produced. All other NVRTC errors (out-of-memory, invalid input, etc.) are returned directly without printing a log.
The PTX returned by
cudaText and cudaFile is a sentinel-terminated [:0]const u8. You are responsible for freeing it with allocator.free(ptx) once the corresponding Module has been loaded—the module retains its own copy of the code inside the CUDA driver.