Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/akhildevelops/cudaz/llms.txt

Use this file to discover all available pages before exploring further.

Device is the central type in cudaz. It represents a single NVIDIA GPU and its primary CUDA context. You use a Device to allocate GPU memory, copy data between host and device, load compiled PTX modules, and free resources. Most cudaz workflows begin by calling Device.default() or Device.new(gpu) and end with device.deinit().

Import

const CuDevice = @import("cudaz").Device;

Fields

FieldTypeDescription
devicecuda.CUdeviceUnderlying CUDA device handle from the driver API
primary_contextcuda.CUcontextThe primary context retained for this device
ordinalu16Zero-based GPU index used to select this device

Initialization

default

pub fn default() CudaError.Error!Device
Initializes and returns a Device for GPU 0. Calls cuInit, retrieves device 0, retains its primary context, and sets it as the current context. Equivalent to Device.new(0). Returns: CudaError.Error!Device
const device = try CuDevice.default();
defer device.deinit();

new

pub fn new(gpu: u16) CudaError.Error!Device
Initializes and returns a Device for the GPU at the specified ordinal index.
gpu
u16
required
Zero-based index of the GPU to initialize. Pass 0 for the first GPU, 1 for the second, and so on.
Returns: CudaError.Error!Device
const device = try CuDevice.new(1); // second GPU
defer device.deinit();

deinit

pub fn deinit(self: *const Device) void
Releases the primary context for this device via cuDevicePrimaryCtxRelease. Panics if the release fails. Call this with defer immediately after a successful default() or new().
self
*const Device
required
Pointer to the Device to release.

Memory Allocation

alloc

pub fn alloc(self: Device, comptime T: type, length: usize) CudaError.Error!CudaSlice(T)
Allocates uninitialized GPU memory for length elements of type T. The returned CudaSlice(T) holds the device pointer and length. The caller is responsible for freeing the slice with slice.free().
self
Device
required
The device on which to allocate memory.
T
type
required
Comptime element type (e.g., f32, i32, u8).
length
usize
required
Number of elements to allocate. Total bytes allocated is @sizeOf(T) * length.
Returns: CudaError.Error!CudaSlice(T)
const slice = try device.alloc(f32, 1024);
defer slice.free();

allocZeros

pub fn allocZeros(self: Device, comptime data_type: type, length: usize) CudaError.Error!CudaSlice(data_type)
Allocates GPU memory for length elements of type data_type and immediately zero-initializes every byte using cuMemsetD8_v2. Equivalent to alloc followed by memsetZeros.
self
Device
required
The device on which to allocate memory.
data_type
type
required
Comptime element type.
length
usize
required
Number of elements to allocate and zero.
Returns: CudaError.Error!CudaSlice(data_type)
const zeros = try device.allocZeros(f32, 256);
defer zeros.free();

allocR

pub fn allocR(self: Device, dtype: DType, length: usize) CudaError.Error!CudaSliceR
Allocates uninitialized GPU memory using a runtime DType value instead of a comptime type. Returns a CudaSliceR whose element_type field carries the DType for later dispatch.
self
Device
required
The device on which to allocate memory.
dtype
DType
required
Runtime element type — .f16 (2 bytes) or .f32 (4 bytes).
length
usize
required
Number of elements to allocate.
Returns: CudaError.Error!CudaSliceR
const slice = try device.allocR(.f32, 512);
defer slice.free();

allocZerosR

pub fn allocZerosR(self: Device, dtype: DType, length: usize) CudaError.Error!CudaSliceR
Runtime-typed equivalent of allocZeros. Allocates and zero-initializes GPU memory with element type determined at runtime.
self
Device
required
The device on which to allocate memory.
dtype
DType
required
Runtime element type.
length
usize
required
Number of elements to allocate and zero.
Returns: CudaError.Error!CudaSliceR

Host ↔ Device Transfers

htodCopy

pub fn htodCopy(self: Device, comptime T: type, src: []const T) CudaError.Error!CudaSlice(T)
Allocates a new CudaSlice(T) on the device and copies all elements from the host slice src into it in one operation. The returned slice has the same length as src.
self
Device
required
The device to allocate on and copy to.
T
type
required
Comptime element type.
src
[]const T
required
Host slice to copy from. The entire slice is transferred.
Returns: CudaError.Error!CudaSlice(T)
const host_data = [_]f32{ 1.0, 2.0, 3.0, 4.0 };
const gpu_slice = try device.htodCopy(f32, &host_data);
defer gpu_slice.free();

htodCopyInto

pub fn htodCopyInto(comptime T: type, src: []const T, destination: CudaSlice(T)) CudaError.Error!void
Copies src into an already-allocated CudaSlice(T). Asserts that src.len == destination.len. Use this when you want to reuse an existing GPU allocation rather than allocating a new one.
T
type
required
Comptime element type.
src
[]const T
required
Host slice to copy from. Must have the same length as destination.
destination
CudaSlice(T)
required
Pre-allocated GPU slice to copy into.
Returns: CudaError.Error!void
const new_data = [_]f32{ 5.0, 6.0, 7.0, 8.0 };
try CuDevice.htodCopyInto(f32, &new_data, gpu_slice);

dtohCopy

pub fn dtohCopy(comptime T: type, allocator: std.mem.Allocator, slice: CudaSlice(T)) ![]T
Allocates a host buffer with allocator and copies the contents of slice into it. The returned []T is owned by the caller and must be freed with allocator.free(...).
T
type
required
Comptime element type.
allocator
std.mem.Allocator
required
Allocator used to create the host buffer.
slice
CudaSlice(T)
required
GPU slice to copy from.
Returns: ![]T — heap-allocated host slice
const host_buf = try CuDevice.dtohCopy(f32, allocator, gpu_slice);
defer allocator.free(host_buf);

syncReclaim

pub fn syncReclaim(comptime T: type, allocator: std.mem.Allocator, slice: CudaSlice(T)) !std.ArrayList(T)
Copies the GPU slice to a newly-created ArrayList(T). The returned ArrayList owns its memory; call arr.deinit() when done. This is a convenience wrapper around dtohCopy for callers that prefer ArrayList.
T
type
required
Comptime element type.
allocator
std.mem.Allocator
required
Allocator for the ArrayList.
slice
CudaSlice(T)
required
GPU slice to copy from.
Returns: !std.ArrayList(T)
var results = try CuDevice.syncReclaim(f32, allocator, gpu_slice);
defer results.deinit();
std.debug.print("first element: {d}\n", .{results.items[0]});

syncReclaimR

pub fn syncReclaimR(comptime T: type, allocator: std.mem.Allocator, slice: CudaSliceR) !std.ArrayList(T)
Runtime-typed variant of syncReclaim. Copies a CudaSliceR to an ArrayList(T). You must supply the concrete T at the call site; it must match the runtime DType stored in slice.element_type.
T
type
required
Comptime element type — must match slice.element_type.
allocator
std.mem.Allocator
required
Allocator for the ArrayList.
slice
CudaSliceR
required
Runtime-typed GPU slice to copy from.
Returns: !std.ArrayList(T)

PTX Loading

loadPtx

pub fn loadPtx(file_path: path.PathBuffer) CudaError.Error!Module
Loads a PTX file from disk into a CUmodule using cuModuleLoad. The PathBuffer wraps a null-terminated path string.
file_path
path.PathBuffer
required
Null-terminated path to the .ptx file to load.
Returns: CudaError.Error!Module
const module = try CuDevice.loadPtx(path_buf);

loadPtxText

pub fn loadPtxText(ptx: [:0]const u8) CudaError.Error!Module
Loads a PTX module directly from a null-terminated PTX string in memory using cuModuleLoadData. Use this after compiling with Cuda.Compile.cudaText or cudaFile.
ptx
[:0]const u8
required
Null-terminated PTX string, as returned by the Compile functions.
Returns: CudaError.Error!Module
const ptx = try Cuda.Compile.cudaText(kernel_source, null, allocator);
defer allocator.free(ptx);
const module = try CuDevice.loadPtxText(ptx);
const func = try module.getFunc("my_kernel");

Memory Utilities

memsetZeros

pub fn memsetZeros(comptime data_type: type, cuda_slice: CudaSlice(data_type)) CudaError.Error!void
Sets every byte of the GPU slice to zero using cuMemsetD8_v2. Useful for resetting a previously allocated buffer.
data_type
type
required
Comptime element type of the slice.
cuda_slice
CudaSlice(data_type)
required
The GPU buffer to zero-fill.
Returns: CudaError.Error!void
try CuDevice.memsetZeros(f32, my_slice);

memsetZerosR

pub fn memsetZerosR(dtype: DType, cuda_slice: CudaSliceR) CudaError.Error!void
Runtime-typed variant of memsetZeros. Sets every byte of a CudaSliceR to zero using cuMemsetD8_v2. The byte count is computed using dtype.size() * cuda_slice.len.
dtype
DType
required
Runtime element type — .f16 (2 bytes) or .f32 (4 bytes).
cuda_slice
CudaSliceR
required
The runtime-typed GPU buffer to zero-fill.
Returns: CudaError.Error!void
try CuDevice.memsetZerosR(.f32, my_runtime_slice);

free

pub fn free(ptr: cuda.CUdeviceptr) !void
Frees a raw GPU device pointer via cuMemFree_v2. Prefer calling slice.free() on a CudaSlice or CudaSliceR instead, which handles the pointer internally.
ptr
cuda.CUdeviceptr
required
The raw device pointer to free.
Returns: !void
Calling free on a pointer that was already freed, or on a pointer not obtained from cuMemAlloc, results in a CUDA error. Prefer the free() method on CudaSlice(T) and CudaSliceR for safety.

Build docs developers (and LLMs) love