Device API Reference — cudaz GPU Management

Device is the central type in cudaz. It represents a single NVIDIA GPU and its primary CUDA context. You use a Device to allocate GPU memory, copy data between host and device, load compiled PTX modules, and free resources. Most cudaz workflows begin by calling Device.default() or Device.new(gpu) and end with device.deinit().

Import

const CuDevice = @import("cudaz").Device;

Fields

Field	Type	Description
`device`	`cuda.CUdevice`	Underlying CUDA device handle from the driver API
`primary_context`	`cuda.CUcontext`	The primary context retained for this device
`ordinal`	`u16`	Zero-based GPU index used to select this device

Initialization

`default`

pub fn default() CudaError.Error!Device

Initializes and returns a Device for GPU 0. Calls cuInit, retrieves device 0, retains its primary context, and sets it as the current context. Equivalent to Device.new(0). Returns: CudaError.Error!Device

const device = try CuDevice.default();
defer device.deinit();

`new`

pub fn new(gpu: u16) CudaError.Error!Device

Initializes and returns a Device for the GPU at the specified ordinal index.

gpu

u16

required

Zero-based index of the GPU to initialize. Pass 0 for the first GPU, 1 for the second, and so on.

Returns: CudaError.Error!Device

const device = try CuDevice.new(1); // second GPU
defer device.deinit();

`deinit`

pub fn deinit(self: *const Device) void

Releases the primary context for this device via cuDevicePrimaryCtxRelease. Panics if the release fails. Call this with defer immediately after a successful default() or new().

self

*const Device

required

Pointer to the Device to release.

Memory Allocation

`alloc`

pub fn alloc(self: Device, comptime T: type, length: usize) CudaError.Error!CudaSlice(T)

Allocates uninitialized GPU memory for length elements of type T. The returned CudaSlice(T) holds the device pointer and length. The caller is responsible for freeing the slice with slice.free().

self

Device

required

The device on which to allocate memory.

type

required

Comptime element type (e.g., f32, i32, u8).

length

usize

required

Number of elements to allocate. Total bytes allocated is @sizeOf(T) * length.

Returns: CudaError.Error!CudaSlice(T)

const slice = try device.alloc(f32, 1024);
defer slice.free();

`allocZeros`

pub fn allocZeros(self: Device, comptime data_type: type, length: usize) CudaError.Error!CudaSlice(data_type)

Allocates GPU memory for length elements of type data_type and immediately zero-initializes every byte using cuMemsetD8_v2. Equivalent to alloc followed by memsetZeros.

self

Device

required

The device on which to allocate memory.

data_type

type

required

Comptime element type.

length

usize

required

Number of elements to allocate and zero.

Returns: CudaError.Error!CudaSlice(data_type)

const zeros = try device.allocZeros(f32, 256);
defer zeros.free();

`allocR`

pub fn allocR(self: Device, dtype: DType, length: usize) CudaError.Error!CudaSliceR

Allocates uninitialized GPU memory using a runtime DType value instead of a comptime type. Returns a CudaSliceR whose element_type field carries the DType for later dispatch.

self

Device

required

The device on which to allocate memory.

dtype

DType

required

Runtime element type — .f16 (2 bytes) or .f32 (4 bytes).

length

usize

required

Number of elements to allocate.

Returns: CudaError.Error!CudaSliceR

const slice = try device.allocR(.f32, 512);
defer slice.free();

`allocZerosR`

pub fn allocZerosR(self: Device, dtype: DType, length: usize) CudaError.Error!CudaSliceR

Runtime-typed equivalent of allocZeros. Allocates and zero-initializes GPU memory with element type determined at runtime.

self

Device

required

The device on which to allocate memory.

dtype

DType

required

Runtime element type.

length

usize

required

Number of elements to allocate and zero.

Returns: CudaError.Error!CudaSliceR

Host ↔ Device Transfers

`htodCopy`

pub fn htodCopy(self: Device, comptime T: type, src: []const T) CudaError.Error!CudaSlice(T)

Allocates a new CudaSlice(T) on the device and copies all elements from the host slice src into it in one operation. The returned slice has the same length as src.

self

Device

required

The device to allocate on and copy to.

type

required

Comptime element type.

src

[]const T

required

Host slice to copy from. The entire slice is transferred.

Returns: CudaError.Error!CudaSlice(T)

const host_data = [_]f32{ 1.0, 2.0, 3.0, 4.0 };
const gpu_slice = try device.htodCopy(f32, &host_data);
defer gpu_slice.free();

`htodCopyInto`

pub fn htodCopyInto(comptime T: type, src: []const T, destination: CudaSlice(T)) CudaError.Error!void

Copies src into an already-allocated CudaSlice(T). Asserts that src.len == destination.len. Use this when you want to reuse an existing GPU allocation rather than allocating a new one.

type

required

Comptime element type.

src

[]const T

required

Host slice to copy from. Must have the same length as destination.

destination

CudaSlice(T)

required

Pre-allocated GPU slice to copy into.

Returns: CudaError.Error!void

const new_data = [_]f32{ 5.0, 6.0, 7.0, 8.0 };
try CuDevice.htodCopyInto(f32, &new_data, gpu_slice);

`dtohCopy`

pub fn dtohCopy(comptime T: type, allocator: std.mem.Allocator, slice: CudaSlice(T)) ![]T

Allocates a host buffer with allocator and copies the contents of slice into it. The returned []T is owned by the caller and must be freed with allocator.free(...).

type

required

Comptime element type.

allocator

std.mem.Allocator

required

Allocator used to create the host buffer.

slice

CudaSlice(T)

required

GPU slice to copy from.

Returns: ![]T — heap-allocated host slice

const host_buf = try CuDevice.dtohCopy(f32, allocator, gpu_slice);
defer allocator.free(host_buf);

`syncReclaim`

pub fn syncReclaim(comptime T: type, allocator: std.mem.Allocator, slice: CudaSlice(T)) !std.ArrayList(T)

Copies the GPU slice to a newly-created ArrayList(T). The returned ArrayList owns its memory; call arr.deinit() when done. This is a convenience wrapper around dtohCopy for callers that prefer ArrayList.

type

required

Comptime element type.

allocator

std.mem.Allocator

required

Allocator for the ArrayList.

slice

CudaSlice(T)

required

GPU slice to copy from.

Returns: !std.ArrayList(T)

var results = try CuDevice.syncReclaim(f32, allocator, gpu_slice);
defer results.deinit();
std.debug.print("first element: {d}\n", .{results.items[0]});

`syncReclaimR`

pub fn syncReclaimR(comptime T: type, allocator: std.mem.Allocator, slice: CudaSliceR) !std.ArrayList(T)

Runtime-typed variant of syncReclaim. Copies a CudaSliceR to an ArrayList(T). You must supply the concrete T at the call site; it must match the runtime DType stored in slice.element_type.

type

required

Comptime element type — must match slice.element_type.

allocator

std.mem.Allocator

required

Allocator for the ArrayList.

slice

CudaSliceR

required

Runtime-typed GPU slice to copy from.

Returns: !std.ArrayList(T)

PTX Loading

`loadPtx`

pub fn loadPtx(file_path: path.PathBuffer) CudaError.Error!Module

Loads a PTX file from disk into a CUmodule using cuModuleLoad. The PathBuffer wraps a null-terminated path string.

file_path

path.PathBuffer

required

Null-terminated path to the .ptx file to load.

Returns: CudaError.Error!Module

const module = try CuDevice.loadPtx(path_buf);

`loadPtxText`

pub fn loadPtxText(ptx: [:0]const u8) CudaError.Error!Module

Loads a PTX module directly from a null-terminated PTX string in memory using cuModuleLoadData. Use this after compiling with Cuda.Compile.cudaText or cudaFile.

ptx

[:0]const u8

required

Null-terminated PTX string, as returned by the Compile functions.

Returns: CudaError.Error!Module

const ptx = try Cuda.Compile.cudaText(kernel_source, null, allocator);
defer allocator.free(ptx);
const module = try CuDevice.loadPtxText(ptx);
const func = try module.getFunc("my_kernel");

Memory Utilities

`memsetZeros`

pub fn memsetZeros(comptime data_type: type, cuda_slice: CudaSlice(data_type)) CudaError.Error!void

Sets every byte of the GPU slice to zero using cuMemsetD8_v2. Useful for resetting a previously allocated buffer.

data_type

type

required

Comptime element type of the slice.

cuda_slice

CudaSlice(data_type)

required

The GPU buffer to zero-fill.

Returns: CudaError.Error!void

try CuDevice.memsetZeros(f32, my_slice);

`memsetZerosR`

pub fn memsetZerosR(dtype: DType, cuda_slice: CudaSliceR) CudaError.Error!void

Runtime-typed variant of memsetZeros. Sets every byte of a CudaSliceR to zero using cuMemsetD8_v2. The byte count is computed using dtype.size() * cuda_slice.len.

dtype

DType

required

Runtime element type — .f16 (2 bytes) or .f32 (4 bytes).

cuda_slice

CudaSliceR

required

The runtime-typed GPU buffer to zero-fill.

Returns: CudaError.Error!void

try CuDevice.memsetZerosR(.f32, my_runtime_slice);

`free`

pub fn free(ptr: cuda.CUdeviceptr) !void

Frees a raw GPU device pointer via cuMemFree_v2. Prefer calling slice.free() on a CudaSlice or CudaSliceR instead, which handles the pointer internally.

ptr

cuda.CUdeviceptr

required

The raw device pointer to free.

Returns: !void

Calling free on a pointer that was already freed, or on a pointer not obtained from cuMemAlloc, results in a CUDA error. Prefer the free() method on CudaSlice(T) and CudaSliceR for safety.

API

Documentation Index

​Import

​Fields

​Initialization

​default

​new

​deinit

​Memory Allocation

​alloc

​allocZeros

​allocR

​allocZerosR

​Host ↔ Device Transfers

​htodCopy

​htodCopyInto

​dtohCopy

​syncReclaim

​syncReclaimR

​PTX Loading

​loadPtx

​loadPtxText

​Memory Utilities

​memsetZeros

​memsetZerosR

​free

Build docs developers (and LLMs) love

Import

Fields

Initialization

`default`

`new`

`deinit`

Memory Allocation

`alloc`

`allocZeros`

`allocR`

`allocZerosR`

Host ↔ Device Transfers

`htodCopy`

`htodCopyInto`

`dtohCopy`

`syncReclaim`

`syncReclaimR`

PTX Loading

`loadPtx`

`loadPtxText`

Memory Utilities

`memsetZeros`

`memsetZerosR`

`free`