Every CUDA kernel launch requires a grid configuration that tells the driver how many thread blocks to create and how many threads to put in each block. cudaz captures this in theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/akhildevelops/cudaz/llms.txt
Use this file to discover all available pages before exploring further.
LaunchConfig struct, which maps directly to the grid/block arguments of cuLaunchKernel. You can construct a LaunchConfig manually for full control over 1D, 2D, or 3D grids, or use the for_num_elems helper to automatically compute a 1D grid that covers n elements.
Import
Struct Fields
Number of thread blocks in the X, Y, and Z dimensions of the grid. The total number of thread blocks launched equals
grid_dim[0] * grid_dim[1] * grid_dim[2]. For 1D workloads, set Y and Z to 1.Number of threads per block in the X, Y, and Z dimensions. The total threads per block equals
block_dim[0] * block_dim[1] * block_dim[2]. Must not exceed the device’s maximum threads-per-block limit (typically 1024).Number of bytes of dynamic shared memory to allocate per block for this launch. Pass
0 if the kernel does not use dynamic shared memory.Functions
for_num_elems
n elements using blocks of 1024 threads. The grid size is rounded up so that every element gets a thread, even if n is not a multiple of 1024.
Total number of elements (threads) to cover. Must be greater than zero.
Self (LaunchConfig)
Formula:
n = 10000, num_blocks = (10000 + 1023) / 1024 = 10, yielding 10 × 1024 = 10 240 total threads — enough to cover all 10 000 elements (each kernel should guard with if (i < n)).
Kernels launched with
for_num_elems will have more threads than elements when n is not a multiple of 1024. Always add a bounds check inside your kernel: if (i < n) { ... }.