Every CUDA kernel launch requires a grid configuration that tells the GPU how many thread blocks to create and how many threads to place in each block. cudaz captures this through theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/akhildevelops/cudaz/llms.txt
Use this file to discover all available pages before exploring further.
LaunchConfig struct, which is passed as the second argument to function.run(params, cfg). Getting the configuration right is important: too few threads leave the GPU underutilised, while an incorrectly sized grid causes out-of-bounds memory accesses or silently drops elements at the end of an array.
LaunchConfig Fields
The LaunchConfig struct has three fields, each of which maps directly to a parameter of cuLaunchKernel:
| Field | Type | Description |
|---|---|---|
grid_dim | struct { u32, u32, u32 } | Number of thread blocks in the x, y, and z dimensions of the grid |
block_dim | struct { u32, u32, u32 } | Number of threads per block in the x, y, and z dimensions |
shared_mem_bytes | u32 | Bytes of dynamic shared memory to allocate per block |
grid_dim[0] and block_dim[0]; the remaining dimensions default to 1. For 2D workloads such as matrix operations, you set both x and y dimensions. The z dimension is typically left at 1 unless you are working with 3D data.
Manual Configuration
You can construct aLaunchConfig literal directly when you know the exact launch geometry:
4 × 256 = 1024 threads arranged as 4 blocks of 256 threads each. The total thread count must be at least as large as the number of elements you want to process. Any threads beyond the last valid element must be guarded by a bounds check inside the kernel:
Helper: LaunchConfig.for_num_elems(n)
For standard 1D element-wise operations, for_num_elems computes the grid and block dimensions automatically from the number of elements:
shared_mem_bytes = 0.
2D Configuration (Matrix Operations)
For 2D kernels such as matrix multiplication, set both the x and y dimensions ofblock_dim and compute grid_dim to cover the full matrix:
Shared Memory
Setshared_mem_bytes to allocate dynamic shared memory per block. Shared memory is accessible to all threads in the same block and is much faster than global device memory:
extern __shared__:
CUDA hardware limits the total number of threads per block to 1024 on most GPUs (the exact limit is reported by
cuDeviceGetAttribute with CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK). LaunchConfig.for_num_elems uses exactly 1024 threads per block, which is the safe maximum. If you set block_dim manually to a product exceeding 1024 (e.g., {32, 32, 2} = 2048), the kernel launch will fail at runtime.