Use this file to discover all available pages before exploring further.
Lectures 3 and 5 by Jeremy Howard are designed specifically for Python programmers coming from NumPy and PyTorch. The key insight is that you do not need to abandon Python — you can write CUDA C++ kernels inline, compile them on the fly, and call them as regular Python functions. This page walks through the core tools and patterns from those lectures.
In NumPy or PyTorch, you write operations over arrays and the framework handles parallelism:
# PyTorch — no explicit threadingoutput = input ** 2
In CUDA, you write a function that a single thread executes, then launch millions of copies:
// CUDA — each thread handles one element__global__ void square_kernel(float* input, float* output, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) { output[i] = input[i] * input[i]; }}
The mental shift is from “operate on the whole array” to “what does one thread do?”
Every thread knows its position through three built-in variables:
Variable
Meaning
threadIdx.x
Thread’s index within its block (0 to blockDim.x - 1)
blockIdx.x
Block’s index within the grid
blockDim.x
Number of threads in a block
The global index for 1D problems is always:
int i = blockIdx.x * blockDim.x + threadIdx.x;
Always add a bounds check (if (i < n)) because the total number of threads launched is usually rounded up to a multiple of the block size.
Visualize a 1D launch with blockDim.x = 4 and 3 blocks: thread global indices run 0–3, 4–7, and 8–11. If your array has 10 elements, threads 10 and 11 are out of bounds and the guard protects against writing garbage.
The GPU Mode lectures use torch.utils.cpp_extension.load_inline to compile and load CUDA kernels directly from Python. The utils.py file in the lectures repository wraps this with a convenient helper:
from torch.utils.cpp_extension import load_inlinecuda_begin = r'''#include <torch/extension.h>#include <stdio.h>#include <c10/cuda/CUDAException.h>#define CHECK_CUDA(x) TORCH_CHECK(x.device().is_cuda(), #x " must be a CUDA tensor")#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)#define CUDA_ERR(ans) { gpuAssert((ans), __FILE__, __LINE__); }inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true){ if (code != cudaSuccess) { fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line); if (abort) exit(code); }}__host__ __device__ inline unsigned int cdiv(unsigned int a, unsigned int b) { return (a+b-1)/b;}'''def load_cuda(cuda_src, cpp_src, funcs, opt=True, verbose=False, name=None): "Simple wrapper for torch.utils.cpp_extension.load_inline" if name is None: name = funcs[0] flags = "-O3 -Xptxas -O3 -Xcompiler -O3" if opt else "-O0 -Xptxas -O0 -Xcompiler -O0" return load_inline(cuda_sources=[cuda_src], cpp_sources=[cpp_src], functions=funcs, extra_cuda_cflags=[flags], verbose=verbose, name=name)
cuda_begin provides the standard includes and macros used in every kernel. The CHECK_INPUT macro verifies that a tensor is on the GPU and contiguous before passing it to a kernel — always check this to avoid silent bugs.
A slightly more complex example — adding two vectors — shows how two input tensors are handled:
cuda_src = cuda_begin + r'''__global__ void add_kernel(float* a, float* b, float* c, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) c[i] = a[i] + b[i];}torch::Tensor add(torch::Tensor a, torch::Tensor b) { CHECK_INPUT(a); CHECK_INPUT(b); int n = a.numel(); auto c = torch::empty_like(a); int threads = 256; add_kernel<<<cdiv(n, threads), threads>>>( a.data_ptr<float>(), b.data_ptr<float>(), c.data_ptr<float>(), n ); C10_CUDA_KERNEL_LAUNCH_CHECK(); return c;}'''cpp_src = "torch::Tensor add(torch::Tensor a, torch::Tensor b);"module = load_cuda(cuda_src, cpp_src, ['add'])