Getting Started with CUDA for Python Programmers

Lectures 3 and 5 by Jeremy Howard are designed specifically for Python programmers coming from NumPy and PyTorch. The key insight is that you do not need to abandon Python — you can write CUDA C++ kernels inline, compile them on the fly, and call them as regular Python functions. This page walks through the core tools and patterns from those lectures.

From Python to GPU kernels

In NumPy or PyTorch, you write operations over arrays and the framework handles parallelism:

# PyTorch — no explicit threading
output = input ** 2

In CUDA, you write a function that a single thread executes, then launch millions of copies:

// CUDA — each thread handles one element
__global__ void square_kernel(float* input, float* output, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        output[i] = input[i] * input[i];
    }
}

The mental shift is from “operate on the whole array” to “what does one thread do?”

Thread indexing

Every thread knows its position through three built-in variables:

Variable	Meaning
`threadIdx.x`	Thread’s index within its block (0 to `blockDim.x - 1`)
`blockIdx.x`	Block’s index within the grid
`blockDim.x`	Number of threads in a block

The global index for 1D problems is always:

int i = blockIdx.x * blockDim.x + threadIdx.x;

Always add a bounds check (if (i < n)) because the total number of threads launched is usually rounded up to a multiple of the block size.

Visualize a 1D launch with blockDim.x = 4 and 3 blocks: thread global indices run 0–3, 4–7, and 8–11. If your array has 10 elements, threads 10 and 11 are out of bounds and the guard protects against writing garbage.

The `load_cuda` utility

The GPU Mode lectures use torch.utils.cpp_extension.load_inline to compile and load CUDA kernels directly from Python. The utils.py file in the lectures repository wraps this with a convenient helper:

from torch.utils.cpp_extension import load_inline

cuda_begin = r'''
#include <torch/extension.h>
#include <stdio.h>
#include <c10/cuda/CUDAException.h>

#define CHECK_CUDA(x) TORCH_CHECK(x.device().is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)
#define CUDA_ERR(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}
__host__ __device__ inline unsigned int cdiv(unsigned int a, unsigned int b) { return (a+b-1)/b;}
'''

def load_cuda(cuda_src, cpp_src, funcs, opt=True, verbose=False, name=None):
    "Simple wrapper for torch.utils.cpp_extension.load_inline"
    if name is None: name = funcs[0]
    flags = "-O3 -Xptxas -O3 -Xcompiler -O3" if opt else "-O0 -Xptxas -O0 -Xcompiler -O0"
    return load_inline(cuda_sources=[cuda_src], cpp_sources=[cpp_src], functions=funcs,
                       extra_cuda_cflags=[flags], verbose=verbose, name=name)

cuda_begin provides the standard includes and macros used in every kernel. The CHECK_INPUT macro verifies that a tensor is on the GPU and contiguous before passing it to a kernel — always check this to avoid silent bugs.

The `cdiv` utility

Ceiling division comes up constantly when computing grid sizes. Both a Python and a CUDA version are provided in utils.py:

def cdiv(a, b):
    "Int ceiling division of `a` over `b`"
    return (a + b - 1) // b

Use it to compute the number of blocks needed to cover n elements:

n = 1_000_000
threads_per_block = 256
num_blocks = cdiv(n, threads_per_block)  # = 3907 blocks

Writing your first kernel end-to-end

Here is the full pattern for writing, compiling, and calling a CUDA kernel from Python using the lecture utilities:

Write the CUDA source

Concatenate cuda_begin with your kernel and a PyTorch-compatible wrapper function:

cuda_src = cuda_begin + r'''
__global__ void square_kernel(float* input, float* output, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) output[i] = input[i] * input[i];
}

torch::Tensor square(torch::Tensor input) {
    CHECK_INPUT(input);
    int n = input.numel();
    auto output = torch::empty_like(input);
    int threads = 256;
    square_kernel<<<cdiv(n, threads), threads>>>(
        input.data_ptr<float>(),
        output.data_ptr<float>(),
        n
    );
    C10_CUDA_KERNEL_LAUNCH_CHECK();
    return output;
}
'''

Write the C++ binding

Declare the function signature for the Python binding:

cpp_src = "torch::Tensor square(torch::Tensor input);"

Compile and load

Call load_cuda to JIT-compile and load the module:

module = load_cuda(cuda_src, cpp_src, ['square'])

This triggers nvcc compilation behind the scenes. The result is cached, so subsequent runs are fast.

Call the kernel

Use the compiled module like any Python function:

import torch

x = torch.tensor([1.0, 2.0, 3.0, 4.0], device='cuda')
y = module.square(x)
print(y)  # tensor([ 1.,  4.,  9., 16.], device='cuda:0')

Vector addition example

A slightly more complex example — adding two vectors — shows how two input tensors are handled:

cuda_src = cuda_begin + r'''
__global__ void add_kernel(float* a, float* b, float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}

torch::Tensor add(torch::Tensor a, torch::Tensor b) {
    CHECK_INPUT(a); CHECK_INPUT(b);
    int n = a.numel();
    auto c = torch::empty_like(a);
    int threads = 256;
    add_kernel<<<cdiv(n, threads), threads>>>(
        a.data_ptr<float>(),
        b.data_ptr<float>(),
        c.data_ptr<float>(),
        n
    );
    C10_CUDA_KERNEL_LAUNCH_CHECK();
    return c;
}
'''

cpp_src = "torch::Tensor add(torch::Tensor a, torch::Tensor b);"
module = load_cuda(cuda_src, cpp_src, ['add'])

Moving to more complex kernels

Once you have the basic pattern down, Lecture 5 extends it to:

2D kernels for matrix operations (using threadIdx.y, blockIdx.y)
Shared memory to reduce global memory traffic
Multiple outputs per thread (thread coarsening)
RGB image processing as a concrete 2D problem

Memory architecture

Understand shared memory, caches, and bandwidth — the next step after writing your first kernels

Performance checklist

Mark Saroufim’s checklist for finding and fixing kernel performance issues

Lecture references

Lecture 3 Colab notebook

Jeremy Howard’s first CUDA for Python Programmers notebook

Lecture 5 folder

Going Further with CUDA for Python Programmers

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Getting Started with CUDA for Python Programmers

From Python to GPU kernels

Thread indexing

The `load_cuda` utility

The `cdiv` utility

Writing your first kernel end-to-end

Vector addition example

Moving to more complex kernels

Memory architecture

Performance checklist

Lecture references

Lecture 3 Colab notebook

Lecture 5 folder

Build docs developers (and LLMs) love

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Documentation Index

​From Python to GPU kernels

​Thread indexing

​The load_cuda utility

​The cdiv utility

​Writing your first kernel end-to-end

​Vector addition example

​Moving to more complex kernels

Memory architecture

Performance checklist

​Lecture references

Lecture 3 Colab notebook

Lecture 5 folder

Build docs developers (and LLMs) love

From Python to GPU kernels

Thread indexing

The `load_cuda` utility

The `cdiv` utility

Writing your first kernel end-to-end

Vector addition example

Moving to more complex kernels

Lecture references