Set up your GPU programming environment

Before running any lecture code, you need a working CUDA environment with PyTorch, Triton, and the NVIDIA profiling tools. This page walks through everything you need from a fresh Linux machine with an NVIDIA GPU.

Prerequisites

An NVIDIA GPU (Ampere or newer recommended for all lectures)
Linux (Ubuntu 20.04+ or similar)
NVIDIA driver installed (nvidia-smi should run without errors)
Python 3.10+

The lectures assume a Linux environment. macOS is not supported — CUDA does not run on macOS. Windows via WSL2 is possible but not officially tested by the lecture authors.

Installation

Install PyTorch with CUDA

Install PyTorch with the CUDA toolkit bundled. Choose the version that matches your driver.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Replace cu121 / pytorch-cuda=12.1 with the CUDA version matching your driver. Run nvidia-smi to check — the top-right corner shows the maximum supported CUDA version.

Install Triton

Triton is used in several lectures for high-level GPU kernel authoring. It ships with recent PyTorch installs, but you can install it directly:

pip install triton

Verify it works:

import triton
print(triton.__version__)

Install Numba and other dependencies

Some lectures use Numba for CUDA kernel authoring and Matplotlib for visualisations.

pip install numba matplotlib

Numba requires the CUDA toolkit to be installed separately on the host (not just bundled with PyTorch). Install it via conda install cudatoolkit or from developer.nvidia.com/cuda-downloads.

Clone the lectures repository

All lecture code lives in the gpu-mode/lectures repository.

git clone https://github.com/gpu-mode/lectures.git
cd lectures

Verify your CUDA setup

The repository ships a utils.py with a print_cuda_info() helper that gives you a quick sanity check:

utils.py

import torch
import subprocess

def print_cuda_info():
    print("=== PyTorch CUDA Info ===")
    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"cuDNN version: {torch.backends.cudnn.version()}")
    print(f"Number of GPUs: {torch.cuda.device_count()}")

    for i in range(torch.cuda.device_count()):
        print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")
        print(f"    Current device: {torch.cuda.current_device()}")
        print(f"    Memory allocated: {torch.cuda.memory_allocated(i)/1e6:.2f} MB")
        print(f"    Memory cached   : {torch.cuda.memory_reserved(i)/1e6:.2f} MB")

    print("\n=== nvidia-smi Info (if available) ===")
    try:
        subprocess.run(["nvidia-smi"], check=True)
    except Exception as e:
        print(f"nvidia-smi not available: {e}")

Run it from the repository root:

import sys
sys.path.insert(0, '.')
from utils import print_cuda_info

print_cuda_info()

A healthy setup prints your GPU name, CUDA version, and memory stats followed by nvidia-smi output.

Profiling tools

Two NVIDIA tools are used throughout the lectures. Both are installed with the CUDA toolkit.

NSight Systems (`nsys`)

NSight Systems provides a timeline view of CPU and GPU activity. It is the right tool for finding where time is spent at a coarse level — which kernels launch, how long they take, and whether the CPU is the bottleneck.

# Record a profile
nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas \
  --capture-range=cudaProfilerApi \
  --cudabacktrace=true \
  -o my_profile \
  python my_script.py

# Open the report in NSight Systems GUI
nsys-ui my_profile.nsys-rep

NSight Compute (`ncu`)

NSight Compute profiles individual CUDA kernels with hardware-counter metrics — memory throughput, occupancy, warp stalls, and more. Use it after nsys has told you which kernel to optimise.

# Profile all kernels in a script
ncu python my_script.py

# Profile and save a report
ncu --set full -o kernel_report python my_script.py

Running ncu directly against a script that uses load_inline (JIT-compiled extensions) can fail with a CUDA initialisation error. Profile JIT-compiled kernels via nsys or use the --target-processes all flag with ncu.

CUDA boilerplate from `utils.py`

Every lecture that writes a custom CUDA extension starts from the cuda_begin constant in utils.py. It includes the standard headers, input-checking macros, and a GPU error handler:

utils.py — cuda_begin

#include <torch/extension.h>
#include <stdio.h>
#include <c10/cuda/CUDAException.h>

#define CHECK_CUDA(x) TORCH_CHECK(x.device().is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)
#define CUDA_ERR(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}
__host__ __device__ inline unsigned int cdiv(unsigned int a, unsigned int b) { return (a+b-1)/b;}

Import it in your own lecture code with:

from utils import cuda_begin, load_cuda

Running notebooks and Colab

Most lectures are accompanied by Jupyter notebooks.

pip install jupyter
jupyter notebook

If you don’t have a local NVIDIA GPU, Google Colab provides free T4 access. Open any .ipynb from the repository in Colab and set the runtime to GPU under Runtime → Change runtime type. The utils.py helpers work unchanged in Colab — clone the repo or copy the file into your Colab session first.

For the fastest iteration cycle on a remote machine, use jupyter notebook --no-browser --port=8888 on the server and forward the port with SSH: ssh -L 8888:localhost:8888 user@host.

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Set up your GPU programming environment

Prerequisites

Installation

Profiling tools

NSight Systems (`nsys`)

NSight Compute (`ncu`)

CUDA boilerplate from `utils.py`

Running notebooks and Colab

Build docs developers (and LLMs) love

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Documentation Index

​Prerequisites

​Installation

​Profiling tools

​NSight Systems (nsys)

​NSight Compute (ncu)

​CUDA boilerplate from utils.py

​Running notebooks and Colab

Build docs developers (and LLMs) love

Prerequisites

Installation

Profiling tools

NSight Systems (`nsys`)

NSight Compute (`ncu`)

CUDA boilerplate from `utils.py`

Running notebooks and Colab