Serving LLMs efficiently requires more than fast kernels — you need smart scheduling, KV cache reuse, and careful management of the memory-compute tradeoff between prefill and decode. SGLang (Structured Generation Language) is an open-source LLM serving framework built around these ideas. Lecture 35, presented by Yineng Zhang, covers how SGLang achieves high throughput and low latency through RadixAttention, continuous batching, and chunked prefill. This page walks through the core techniques and their performance implications.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt
Use this file to discover all available pages before exploring further.
What SGLang is
SGLang is a serving framework and programming model for LLMs that focuses on two things:- Structured generation — a frontend language for expressing programs that interleave LLM generation with control flow, enabling multi-turn prompts, tool calls, and constrained outputs.
- High-performance runtime — a backend that maximizes GPU utilization through efficient KV cache management, batching, and scheduling.
RadixAttention: KV cache reuse via prefix tree
The most distinctive feature of SGLang is RadixAttention — a KV cache management scheme that automatically reuses cached key-value pairs across requests that share a common prefix.Why prefix reuse matters
Many LLM workloads share prefixes:- System prompts: every request in a chat application starts with the same system message.
- Few-shot examples: classification tasks with the same examples in every prompt.
- RAG: retrieved documents that appear in multiple requests.
- Multi-turn conversations: each turn contains all previous turns as context.
How RadixAttention works
SGLang maintains a radix tree (also called a trie) of cached KV blocks, indexed by token sequence. When a new request arrives:Prefix lookup
SGLang walks the radix tree to find the longest matching prefix in the cache. The matched KV blocks are reused directly — no recomputation needed.
Incremental prefill
Only the unmatched suffix of the prompt is prefilled. The new KV blocks are appended to the matched prefix in the tree.
RadixAttention is transparent to the user — you don’t need to explicitly mark shared prefixes. SGLang identifies them automatically from the token sequences of incoming requests.
PagedAttention vs. RadixAttention
Both systems manage KV cache in fixed-size pages/blocks rather than contiguous buffers, but they differ in how they track and reuse blocks:| Feature | PagedAttention (vLLM) | RadixAttention (SGLang) |
|---|---|---|
| Storage unit | Fixed-size pages | Fixed-size blocks in a radix tree |
| Prefix reuse | Manual prefix caching (added later) | Automatic, built-in |
| Eviction policy | LRU per request | LRU on tree leaves |
| Multi-request sharing | Copy-on-write | Shared tree nodes |
Continuous batching
Early LLM serving systems used static batching: collect a fixed batch of requests, run them to completion, then start the next batch. This wastes GPU cycles when requests in the batch finish at different times. Continuous batching (also called in-flight batching or iteration-level scheduling) eliminates this waste by:- Running the scheduler at every decode step, not just between batches.
- Immediately inserting new requests into the running batch when existing requests finish.
- Maintaining high GPU utilization by keeping the batch full at all times.
Chunked prefill: balancing prefill and decode
Prefill (processing the input prompt) and decode (generating output tokens) have very different compute characteristics:- Prefill is compute-bound: processing
Nprompt tokens in parallel is a large matrix multiplication. - Decode is memory-bandwidth-bound: generating one token requires loading all model weights once.
Chunk the prompt
Instead of prefilling all
N tokens at once, split the prompt into chunks of C tokens (e.g. C=512).Interleave with decode
Each iteration processes one chunk of the long prompt alongside the normal decode tokens from other requests. The per-step compute is bounded by
C.Chunked prefill adds a small overhead: more scheduler iterations are needed per request, and the partial KV cache must be stored between chunks. For short prompts, chunked prefill is not applied.
Tensor parallelism in SGLang
For models too large to fit on a single GPU, SGLang uses tensor parallelism (TP) to split individual layers across multiple devices. In TP:- Linear layers are split column-wise or row-wise across
tp_sizeGPUs. - Each GPU holds a shard of the weight matrix and computes its partial output.
- AllReduce collectives synchronize the partial outputs after each layer.
Key performance metrics
When benchmarking LLM serving, track these metrics:| Metric | Full name | What it measures |
|---|---|---|
| TTFT | Time to first token | Latency from request submission to first output token. Dominated by prefill time. |
| TPOT | Time per output token | Average latency between successive output tokens. Dominated by decode speed. |
| Throughput | Tokens/second | Total output tokens generated per second across all requests. |
| Cache hit rate | — | Fraction of KV blocks served from RadixAttention cache vs. recomputed. |
Benchmarking and configuration
SGLang ships with a benchmark script for measuring throughput and latency:Key configuration parameters
| Parameter | Description |
|---|---|
--tp | Tensor parallel degree |
--mem-fraction-static | Fraction of GPU memory reserved for KV cache (default: 0.9) |
--chunked-prefill-size | Chunk size C for chunked prefill (default: 512) |
--max-prefill-tokens | Max total tokens in a prefill batch |
--disable-radix-cache | Disable RadixAttention (useful for measuring its impact) |
--attention-backend | Attention kernel backend: flashinfer (default) or triton |
Lecture 35 slides
Yineng Zhang’s full SGLang performance optimization slides
GPU Mode Discord
Ask questions and discuss SGLang, RadixAttention, and LLM serving