Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt

Use this file to discover all available pages before exploring further.

Serving LLMs efficiently requires more than fast kernels — you need smart scheduling, KV cache reuse, and careful management of the memory-compute tradeoff between prefill and decode. SGLang (Structured Generation Language) is an open-source LLM serving framework built around these ideas. Lecture 35, presented by Yineng Zhang, covers how SGLang achieves high throughput and low latency through RadixAttention, continuous batching, and chunked prefill. This page walks through the core techniques and their performance implications.

What SGLang is

SGLang is a serving framework and programming model for LLMs that focuses on two things:
  1. Structured generation — a frontend language for expressing programs that interleave LLM generation with control flow, enabling multi-turn prompts, tool calls, and constrained outputs.
  2. High-performance runtime — a backend that maximizes GPU utilization through efficient KV cache management, batching, and scheduling.
The runtime is the focus of Lecture 35. It is built on top of FlashInfer for attention kernels and implements its own scheduler, cache manager, and tensor parallel execution.

RadixAttention: KV cache reuse via prefix tree

The most distinctive feature of SGLang is RadixAttention — a KV cache management scheme that automatically reuses cached key-value pairs across requests that share a common prefix.

Why prefix reuse matters

Many LLM workloads share prefixes:
  • System prompts: every request in a chat application starts with the same system message.
  • Few-shot examples: classification tasks with the same examples in every prompt.
  • RAG: retrieved documents that appear in multiple requests.
  • Multi-turn conversations: each turn contains all previous turns as context.
Without prefix reuse, every request recomputes attention over the shared prefix from scratch. This wastes compute and reduces throughput.

How RadixAttention works

SGLang maintains a radix tree (also called a trie) of cached KV blocks, indexed by token sequence. When a new request arrives:
1

Prefix lookup

SGLang walks the radix tree to find the longest matching prefix in the cache. The matched KV blocks are reused directly — no recomputation needed.
2

Incremental prefill

Only the unmatched suffix of the prompt is prefilled. The new KV blocks are appended to the matched prefix in the tree.
3

Cache eviction

When memory is full, SGLang evicts the least-recently-used leaf nodes from the radix tree, freeing KV blocks for new requests.
RadixAttention is transparent to the user — you don’t need to explicitly mark shared prefixes. SGLang identifies them automatically from the token sequences of incoming requests.
The cache hit rate depends on workload structure. For applications with a long shared system prompt (e.g. 1000+ tokens), RadixAttention can eliminate the majority of prefill compute.

PagedAttention vs. RadixAttention

Both systems manage KV cache in fixed-size pages/blocks rather than contiguous buffers, but they differ in how they track and reuse blocks:
FeaturePagedAttention (vLLM)RadixAttention (SGLang)
Storage unitFixed-size pagesFixed-size blocks in a radix tree
Prefix reuseManual prefix caching (added later)Automatic, built-in
Eviction policyLRU per requestLRU on tree leaves
Multi-request sharingCopy-on-writeShared tree nodes
RadixAttention is most impactful for workloads where many requests share long prefixes. For purely interactive chat with unique prompts, the benefit is smaller.

Continuous batching

Early LLM serving systems used static batching: collect a fixed batch of requests, run them to completion, then start the next batch. This wastes GPU cycles when requests in the batch finish at different times. Continuous batching (also called in-flight batching or iteration-level scheduling) eliminates this waste by:
  • Running the scheduler at every decode step, not just between batches.
  • Immediately inserting new requests into the running batch when existing requests finish.
  • Maintaining high GPU utilization by keeping the batch full at all times.
SGLang’s scheduler operates at the iteration level: after each forward pass, it checks for completed sequences, removes them from the batch, and fills the freed memory with new requests. This keeps the GPU busy and reduces queuing latency for incoming requests.

Chunked prefill: balancing prefill and decode

Prefill (processing the input prompt) and decode (generating output tokens) have very different compute characteristics:
  • Prefill is compute-bound: processing N prompt tokens in parallel is a large matrix multiplication.
  • Decode is memory-bandwidth-bound: generating one token requires loading all model weights once.
When a long prompt is prefilled in the same batch as decoding requests, the prefill computation dominates the step time, increasing latency for all decoding requests in the batch (known as prefill stalls). Chunked prefill breaks long prompts into smaller chunks and interleaves them with decode steps:
1

Chunk the prompt

Instead of prefilling all N tokens at once, split the prompt into chunks of C tokens (e.g. C=512).
2

Interleave with decode

Each iteration processes one chunk of the long prompt alongside the normal decode tokens from other requests. The per-step compute is bounded by C.
3

Resume next chunk

The partially-prefilled request’s KV cache is saved. The next iteration processes the next chunk until the prompt is fully processed.
This bounds the maximum step latency and prevents individual long prompts from stalling the entire batch.
Chunked prefill adds a small overhead: more scheduler iterations are needed per request, and the partial KV cache must be stored between chunks. For short prompts, chunked prefill is not applied.

Tensor parallelism in SGLang

For models too large to fit on a single GPU, SGLang uses tensor parallelism (TP) to split individual layers across multiple devices. In TP:
  • Linear layers are split column-wise or row-wise across tp_size GPUs.
  • Each GPU holds a shard of the weight matrix and computes its partial output.
  • AllReduce collectives synchronize the partial outputs after each layer.
SGLang integrates with NCCL for the AllReduce operations and supports TP degrees of 2, 4, and 8 GPUs (powers of two matching NVLink topology).
from sglang import Engine

engine = Engine(
    model_path="meta-llama/Llama-3-70b-instruct",
    tp_size=4,  # Tensor parallel across 4 GPUs
)
Tensor parallelism reduces per-GPU memory and increases compute throughput, but adds AllReduce latency per layer. For latency-sensitive workloads, use the smallest tp_size that fits the model in memory.

Key performance metrics

When benchmarking LLM serving, track these metrics:
MetricFull nameWhat it measures
TTFTTime to first tokenLatency from request submission to first output token. Dominated by prefill time.
TPOTTime per output tokenAverage latency between successive output tokens. Dominated by decode speed.
ThroughputTokens/secondTotal output tokens generated per second across all requests.
Cache hit rateFraction of KV blocks served from RadixAttention cache vs. recomputed.
For interactive applications (chatbots, coding assistants), TTFT and TPOT are the primary user-visible metrics. For batch processing (summarization, evaluation), throughput matters more.

Benchmarking and configuration

SGLang ships with a benchmark script for measuring throughput and latency:
# Start the SGLang server
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3-8b-instruct \
    --tp 1 \
    --port 30000
# Benchmark with the built-in tool
python -m sglang.bench_serving \
    --backend sglang \
    --dataset-name sharegpt \
    --num-prompt 1000 \
    --request-rate 10
# Programmatic usage with the Python API
import sglang as sgl

@sgl.function
def chat(s, question):
    s += sgl.system("You are a helpful assistant.")
    s += sgl.user(question)
    s += sgl.assistant(sgl.gen("answer", max_tokens=256))

state = chat.run(question="Explain RadixAttention in one paragraph.")
print(state["answer"])

Key configuration parameters

ParameterDescription
--tpTensor parallel degree
--mem-fraction-staticFraction of GPU memory reserved for KV cache (default: 0.9)
--chunked-prefill-sizeChunk size C for chunked prefill (default: 512)
--max-prefill-tokensMax total tokens in a prefill batch
--disable-radix-cacheDisable RadixAttention (useful for measuring its impact)
--attention-backendAttention kernel backend: flashinfer (default) or triton

Lecture 35 slides

Yineng Zhang’s full SGLang performance optimization slides

GPU Mode Discord

Ask questions and discuss SGLang, RadixAttention, and LLM serving

Build docs developers (and LLMs) love