Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Wenyueh/MinivLLM/llms.txt
Use this file to discover all available pages before exploring further.
ModelRunner runs on every GPU rank. The rank-0 instance is owned directly by LLMEngine; ranks 1..N-1 run in separate worker processes and communicate via shared memory + multiprocessing.Event. It handles weight loading, KV cache allocation, prefill/decode input preparation, and optional CUDA graph capture for fast decode.
Constructor
__init__ the following steps happen in order:
dist.init_process_group("nccl", ...)— collective barrier; all ranks must call this.- Model construction and weight loading on the assigned GPU.
warmup_model()— dry-run forward pass to measure peak memory.allocate_kv_cache()— allocates the KV cache pool.capture_cudagraph()— captures decode graphs (skipped ifenforce_eager=True).- Shared-memory setup for IPC (skipped when
world_size == 1).
The same config dict passed to
LLMEngine. Must include all model architecture
keys (vocab_size, hidden_size, num_layers, etc.) as well as the following
runtime keys consumed directly by ModelRunner:| Key | Used by | Description |
|---|---|---|
block_size | all | Tokens per KV cache block |
world_size | all | Number of GPU ranks |
enforce_eager | __init__ | Skip CUDA graph capture when True |
max_num_batch_tokens | warmup_model | Tokens in warmup dry-run batch |
max_model_length | warmup_model, capture_cudagraph | Max total sequence length |
max_num_seqs | capture_cudagraph | Upper bound for CUDA graph batch sizes (only required when enforce_eager=False) |
gpu_memory_utilization | allocate_kv_cache | Fraction of free GPU memory for KV pool |
num_layers | allocate_kv_cache | Number of transformer layers |
num_kv_heads | allocate_kv_cache | Total KV head count (divided by world_size per rank) |
head_dim | allocate_kv_cache | Head dimension (or inferred from hidden_size / num_heads) |
vocab_size | capture_cudagraph | Vocabulary size for output buffer pre-allocation |
CUDA device index for this worker. Rank 0 is the primary process; ranks
1..world_size-1 are spawned workers.For rank 0: a list of
multiprocessing.Event objects, one per worker rank,
used to signal that new IPC data has been written to shared memory.
For rank != 0: a single Event used to wait for commands from rank 0.Methods
warmup_model
max_num_batch_tokens // max_model_length sequences of length max_model_length) to force all CUDA kernels to JIT-compile and to record the peak GPU memory footprint. The result is used by allocate_kv_cache() to determine how much memory is available for the KV cache pool.
allocate_kv_cache
(2, num_layers, max_cached_blocks, block_size, num_kv_heads_per_rank, head_dim)
and assigns slices of it to each attention layer’s k_cache / v_cache attributes.
The number of blocks is derived from:
world_size > 1, an all_reduce(MIN) synchronises the block count across ranks so the scheduler never allocates more blocks than the most memory-constrained rank can hold.
max_cached_blocks in config is overwritten with the computed value after this method returns.prepare_prefill
set_context(is_prefill=True, ...):
| Tensor | Description |
|---|---|
input_ids | Concatenated prompt tokens (excluding prefix-cached tokens) |
cu_seqlens_q | Cumulative query sequence lengths, shape (num_seqs + 1,) |
cu_seqlens_k | Cumulative key sequence lengths (includes cached prefix) |
slot_mapping | KV cache slot indices for each new token to be written |
block_tables | Per-sequence block ID table for reading cached KV values |
input_ids and slot_mapping — only uncached tokens require a new forward pass.
Sequences scheduled for prefill. Each sequence must have an allocated
block_table.Return value
1-D long tensor of concatenated (non-cached) token IDs, on the current CUDA device.
prepare_decode
set_context(is_prefill=False, ...):
| Tensor | Description |
|---|---|
input_ids | Last token of each sequence, shape (batch_size,) |
slot_mapping | KV cache slot for the new token in each sequence |
context_lens | Total token count for each sequence (for attention masking) |
block_tables | Per-sequence block ID table |
Sequences scheduled for decode.
Return value
1-D long tensor of last-token IDs, shape
(batch_size,), on the current CUDA device.run_model
- Prefill / eager mode: calls
self.model(input_ids)directly, thenmodel.compute_logits(hidden_states). - Decode (CUDA graph): finds the smallest captured graph whose batch size
≥ len(seqs), copies the current tensors into the pre-allocated graph variables, replays the graph, and computes logits fromgraph_vars["outputs"].
Token IDs prepared by
prepare_prefill or prepare_decode.True to use the eager path; False to use CUDA graph replay.Return value
Logit tensor. Shape
(num_tokens, vocab_size) for prefill;
(batch_size, vocab_size) for decode.run
prepare_prefill or prepare_decode, then run_model, then the sampler. Only rank 0 samples tokens; all other ranks return None.
Sequences to process.
Whether to run a prefill or decode step.
Return value
1-D tensor of sampled token IDs on rank 0 (one per sequence).
None on
worker ranks.capture_cudagraph
[1, 2, 4, 8] plus every multiple of 16 up to max_num_seqs.
At decode time run_model selects the smallest captured graph whose size is
≥ batch_size, so the overhead of padding is minimised.
Captured graphs are stored in self.graphs (dict mapping batch size → CUDAGraph) and input/output tensors are stored in self.graph_vars.
CUDA graph capture is skipped when
config["enforce_eager"] = True.call
LLMEngine to invoke model operations.
- On rank 0: serialises
(method_name, *args)into shared memory viawrite_shm(), then executes the method locally. - On all ranks: looks up
method_nameonselfand calls it withargs.
Name of the
ModelRunner method to invoke (e.g. "run", "exit").Positional arguments forwarded to the method.
loop
rank != 0). Continuously reads IPC commands from shared memory via read_shm(), dispatches them via call(), and exits cleanly when "exit" is received.
exit
CUDA graph batch sizes
The table below shows which graphs are captured for a givenmax_num_seqs.
max_num_seqs | Captured batch sizes |
|---|---|
| 8 | 1, 2, 4, 8 |
| 16 | 1, 2, 4, 8, 16 |
| 32 | 1, 2, 4, 8, 16, 32 |
| 64 | 1, 2, 4, 8, 16, 32, 48, 64 |
bs_ >= actual_batch_size is replayed, so at most bs_ - actual_batch_size dummy tokens are computed.