LLM class is the primary public interface for nano-vLLM. It is imported directly from the top-level package and inherits all logic from LLMEngine.
Constructor
LLM instance, initializes the engine config, loads the model, and starts any worker processes required for tensor parallelism.
Parameters
Path to the local model directory. The directory must exist and contain a valid Hugging Face-style model config (
config.json). Passing a model name like "Qwen/Qwen3-0.6B" will raise an AssertionError.Maximum total number of tokens (across all sequences) processed in a single forward pass. Must be
>= max_model_len.Maximum number of sequences that can be in flight concurrently.
Maximum sequence length (prompt + completion). Automatically capped to the model’s
max_position_embeddings.Fraction of total GPU memory to reserve for the KV cache. Values between
0.0 and 1.0.Number of GPUs to use for tensor parallelism. Must be between
1 and 8 inclusive. Values greater than 1 spawn additional worker processes via torch.multiprocessing.When
True, disables CUDA graph capture and always runs in eager mode. Useful for debugging or when CUDA graphs cause issues.Number of tokens per KV cache block. Must be a multiple of
256.Methods
generate
A list of prompts to process. Each prompt can be either a plain string (which will be tokenized internally) or a pre-tokenized list of integer token IDs.
Sampling configuration. Pass a single
SamplingParams to apply to all prompts, or a list with one entry per prompt.When
True, displays a tqdm progress bar showing the number of completed requests along with live prefill and decode throughput (tokens/sec).list[dict] — one entry per prompt, in the original prompt order:
The decoded completion string.
The raw completion token IDs (not including the prompt tokens).
add_request
step() and is_finished() for manual, fine-grained control over the generation loop.
The prompt as a string or a list of pre-tokenized integer IDs.
Sampling configuration for this specific request.
step
(outputs, num_tokens):
A list of
(seq_id, completion_token_ids) pairs for every sequence that finished during this step. Empty if no sequences finished.Positive value: the number of prefill tokens processed in this step.
Negative value: the negative of the decode batch size (i.e.,
-len(seqs)).is_finished
True when both the waiting queue and the running queue are empty — i.e., all submitted requests have completed.