Skip to main content
The LLM class is the primary public interface for nano-vLLM. It is imported directly from the top-level package and inherits all logic from LLMEngine.
from nanovllm import LLM

Constructor

LLM(model: str, **kwargs)
Creates an LLM instance, initializes the engine config, loads the model, and starts any worker processes required for tensor parallelism.
model must be a path to a local directory containing model weights and config files. Hugging Face model hub names are not supported.

Parameters

model
str
required
Path to the local model directory. The directory must exist and contain a valid Hugging Face-style model config (config.json). Passing a model name like "Qwen/Qwen3-0.6B" will raise an AssertionError.
max_num_batched_tokens
int
default:"16384"
Maximum total number of tokens (across all sequences) processed in a single forward pass. Must be >= max_model_len.
max_num_seqs
int
default:"512"
Maximum number of sequences that can be in flight concurrently.
max_model_len
int
default:"4096"
Maximum sequence length (prompt + completion). Automatically capped to the model’s max_position_embeddings.
gpu_memory_utilization
float
default:"0.9"
Fraction of total GPU memory to reserve for the KV cache. Values between 0.0 and 1.0.
tensor_parallel_size
int
default:"1"
Number of GPUs to use for tensor parallelism. Must be between 1 and 8 inclusive. Values greater than 1 spawn additional worker processes via torch.multiprocessing.
enforce_eager
bool
default:"false"
When True, disables CUDA graph capture and always runs in eager mode. Useful for debugging or when CUDA graphs cause issues.
kvcache_block_size
int
default:"256"
Number of tokens per KV cache block. Must be a multiple of 256.

Methods

generate

llm.generate(
    prompts: list[str] | list[list[int]],
    sampling_params: SamplingParams | list[SamplingParams],
    use_tqdm: bool = True,
) -> list[dict]
Runs inference on a list of prompts and returns results in the same order as the input.
prompts
list[str] | list[list[int]]
required
A list of prompts to process. Each prompt can be either a plain string (which will be tokenized internally) or a pre-tokenized list of integer token IDs.
sampling_params
SamplingParams | list[SamplingParams]
required
Sampling configuration. Pass a single SamplingParams to apply to all prompts, or a list with one entry per prompt.
use_tqdm
bool
default:"true"
When True, displays a tqdm progress bar showing the number of completed requests along with live prefill and decode throughput (tokens/sec).
Returns list[dict] — one entry per prompt, in the original prompt order:
text
str
The decoded completion string.
token_ids
list[int]
The raw completion token IDs (not including the prompt tokens).

add_request

llm.add_request(
    prompt: str | list[int],
    sampling_params: SamplingParams,
)
Adds a single request to the scheduler’s waiting queue. Strings are tokenized automatically. Use this alongside step() and is_finished() for manual, fine-grained control over the generation loop.
prompt
str | list[int]
required
The prompt as a string or a list of pre-tokenized integer IDs.
sampling_params
SamplingParams
required
Sampling configuration for this specific request.

step

llm.step() -> tuple[list[tuple[int, list[int]]], int]
Executes one scheduling and inference step. Schedules a batch (either a prefill batch or a decode batch), runs the model, and post-processes results. Returns a tuple (outputs, num_tokens):
outputs
list[tuple[int, list[int]]]
A list of (seq_id, completion_token_ids) pairs for every sequence that finished during this step. Empty if no sequences finished.
num_tokens
int
Positive value: the number of prefill tokens processed in this step. Negative value: the negative of the decode batch size (i.e., -len(seqs)).

is_finished

llm.is_finished() -> bool
Returns True when both the waiting queue and the running queue are empty — i.e., all submitted requests have completed.

Usage Example

from nanovllm import LLM, SamplingParams

# Initialize the engine from a local model directory
llm = LLM("/models/Qwen3-0.6B", max_model_len=2048, gpu_memory_utilization=0.85)

sampling_params = SamplingParams(temperature=0.6, max_tokens=128)

prompts = [
    "The capital of France is",
    "Explain the difference between a list and a tuple in Python.",
]

outputs = llm.generate(prompts, sampling_params)

for prompt, output in zip(prompts, outputs):
    print(f"Prompt: {prompt!r}")
    print(f"Output: {output['text']!r}")
    print()

Manual generation loop

from nanovllm import LLM, SamplingParams

llm = LLM("/models/Qwen3-0.6B")
sp = SamplingParams(temperature=0.8, max_tokens=64)

llm.add_request("Hello, world!", sp)
llm.add_request("What is 2 + 2?", sp)

results = {}
while not llm.is_finished():
    outputs, num_tokens = llm.step()
    for seq_id, token_ids in outputs:
        results[seq_id] = token_ids

Build docs developers (and LLMs) love