LLM

The LLM class is the primary public interface for nano-vLLM. It is imported directly from the top-level package and inherits all logic from LLMEngine.

from nanovllm import LLM

Constructor

LLM(model: str, **kwargs)

Creates an LLM instance, initializes the engine config, loads the model, and starts any worker processes required for tensor parallelism.

model must be a path to a local directory containing model weights and config files. Hugging Face model hub names are not supported.

Parameters

model

str

required

Path to the local model directory. The directory must exist and contain a valid Hugging Face-style model config (config.json). Passing a model name like "Qwen/Qwen3-0.6B" will raise an AssertionError.

max_num_batched_tokens

int

default:"16384"

Maximum total number of tokens (across all sequences) processed in a single forward pass. Must be >= max_model_len.

max_num_seqs

int

default:"512"

Maximum number of sequences that can be in flight concurrently.

max_model_len

int

default:"4096"

Maximum sequence length (prompt + completion). Automatically capped to the model’s max_position_embeddings.

gpu_memory_utilization

float

default:"0.9"

Fraction of total GPU memory to reserve for the KV cache. Values between 0.0 and 1.0.

tensor_parallel_size

int

default:"1"

Number of GPUs to use for tensor parallelism. Must be between 1 and 8 inclusive. Values greater than 1 spawn additional worker processes via torch.multiprocessing.

enforce_eager

bool

default:"false"

When True, disables CUDA graph capture and always runs in eager mode. Useful for debugging or when CUDA graphs cause issues.

kvcache_block_size

int

default:"256"

Number of tokens per KV cache block. Must be a multiple of 256.

Methods

generate

llm.generate(
    prompts: list[str] | list[list[int]],
    sampling_params: SamplingParams | list[SamplingParams],
    use_tqdm: bool = True,
) -> list[dict]

Runs inference on a list of prompts and returns results in the same order as the input.

prompts

list[str] | list[list[int]]

required

A list of prompts to process. Each prompt can be either a plain string (which will be tokenized internally) or a pre-tokenized list of integer token IDs.

sampling_params

SamplingParams | list[SamplingParams]

required

Sampling configuration. Pass a single SamplingParams to apply to all prompts, or a list with one entry per prompt.

use_tqdm

bool

default:"true"

When True, displays a tqdm progress bar showing the number of completed requests along with live prefill and decode throughput (tokens/sec).

Returns list[dict] — one entry per prompt, in the original prompt order:

text

str

The decoded completion string.

token_ids

list[int]

The raw completion token IDs (not including the prompt tokens).

add_request

llm.add_request(
    prompt: str | list[int],
    sampling_params: SamplingParams,
)

Adds a single request to the scheduler’s waiting queue. Strings are tokenized automatically. Use this alongside step() and is_finished() for manual, fine-grained control over the generation loop.

prompt

str | list[int]

required

The prompt as a string or a list of pre-tokenized integer IDs.

sampling_params

SamplingParams

required

Sampling configuration for this specific request.

step

llm.step() -> tuple[list[tuple[int, list[int]]], int]

Executes one scheduling and inference step. Schedules a batch (either a prefill batch or a decode batch), runs the model, and post-processes results. Returns a tuple (outputs, num_tokens):

outputs

list[tuple[int, list[int]]]

A list of (seq_id, completion_token_ids) pairs for every sequence that finished during this step. Empty if no sequences finished.

num_tokens

int

Positive value: the number of prefill tokens processed in this step. Negative value: the negative of the decode batch size (i.e., -len(seqs)).

is_finished

llm.is_finished() -> bool

Returns True when both the waiting queue and the running queue are empty — i.e., all submitted requests have completed.

Usage Example

from nanovllm import LLM, SamplingParams

# Initialize the engine from a local model directory
llm = LLM("/models/Qwen3-0.6B", max_model_len=2048, gpu_memory_utilization=0.85)

sampling_params = SamplingParams(temperature=0.6, max_tokens=128)

prompts = [
    "The capital of France is",
    "Explain the difference between a list and a tuple in Python.",
]

outputs = llm.generate(prompts, sampling_params)

for prompt, output in zip(prompts, outputs):
    print(f"Prompt: {prompt!r}")
    print(f"Output: {output['text']!r}")
    print()

Manual generation loop

from nanovllm import LLM, SamplingParams

llm = LLM("/models/Qwen3-0.6B")
sp = SamplingParams(temperature=0.8, max_tokens=64)

llm.add_request("Hello, world!", sp)
llm.add_request("What is 2 + 2?", sp)

results = {}
while not llm.is_finished():
    outputs, num_tokens = llm.step()
    for seq_id, token_ids in outputs:
        results[seq_id] = token_ids

Core API

Internals

Constructor

Parameters

Methods

generate

add_request

step

is_finished

Usage Example

Manual generation loop

Build docs developers (and LLMs) love

Core API

Internals

​Constructor

​Parameters

​Methods

​generate

​add_request

​step

​is_finished

​Usage Example

​Manual generation loop

Build docs developers (and LLMs) love

Constructor

Parameters

Methods

generate

add_request

step

is_finished

Usage Example

Manual generation loop