Running Inference

The `generate()` method

LLMEngine.generate is the primary entry point for offline inference. It accepts a list of prompts, runs the full prefill-decode loop, and returns the completed sequences.

def generate(
    self,
    prompts: list[str] | list[list[int]],
    sampling_params: SamplingParams | list[SamplingParams],
    use_tqdm: bool = True,
) -> list[dict]:

Parameter	Type	Description
`prompts`	`list[str]` or `list[list[int]]`	Text strings or pre-tokenized token ID lists
`sampling_params`	`SamplingParams` or `list[SamplingParams]`	A single shared config or one per prompt
`use_tqdm`	`bool`	Show a progress bar with live prefill/decode throughput (default `True`)

Each element of the returned list is a dict with two keys:

{
    "text": str,          # decoded output text
    "token_ids": list[int]  # raw output token IDs (not including the prompt)
}

Outputs are returned in the same order as the input prompts, regardless of which requests finish first.

Single prompt

Pass a one-element list to run a single request:

from nanovllm import LLM, SamplingParams

llm = LLM("/path/to/model", enforce_eager=True)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)

outputs = llm.generate(["Hello, Nano-vLLM."], sampling_params)
print(outputs[0]["text"])

Batch inference

Passing multiple prompts in a single generate() call allows the engine to schedule them together, which significantly improves GPU utilization compared to running them one at a time.

import os
from nanovllm import LLM, SamplingParams
from transformers import AutoTokenizer

path = os.path.expanduser("~/huggingface/Qwen3-0.6B/")
tokenizer = AutoTokenizer.from_pretrained(path)
llm = LLM(path, enforce_eager=True, tensor_parallel_size=1)

sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = [
    "introduce yourself",
    "list all prime numbers within 100",
]
prompts = [
    tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=False,
        add_generation_prompt=True,
    )
    for prompt in prompts
]
outputs = llm.generate(prompts, sampling_params)

for prompt, output in zip(prompts, outputs):
    print(f"Prompt: {prompt!r}")
    print(f"Completion: {output['text']!r}")

The scheduler batches waiting sequences together during prefill and decode phases. Larger batches amortize the fixed GPU kernel launch overhead and improve overall throughput.

Per-request `SamplingParams`

When sampling_params is a single SamplingParams instance it is broadcast to every prompt. To use different parameters per request, pass a list of the same length as prompts:

from nanovllm import LLM, SamplingParams

llm = LLM("/path/to/model")

prompts = [
    "Write a haiku about the ocean.",
    "Explain quantum entanglement in one sentence.",
    "List the capitals of all G7 countries.",
]

sampling_params = [
    SamplingParams(temperature=0.9, max_tokens=64),   # creative
    SamplingParams(temperature=0.3, max_tokens=128),  # factual
    SamplingParams(temperature=0.6, max_tokens=256),  # balanced
]

outputs = llm.generate(prompts, sampling_params)

The engine requires len(sampling_params) == len(prompts) when a list is provided.

Token ID input

Instead of raw strings, you can pass pre-tokenized sequences as list[list[int]]. This skips the internal tokenization step and is useful when you have already applied a chat template or need precise control over the input.

from nanovllm import LLM, SamplingParams
from transformers import AutoTokenizer
from random import randint, seed

path = "/path/to/model"
tokenizer = AutoTokenizer.from_pretrained(path)
llm = LLM(path)

# pass raw token IDs directly
seed(0)
prompt_token_ids = [
    [randint(0, 10000) for _ in range(randint(100, 1024))]
    for _ in range(8)
]
sampling_params = SamplingParams(temperature=0.6, max_tokens=128)

outputs = llm.generate(prompt_token_ids, sampling_params)
for output in outputs:
    print(output["token_ids"])  # list[int]

Internally, add_request checks isinstance(prompt, str) and calls tokenizer.encode only when a string is received:

def add_request(self, prompt: str | list[int], sampling_params: SamplingParams):
    if isinstance(prompt, str):
        prompt = self.tokenizer.encode(prompt)
    seq = Sequence(prompt, sampling_params)
    self.scheduler.add(seq)

Streaming-style control with `add_request()` and `step()`

For fine-grained control — such as processing completions as they arrive — you can drive the engine manually using the lower-level add_request / step API instead of generate. step() runs one scheduling round (either a prefill or a decode pass) and returns the sequences that finished during that round:

def step(self) -> tuple[list[tuple[int, list[int]]], int]:
    seqs, is_prefill = self.scheduler.schedule()
    token_ids = self.model_runner.call("run", seqs, is_prefill)
    self.scheduler.postprocess(seqs, token_ids)
    outputs = [(seq.seq_id, seq.completion_token_ids) for seq in seqs if seq.is_finished]
    num_tokens = sum(len(seq) for seq in seqs) if is_prefill else -len(seqs)
    return outputs, num_tokens

Example — add requests and drain the engine step-by-step:

from nanovllm import LLM, SamplingParams

llm = LLM("/path/to/model")
sp = SamplingParams(temperature=0.6, max_tokens=128)

prompts = ["Tell me a joke.", "What is 2 + 2?"]
for prompt in prompts:
    llm.add_request(prompt, sp)

results = {}
while not llm.is_finished():
    outputs, num_tokens = llm.step()
    for seq_id, token_ids in outputs:
        # decode as each sequence finishes
        text = llm.tokenizer.decode(token_ids)
        results[seq_id] = text
        print(f"[seq {seq_id}] {text}")

step() returns only the sequences that finished in that round. Sequences still generating tokens are not included in the output until they hit max_tokens or produce an EOS token.

Get Started

Guides

Architecture

The `generate()` method

Single prompt

Batch inference

Per-request `SamplingParams`

Token ID input

Streaming-style control with `add_request()` and `step()`

Build docs developers (and LLMs) love

Get Started

Guides

Architecture

​The generate() method

​Single prompt

​Batch inference

​Per-request SamplingParams

​Token ID input

​Streaming-style control with add_request() and step()

Build docs developers (and LLMs) love

The `generate()` method

Single prompt

Batch inference

Per-request `SamplingParams`

Token ID input

Streaming-style control with `add_request()` and `step()`