Skip to main content

The generate() method

LLMEngine.generate is the primary entry point for offline inference. It accepts a list of prompts, runs the full prefill-decode loop, and returns the completed sequences.
def generate(
    self,
    prompts: list[str] | list[list[int]],
    sampling_params: SamplingParams | list[SamplingParams],
    use_tqdm: bool = True,
) -> list[dict]:
ParameterTypeDescription
promptslist[str] or list[list[int]]Text strings or pre-tokenized token ID lists
sampling_paramsSamplingParams or list[SamplingParams]A single shared config or one per prompt
use_tqdmboolShow a progress bar with live prefill/decode throughput (default True)
Each element of the returned list is a dict with two keys:
{
    "text": str,          # decoded output text
    "token_ids": list[int]  # raw output token IDs (not including the prompt)
}
Outputs are returned in the same order as the input prompts, regardless of which requests finish first.

Single prompt

Pass a one-element list to run a single request:
from nanovllm import LLM, SamplingParams

llm = LLM("/path/to/model", enforce_eager=True)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)

outputs = llm.generate(["Hello, Nano-vLLM."], sampling_params)
print(outputs[0]["text"])

Batch inference

Passing multiple prompts in a single generate() call allows the engine to schedule them together, which significantly improves GPU utilization compared to running them one at a time.
import os
from nanovllm import LLM, SamplingParams
from transformers import AutoTokenizer

path = os.path.expanduser("~/huggingface/Qwen3-0.6B/")
tokenizer = AutoTokenizer.from_pretrained(path)
llm = LLM(path, enforce_eager=True, tensor_parallel_size=1)

sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = [
    "introduce yourself",
    "list all prime numbers within 100",
]
prompts = [
    tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=False,
        add_generation_prompt=True,
    )
    for prompt in prompts
]
outputs = llm.generate(prompts, sampling_params)

for prompt, output in zip(prompts, outputs):
    print(f"Prompt: {prompt!r}")
    print(f"Completion: {output['text']!r}")
The scheduler batches waiting sequences together during prefill and decode phases. Larger batches amortize the fixed GPU kernel launch overhead and improve overall throughput.

Per-request SamplingParams

When sampling_params is a single SamplingParams instance it is broadcast to every prompt. To use different parameters per request, pass a list of the same length as prompts:
from nanovllm import LLM, SamplingParams

llm = LLM("/path/to/model")

prompts = [
    "Write a haiku about the ocean.",
    "Explain quantum entanglement in one sentence.",
    "List the capitals of all G7 countries.",
]

sampling_params = [
    SamplingParams(temperature=0.9, max_tokens=64),   # creative
    SamplingParams(temperature=0.3, max_tokens=128),  # factual
    SamplingParams(temperature=0.6, max_tokens=256),  # balanced
]

outputs = llm.generate(prompts, sampling_params)
The engine requires len(sampling_params) == len(prompts) when a list is provided.

Token ID input

Instead of raw strings, you can pass pre-tokenized sequences as list[list[int]]. This skips the internal tokenization step and is useful when you have already applied a chat template or need precise control over the input.
from nanovllm import LLM, SamplingParams
from transformers import AutoTokenizer
from random import randint, seed

path = "/path/to/model"
tokenizer = AutoTokenizer.from_pretrained(path)
llm = LLM(path)

# pass raw token IDs directly
seed(0)
prompt_token_ids = [
    [randint(0, 10000) for _ in range(randint(100, 1024))]
    for _ in range(8)
]
sampling_params = SamplingParams(temperature=0.6, max_tokens=128)

outputs = llm.generate(prompt_token_ids, sampling_params)
for output in outputs:
    print(output["token_ids"])  # list[int]
Internally, add_request checks isinstance(prompt, str) and calls tokenizer.encode only when a string is received:
def add_request(self, prompt: str | list[int], sampling_params: SamplingParams):
    if isinstance(prompt, str):
        prompt = self.tokenizer.encode(prompt)
    seq = Sequence(prompt, sampling_params)
    self.scheduler.add(seq)

Streaming-style control with add_request() and step()

For fine-grained control — such as processing completions as they arrive — you can drive the engine manually using the lower-level add_request / step API instead of generate. step() runs one scheduling round (either a prefill or a decode pass) and returns the sequences that finished during that round:
def step(self) -> tuple[list[tuple[int, list[int]]], int]:
    seqs, is_prefill = self.scheduler.schedule()
    token_ids = self.model_runner.call("run", seqs, is_prefill)
    self.scheduler.postprocess(seqs, token_ids)
    outputs = [(seq.seq_id, seq.completion_token_ids) for seq in seqs if seq.is_finished]
    num_tokens = sum(len(seq) for seq in seqs) if is_prefill else -len(seqs)
    return outputs, num_tokens
Example — add requests and drain the engine step-by-step:
from nanovllm import LLM, SamplingParams

llm = LLM("/path/to/model")
sp = SamplingParams(temperature=0.6, max_tokens=128)

prompts = ["Tell me a joke.", "What is 2 + 2?"]
for prompt in prompts:
    llm.add_request(prompt, sp)

results = {}
while not llm.is_finished():
    outputs, num_tokens = llm.step()
    for seq_id, token_ids in outputs:
        # decode as each sequence finishes
        text = llm.tokenizer.decode(token_ids)
        results[seq_id] = text
        print(f"[seq {seq_id}] {text}")
step() returns only the sequences that finished in that round. Sequences still generating tokens are not included in the output until they hit max_tokens or produce an EOS token.

Build docs developers (and LLMs) love