The generate() method
LLMEngine.generate is the primary entry point for offline inference. It accepts a list of prompts, runs the full prefill-decode loop, and returns the completed sequences.
def generate(
self,
prompts: list[str] | list[list[int]],
sampling_params: SamplingParams | list[SamplingParams],
use_tqdm: bool = True,
) -> list[dict]:
| Parameter | Type | Description |
|---|
prompts | list[str] or list[list[int]] | Text strings or pre-tokenized token ID lists |
sampling_params | SamplingParams or list[SamplingParams] | A single shared config or one per prompt |
use_tqdm | bool | Show a progress bar with live prefill/decode throughput (default True) |
Each element of the returned list is a dict with two keys:
{
"text": str, # decoded output text
"token_ids": list[int] # raw output token IDs (not including the prompt)
}
Outputs are returned in the same order as the input prompts, regardless of which requests finish first.
Single prompt
Pass a one-element list to run a single request:
from nanovllm import LLM, SamplingParams
llm = LLM("/path/to/model", enforce_eager=True)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
outputs = llm.generate(["Hello, Nano-vLLM."], sampling_params)
print(outputs[0]["text"])
Batch inference
Passing multiple prompts in a single generate() call allows the engine to schedule them together, which significantly improves GPU utilization compared to running them one at a time.
import os
from nanovllm import LLM, SamplingParams
from transformers import AutoTokenizer
path = os.path.expanduser("~/huggingface/Qwen3-0.6B/")
tokenizer = AutoTokenizer.from_pretrained(path)
llm = LLM(path, enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = [
"introduce yourself",
"list all prime numbers within 100",
]
prompts = [
tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
tokenize=False,
add_generation_prompt=True,
)
for prompt in prompts
]
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print(f"Prompt: {prompt!r}")
print(f"Completion: {output['text']!r}")
The scheduler batches waiting sequences together during prefill and decode phases. Larger batches amortize the fixed GPU kernel launch overhead and improve overall throughput.
Per-request SamplingParams
When sampling_params is a single SamplingParams instance it is broadcast to every prompt. To use different parameters per request, pass a list of the same length as prompts:
from nanovllm import LLM, SamplingParams
llm = LLM("/path/to/model")
prompts = [
"Write a haiku about the ocean.",
"Explain quantum entanglement in one sentence.",
"List the capitals of all G7 countries.",
]
sampling_params = [
SamplingParams(temperature=0.9, max_tokens=64), # creative
SamplingParams(temperature=0.3, max_tokens=128), # factual
SamplingParams(temperature=0.6, max_tokens=256), # balanced
]
outputs = llm.generate(prompts, sampling_params)
The engine requires len(sampling_params) == len(prompts) when a list is provided.
Instead of raw strings, you can pass pre-tokenized sequences as list[list[int]]. This skips the internal tokenization step and is useful when you have already applied a chat template or need precise control over the input.
from nanovllm import LLM, SamplingParams
from transformers import AutoTokenizer
from random import randint, seed
path = "/path/to/model"
tokenizer = AutoTokenizer.from_pretrained(path)
llm = LLM(path)
# pass raw token IDs directly
seed(0)
prompt_token_ids = [
[randint(0, 10000) for _ in range(randint(100, 1024))]
for _ in range(8)
]
sampling_params = SamplingParams(temperature=0.6, max_tokens=128)
outputs = llm.generate(prompt_token_ids, sampling_params)
for output in outputs:
print(output["token_ids"]) # list[int]
Internally, add_request checks isinstance(prompt, str) and calls tokenizer.encode only when a string is received:
def add_request(self, prompt: str | list[int], sampling_params: SamplingParams):
if isinstance(prompt, str):
prompt = self.tokenizer.encode(prompt)
seq = Sequence(prompt, sampling_params)
self.scheduler.add(seq)
Streaming-style control with add_request() and step()
For fine-grained control — such as processing completions as they arrive — you can drive the engine manually using the lower-level add_request / step API instead of generate.
step() runs one scheduling round (either a prefill or a decode pass) and returns the sequences that finished during that round:
def step(self) -> tuple[list[tuple[int, list[int]]], int]:
seqs, is_prefill = self.scheduler.schedule()
token_ids = self.model_runner.call("run", seqs, is_prefill)
self.scheduler.postprocess(seqs, token_ids)
outputs = [(seq.seq_id, seq.completion_token_ids) for seq in seqs if seq.is_finished]
num_tokens = sum(len(seq) for seq in seqs) if is_prefill else -len(seqs)
return outputs, num_tokens
Example — add requests and drain the engine step-by-step:
from nanovllm import LLM, SamplingParams
llm = LLM("/path/to/model")
sp = SamplingParams(temperature=0.6, max_tokens=128)
prompts = ["Tell me a joke.", "What is 2 + 2?"]
for prompt in prompts:
llm.add_request(prompt, sp)
results = {}
while not llm.is_finished():
outputs, num_tokens = llm.step()
for seq_id, token_ids in outputs:
# decode as each sequence finishes
text = llm.tokenizer.decode(token_ids)
results[seq_id] = text
print(f"[seq {seq_id}] {text}")
step() returns only the sequences that finished in that round. Sequences still generating tokens are not included in the output until they hit max_tokens or produce an EOS token.