Skip to main content
This guide walks you through running your first inference with Nano-vLLM. You should have already installed Nano-vLLM and downloaded a model before continuing.

Basic Usage

1

Import LLM and SamplingParams

The public API is exported directly from the nanovllm package:
from nanovllm import LLM, SamplingParams
2

Initialize the LLM engine

Pass the path to your local model directory. Use enforce_eager=True to disable CUDA graphs (useful for debugging or low-VRAM setups) and tensor_parallel_size to spread the model across multiple GPUs.
llm = LLM("/path/to/your/model", enforce_eager=True, tensor_parallel_size=1)
tensor_parallel_size must be between 1 and 8. Set it to the number of GPUs you want to use.
3

Define sampling parameters

SamplingParams controls how tokens are sampled during generation:
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
ParameterTypeDefaultDescription
temperaturefloat1.0Sampling temperature. Must be greater than 0 (greedy decoding is not supported).
max_tokensint64Maximum number of tokens to generate per sequence.
ignore_eosboolFalseIf True, continue generating past the EOS token.
4

Generate completions

Pass a list of prompt strings and the sampling parameters to llm.generate():
prompts = ["Hello, Nano-vLLM."]
outputs = llm.generate(prompts, sampling_params)
generate() processes all prompts in a single batched call and returns a list of output dictionaries — one per prompt.
5

Access the results

Each element of outputs is a dictionary with two keys:
  • "text" — the generated text as a string
  • "token_ids" — the generated token IDs as a list of integers
print(outputs[0]["text"])       # generated string
print(outputs[0]["token_ids"]) # list of token IDs

Full Examples

import os
from nanovllm import LLM, SamplingParams
from transformers import AutoTokenizer

path = os.path.expanduser("~/huggingface/Qwen3-0.6B/")
tokenizer = AutoTokenizer.from_pretrained(path)
llm = LLM(path, enforce_eager=True, tensor_parallel_size=1)

sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = [
    "introduce yourself",
    "list all prime numbers within 100",
]

# Apply the model's chat template before passing to the engine
prompts = [
    tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=False,
        add_generation_prompt=True,
    )
    for prompt in prompts
]

outputs = llm.generate(prompts, sampling_params)

for prompt, output in zip(prompts, outputs):
    print(f"Prompt: {prompt!r}")
    print(f"Completion: {output['text']!r}")
Instruction-tuned models (such as Qwen3, Llama-Instruct, etc.) expect input formatted with their chat template. Use tokenizer.apply_chat_template() as shown above, or the model may produce low-quality output.

Output Format

llm.generate() returns a list of dictionaries, one per input prompt:
[
    {"text": "<generated text>", "token_ids": [id1, id2, ...]},
    ...
]
The order of outputs matches the order of the input prompts list.

Build docs developers (and LLMs) love