Quickstart

This guide walks you through running your first inference with Nano-vLLM. You should have already installed Nano-vLLM and downloaded a model before continuing.

Basic Usage

Import LLM and SamplingParams

The public API is exported directly from the nanovllm package:

from nanovllm import LLM, SamplingParams

Initialize the LLM engine

Pass the path to your local model directory. Use enforce_eager=True to disable CUDA graphs (useful for debugging or low-VRAM setups) and tensor_parallel_size to spread the model across multiple GPUs.

llm = LLM("/path/to/your/model", enforce_eager=True, tensor_parallel_size=1)

tensor_parallel_size must be between 1 and 8. Set it to the number of GPUs you want to use.

Define sampling parameters

SamplingParams controls how tokens are sampled during generation:

sampling_params = SamplingParams(temperature=0.6, max_tokens=256)

Parameter	Type	Default	Description
`temperature`	`float`	`1.0`	Sampling temperature. Must be greater than 0 (greedy decoding is not supported).
`max_tokens`	`int`	`64`	Maximum number of tokens to generate per sequence.
`ignore_eos`	`bool`	`False`	If `True`, continue generating past the EOS token.

Generate completions

Pass a list of prompt strings and the sampling parameters to llm.generate():

prompts = ["Hello, Nano-vLLM."]
outputs = llm.generate(prompts, sampling_params)

generate() processes all prompts in a single batched call and returns a list of output dictionaries — one per prompt.

Access the results

Each element of outputs is a dictionary with two keys:

"text" — the generated text as a string
"token_ids" — the generated token IDs as a list of integers

print(outputs[0]["text"])       # generated string
print(outputs[0]["token_ids"]) # list of token IDs

Full Examples

import os
from nanovllm import LLM, SamplingParams
from transformers import AutoTokenizer

path = os.path.expanduser("~/huggingface/Qwen3-0.6B/")
tokenizer = AutoTokenizer.from_pretrained(path)
llm = LLM(path, enforce_eager=True, tensor_parallel_size=1)

sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = [
    "introduce yourself",
    "list all prime numbers within 100",
]

# Apply the model's chat template before passing to the engine
prompts = [
    tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=False,
        add_generation_prompt=True,
    )
    for prompt in prompts
]

outputs = llm.generate(prompts, sampling_params)

for prompt, output in zip(prompts, outputs):
    print(f"Prompt: {prompt!r}")
    print(f"Completion: {output['text']!r}")

Instruction-tuned models (such as Qwen3, Llama-Instruct, etc.) expect input formatted with their chat template. Use tokenizer.apply_chat_template() as shown above, or the model may produce low-quality output.

Output Format

llm.generate() returns a list of dictionaries, one per input prompt:

[
    {"text": "<generated text>", "token_ids": [id1, id2, ...]},
    ...
]

The order of outputs matches the order of the input prompts list.

Get Started

Guides

Architecture

Basic Usage

Full Examples

Output Format

Build docs developers (and LLMs) love

Get Started

Guides

Architecture

​Basic Usage

​Full Examples

​Output Format

Build docs developers (and LLMs) love

Basic Usage

Full Examples

Output Format