Basic Usage
Initialize the LLM engine
Pass the path to your local model directory. Use
enforce_eager=True to disable CUDA graphs (useful for debugging or low-VRAM setups) and tensor_parallel_size to spread the model across multiple GPUs.tensor_parallel_size must be between 1 and 8. Set it to the number of GPUs you want to use.Define sampling parameters
SamplingParams controls how tokens are sampled during generation:| Parameter | Type | Default | Description |
|---|---|---|---|
temperature | float | 1.0 | Sampling temperature. Must be greater than 0 (greedy decoding is not supported). |
max_tokens | int | 64 | Maximum number of tokens to generate per sequence. |
ignore_eos | bool | False | If True, continue generating past the EOS token. |
Generate completions
Pass a list of prompt strings and the sampling parameters to
llm.generate():generate() processes all prompts in a single batched call and returns a list of output dictionaries — one per prompt.Full Examples
Instruction-tuned models (such as Qwen3, Llama-Instruct, etc.) expect input formatted with their chat template. Use
tokenizer.apply_chat_template() as shown above, or the model may produce low-quality output.Output Format
llm.generate() returns a list of dictionaries, one per input prompt:
prompts list.