Config

Config is an internal dataclass that centralises every engine-level setting. You do not construct it directly — instead pass its fields as keyword arguments when creating an LLM instance.

# All Config fields can be passed directly to LLM()
llm = LLM(
    "/models/Qwen3-0.6B",
    max_model_len=8192,
    tensor_parallel_size=2,
    gpu_memory_utilization=0.85,
)

Fields

model

str

required

Local filesystem path to the model directory. The directory must contain a Hugging Face-compatible model config (config.json) and weight files. Passing a model hub name raises an AssertionError.

max_num_batched_tokens

int

default:"16384"

Maximum total number of tokens (summed across all sequences) processed in a single forward pass. Must be >= max_model_len.

max_num_seqs

int

default:"512"

Maximum number of sequences that can be scheduled concurrently.

max_model_len

int

default:"4096"

Maximum supported sequence length in tokens (prompt + completion combined). During initialisation this value is silently capped to the model’s own max_position_embeddings if it is smaller.

gpu_memory_utilization

float

default:"0.9"

Fraction of total GPU memory to allocate for the KV cache, after accounting for model weights and activation memory. For example, 0.9 means 90 % of GPU memory may be used.

tensor_parallel_size

int

default:"1"

Number of GPUs to use for tensor-parallel inference. Must be between 1 and 8 inclusive. When set to a value greater than 1, additional worker processes are spawned via torch.multiprocessing using the spawn start method.

enforce_eager

bool

default:"false"

When True, CUDA graph capture is skipped and every step is executed in PyTorch eager mode. Recommended for debugging, profiling, or environments where CUDA graphs are unsupported.

kvcache_block_size

int

default:"256"

Number of tokens per KV cache block. Must be a multiple of 256. Larger values reduce block-management overhead but may waste memory for short sequences.

Internal-only fields

These fields are set automatically during initialisation and should not be provided by the user:

Field	Type	Description
`hf_config`	`AutoConfig \| None`	Loaded Hugging Face model config. Set by `__post_init__`.
`eos`	`int`	EOS token ID. Populated from the tokenizer after the engine starts.
`num_kvcache_blocks`	`int`	Total number of KV cache blocks computed after GPU memory profiling.

Validation rules

Config.__post_init__ enforces the following constraints at construction time:

def __post_init__(self):
    assert os.path.isdir(self.model)
    assert self.kvcache_block_size % 256 == 0
    assert 1 <= self.tensor_parallel_size <= 8
    self.hf_config = AutoConfig.from_pretrained(self.model)
    self.max_model_len = min(self.max_model_len, self.hf_config.max_position_embeddings)
    assert self.max_num_batched_tokens >= self.max_model_len

max_model_len is silently reduced if the value you supply exceeds the model’s own max_position_embeddings. No warning is emitted.

Core API

Internals

Fields

Internal-only fields

Validation rules

Build docs developers (and LLMs) love

Core API

Internals

​Fields

​Internal-only fields

​Validation rules

Build docs developers (and LLMs) love

Fields

Internal-only fields

Validation rules