Skip to main content
Config is an internal dataclass that centralises every engine-level setting. You do not construct it directly — instead pass its fields as keyword arguments when creating an LLM instance.
# All Config fields can be passed directly to LLM()
llm = LLM(
    "/models/Qwen3-0.6B",
    max_model_len=8192,
    tensor_parallel_size=2,
    gpu_memory_utilization=0.85,
)

Fields

model
str
required
Local filesystem path to the model directory. The directory must contain a Hugging Face-compatible model config (config.json) and weight files. Passing a model hub name raises an AssertionError.
max_num_batched_tokens
int
default:"16384"
Maximum total number of tokens (summed across all sequences) processed in a single forward pass. Must be >= max_model_len.
max_num_seqs
int
default:"512"
Maximum number of sequences that can be scheduled concurrently.
max_model_len
int
default:"4096"
Maximum supported sequence length in tokens (prompt + completion combined). During initialisation this value is silently capped to the model’s own max_position_embeddings if it is smaller.
gpu_memory_utilization
float
default:"0.9"
Fraction of total GPU memory to allocate for the KV cache, after accounting for model weights and activation memory. For example, 0.9 means 90 % of GPU memory may be used.
tensor_parallel_size
int
default:"1"
Number of GPUs to use for tensor-parallel inference. Must be between 1 and 8 inclusive. When set to a value greater than 1, additional worker processes are spawned via torch.multiprocessing using the spawn start method.
enforce_eager
bool
default:"false"
When True, CUDA graph capture is skipped and every step is executed in PyTorch eager mode. Recommended for debugging, profiling, or environments where CUDA graphs are unsupported.
kvcache_block_size
int
default:"256"
Number of tokens per KV cache block. Must be a multiple of 256. Larger values reduce block-management overhead but may waste memory for short sequences.

Internal-only fields

These fields are set automatically during initialisation and should not be provided by the user:
FieldTypeDescription
hf_configAutoConfig | NoneLoaded Hugging Face model config. Set by __post_init__.
eosintEOS token ID. Populated from the tokenizer after the engine starts.
num_kvcache_blocksintTotal number of KV cache blocks computed after GPU memory profiling.

Validation rules

Config.__post_init__ enforces the following constraints at construction time:
def __post_init__(self):
    assert os.path.isdir(self.model)
    assert self.kvcache_block_size % 256 == 0
    assert 1 <= self.tensor_parallel_size <= 8
    self.hf_config = AutoConfig.from_pretrained(self.model)
    self.max_model_len = min(self.max_model_len, self.hf_config.max_position_embeddings)
    assert self.max_num_batched_tokens >= self.max_model_len
max_model_len is silently reduced if the value you supply exceeds the model’s own max_position_embeddings. No warning is emitted.

Build docs developers (and LLMs) love