Config is an internal dataclass that centralises every engine-level setting. You do not construct it directly — instead pass its fields as keyword arguments when creating an LLM instance.
Fields
Local filesystem path to the model directory. The directory must contain a Hugging Face-compatible model config (
config.json) and weight files. Passing a model hub name raises an AssertionError.Maximum total number of tokens (summed across all sequences) processed in a single forward pass. Must be
>= max_model_len.Maximum number of sequences that can be scheduled concurrently.
Maximum supported sequence length in tokens (prompt + completion combined). During initialisation this value is silently capped to the model’s own
max_position_embeddings if it is smaller.Fraction of total GPU memory to allocate for the KV cache, after accounting for model weights and activation memory. For example,
0.9 means 90 % of GPU memory may be used.Number of GPUs to use for tensor-parallel inference. Must be between
1 and 8 inclusive. When set to a value greater than 1, additional worker processes are spawned via torch.multiprocessing using the spawn start method.When
True, CUDA graph capture is skipped and every step is executed in PyTorch eager mode. Recommended for debugging, profiling, or environments where CUDA graphs are unsupported.Number of tokens per KV cache block. Must be a multiple of
256. Larger values reduce block-management overhead but may waste memory for short sequences.Internal-only fields
These fields are set automatically during initialisation and should not be provided by the user:| Field | Type | Description |
|---|---|---|
hf_config | AutoConfig | None | Loaded Hugging Face model config. Set by __post_init__. |
eos | int | EOS token ID. Populated from the tokenizer after the engine starts. |
num_kvcache_blocks | int | Total number of KV cache blocks computed after GPU memory profiling. |
Validation rules
Config.__post_init__ enforces the following constraints at construction time:
max_model_len is silently reduced if the value you supply exceeds the model’s own max_position_embeddings. No warning is emitted.