SamplingParams controls how tokens are sampled from the model’s output distribution. It is a Python dataclass and is imported alongside LLM from the top-level package.
Fields
Scales the logits before sampling. Higher values produce more random output; lower values concentrate probability on the top tokens.Constraint: must be strictly greater than
1e-10. Passing 0.0 or any value <= 1e-10 raises an AssertionError with the message "greedy sampling is not permitted".Maximum number of tokens to generate per request. Generation stops when this limit is reached or when the EOS token is emitted, whichever comes first.
When
True, the EOS token does not stop generation. The sequence continues until max_tokens is exhausted. Useful for throughput benchmarking where controlled-length outputs are needed.Validation
The dataclass runs__post_init__ on construction:
Examples
Per-request sampling params
You can pass a list ofSamplingParams to LLM.generate() to apply different settings to each prompt: