Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/NVIDIA/TensorRT-LLM/llms.txt

Use this file to discover all available pages before exploring further.

SamplingParams

The SamplingParams class controls how text is generated from language models. It configures sampling strategies, stopping conditions, penalties, and output options.

Constructor

from tensorrt_llm.sampling_params import SamplingParams

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    max_tokens=256
)

Parameters

Token Generation

max_tokens
int
default:"32"
Maximum number of tokens to generate per output sequence.
min_tokens
int
default:"None"
Minimum number of tokens to generate. Values < 1 have no effect. Prevents early stopping.

Sampling Strategy

temperature
float
default:"None"
Temperature for sampling (≥ 0). Controls randomness:
  • 0.0: Greedy decoding (deterministic)
  • < 1.0: More focused, deterministic outputs
  • = 1.0: Standard sampling
  • > 1.0: More random, creative outputs
If None and neither top_p nor top_k are specified, defaults to greedy decoding.
top_p
float
default:"None"
Nucleus sampling threshold (0 to 1). Only tokens with cumulative probability ≥ top_p are considered.
  • 1.0: Consider all tokens (standard sampling)
  • < 1.0: Consider only most probable tokens
If None and neither temperature nor top_k are specified, defaults to greedy decoding.
top_k
int
default:"None"
Sample from the top K most likely tokens.
  • 0: Consider all tokens
  • 1: Greedy decoding
  • > 1: Limit sampling to top K tokens
If None and neither temperature nor top_p are specified, defaults to greedy decoding.
top_p_min
float
default:"None"
Lower bound for top-P decay algorithm. Defaults to 1e-6.
top_p_decay
float
default:"None"
Decay factor for top-P algorithm. Defaults to 1.0.
top_p_reset_ids
int
default:"None"
Token ID where top-P decay resets. Defaults to 1.
min_p
float
default:"None"
Minimum token probability threshold. Scales the most likely token to determine minimum probability. Defaults to 0.0.
seed
int
default:"None"
Random seed for reproducible sampling. Defaults to 0.
Enable beam search instead of sampling. When True, best_of becomes the beam width.
n
int
default:"1"
Number of output sequences to return per prompt.
best_of
int
default:"None"
Number of sequences to generate for selection:
  • Sampling mode: Generate best_of sequences, return top n by cumulative log probability
  • Beam search mode: Use beam width of best_of, return top n
Must satisfy best_of >= n. Defaults to n.
beam_search_diversity_rate
float
default:"None"
Diversity penalty for beam search. Values > 1.0 encourage diverse beams. Defaults to 1.0.
beam_width_array
List[int]
default:"None"
Array of beam widths for variable-beam-width search.
length_penalty
float
default:"None"
Exponential penalty for sequence length in beam search. Defaults to 0.0.
early_stopping
int
default:"None"
Stop beam search when beam_width complete sequences are generated. Defaults to 1.

Stopping Conditions

end_id
int
default:"None"
End-of-sequence token ID. Generation stops when this token is generated. Defaults to tokenizer’s EOS token.
pad_id
int
default:"None"
Padding token ID. Defaults to end_id.
stop
str | List[str]
default:"None"
Stop string(s). Generation stops when any of these strings are generated.
SamplingParams(stop=["\n\n", "END", "###"])
stop_token_ids
List[int]
default:"None"
Stop token IDs. Generation stops when any of these tokens are generated.
include_stop_str_in_output
bool
default:"False"
Include stop string in the output text. When False, stop strings are removed.
ignore_eos
bool
default:"False"
Continue generation after EOS token is generated.

Bad Words / Tokens

bad
str | List[str]
default:"None"
Bad string(s) to avoid. When these would be generated, they are redirected to alternative tokens.
bad_token_ids
List[int]
default:"None"
Token IDs to avoid during generation.

Repetition Control

repetition_penalty
float
default:"None"
Penalty for repeating tokens:
  • < 1.0: Encourage repetition
  • = 1.0: No penalty (default)
  • > 1.0: Discourage repetition
presence_penalty
float
default:"None"
Penalty for tokens that have already appeared (independent of frequency):
  • < 0.0: Encourage repetition
  • = 0.0: No penalty (default)
  • > 0.0: Discourage repetition
frequency_penalty
float
default:"None"
Penalty based on token frequency in generated text:
  • < 0.0: Encourage repetition
  • = 0.0: No penalty (default)
  • > 0.0: Discourage repetition (stronger for more frequent tokens)
prompt_ignore_length
int
default:"None"
Number of prompt tokens to ignore for presence/frequency penalties. Defaults to 0.
no_repeat_ngram_size
int
default:"None"
Prevent repetition of n-grams of this size. Defaults to very large value (no limit).

Output Control

logprobs
int
default:"None"
Number of log probabilities to return per output token:
  • None: No log probabilities
  • 0: Only the sampled token’s log probability
  • K > 0: Top-K log probabilities plus sampled token (if not in top-K)
prompt_logprobs
int
default:"None"
Number of log probabilities to return per prompt token. Same format as logprobs.
logprobs_mode
LogprobMode
default:"LogprobMode.RAW"
Log probability calculation mode:
  • LogprobMode.RAW: Raw log probabilities from model output
  • LogprobMode.PROCESSED: After applying sampling parameters (temperature, top-k, top-p)
return_context_logits
bool
default:"False"
Return full logits tensor for prompt tokens.
return_generation_logits
bool
default:"False"
Return full logits tensor for generated tokens.
exclude_input_from_output
bool
default:"True"
Exclude input tokens from output token IDs.
return_encoder_output
bool
default:"False"
Return encoder hidden states for encoder-decoder models.
return_perf_metrics
bool
default:"False"
Include performance metrics in output (TTFT, latency, throughput, etc.).
additional_model_outputs
List[str]
default:"None"
Additional model outputs to gather (model-specific).

Tokenization

detokenize
bool
default:"True"
Convert output token IDs to text.
add_special_tokens
bool
default:"True"
Add special tokens (BOS, EOS) when encoding the prompt.
skip_special_tokens
bool
default:"True"
Skip special tokens when decoding output text.
spaces_between_special_tokens
bool
default:"True"
Add spaces between special tokens in decoded output.
truncate_prompt_tokens
int
default:"None"
Truncate prompt to last K tokens (left truncation). Must be ≥ 1.

Advanced

embedding_bias
torch.Tensor
default:"None"
Embedding bias tensor of shape [vocab_size] with dtype float32.
logits_processor
LogitsProcessor | List[LogitsProcessor]
default:"None"
Custom logits processor callback(s) to modify logits before sampling. Can be a single processor or list.
apply_batched_logits_processor
bool
default:"False"
Apply batched logits processor. Processor must be provided when initializing LLM.
lookahead_config
LookaheadDecodingConfig
default:"None"
Configuration for lookahead decoding optimization.
guided_decoding
GuidedDecodingParams
default:"None"
Guided decoding parameters for structured output (JSON, regex, grammar).

Usage Examples

Basic Sampling

from tensorrt_llm.sampling_params import SamplingParams

# Balanced creativity
params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=256
)

Greedy Decoding

# Deterministic output
params = SamplingParams(
    temperature=0.0,  # or top_k=1
    max_tokens=100
)

Beam Search

params = SamplingParams(
    use_beam_search=True,
    best_of=4,  # beam width
    n=1,        # return best sequence
    max_tokens=200
)

Stop Strings

params = SamplingParams(
    stop=["\n\n", "###", "END"],
    include_stop_str_in_output=False,
    max_tokens=500
)

Multiple Outputs

# Generate 5 candidates, return best 3
params = SamplingParams(
    n=3,
    best_of=5,
    temperature=0.8,
    max_tokens=150
)

Log Probabilities

params = SamplingParams(
    logprobs=5,              # Top-5 token log probs
    prompt_logprobs=3,       # Top-3 for prompt tokens
    max_tokens=100
)

output = llm.generate("Hello", sampling_params=params)
for token_logprobs in output.outputs[0].logprobs:
    print(token_logprobs)  # Dict[token_id -> Logprob]

Repetition Control

params = SamplingParams(
    repetition_penalty=1.2,      # Discourage repetition
    frequency_penalty=0.5,       # Penalize frequent tokens
    presence_penalty=0.3,        # Penalize any repeated tokens
    no_repeat_ngram_size=3,      # No 3-gram repetition
    max_tokens=300
)

Structured Output with Guided Decoding

from tensorrt_llm.sampling_params import SamplingParams, GuidedDecodingParams

# JSON output
params = SamplingParams(
    guided_decoding=GuidedDecodingParams(
        json={
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "age": {"type": "integer"}
            },
            "required": ["name", "age"]
        }
    ),
    max_tokens=200
)

# Regex pattern
params = SamplingParams(
    guided_decoding=GuidedDecodingParams(
        regex=r"\d{3}-\d{3}-\d{4}"  # Phone number format
    ),
    max_tokens=50
)

Performance Monitoring

params = SamplingParams(
    return_perf_metrics=True,
    max_tokens=256
)

output = llm.generate("Test prompt", sampling_params=params)
metrics = output.outputs[0].request_perf_metrics
print(f"Time to first token: {metrics.ttft}")
print(f"Throughput: {metrics.throughput}")

GuidedDecodingParams

Parameters for structured output generation:
from tensorrt_llm.sampling_params import GuidedDecodingParams

guided = GuidedDecodingParams(
    json_object=True  # Any valid JSON object
)
# OR
guided = GuidedDecodingParams(
    json={...}  # Specific JSON schema
)
# OR
guided = GuidedDecodingParams(
    regex="pattern"  # Regex pattern
)
# OR
guided = GuidedDecodingParams(
    grammar="EBNF grammar"  # EBNF grammar
)

LogprobMode

Enum for log probability modes:
from tensorrt_llm.sampling_params import LogprobMode

LogprobMode.RAW        # Raw model output logits
LogprobMode.PROCESSED  # After temperature/top-k/top-p

Build docs developers (and LLMs) love