llama-cli
llama-cli
Interactive command-line inference. Accepts the same model and context flags as Pass
llama-server.-i for interactive (chat) mode. Use --completion-bash to generate shell tab-completion:llama-quantize
llama-quantize
Quantize a full-precision GGUF to a smaller quantization type.
| Flag | Description |
|---|---|
--imatrix FILE | Apply an importance matrix to improve quantization quality. Recommended for types below Q6_0. |
--custom-q "regex=type,..." | Mix quantization types per tensor using regular expressions. |
--dry-run | Print tensor types and output sizes without running quantization. Use to preview --custom-q mixes before committing. |
--partial-requant | Quantize only missing split files in a destination directory. |
llama-imatrix
llama-imatrix
Generate an importance matrix from a calibration dataset. The output
.imatrix file is passed to llama-quantize.| Flag | Description |
|---|---|
--layer-similarity / -lsim | Collect cosine-similarity statistics on layer activations. |
--hide-imatrix | Anonymize the output by storing top_secret in the file name and zeroing calibration metadata. |
Use
convert_imatrix_gguf_to_dat.py to convert GGUF imatrix files to the
legacy .dat format if needed.llama-bench
llama-bench
Standard benchmark utility for measuring prompt processing (PP) and token generation (TG) throughput.
| Flag | Description |
|---|---|
-tgb, --threads-gen-batch N,M | Test different thread counts for generation vs. batch processing in a single run. |
llama-sweep-bench
llama-sweep-bench
Extended benchmark that runs a series of PP batches followed by TG without clearing the KV cache. The
N_KV column in the output shows the KV cache occupancy at each measurement point.Accepts the same model/context flags as llama-server.| Flag | Description |
|---|---|
-nrep N, --n-repetitions N | Number of repetitions at zero context before the sweep begins. |
-n N | Number of TG tokens per step. Defaults to ubatch / 4 if not set. |
convert_hf_to_gguf.py
convert_hf_to_gguf.py
Convert a HuggingFace model checkpoint to GGUF format.Supports legacy quantization conversion schemes. Run with
--help for all options. For split output models, combine with llama-gguf-split to produce multi-part GGUFs.