Fit analysis determines whether a model can run on your hardware and how efficiently it will use available memory. llmfit evaluates fit across four levels with dynamic quantization selection.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/AlexsJones/llmfit/llms.txt
Use this file to discover all available pages before exploring further.
Four Fit Levels
Perfect
Criteria:
- Running on GPU (not CPU-only)
- Recommended memory met
- Comfortable headroom for inference
Good
Criteria:
- Fits with ≥20% headroom
- Best achievable for MoE offload
- Best achievable for CPU+GPU offload
Marginal
Criteria:
- Minimum memory met but tight
- Best achievable for CPU-only
- Risk of OOM under load
Too Tight
Criteria:
- Insufficient memory in all pools
- Model will not run
Fit Scoring Logic
Fit level depends on both memory headroom and run mode:Key insight: CPU-only and offload modes can never achieve Perfect. Perfect requires GPU acceleration with comfortable memory.
Run Mode Determination
llmfit tries execution paths in order of preference:Unified Memory Detection
If
system.unified_memory == true (Apple Silicon, NVIDIA Grace):- GPU and CPU share the same memory pool
- No separate CPU+GPU offload path
- Use GPU path with full memory budget
Try GPU Path (Discrete VRAM)
Attempt to fit the model in VRAM with dynamic quantization:If any quantization level fits, use GPU path.
Try MoE Offload (MoE Models Only)
For Mixture-of-Experts models, try expert offloading:Requirements:
- Active experts fit in VRAM
- Inactive experts fit in system RAM
Try CPU+GPU Offload
Model doesn’t fit in VRAM—spill to system RAM:Penalty: 0.5× GPU speed (RAM bandwidth bottleneck)
Dynamic Quantization Selection
Instead of using the model’s default quantization, llmfit walks a hierarchy to find the best quality that fits:GGUF Quantization Hierarchy (llama.cpp)
MLX Quantization Hierarchy (Apple Silicon)
Selection Algorithm
Example: Llama-3.1-70B on RTX 4090 (24 GB VRAM)
- Q8_0: 75.2 GB — doesn’t fit
- Q6_K: 58.0 GB — doesn’t fit
- Q5_K_M: 49.3 GB — doesn’t fit
- Q4_K_M: 42.2 GB — doesn’t fit
- Q3_K_M: 34.7 GB — doesn’t fit
- Q2_K: 26.7 GB — doesn’t fit
- Q8_0 @ 65K ctx: 73.1 GB — doesn’t fit
- Q6_K @ 65K ctx: 55.9 GB — doesn’t fit
- Q5_K_M @ 65K ctx: 47.2 GB — doesn’t fit
- Q4_K_M @ 65K ctx: 40.1 GB — doesn’t fit
- Q3_K_M @ 65K ctx: 32.6 GB — doesn’t fit
- Q2_K @ 65K ctx: 24.6 GB — fits! ✓
Q2_K at 65K context, 24.6 GB VRAMMemory Estimation Formula
llmfit computes memory requirements dynamically:Model Weights
Model Weights
Formula:
params × bytes_per_param(quant)Example: 7B @ Q4_K_M = 7 × 0.58 = 4.06 GBThis is the bulk of memory usage—the model parameters themselves.KV Cache
KV Cache
Formula:
0.000008 × params × context_lengthExample: 7B @ 8K context = 0.000008 × 7 × 8192 = 0.46 GBStores key/value tensors for attention mechanism. Grows linearly with context length.Runtime Overhead
Runtime Overhead
Fixed: 0.5 GBCovers CUDA/Metal context, buffer allocations, and framework overhead.
Memory Utilization Targets
llmfit aims for specific utilization ranges:- GPU Inference
- CPU+GPU Offload
- CPU-Only
Target: 50-80% of VRAMSweet spot: Efficient use without risking OOM.
MoE Expert Offloading
Mixture-of-Experts models can split across VRAM and RAM:Memory Split Calculation
Example: Mixtral 8x7B @ Q4_K_M
Fit Analysis Examples
Example 1: Perfect Fit
Example 1: Perfect Fit
Hardware: RTX 4090 (24 GB VRAM), 64 GB RAMModel: Qwen2.5-Coder-14B-InstructAnalysis:Result: Excellent fit—high-quality quantization with plenty of headroom.
Example 2: MoE Offload
Example 2: MoE Offload
Hardware: RTX 3070 (8 GB VRAM), 32 GB RAMModel: Mixtral 8x7B-InstructAnalysis:Result: Fits via expert offloading—wouldn’t run otherwise.
Example 3: CPU+GPU Offload
Example 3: CPU+GPU Offload
Hardware: GTX 1660 Ti (6 GB VRAM), 16 GB RAMModel: Llama-3.1-8B-InstructAnalysis:Result: Spills to RAM—significant performance hit but runnable.
Example 4: Too Tight
Example 4: Too Tight
Hardware: GTX 1650 (4 GB VRAM), 8 GB RAMModel: Llama-3.1-70B-InstructAnalysis:Result: Unrunnable. Model ranks last in fit results.
Context-Length Capping
Use--max-context to reduce memory requirements:
Run Mode Selection Summary
llmfit always tries the fastest path first (GPU) and falls back gracefully to slower modes when memory is insufficient.
