Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/AlexsJones/llmfit/llms.txt

Use this file to discover all available pages before exploring further.

Fit Analysis API

The fit module analyzes how well models run on specific hardware, computing fit levels, run modes, and multi-dimensional scores.

Core Types

ModelFit

Complete analysis of a model’s fit on specific hardware:
pub struct ModelFit {
    pub model: LlmModel,
    pub fit_level: FitLevel,
    pub run_mode: RunMode,
    pub memory_required_gb: f64,
    pub memory_available_gb: f64,
    pub utilization_pct: f64,
    pub notes: Vec<String>,
    pub moe_offloaded_gb: Option<f64>,
    pub score: f64,
    pub score_components: ScoreComponents,
    pub estimated_tps: f64,
    pub best_quant: String,
    pub use_case: UseCase,
    pub runtime: InferenceRuntime,
    pub installed: bool,
}
Key Fields:
  • model - The analyzed model
  • fit_level - Memory fit quality (Perfect, Good, Marginal, TooTight)
  • run_mode - Execution path (Gpu, CpuOffload, CpuOnly, MoeOffload)
  • memory_required_gb - Memory needed for this run mode
  • memory_available_gb - Memory pool capacity
  • utilization_pct - Memory utilization percentage
  • notes - Human-readable analysis notes
  • score - Composite score (0-100)
  • score_components - Individual dimension scores
  • estimated_tps - Estimated tokens per second
  • best_quant - Optimal quantization for this hardware
  • runtime - Inference runtime (MLX or llama.cpp)
  • installed - Whether model is locally installed

FitLevel

Memory fit quality:
pub enum FitLevel {
    Perfect,  // Recommended memory met on GPU
    Good,     // Fits with headroom
    Marginal, // Minimum memory met but tight
    TooTight, // Does not fit in available memory
}
Fit Criteria:
  • Perfect: GPU path with recommended memory available
  • Good: Adequate headroom (1.2x minimum or more)
  • Marginal: Meets minimum but tight
  • TooTight: Insufficient memory
Note: CPU-only paths cap at Marginal since they’re always a compromise.

RunMode

Execution path:
pub enum RunMode {
    Gpu,        // Fully loaded into VRAM (fast)
    MoeOffload, // MoE: active experts in VRAM, inactive in RAM
    CpuOffload, // Partial GPU, spills to RAM (mixed)
    CpuOnly,    // Entirely in system RAM (slow)
}
Performance Order:
  1. Gpu: Full GPU inference (fastest)
  2. MoeOffload: Expert switching with some latency
  3. CpuOffload: Significant slowdown from memory transfers
  4. CpuOnly: Slowest, but works without GPU

InferenceRuntime

Software framework:
pub enum InferenceRuntime {
    LlamaCpp, // llama.cpp / Ollama
    Mlx,      // Apple MLX framework
}

ScoreComponents

Multi-dimensional scoring:
pub struct ScoreComponents {
    pub quality: f64,  // 0-100
    pub speed: f64,    // 0-100
    pub fit: f64,      // 0-100
    pub context: f64,  // 0-100
}
Dimensions:
  • Quality: Model capability (params + family + quant + task alignment)
  • Speed: Estimated tok/s vs. target for use case
  • Fit: Memory efficiency (50-80% utilization = optimal)
  • Context: Context window vs. use case needs

SortColumn

Sort options for ranked models:
pub enum SortColumn {
    Score,       // Composite score (default)
    Tps,         // Estimated tokens/second
    Params,      // Parameter count
    MemPct,      // Memory utilization
    Ctx,         // Context length
    ReleaseDate, // Release date
    UseCase,     // Use case category
}

Functions

ModelFit::analyze()

Analyzes model fit on hardware:
pub fn analyze(model: &LlmModel, system: &SystemSpecs) -> Self
Parameters:
  • model - Model to analyze
  • system - System hardware specs
Returns: Complete fit analysis Example:
use llmfit_core::{ModelFit, ModelDatabase, SystemSpecs};

let specs = SystemSpecs::detect();
let db = ModelDatabase::new();

let model = &db.get_all_models()[0];
let fit = ModelFit::analyze(model, &specs);

println!("{}: {}", model.name, fit.fit_text());
println!("  Run mode: {}", fit.run_mode_text());
println!("  Memory: {:.2}/{:.2} GB ({:.1}%)",
    fit.memory_required_gb,
    fit.memory_available_gb,
    fit.utilization_pct
);
println!("  Speed: {:.1} tok/s", fit.estimated_tps);
println!("  Score: {:.1}/100", fit.score);

for note in &fit.notes {
    println!("  - {}", note);
}

ModelFit::analyze_with_context_limit()

Analyzes with custom context limit:
pub fn analyze_with_context_limit(
    model: &LlmModel,
    system: &SystemSpecs,
    context_limit: Option<u32>,
) -> Self
Parameters:
  • model - Model to analyze
  • system - System hardware specs
  • context_limit - Maximum context length to assume
Example:
// Analyze with 8K context instead of model's full 128K
let fit = ModelFit::analyze_with_context_limit(
    model,
    &specs,
    Some(8192)
);

// This reduces memory estimates for long-context models

Helper Methods

impl ModelFit {
    pub fn fit_emoji(&self) -> &str;      // "🟢" / "🟡" / "🟠" / "🔴"
    pub fn fit_text(&self) -> &str;       // "Perfect" / "Good" / etc.
    pub fn runtime_text(&self) -> &str;   // "llama.cpp" / "MLX"
    pub fn run_mode_text(&self) -> &str;  // "GPU" / "CPU" / etc.
}

Ranking Functions

rank_models_by_fit()

Ranks models by composite score:
pub fn rank_models_by_fit(models: Vec<ModelFit>) -> Vec<ModelFit>
Returns: Models sorted by score (best first), with TooTight at end Example:
use llmfit_core::{rank_models_by_fit, ModelFit, FitLevel};

let specs = SystemSpecs::detect();
let db = ModelDatabase::new();

let fits: Vec<ModelFit> = db.get_all_models()
    .iter()
    .map(|m| ModelFit::analyze(m, &specs))
    .collect();

let ranked = rank_models_by_fit(fits);

println!("Top 5 models:");
for (i, fit) in ranked.iter()
    .filter(|f| f.fit_level != FitLevel::TooTight)
    .take(5)
    .enumerate()
{
    println!("{}. {} - Score {:.1}, {:.1} tok/s",
        i + 1,
        fit.model.name,
        fit.score,
        fit.estimated_tps
    );
}

rank_models_by_fit_opts()

Ranks with installed-first option:
pub fn rank_models_by_fit_opts(
    models: Vec<ModelFit>,
    installed_first: bool,
) -> Vec<ModelFit>
Example:
// Rank with installed models first
let ranked = rank_models_by_fit_opts(fits, true);

rank_models_by_fit_opts_col()

Ranks with custom sort column:
pub fn rank_models_by_fit_opts_col(
    models: Vec<ModelFit>,
    installed_first: bool,
    sort_column: SortColumn,
) -> Vec<ModelFit>
Example:
use llmfit_core::SortColumn;

// Sort by speed (tok/s)
let by_speed = rank_models_by_fit_opts_col(
    fits.clone(),
    false,
    SortColumn::Tps
);

// Sort by parameter count
let by_size = rank_models_by_fit_opts_col(
    fits.clone(),
    false,
    SortColumn::Params
);

// Sort by release date (newest first)
let by_date = rank_models_by_fit_opts_col(
    fits.clone(),
    false,
    SortColumn::ReleaseDate
);

Scoring Algorithm

The composite score combines four dimensions with use-case-specific weights:

Quality Score (0-100)

Based on:
  • Parameter count tier (1B = 45, 7B = 75, 70B+ = 95)
  • Family reputation (Qwen +2, DeepSeek +3, Llama +2)
  • Quantization penalty (Q8 = 0, Q4 = -5, Q2 = -12)
  • Task alignment bonus (coding models on coding tasks +6)

Speed Score (0-100)

Normalized against use-case target:
  • General/Coding/Chat: 40 tok/s target
  • Reasoning: 25 tok/s target
  • Embedding: 200 tok/s target
Score = (estimated_tps / target) * 100, capped at 100.

Fit Score (0-100)

Memory efficiency:
  • 50-80% utilization: 100 points (sweet spot)
  • 30-50% utilization: 60-100 points (scaled)
  • 80-90% utilization: 70 points (getting tight)
  • 90-100% utilization: 50 points (very tight)
  • 100% utilization: 0 points (doesn’t fit)

Context Score (0-100)

Context window vs. use case needs:
  • Meets or exceeds target: 100 points
  • At least half of target: 70 points
  • Below half: 30 points

Weighted Composite

Use-case-specific weights [Quality, Speed, Fit, Context]:
  • General: [0.45, 0.30, 0.15, 0.10]
  • Coding: [0.50, 0.20, 0.15, 0.15]
  • Reasoning: [0.55, 0.15, 0.15, 0.15]
  • Chat: [0.40, 0.35, 0.15, 0.10]
  • Multimodal: [0.50, 0.20, 0.15, 0.15]
  • Embedding: [0.30, 0.40, 0.20, 0.10]
Example:
let fit = ModelFit::analyze(model, &specs);
let sc = fit.score_components;

println!("Quality: {:.1}/100", sc.quality);
println!("Speed: {:.1}/100", sc.speed);
println!("Fit: {:.1}/100", sc.fit);
println!("Context: {:.1}/100", sc.context);
println!("Weighted: {:.1}/100", fit.score);

Speed Estimation

Token generation is memory-bandwidth-bound. The estimator uses:

Bandwidth-Based (Preferred)

For recognized GPUs:
model_gb = params * bytes_per_param(quant)
raw_tps = (bandwidth_gbps / model_gb) * 0.55
estimated_tps = raw_tps * run_mode_factor
Efficiency factor (0.55) accounts for:
  • Kernel launch overhead
  • KV-cache reads
  • Memory controller inefficiency
Run mode factors:
  • Gpu: 1.0
  • MoeOffload: 0.8
  • CpuOffload: 0.5
  • CpuOnly: 0.3

Fallback (Unknown GPUs)

Fixed constants by backend:
  • Metal + MLX: 250
  • Metal + llama.cpp: 160
  • CUDA: 220
  • ROCm: 180
  • Vulkan: 150
  • SYCL: 100
  • CPU ARM: 90
  • CPU x86: 70
  • Ascend: 390
Example:
let fit = ModelFit::analyze(model, &specs);

println!("Estimated speed: {:.1} tok/s", fit.estimated_tps);

if fit.runtime == InferenceRuntime::Mlx {
    println!("Using MLX runtime on Apple Silicon");
}

Fit Analysis Notes

The notes field contains human-readable explanations:
for note in &fit.notes {
    println!("  {}", note);
}
Common notes:
  • “GPU: model loaded into VRAM”
  • “Unified memory: GPU and CPU share the same pool”
  • “MoE: 2/8 experts active in VRAM (12.5 GB) at Q4_K_M”
  • “Inactive experts offloaded to system RAM (18.3 GB)”
  • “GPU: insufficient VRAM, spilling to system RAM”
  • “Best quantization for hardware: Q4_K_M (model default: Q8_0)”
  • “MLX runtime: ~35% faster than llama.cpp (85.2 vs 63.1 tok/s)”
  • “Baseline estimated speed: 45.3 tok/s”

Backend Compatibility

pub fn backend_compatible(model: &LlmModel, system: &SystemSpecs) -> bool
Example:
use llmfit_core::fit::backend_compatible;

let db = ModelDatabase::new();
let specs = SystemSpecs::detect();

for model in db.get_all_models() {
    if !backend_compatible(model, &specs) {
        println!("{}: incompatible with {}",
            model.name,
            specs.backend.label()
        );
    }
}
MLX models require Apple Silicon (Metal backend + unified memory).

Build docs developers (and LLMs) love