QMD uses vector embeddings to enable semantic search. Documents are chunked into ~900 token pieces, embedded using a local GGUF model, and stored in a vector index for fast similarity lookup.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/tobi/qmd/llms.txt
Use this file to discover all available pages before exploring further.
What are Embeddings?
Embeddings are dense vector representations of text that capture semantic meaning. Similar concepts have similar vectors, even if they use different words:Generating Embeddings
Runqmd embed to generate embeddings for all indexed documents:
Embeddings are generated once and cached. Subsequent
qmd embed calls only process new or changed documents.Force Re-embedding
To regenerate all embeddings (e.g., after model upgrade):Embedding Model
QMD uses embeddinggemma-300M-Q8_0 for embeddings:| Property | Value |
|---|---|
| Model | embeddinggemma-300M-Q8_0.gguf |
| Size | ~300MB |
| Dimensions | 1024 |
| Context | 2048 tokens |
| Format | GGUF (runs via node-llama-cpp) |
~/.cache/qmd/models/ on first use.
Embedding Format
Documents are formatted for embeddinggemma using nomic-style prompts: For queries:Smart Chunking Strategy
Documents are split into chunks before embedding. QMD uses markdown-aware smart chunking to preserve semantic units.Chunking Parameters
Why 900 Tokens?
- Fits embedding model context (2048 tokens)
- Balances granularity and coherence (too small = fragmented, too large = diluted)
- Leaves room for overlap (15% = 135 tokens)
- Accommodates title prefix (~50 tokens for
title: ... | text: ...)
15% Overlap
Chunks overlap by 135 tokens to avoid cutting concepts in half:Smart Boundary Detection
Instead of cutting at hard token boundaries, QMD finds natural markdown break points within a 200-token window.Break Point Scores
| Pattern | Score | Description |
|---|---|---|
# Heading | 100 | H1 - major section |
## Heading | 90 | H2 - subsection |
### Heading | 80 | H3 |
#### Heading | 70 | H4 |
##### Heading | 60 | H5 |
###### Heading | 50 | H6 |
``` | 80 | Code block boundary |
--- / *** | 60 | Horizontal rule |
| Blank line | 20 | Paragraph boundary |
- item / 1. item | 5 | List item |
| Line break | 1 | Minimal break |
Scoring Algorithm
- Scan document for all break points
- When approaching 900-token target, search 200 tokens backward
- Score each break point:
finalScore = baseScore × (1 - (distance/window)² × 0.7) - Cut at highest-scoring break point
Code Fence Protection
Break points inside code blocks are ignored — code stays together. If a code block exceeds the chunk size, it’s kept whole when possible.vectors_vec
Vector data using sqlite-vec:Vector Search Process
Compute Distances
sqlite-vec computes cosine distance between query vector and all document chunk vectors:
Normalize Scores
Cosine distance is converted to similarity score:Range: 0.0 (dissimilar) to 1.0 (identical).
Embedding Pipeline
Performance Characteristics
Embedding Speed
| Hardware | Speed | Chunking + Embedding |
|---|---|---|
| CUDA GPU | ~500 chunks/sec | ~100 docs/sec |
| Apple M1/M2 | ~200 chunks/sec | ~40 docs/sec |
| CPU only | ~20 chunks/sec | ~4 docs/sec |
Embedding speed scales with parallelism. QMD creates multiple embedding contexts based on available VRAM/cores.
Search Speed
| Index Size | GPU | CPU |
|---|---|---|
| 1K chunks | ~10ms | ~50ms |
| 10K chunks | ~30ms | ~200ms |
| 100K chunks | ~100ms | ~1s |
GPU Acceleration
QMD auto-detects GPU support and uses the best available backend:- CUDA (NVIDIA GPUs)
- Metal (Apple Silicon)
- Vulkan (cross-platform)
- CPU (fallback)
Model Cache
Models are downloaded to:Updating Embeddings
Embeddings are content-addressed by document hash. When a document changes:- New hash is computed
- New embeddings are generated
- Old embeddings remain until cleanup
- Orphaned embeddings (no matching document)
- Inactive document records
- Stale LLM cache entries
Context and Embeddings
Context is NOT embedded in vectors. Chunk embeddings only include:- Document title
- Chunk text
Context affects search quality through:
- LLM reranking (reranker sees context)
- Query expansion (context can inform expansions)
- User understanding (helps interpret results)
Best Practices
Embed before searching
Embed before searching
Vector search (
vsearch, query) requires embeddings:Re-embed after major changes
Re-embed after major changes
If you restructure documents or add significant content, re-embed:
Use cleanup regularly
Use cleanup regularly
Orphaned embeddings waste space. Run cleanup monthly:
Monitor embedding coverage
Monitor embedding coverage
Check how many documents need embedding:Look for “Documents need embedding” count.
Balance chunk size
Balance chunk size
The default 900 tokens works well for most documents. Don’t change
CHUNK_SIZE_TOKENS unless you have a specific need (e.g., very short or very long documents).Troubleshooting
”sqlite-vec extension unavailable”
Vector search requires sqlite-vec. On macOS:BREW_PREFIX if Homebrew is in a non-standard location.
Slow embedding on CPU
CPU-only embedding is slow. Options:- Use a GPU (CUDA, Metal, or Vulkan)
- Reduce collection size
- Be patient — embedding is one-time per document
Out of memory during embedding
QMD auto-detects available VRAM/RAM and creates appropriate parallelism. If you still run out:- Close other applications
- Reduce collection size
- Embed in batches (use collection filters)
Embeddings not updating
Embeddings are cached by document hash. If content changes:Related
- Search Modes - Using vsearch and query
- Collections - Organizing documents
- Context Management - Adding metadata to chunks
- CLI Reference - Embed command documentation