What are Embeddings?
Embeddings are dense vector representations of text that capture semantic meaning. Similar concepts have similar vectors, even if they use different words:Generating Embeddings
Runqmd embed to generate embeddings for all indexed documents:
Embeddings are generated once and cached. Subsequent
qmd embed calls only process new or changed documents.Force Re-embedding
To regenerate all embeddings (e.g., after model upgrade):Embedding Model
QMD uses embeddinggemma-300M-Q8_0 for embeddings:| Property | Value |
|---|---|
| Model | embeddinggemma-300M-Q8_0.gguf |
| Size | ~300MB |
| Dimensions | 1024 |
| Context | 2048 tokens |
| Format | GGUF (runs via node-llama-cpp) |
~/.cache/qmd/models/ on first use.
Embedding Format
Documents are formatted for embeddinggemma using nomic-style prompts: For queries:Smart Chunking Strategy
Documents are split into chunks before embedding. QMD uses markdown-aware smart chunking to preserve semantic units.Chunking Parameters
Why 900 Tokens?
- Fits embedding model context (2048 tokens)
- Balances granularity and coherence (too small = fragmented, too large = diluted)
- Leaves room for overlap (15% = 135 tokens)
- Accommodates title prefix (~50 tokens for
title: ... | text: ...)
15% Overlap
Chunks overlap by 135 tokens to avoid cutting concepts in half:Smart Boundary Detection
Instead of cutting at hard token boundaries, QMD finds natural markdown break points within a 200-token window.Break Point Scores
| Pattern | Score | Description |
|---|---|---|
# Heading | 100 | H1 - major section |
## Heading | 90 | H2 - subsection |
### Heading | 80 | H3 |
#### Heading | 70 | H4 |
##### Heading | 60 | H5 |
###### Heading | 50 | H6 |
``` | 80 | Code block boundary |
--- / *** | 60 | Horizontal rule |
| Blank line | 20 | Paragraph boundary |
- item / 1. item | 5 | List item |
| Line break | 1 | Minimal break |
Scoring Algorithm
- Scan document for all break points
- When approaching 900-token target, search 200 tokens backward
- Score each break point:
finalScore = baseScore × (1 - (distance/window)² × 0.7) - Cut at highest-scoring break point
Code Fence Protection
Break points inside code blocks are ignored — code stays together. If a code block exceeds the chunk size, it’s kept whole when possible.vectors_vec
Vector data using sqlite-vec:Vector Search Process
Compute Distances
sqlite-vec computes cosine distance between query vector and all document chunk vectors:
Normalize Scores
Cosine distance is converted to similarity score:Range: 0.0 (dissimilar) to 1.0 (identical).
Embedding Pipeline
Performance Characteristics
Embedding Speed
| Hardware | Speed | Chunking + Embedding |
|---|---|---|
| CUDA GPU | ~500 chunks/sec | ~100 docs/sec |
| Apple M1/M2 | ~200 chunks/sec | ~40 docs/sec |
| CPU only | ~20 chunks/sec | ~4 docs/sec |
Embedding speed scales with parallelism. QMD creates multiple embedding contexts based on available VRAM/cores.
Search Speed
| Index Size | GPU | CPU |
|---|---|---|
| 1K chunks | ~10ms | ~50ms |
| 10K chunks | ~30ms | ~200ms |
| 100K chunks | ~100ms | ~1s |
GPU Acceleration
QMD auto-detects GPU support and uses the best available backend:- CUDA (NVIDIA GPUs)
- Metal (Apple Silicon)
- Vulkan (cross-platform)
- CPU (fallback)
Model Cache
Models are downloaded to:Updating Embeddings
Embeddings are content-addressed by document hash. When a document changes:- New hash is computed
- New embeddings are generated
- Old embeddings remain until cleanup
- Orphaned embeddings (no matching document)
- Inactive document records
- Stale LLM cache entries
Context and Embeddings
Context is NOT embedded in vectors. Chunk embeddings only include:- Document title
- Chunk text
Context affects search quality through:
- LLM reranking (reranker sees context)
- Query expansion (context can inform expansions)
- User understanding (helps interpret results)
Best Practices
Embed before searching
Embed before searching
Vector search (
vsearch, query) requires embeddings:Re-embed after major changes
Re-embed after major changes
If you restructure documents or add significant content, re-embed:
Use cleanup regularly
Use cleanup regularly
Orphaned embeddings waste space. Run cleanup monthly:
Monitor embedding coverage
Monitor embedding coverage
Check how many documents need embedding:Look for “Documents need embedding” count.
Balance chunk size
Balance chunk size
The default 900 tokens works well for most documents. Don’t change
CHUNK_SIZE_TOKENS unless you have a specific need (e.g., very short or very long documents).Troubleshooting
”sqlite-vec extension unavailable”
Vector search requires sqlite-vec. On macOS:BREW_PREFIX if Homebrew is in a non-standard location.
Slow embedding on CPU
CPU-only embedding is slow. Options:- Use a GPU (CUDA, Metal, or Vulkan)
- Reduce collection size
- Be patient — embedding is one-time per document
Out of memory during embedding
QMD auto-detects available VRAM/RAM and creates appropriate parallelism. If you still run out:- Close other applications
- Reduce collection size
- Embed in batches (use collection filters)
Embeddings not updating
Embeddings are cached by document hash. If content changes:Related
- Search Modes - Using vsearch and query
- Collections - Organizing documents
- Context Management - Adding metadata to chunks
- CLI Reference - Embed command documentation