Overview
know is a semantic search CLI that combines vector embeddings and lexical search to enable fast, accurate retrieval of information from local documents and code. The system is built on ChromaDB for vector storage, llama-index for document processing, and BM25 for lexical search.Architecture Diagram
Core Components
1. Directory Tracker
Location:src/know.py (INDEX_FILE)
Purpose: Manages the list of directories to watch and index.
Storage: ~/.know_dirs - Plain text file with one directory path per line
Operations:
add- Append directory to watch listremove- Remove directory from watch listdirs- Display all watched directories
2. Document Loader
Library: llama-indexSimpleDirectoryReader
Configuration:
- Automatic file type detection based on extension
- Recursive directory traversal
- Glob pattern filtering
- Modification time filtering (
--since)
3. Text Chunking
Library: llama-indexSentenceSplitter
Default Configuration:
- Chunk size: 512 tokens
- Chunk overlap: 50 tokens
4. Deduplication System
Method: MD5 hash ofpath:chunk_index:text
Location: src/db.py:201-203
Implementation:
- Existing chunks: Check against ChromaDB for already-indexed chunks
- Within-batch duplicates: Track seen IDs to prevent duplicate inserts
- Collision detection: Report chunks with identical content from different files
5. Vector Store (ChromaDB)
Type:chromadb.PersistentClient
Location: ./know_index (relative to working directory)
Collection: documents
Metadata Schema:
upsert()- Add or update chunks in batches of 100query()- Semantic similarity search with cosine distanceget()- Retrieve chunks by ID or metadata filtersdelete()- Remove chunks (used byprunecommand)
6. BM25 Lexical Search
Library:bm25s with PyStemmer
Cache Location: ./know_index/bm25/
Cache Files:
meta.json- Document count for cache validationids.json- List of chunk IDs in index order- BM25 index files (managed by bm25s)
- Check if cached index exists and matches current document count
- If cache miss: retrieve all documents, build BM25 index, save cache
- If cache hit: load pre-built index and ID mapping
- Tokenize query and retrieve top-k ranked results
7. File Cache
Location:./know_index/file_cache.json
Purpose: Skip re-indexing unchanged files for faster incremental updates.
Schema:
- File is skipped if
mtimeandsizematch cached values - Cache is invalidated if chunk size or overlap settings change
- Cache entries are cleaned during
pruneoperation
8. Search Engine
Modes:Dense (Vector) Search
- Uses ChromaDB’s embedding-based semantic search
- Returns results sorted by cosine distance
- Best for conceptual and semantic queries
BM25 (Lexical) Search
- Uses BM25 term-frequency ranking
- Returns results sorted by BM25 score
- Best for exact term matches and keyword queries
Hybrid Search
- Combines dense and BM25 results using Reciprocal Rank Fusion (RRF)
- Fetches 3x candidate results from each method
- Fuses rankings with RRF (k=60)
- Returns top results by fused score
- Best for balanced recall and precision
k=60 and rank_i is the rank from each retrieval method.
Data Flow
Indexing Flow
Search Flow
Storage Layout
Performance Optimizations
Batch Processing
- Documents are upserted to ChromaDB in batches of 100
- Existing ID checks are batched to reduce round trips
- BM25 queries fetch 3x candidates to account for filtering
Caching Strategies
- File Cache: Skip unchanged files based on mtime/size
- BM25 Cache: Persist pre-built BM25 index between searches
- ID Deduplication: Track seen IDs in-memory to avoid duplicate processing
Incremental Updates
- Only new and modified files are processed during re-indexing
- Existing chunks are preserved and reused
- Cache invalidation ensures consistency with configuration changes
Dependencies
Core Libraries
- chromadb: Vector database with embedding generation
- llama-index-core: Document loading and text splitting
- bm25s: Fast BM25 implementation
- PyStemmer: English word stemming for BM25
- typer: CLI framework with type annotations
- rich: Terminal output formatting and progress bars
Storage Requirements
- Vector embeddings: ~384 dimensions × 4 bytes per chunk
- BM25 index: Sparse matrix (~10-20% of document size)
- File cache: ~100 bytes per indexed file
- ChromaDB overhead: SQLite metadata (~1-5% of total)
Related Documentation
- Supported File Types - List of indexable extensions
- FAQ - Common questions and troubleshooting
- Commands Reference - Command documentation