Configuration

know provides several configuration options to optimize indexing for your content. Understanding these settings helps you balance search quality, performance, and storage.

Chunk Size and Overlap

Documents are split into chunks for indexing. The chunk size and overlap determine how content is divided.

Chunk Size

The --chunk-size parameter controls the maximum size of each chunk in tokens (roughly equivalent to words).

# Default: 512 tokens (~400 words)
know index

# Smaller chunks for more precise results
know index --chunk-size 256

# Larger chunks for more context
know index --chunk-size 1024

Small (256)
Medium (512)
Large (1024)

Use for:

Short documents (notes, snippets)
Precise location matching
Code files with small functions
FAQ-style content

Tradeoffs:

✅ More precise results
✅ Better for short queries
❌ May split logical units
❌ More chunks = more storage

Chunk Overlap

The --overlap parameter controls how many tokens overlap between consecutive chunks.

# Default: 50 tokens
know index

# No overlap
know index --overlap 0

# More overlap for better context
know index --overlap 100

# Combine with chunk size
know index --chunk-size 512 --overlap 75

Why Use Chunk Overlap?

Overlap ensures that content at chunk boundaries isn’t lost or split awkwardly:Without overlap (—overlap 0):

Chunk 1: "...the function returns a value."
Chunk 2: "The value is then processed by..."

Searching for “return value processed” might miss this!With overlap (—overlap 50):

Chunk 1: "...the function returns a value."
Chunk 2: "...returns a value. The value is then processed by..."

Now both chunks contain the complete context.Recommended overlap:

10-20% of chunk size
Default 50 tokens works well with 512 chunk size (~10%)
Increase for narrative content
Decrease for independent items (logs, code)

How Chunking Works

know uses LlamaIndex’s SentenceSplitter for intelligent chunking:

# From src/db.py:183
splitter = SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
nodes = splitter.get_nodes_from_documents(filtered)

The splitter:

Respects sentence boundaries when possible
Avoids breaking words or sentences mid-way
Maintains metadata (file path, chunk index)
Creates overlapping regions for context

Implementation reference: src/db.py:79-313

Caching and Performance

File Cache

know maintains a cache in ~/.cache/know/ to track indexed files:

# Cache stores: modification time, file size, chunk settings
# Files are skipped if unchanged

The cache is invalidated when:

File modification time changes
File size changes
Chunk size changes
Chunk overlap changes

This ensures re-indexing only when necessary.

# From src/db.py:142-173
cache = load_file_cache(chunk_size, chunk_overlap)
for doc in documents:
    source_path = doc.metadata["file_path"]
    stat = Path(source_path).stat()
    cache_entry = cache.get(source_path)
    if (
        cache_entry
        and cache_entry.get("indexed")
        and cache_entry.get("mtime") == stat.st_mtime
        and cache_entry.get("size") == stat.st_size
    ):
        skipped_cache += 1
        continue

Implementation reference: src/db.py:142-173

Deduplication

know automatically deduplicates chunks using MD5 hashing:

# From src/db.py:201-203
chunk_id = hashlib.md5(
    f"{source_path}:{chunk_index}:{node.text}".encode()
).hexdigest()

This prevents:

Duplicate content from being indexed multiple times
Wasted storage and compute
Redundant search results

Implementation reference: src/db.py:198-255

Batch Processing

Indexing uses batched operations for performance:

# From src/db.py:190-301
batch_size = 100

# Batch upsert for embeddings
for i in range(0, len(pending_ids), batch_size):
    batch_ids = pending_ids[i : i + batch_size]
    batch_docs = pending_docs[i : i + batch_size]
    batch_metas = pending_metas[i : i + batch_size]
    
    dense_collection.upsert(
        ids=batch_ids,
        documents=batch_docs,
        metadatas=batch_metas,
    )

Batching reduces:

API call overhead
Memory usage
Indexing time

Implementation reference: src/db.py:190-301

Directory Management

know tracks watched directories in ~/.know_dirs:

# Add directories to watch
know add ~/Documents
know add ~/Projects/my-app

# List watched directories
know dirs

# Remove a directory
know remove ~/Documents

When you run know index, all watched directories are indexed together.

Recursive Scanning

By default, indexing is recursive. You can disable this:

# Index only top-level files
know index --no-recursive

# Recursive is the default
know index --recursive

Index Storage

know stores indexes in ./know_index/:

know_index/
├── chroma.sqlite3          # ChromaDB database
├── [uuid]/                 # Collection data
└── bm25/                   # BM25 index cache
    ├── indices.npz         # BM25 index
    ├── ids.json           # Document IDs
    └── meta.json          # Metadata

Storage Requirements

Approximate storage per 1000 documents (512 token chunks):

Dense vectors: ~5-10 MB (depends on embedding model)
BM25 index: ~2-5 MB (depends on vocabulary size)
Metadata: ~1 MB

Total: ~8-16 MB per 1000 documents

Maintenance Operations

Pruning Orphaned Chunks

Remove chunks from deleted files:

# Show what would be pruned
know prune --dry

# Actually prune
know prune

# Show details
know prune --log

# From src/db.py:528-582
def prune(dry_run: bool = False, log: bool = False) -> tuple[int, int]:
    all_data = dense_collection.get(include=["metadatas"])
    
    orphan_ids: list[str] = []
    checked_paths: dict[str, bool] = {}  # cache path existence checks
    
    for chunk_id, meta in zip(all_data["ids"], all_data["metadatas"]):
        path = meta.get("path", "")
        if not path:
            orphan_ids.append(chunk_id)
            continue
        
        if path not in checked_paths:
            checked_paths[path] = Path(path).exists()
        
        if not checked_paths[path]:
            orphan_ids.append(chunk_id)

Implementation reference: src/db.py:528-582

Resetting the Index

Clear everything and start fresh:

# WARNING: This deletes all indexed data
know reset

# Re-index everything
know index --force

Implementation reference: src/db.py:519-525

Advanced Indexing Options

Dry Run

Preview what would be indexed without making changes:

# See what would be indexed
know index --dry

# With detailed logs
know index --dry --log

# With filters
know index --dry --glob "*.md" --since 7d

Detailed Logging

Get verbose output during indexing:

# Show progress and details
know index --log

# Output includes:
# - Scanning progress
# - Document counts
# - Chunk counts
# - Skipped files
# - Processing time

Force Reindex

Bypass cache and reindex everything:

# Clear index and reindex all directories
know index --force

# Be careful: this deletes existing data!

Skip Reports

Generate detailed reports of skipped chunks:

# Write skip report to file
know index --report skip_report.json

Report includes:

Files skipped (unchanged)
Chunks skipped (already indexed)
Chunks skipped (duplicate content)
Collision details

Implementation reference: src/db.py:57-73, src/db.py:146-267

Extension Filtering

Control which file types to index:

# Default extensions (see SUPPORTED_EXTENSIONS)
know index

# Index only specific extensions
know index --ext py --ext js
know index --ext .md --ext .txt  # Leading dot optional
know index --ext "py,js,ts"      # Comma-separated

# Combine with other filters
know index --ext py --glob "src/**" --since 7d

Default supported extensions in src/db.py:29-54:Documents: .md, .txt, .pdf, .docx, .pptx, .htmlCode: .py, .js, .ts, .jsx, .tsx, .go, .rs, .java, .c, .cpp, .h, .hpp, .rb, .sh, .lua, .swift

Optimization Guidelines

Small Documents

know index \
  --chunk-size 256 \
  --overlap 25

Notes, snippets, short docs

Long Documents

know index \
  --chunk-size 1024 \
  --overlap 100

Papers, books, articles

Code Files

know index \
  --chunk-size 512 \
  --overlap 50 \
  --ext py,js,ts

Source code repositories

Documentation

know index \
  --chunk-size 512 \
  --overlap 75 \
  --glob "**/*.md"

Markdown documentation

Configuration Recommendations

General Purpose (Default)

know index --chunk-size 512 --overlap 50

Good for mixed content types.

Technical Documentation

know index --chunk-size 512 --overlap 75 --glob "**/*.md"

Higher overlap preserves code examples and explanations.

Code Search

know index --chunk-size 400 --overlap 40 --ext py,js,ts,go

Slightly smaller chunks for function-level granularity.

Notes and Snippets

know index --chunk-size 256 --overlap 25

Smaller chunks for precise matching.

Academic Papers

know index --chunk-size 1024 --overlap 150

Larger chunks preserve complex arguments.

Next Steps

Search Modes

Learn about dense, BM25, and hybrid search

Output Formats

Explore different output formats

Get Started

Commands

Guides

Reference

​Chunk Size and Overlap

​Chunk Size

​Chunk Overlap

​How Chunking Works

​Caching and Performance

​File Cache

​Deduplication

​Batch Processing

​Directory Management

​Recursive Scanning

​Index Storage

​Storage Requirements

​Maintenance Operations

​Pruning Orphaned Chunks

​Resetting the Index

​Advanced Indexing Options

​Dry Run

​Detailed Logging

​Force Reindex

​Skip Reports

​Extension Filtering

​Optimization Guidelines

Small Documents

Long Documents

Code Files

Documentation

​Configuration Recommendations

​General Purpose (Default)

​Technical Documentation

​Code Search

​Notes and Snippets

​Academic Papers

​Next Steps

Search Modes

Output Formats

Build docs developers (and LLMs) love

Chunk Size and Overlap

Chunk Size

Chunk Overlap

How Chunking Works

Caching and Performance

File Cache

Deduplication

Batch Processing

Directory Management

Recursive Scanning

Index Storage

Storage Requirements

Maintenance Operations

Pruning Orphaned Chunks

Resetting the Index

Advanced Indexing Options

Dry Run

Detailed Logging

Force Reindex

Skip Reports

Extension Filtering

Optimization Guidelines

Configuration Recommendations

General Purpose (Default)

Technical Documentation

Code Search

Notes and Snippets

Academic Papers

Next Steps