Skip to main content

Overview

Airweave automatically chunks documents before embedding to ensure:
  • Semantic coherence: Chunks represent complete ideas
  • Token limits: Chunks fit within embedding model limits (8192 tokens max)
  • Search quality: Granular chunks return precise results
Two specialized chunkers handle different content types:

Semantic Chunker

For natural language content (docs, emails, support tickets)Uses embedding similarity to find topic boundaries

Code Chunker

For source code filesUses AST parsing to chunk at function/class boundaries

Semantic Chunker

The SemanticChunker uses local embedding models to detect semantic boundaries without external API calls.

How It Works

1

Sentence Splitting

Splits document into sentences using delimiters:
SENTENCE_DELIMITERS = [". ", "! ", "? ", "\n"]
2

Embedding Similarity

Computes embeddings for each sentence using a lightweight local model:
# Default: minishlab/potion-base-8M (8M params, ~0.5s/doc)
EMBEDDING_MODEL = "minishlab/potion-base-8M"
Compares similarity in a sliding window:
SIMILARITY_WINDOW = 10  # Compare 10 consecutive sentences
3

Boundary Detection

Identifies topic shifts when similarity drops below threshold:
SIMILARITY_THRESHOLD = 0.01  # Lower = larger chunks
Creates semantic groups (chunks) of related sentences.
4

Token Recounting

Recounts tokens using OpenAI’s tiktoken (cl100k_base) for accuracy:
chunk.token_count = len(
    tiktoken_tokenizer.encode(chunk.text, allowed_special="all")
)
5

Safety Net

Splits any oversized chunks (>8192 tokens) at exact token boundaries:
if chunk.token_count > MAX_TOKENS_PER_CHUNK:  # 8192
    # Use TokenChunker to force-split
    split_chunks = token_chunker.chunk_batch([chunk.text])

Configuration

All constants are defined in platform/chunkers/semantic.py:
MAX_TOKENS_PER_CHUNK
int
default:"8192"
Hard limit matching OpenAI’s text-embedding-3-small limitEnforced by TokenChunker safety net
SEMANTIC_CHUNK_SIZE
int
default:"4096"
Target size for semantic groups (soft limit)Tradeoff:
  • Larger = more context per chunk, fewer API calls
  • Smaller = more precise search results, more chunks
OVERLAP_TOKENS
int
default:"128"
Token overlap between consecutive chunks (reserved for future use)
EMBEDDING_MODEL
string
default:"minishlab/potion-base-8M"
Local embedding model for chunking decisionsAvailable options (sorted by speed):Model2Vec (included with chonkie[semantic]):
  • minishlab/potion-base-8M - 8M params, ~0.5s/doc, good quality ⭐
  • minishlab/potion-base-32M - 32M params, ~1s/doc, better quality
  • minishlab/potion-base-128M - 128M params, ~2-3s/doc, best Model2Vec
SentenceTransformer (requires: poetry add sentence-transformers):
  • all-MiniLM-L6-v2 - 33M params, ~1-2s/doc, good quality
  • all-MiniLM-L12-v2 - 66M params, ~2-3s/doc, better quality
  • all-mpnet-base-v2 - 110M params, ~3-5s/doc, best quality
This model is only for chunking (finding semantic boundaries). Final embeddings use your configured DENSE_EMBEDDER (OpenAI/Mistral/Local).
SIMILARITY_THRESHOLD
float
default:"0.01"
Threshold for detecting topic boundaries (0-1 range)Tradeoff:
  • Lower (0.001-0.01): Larger chunks, fewer splits, more context
  • Higher (0.05-0.1): Smaller chunks, more splits, precise boundaries
Default (0.01) balances context and granularity.
SIMILARITY_WINDOW
int
default:"10"
Number of consecutive sentences to compare for similarityLarger window = smoother chunking, slower processing
MIN_SENTENCES_PER_CHUNK
int
default:"1"
Minimum sentences per chunk (prevents tiny fragments)
MIN_CHARACTERS_PER_SENTENCE
int
default:"24"
Minimum characters to count as a sentence

Advanced Features

Smooths similarity scores to reduce noisy boundaries:
FILTER_WINDOW = 5         # Window length for filter
FILTER_POLYORDER = 3      # Polynomial order
FILTER_TOLERANCE = 0.2    # Boundary detection tolerance
Reduces over-segmentation from minor similarity fluctuations.
Merges non-consecutive similar groups:
SKIP_WINDOW = 0  # 0=disabled, >0=merge similar groups
Currently disabled (0). Enable to merge related sections separated by short transitions.
Configures how sentence delimiters are preserved:
SENTENCE_DELIMITERS = [". ", "! ", "? ", "\n"]
INCLUDE_DELIMITER = "prev"  # Include with previous sentence
Options: "prev", "next", "none"

Two-Stage Pipeline

The semantic chunker uses a two-stage approach: Stage 1: Semantic boundary detection (local embedding model) Stage 1.5: Token recounting with tiktoken (OpenAI compatibility) Stage 2: Safety net for oversized chunks (force-split at token boundaries)
The TokenChunker safety net guarantees all chunks are ≤8192 tokens, even if semantic chunking produces large groups.

Example Workflow

from airweave.platform.chunkers.semantic import SemanticChunker

chunker = SemanticChunker()  # Singleton instance

# Batch processing
documents = [
    "Long document about machine learning...",
    "Technical guide to databases...",
    "Product requirements document..."
]

results = await chunker.chunk_batch(documents)

# results[0] = List of chunks for document 0
for chunk in results[0]:
    print(f"Chunk: {chunk['text'][:100]}...")
    print(f"Tokens: {chunk['token_count']}")
    print(f"Range: {chunk['start_index']}-{chunk['end_index']}")
    print()

Code Chunker

The CodeChunker uses AST (Abstract Syntax Tree) parsing to chunk at logical code boundaries.

How It Works

1

Language Detection

Auto-detects programming language using Magika:
language="auto"  # Supports Python, JS, Java, Go, etc.
2

AST Parsing

Parses code into syntax tree nodes:
  • Functions
  • Classes
  • Methods
  • Modules
Chunks at natural boundaries between nodes.
3

Token Recounting

Recounts tokens with tiktoken:
# Chonkie's CodeChunker underestimates tokens
# (counts AST nodes, not whitespace/gaps)
chunk.token_count = len(
    tiktoken_tokenizer.encode(chunk.text, allowed_special="all")
)
4

Safety Net

Splits oversized chunks (>8192 tokens) at token boundaries:
if chunk.token_count > MAX_TOKENS_PER_CHUNK:
    split_chunks = token_chunker.chunk_batch([chunk.text])

Configuration

MAX_TOKENS_PER_CHUNK
int
default:"8192"
Hard limit enforced by TokenChunker safety net
CHUNK_SIZE
int
default:"2048"
Target chunk size for AST groupsNote: Can be exceeded by large AST nodes (e.g., 3000-line function). Safety net handles this.
TOKENIZER
string
default:"cl100k_base"
OpenAI’s tiktoken encoding for accurate token counting

Supported Languages

The CodeChunker auto-detects and supports:
  • Python
  • JavaScript / TypeScript
  • Java
  • Go
  • C / C++
  • Rust
  • Ruby
  • PHP
  • And more via tree-sitter grammars
For unsupported languages, the chunker falls back to token-based splitting.

Example Workflow

from airweave.platform.chunkers.code import CodeChunker

chunker = CodeChunker()  # Singleton instance

# Batch processing
code_files = [
    "def calculate_total(items):\n    return sum(item.price for item in items)\n\nclass Order:\n    ...",
    "function processPayment(amount) {\n    // ...\n}\n\nclass PaymentProcessor {\n    ..."
]

results = await chunker.chunk_batch(code_files)

# results[0] = List of chunks for code_files[0]
for chunk in results[0]:
    print(f"Chunk: {chunk['text'][:100]}...")
    print(f"Tokens: {chunk['token_count']}")

Advantages Over Token-Based Chunking

AST-based (Code Chunker):
# Chunk 1: Complete function
def calculate_total(items):
    subtotal = sum(item.price for item in items)
    tax = subtotal * 0.08
    return subtotal + tax

# Chunk 2: Complete class
class Order:
    def __init__(self, items):
        self.items = items
Token-based (naive):
# Chunk 1: Incomplete function
def calculate_total(items):
    subtotal = sum(item.price for item in items)
    tax = subtotal * 0.08

# Chunk 2: Orphaned code
    return subtotal + tax

class Order:
    def __init__(self, items):

Token Counting

Both chunkers use tiktoken for accurate OpenAI token counting:
from airweave.platform.tokenizers import get_tokenizer

tokenizer = get_tokenizer("cl100k_base")

# Count tokens
token_count = len(tokenizer.encode(
    text, 
    allowed_special="all"  # Handle special tokens like <|endoftext|>
))

Performance Considerations

Singleton Pattern

Both chunkers use singletons to avoid reloading models:
# Models loaded once per pod, shared across all syncs
chunker = SemanticChunker()  # Returns same instance
Benefits:
  • No model reload overhead (~2-3s per sync)
  • Lower memory usage
  • Faster sync throughput

Async Thread Pool

Chonkie chunkers are synchronous, so we use thread pools:
from airweave.platform.sync.async_helpers import run_in_thread_pool

# Prevents blocking the event loop
results = await run_in_thread_pool(
    self._semantic_chunker.chunk_batch, 
    texts
)

Batch Processing

# Process multiple documents in one call
results = await chunker.chunk_batch([
    document_1,
    document_2,
    document_3
])

# Returns: [
#   [chunk1_doc1, chunk2_doc1],  # Document 1 chunks
#   [chunk1_doc2],                # Document 2 chunks
#   [chunk1_doc3, chunk2_doc3, chunk3_doc3]  # Document 3 chunks
# ]

Troubleshooting

Symptom:
SyncFailureError: PROGRAMMING ERROR: Chunk has 9500 tokens after 
TokenChunker fallback (max: 8192). TokenChunker failed to enforce hard limit.
Cause: Bug in TokenChunker safety net (should never happen)Solution: This indicates a chunker bug. Report to Airweave team.
Symptom:
[CodeChunker] Skipping empty chunk - this may indicate a chunker bug
Cause: Edge case in AST parsing producing empty nodesSolution: Empty chunks are automatically filtered. No action needed.
Symptom: Code chunked poorly as plain textCause: Unsupported language or ambiguous file extensionSolution:
  1. Check if language is supported (Python, JS, Java, Go, etc.)
  2. Falls back to token-based chunking (still works, less optimal)
Symptom: First sync takes 2-3 seconds longerCause: Lazy model initialization on first useSolution: This is expected. Subsequent syncs reuse the loaded model (singleton).
Symptom: Chunks are too granular or too largeSolution: Adjust semantic chunker settings in platform/chunkers/semantic.py:Fewer, larger chunks:
SEMANTIC_CHUNK_SIZE = 6144  # Increase from 4096
SIMILARITY_THRESHOLD = 0.005  # Decrease from 0.01
More, smaller chunks:
SEMANTIC_CHUNK_SIZE = 2048  # Decrease from 4096
SIMILARITY_THRESHOLD = 0.05  # Increase from 0.01

Next Steps

Embeddings

Configure embedding models for chunked content

Transformers

Transform entities before chunking

Search

Use chunks in hybrid search queries

Development

Learn about chunking internals

Build docs developers (and LLMs) love