Why Chunking Matters
Large language models have limited context windows . When your document exceeds this limit, LangExtract automatically chunks the text into smaller segments for processing.
Effective chunking strategy balances:
Recall : Finding all relevant entities
Precision : Avoiding false positives
Cost : Minimizing API token usage
Speed : Maximizing parallel processing
For documents under ~4,000 characters with gemini-2.5-flash, chunking is typically unnecessary. For longer documents, strategic chunking is critical.
The max_char_buffer Parameter
The max_char_buffer parameter controls the maximum size of each text chunk:
result = lx.extract(
text_or_documents = long_text,
prompt_description = prompt,
examples = examples,
max_char_buffer = 1000 , # Maximum ~1000 characters per chunk
)
How Chunking Works
LangExtract uses intelligent sentence-aware chunking (see chunking.py:343-507 ):
Sentence boundaries : Chunks respect sentence boundaries when possible
Newline preservation : Breaks at newlines for structured text (e.g., poetry, lists)
Token overflow : Single tokens exceeding the buffer become their own chunk
Maximum packing : Multiple sentences pack into chunks up to the limit
Example Chunking Behavior
Given this poem:
No man is an island,
Entire of itself,
Every man is a piece of the continent,
A part of the main.
With max_char_buffer=40:
Chunk 1: "No man is an island,\nEntire of itself," (38 chars)
Chunk 2: "Every man is a piece of the continent," (38 chars)
Chunk 3: "A part of the main." (19 chars)
See chunking.py:349-383 for detailed chunking logic.
Choosing max_char_buffer
Smaller Buffers (500-1500 chars)
Advantages:
Higher precision (focused context)
Better for dense documents
Reduced hallucination risk
Disadvantages:
More API calls (higher cost)
May miss cross-chunk entities
Slower without parallelization
Use when:
Extracting from dense medical records
Processing structured data (tables, lists)
High precision is critical
result = lx.extract(
text_or_documents = medical_record,
prompt_description = "Extract medications, dosages, and frequencies" ,
examples = examples,
max_char_buffer = 1000 , # Focused chunks for precision
)
Larger Buffers (2000-5000 chars)
Advantages:
Fewer API calls (lower cost)
Better cross-reference resolution
Faster with fewer chunks
Disadvantages:
May reduce precision
“Needle in haystack” problem
Higher memory usage
Use when:
Processing narrative text
Extracting sparse entities
Cost optimization is important
result = lx.extract(
text_or_documents = novel_text,
prompt_description = "Extract character names and locations" ,
examples = examples,
max_char_buffer = 4000 , # Larger chunks for sparse entities
)
Recommended Values
High Precision Tasks 500-1500 characters Medical records, legal documents, structured data
Balanced Tasks 1500-2500 characters News articles, research papers, general text
High Recall Tasks 2500-4000 characters Novels, narrative text, sparse entities
Cost Optimization 4000-6000 characters Reduce API calls for large-scale processing
Context Windows
The context_window_chars parameter includes text from the previous chunk to help with coreference resolution:
result = lx.extract(
text_or_documents = text,
prompt_description = prompt,
examples = examples,
max_char_buffer = 2000 ,
context_window_chars = 500 , # Include 500 chars from previous chunk
)
How Context Windows Work
See annotation.py:358-361 for context-aware prompt building:
# Chunk 1: Characters 0-2000
chunk_1_text = document[ 0 : 2000 ]
prompt_1 = generate_prompt(chunk_1_text) # No context
# Chunk 2: Characters 2000-4000
context = document[ 1500 : 2000 ] # Last 500 chars of previous chunk
chunk_2_text = document[ 2000 : 4000 ]
prompt_2 = generate_prompt(context + chunk_2_text) # With context
Example: Coreference Resolution
Without context window:
Chunk 1: "Dr. Smith examined the patient. She noted elevated blood pressure."
Chunk 2: "She prescribed lisinopril 10mg daily." # Who is "She"?
Extraction from Chunk 2 may not know “She” refers to “Dr. Smith”.
With context window:
Context: "...Dr. Smith examined the patient. She noted elevated blood pressure."
Chunk 2: "She prescribed lisinopril 10mg daily."
The LLM can now resolve “She” → “Dr. Smith”.
When to Use Context Windows
✅ Use context windows when:
Documents use pronouns extensively
Entities span chunk boundaries
Cross-chunk relationships matter
Processing narrative text
⚠️ Context windows increase token usage:
Each chunk reprocesses context characters
May increase API costs by 10-30%
The extraction_passes parameter runs sequential independent extractions and merges non-overlapping results:
result = lx.extract(
text_or_documents = long_document,
prompt_description = prompt,
examples = examples,
max_char_buffer = 1000 ,
extraction_passes = 3 , # Run extraction 3 times
)
How Multiple Passes Work
See annotation.py:442-525 for sequential pass implementation:
Pass 1 : Extract entities from all chunks → Result set A
Pass 2 : Extract again from all chunks → Result set B
Pass 3 : Extract again from all chunks → Result set C
Merge : Combine A, B, C using first-pass-wins strategy
Merge Strategy
From annotation.py:46-84 :
def _merge_non_overlapping_extractions ( all_extractions ):
"""
Merge extractions from multiple passes.
When extractions overlap in character positions,
the extraction from the earlier pass is kept.
"""
merged = list (all_extractions[ 0 ]) # Start with pass 1
for pass_extractions in all_extractions[ 1 :]:
for extraction in pass_extractions:
if not overlaps_with_any(extraction, merged):
merged.append(extraction) # Add non-overlapping
return merged
Overlapping extractions are discarded from later passes (earlier passes win).
When to Use Multiple Passes
Use extraction_passes=2-3 when:
Processing very long documents (>10,000 characters)
Recall is critical (medical records, legal discovery)
Entities are sparse across the document
You’re willing to pay 2-3x the API cost
Example: Romeo and Juliet extraction
From the README example (lines 142-152):
result = lx.extract(
text_or_documents = "https://www.gutenberg.org/files/1513/1513-0.txt" ,
prompt_description = prompt,
examples = examples,
model_id = "gemini-2.5-flash" ,
extraction_passes = 3 , # Multiple passes for recall
max_workers = 20 , # Parallel processing
max_char_buffer = 1000 # Smaller chunks for precision
)
This configuration:
Uses small chunks (1000 chars) for precision
Runs 3 independent passes for recall
Parallelizes with 20 workers for speed
Processes full novel (147,843 chars)
Cost consideration : extraction_passes=3 means the document is processed 3 times , tripling token costs.
Parallel Processing
The max_workers parameter controls parallel processing:
result = lx.extract(
text_or_documents = text,
prompt_description = prompt,
examples = examples,
max_char_buffer = 1000 ,
batch_length = 20 , # Process 20 chunks per batch
max_workers = 20 , # Use 20 parallel workers
)
Batch Processing
From extraction.py:111-115 :
batch_length: int = 10 # Number of chunks per batch
max_workers: int = 10 # Maximum parallel workers
# Effective parallelization = min(batch_length, max_workers)
If batch_length < max_workers, only batch_length workers are used. Set batch_length >= max_workers for full parallelization.
Optimizing for Speed
For fastest processing:
result = lx.extract(
text_or_documents = long_text,
prompt_description = prompt,
examples = examples,
max_char_buffer = 2000 , # Reasonable chunk size
batch_length = 50 , # Large batches
max_workers = 50 , # Maximum parallelization
)
API Rate Limits
For large-scale processing, use Tier 2 Gemini quota to increase throughput and avoid rate limits. See rate-limit documentation .
Chunking Best Practices
1. Start Conservative
Begin with small chunks and single pass:
# Initial configuration
result = lx.extract(
text_or_documents = text,
prompt_description = prompt,
examples = examples,
max_char_buffer = 1000 ,
extraction_passes = 1 ,
)
Measure precision and recall, then adjust.
2. Monitor Token Usage
Estimate costs before scaling:
total_chars = len (document_text)
chunks_per_pass = total_chars / max_char_buffer
total_chunks = chunks_per_pass * extraction_passes
print ( f "Estimated chunks to process: { total_chunks } " )
print ( f "Estimated tokens (rough): { total_chunks * max_char_buffer / 4 } " )
3. Tune Based on Document Type
Document Type Recommended Config Medical records max_char_buffer=1000, extraction_passes=1Legal contracts max_char_buffer=1500, context_window_chars=300Research papers max_char_buffer=2000, extraction_passes=1Novels max_char_buffer=1000, extraction_passes=3, max_workers=20News articles max_char_buffer=2500, extraction_passes=1
4. Use Context Windows Selectively
Enable only when necessary:
# For narrative text with pronouns
result = lx.extract(
text_or_documents = narrative_text,
max_char_buffer = 2000 ,
context_window_chars = 500 , # Enable context
)
# For structured data (no pronouns)
result = lx.extract(
text_or_documents = tabular_data,
max_char_buffer = 1000 ,
# context_window_chars not needed
)
5. Balance Recall and Cost
Use multiple passes strategically:
# High-value documents: Maximize recall
result = lx.extract(
text_or_documents = important_document,
max_char_buffer = 1000 ,
extraction_passes = 3 , # High recall
)
# Bulk processing: Optimize cost
result = lx.extract(
text_or_documents = bulk_documents,
max_char_buffer = 3000 , # Fewer chunks
extraction_passes = 1 , # Single pass
)
Advanced: Custom Tokenization
Provide a custom tokenizer for specialized chunking:
from langextract.core import tokenizer as tokenizer_lib
custom_tokenizer = tokenizer_lib.RegexTokenizer()
result = lx.extract(
text_or_documents = text,
prompt_description = prompt,
examples = examples,
tokenizer = custom_tokenizer,
)
See chunking.py:385-421 for ChunkIterator tokenizer integration.
Debugging Chunking Issues
Inspect Chunks
from langextract import chunking
from langextract.core import tokenizer as tokenizer_lib
tokenizer = tokenizer_lib.RegexTokenizer()
tokenized = tokenizer.tokenize(long_text)
chunk_iter = chunking.ChunkIterator(
text = tokenized,
max_char_buffer = 1000 ,
tokenizer_impl = tokenizer,
)
for i, chunk in enumerate (chunk_iter):
print ( f "Chunk { i } : { len (chunk.chunk_text) } chars" )
print ( f "Preview: { chunk.chunk_text[: 100 ] } ..." )
print ( f "Char interval: { chunk.char_interval } " )
print ()
import time
start_time = time.time()
result = lx.extract(
text_or_documents = text,
prompt_description = prompt,
examples = examples,
max_char_buffer = 1000 ,
show_progress = True , # Enable progress bar
)
elapsed = time.time() - start_time
print ( f "Processed { len (text) } chars in { elapsed :.2f} s" )
print ( f "Found { len (result.extractions) } extractions" )
Next Steps
Source Grounding Learn how chunking affects alignment
Prompt Engineering Optimize prompts for chunked extraction