Skip to main content

Why Chunking Matters

Large language models have limited context windows. When your document exceeds this limit, LangExtract automatically chunks the text into smaller segments for processing. Effective chunking strategy balances:
  • Recall: Finding all relevant entities
  • Precision: Avoiding false positives
  • Cost: Minimizing API token usage
  • Speed: Maximizing parallel processing
For documents under ~4,000 characters with gemini-2.5-flash, chunking is typically unnecessary. For longer documents, strategic chunking is critical.

The max_char_buffer Parameter

The max_char_buffer parameter controls the maximum size of each text chunk:
result = lx.extract(
    text_or_documents=long_text,
    prompt_description=prompt,
    examples=examples,
    max_char_buffer=1000,  # Maximum ~1000 characters per chunk
)

How Chunking Works

LangExtract uses intelligent sentence-aware chunking (see chunking.py:343-507):
  1. Sentence boundaries: Chunks respect sentence boundaries when possible
  2. Newline preservation: Breaks at newlines for structured text (e.g., poetry, lists)
  3. Token overflow: Single tokens exceeding the buffer become their own chunk
  4. Maximum packing: Multiple sentences pack into chunks up to the limit

Example Chunking Behavior

Given this poem:
No man is an island,
Entire of itself,
Every man is a piece of the continent,
A part of the main.
With max_char_buffer=40:
  • Chunk 1: "No man is an island,\nEntire of itself," (38 chars)
  • Chunk 2: "Every man is a piece of the continent," (38 chars)
  • Chunk 3: "A part of the main." (19 chars)
See chunking.py:349-383 for detailed chunking logic.

Choosing max_char_buffer

Smaller Buffers (500-1500 chars)

Advantages:
  • Higher precision (focused context)
  • Better for dense documents
  • Reduced hallucination risk
Disadvantages:
  • More API calls (higher cost)
  • May miss cross-chunk entities
  • Slower without parallelization
Use when:
  • Extracting from dense medical records
  • Processing structured data (tables, lists)
  • High precision is critical
result = lx.extract(
    text_or_documents=medical_record,
    prompt_description="Extract medications, dosages, and frequencies",
    examples=examples,
    max_char_buffer=1000,  # Focused chunks for precision
)

Larger Buffers (2000-5000 chars)

Advantages:
  • Fewer API calls (lower cost)
  • Better cross-reference resolution
  • Faster with fewer chunks
Disadvantages:
  • May reduce precision
  • “Needle in haystack” problem
  • Higher memory usage
Use when:
  • Processing narrative text
  • Extracting sparse entities
  • Cost optimization is important
result = lx.extract(
    text_or_documents=novel_text,
    prompt_description="Extract character names and locations",
    examples=examples,
    max_char_buffer=4000,  # Larger chunks for sparse entities
)

High Precision Tasks

500-1500 charactersMedical records, legal documents, structured data

Balanced Tasks

1500-2500 charactersNews articles, research papers, general text

High Recall Tasks

2500-4000 charactersNovels, narrative text, sparse entities

Cost Optimization

4000-6000 charactersReduce API calls for large-scale processing

Context Windows

The context_window_chars parameter includes text from the previous chunk to help with coreference resolution:
result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    max_char_buffer=2000,
    context_window_chars=500,  # Include 500 chars from previous chunk
)

How Context Windows Work

See annotation.py:358-361 for context-aware prompt building:
# Chunk 1: Characters 0-2000
chunk_1_text = document[0:2000]
prompt_1 = generate_prompt(chunk_1_text)  # No context

# Chunk 2: Characters 2000-4000
context = document[1500:2000]  # Last 500 chars of previous chunk
chunk_2_text = document[2000:4000]
prompt_2 = generate_prompt(context + chunk_2_text)  # With context

Example: Coreference Resolution

Without context window:
Chunk 1: "Dr. Smith examined the patient. She noted elevated blood pressure."
Chunk 2: "She prescribed lisinopril 10mg daily."  # Who is "She"?
Extraction from Chunk 2 may not know “She” refers to “Dr. Smith”. With context window:
Context: "...Dr. Smith examined the patient. She noted elevated blood pressure."
Chunk 2: "She prescribed lisinopril 10mg daily."
The LLM can now resolve “She” → “Dr. Smith”.

When to Use Context Windows

✅ Use context windows when:
  • Documents use pronouns extensively
  • Entities span chunk boundaries
  • Cross-chunk relationships matter
  • Processing narrative text
⚠️ Context windows increase token usage:
  • Each chunk reprocesses context characters
  • May increase API costs by 10-30%

Multiple Extraction Passes

The extraction_passes parameter runs sequential independent extractions and merges non-overlapping results:
result = lx.extract(
    text_or_documents=long_document,
    prompt_description=prompt,
    examples=examples,
    max_char_buffer=1000,
    extraction_passes=3,  # Run extraction 3 times
)

How Multiple Passes Work

See annotation.py:442-525 for sequential pass implementation:
  1. Pass 1: Extract entities from all chunks → Result set A
  2. Pass 2: Extract again from all chunks → Result set B
  3. Pass 3: Extract again from all chunks → Result set C
  4. Merge: Combine A, B, C using first-pass-wins strategy

Merge Strategy

From annotation.py:46-84:
def _merge_non_overlapping_extractions(all_extractions):
    """
    Merge extractions from multiple passes.
    When extractions overlap in character positions,
    the extraction from the earlier pass is kept.
    """
    merged = list(all_extractions[0])  # Start with pass 1
    
    for pass_extractions in all_extractions[1:]:
        for extraction in pass_extractions:
            if not overlaps_with_any(extraction, merged):
                merged.append(extraction)  # Add non-overlapping
    
    return merged
Overlapping extractions are discarded from later passes (earlier passes win).

When to Use Multiple Passes

Use extraction_passes=2-3 when:
  • Processing very long documents (>10,000 characters)
  • Recall is critical (medical records, legal discovery)
  • Entities are sparse across the document
  • You’re willing to pay 2-3x the API cost
Example: Romeo and Juliet extraction From the README example (lines 142-152):
result = lx.extract(
    text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt",
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
    extraction_passes=3,    # Multiple passes for recall
    max_workers=20,         # Parallel processing
    max_char_buffer=1000    # Smaller chunks for precision
)
This configuration:
  • Uses small chunks (1000 chars) for precision
  • Runs 3 independent passes for recall
  • Parallelizes with 20 workers for speed
  • Processes full novel (147,843 chars)
Cost consideration: extraction_passes=3 means the document is processed 3 times, tripling token costs.

Parallel Processing

The max_workers parameter controls parallel processing:
result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    max_char_buffer=1000,
    batch_length=20,      # Process 20 chunks per batch
    max_workers=20,       # Use 20 parallel workers
)

Batch Processing

From extraction.py:111-115:
batch_length: int = 10  # Number of chunks per batch
max_workers: int = 10   # Maximum parallel workers

# Effective parallelization = min(batch_length, max_workers)
If batch_length < max_workers, only batch_length workers are used. Set batch_length >= max_workers for full parallelization.

Optimizing for Speed

For fastest processing:
result = lx.extract(
    text_or_documents=long_text,
    prompt_description=prompt,
    examples=examples,
    max_char_buffer=2000,  # Reasonable chunk size
    batch_length=50,       # Large batches
    max_workers=50,        # Maximum parallelization
)

API Rate Limits

For large-scale processing, use Tier 2 Gemini quota to increase throughput and avoid rate limits. See rate-limit documentation.

Chunking Best Practices

1. Start Conservative

Begin with small chunks and single pass:
# Initial configuration
result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    max_char_buffer=1000,
    extraction_passes=1,
)
Measure precision and recall, then adjust.

2. Monitor Token Usage

Estimate costs before scaling:
total_chars = len(document_text)
chunks_per_pass = total_chars / max_char_buffer
total_chunks = chunks_per_pass * extraction_passes

print(f"Estimated chunks to process: {total_chunks}")
print(f"Estimated tokens (rough): {total_chunks * max_char_buffer / 4}")

3. Tune Based on Document Type

Document TypeRecommended Config
Medical recordsmax_char_buffer=1000, extraction_passes=1
Legal contractsmax_char_buffer=1500, context_window_chars=300
Research papersmax_char_buffer=2000, extraction_passes=1
Novelsmax_char_buffer=1000, extraction_passes=3, max_workers=20
News articlesmax_char_buffer=2500, extraction_passes=1

4. Use Context Windows Selectively

Enable only when necessary:
# For narrative text with pronouns
result = lx.extract(
    text_or_documents=narrative_text,
    max_char_buffer=2000,
    context_window_chars=500,  # Enable context
)

# For structured data (no pronouns)
result = lx.extract(
    text_or_documents=tabular_data,
    max_char_buffer=1000,
    # context_window_chars not needed
)

5. Balance Recall and Cost

Use multiple passes strategically:
# High-value documents: Maximize recall
result = lx.extract(
    text_or_documents=important_document,
    max_char_buffer=1000,
    extraction_passes=3,      # High recall
)

# Bulk processing: Optimize cost
result = lx.extract(
    text_or_documents=bulk_documents,
    max_char_buffer=3000,     # Fewer chunks
    extraction_passes=1,      # Single pass
)

Advanced: Custom Tokenization

Provide a custom tokenizer for specialized chunking:
from langextract.core import tokenizer as tokenizer_lib

custom_tokenizer = tokenizer_lib.RegexTokenizer()

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    tokenizer=custom_tokenizer,
)
See chunking.py:385-421 for ChunkIterator tokenizer integration.

Debugging Chunking Issues

Inspect Chunks

from langextract import chunking
from langextract.core import tokenizer as tokenizer_lib

tokenizer = tokenizer_lib.RegexTokenizer()
tokenized = tokenizer.tokenize(long_text)

chunk_iter = chunking.ChunkIterator(
    text=tokenized,
    max_char_buffer=1000,
    tokenizer_impl=tokenizer,
)

for i, chunk in enumerate(chunk_iter):
    print(f"Chunk {i}: {len(chunk.chunk_text)} chars")
    print(f"Preview: {chunk.chunk_text[:100]}...")
    print(f"Char interval: {chunk.char_interval}")
    print()

Track Performance

import time

start_time = time.time()

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    max_char_buffer=1000,
    show_progress=True,  # Enable progress bar
)

elapsed = time.time() - start_time
print(f"Processed {len(text)} chars in {elapsed:.2f}s")
print(f"Found {len(result.extractions)} extractions")

Next Steps

Source Grounding

Learn how chunking affects alignment

Prompt Engineering

Optimize prompts for chunked extraction

Build docs developers (and LLMs) love