Chunking Strategy - LangExtract

Why Chunking Matters

Large language models have limited context windows. When your document exceeds this limit, LangExtract automatically chunks the text into smaller segments for processing. Effective chunking strategy balances:

Recall: Finding all relevant entities
Precision: Avoiding false positives
Cost: Minimizing API token usage
Speed: Maximizing parallel processing

For documents under ~4,000 characters with gemini-2.5-flash, chunking is typically unnecessary. For longer documents, strategic chunking is critical.

The max_char_buffer Parameter

The max_char_buffer parameter controls the maximum size of each text chunk:

result = lx.extract(
    text_or_documents=long_text,
    prompt_description=prompt,
    examples=examples,
    max_char_buffer=1000,  # Maximum ~1000 characters per chunk
)

How Chunking Works

LangExtract uses intelligent sentence-aware chunking (see chunking.py:343-507):

Sentence boundaries: Chunks respect sentence boundaries when possible
Newline preservation: Breaks at newlines for structured text (e.g., poetry, lists)
Token overflow: Single tokens exceeding the buffer become their own chunk
Maximum packing: Multiple sentences pack into chunks up to the limit

Example Chunking Behavior

Given this poem:

No man is an island,
Entire of itself,
Every man is a piece of the continent,
A part of the main.

With max_char_buffer=40:

Chunk 1: "No man is an island,\nEntire of itself," (38 chars)
Chunk 2: "Every man is a piece of the continent," (38 chars)
Chunk 3: "A part of the main." (19 chars)

See chunking.py:349-383 for detailed chunking logic.

Choosing max_char_buffer

Smaller Buffers (500-1500 chars)

Advantages:

Higher precision (focused context)
Better for dense documents
Reduced hallucination risk

Disadvantages:

More API calls (higher cost)
May miss cross-chunk entities
Slower without parallelization

Use when:

Extracting from dense medical records
Processing structured data (tables, lists)
High precision is critical

result = lx.extract(
    text_or_documents=medical_record,
    prompt_description="Extract medications, dosages, and frequencies",
    examples=examples,
    max_char_buffer=1000,  # Focused chunks for precision
)

Larger Buffers (2000-5000 chars)

Advantages:

Fewer API calls (lower cost)
Better cross-reference resolution
Faster with fewer chunks

Disadvantages:

May reduce precision
“Needle in haystack” problem
Higher memory usage

Use when:

Processing narrative text
Extracting sparse entities
Cost optimization is important

result = lx.extract(
    text_or_documents=novel_text,
    prompt_description="Extract character names and locations",
    examples=examples,
    max_char_buffer=4000,  # Larger chunks for sparse entities
)

Recommended Values

High Precision Tasks

500-1500 charactersMedical records, legal documents, structured data

Balanced Tasks

1500-2500 charactersNews articles, research papers, general text

High Recall Tasks

2500-4000 charactersNovels, narrative text, sparse entities

Cost Optimization

4000-6000 charactersReduce API calls for large-scale processing

Context Windows

The context_window_chars parameter includes text from the previous chunk to help with coreference resolution:

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    max_char_buffer=2000,
    context_window_chars=500,  # Include 500 chars from previous chunk
)

How Context Windows Work

See annotation.py:358-361 for context-aware prompt building:

# Chunk 1: Characters 0-2000
chunk_1_text = document[0:2000]
prompt_1 = generate_prompt(chunk_1_text)  # No context

# Chunk 2: Characters 2000-4000
context = document[1500:2000]  # Last 500 chars of previous chunk
chunk_2_text = document[2000:4000]
prompt_2 = generate_prompt(context + chunk_2_text)  # With context

Example: Coreference Resolution

Without context window:

Chunk 1: "Dr. Smith examined the patient. She noted elevated blood pressure."
Chunk 2: "She prescribed lisinopril 10mg daily."  # Who is "She"?

Extraction from Chunk 2 may not know “She” refers to “Dr. Smith”. With context window:

Context: "...Dr. Smith examined the patient. She noted elevated blood pressure."
Chunk 2: "She prescribed lisinopril 10mg daily."

The LLM can now resolve “She” → “Dr. Smith”.

When to Use Context Windows

✅ Use context windows when:

Documents use pronouns extensively
Entities span chunk boundaries
Cross-chunk relationships matter
Processing narrative text

⚠️ Context windows increase token usage:

Each chunk reprocesses context characters
May increase API costs by 10-30%

Multiple Extraction Passes

The extraction_passes parameter runs sequential independent extractions and merges non-overlapping results:

result = lx.extract(
    text_or_documents=long_document,
    prompt_description=prompt,
    examples=examples,
    max_char_buffer=1000,
    extraction_passes=3,  # Run extraction 3 times
)

How Multiple Passes Work

See annotation.py:442-525 for sequential pass implementation:

Pass 1: Extract entities from all chunks → Result set A
Pass 2: Extract again from all chunks → Result set B
Pass 3: Extract again from all chunks → Result set C
Merge: Combine A, B, C using first-pass-wins strategy

Merge Strategy

From annotation.py:46-84:

def _merge_non_overlapping_extractions(all_extractions):
    """
    Merge extractions from multiple passes.
    When extractions overlap in character positions,
    the extraction from the earlier pass is kept.
    """
    merged = list(all_extractions[0])  # Start with pass 1
    
    for pass_extractions in all_extractions[1:]:
        for extraction in pass_extractions:
            if not overlaps_with_any(extraction, merged):
                merged.append(extraction)  # Add non-overlapping
    
    return merged

Overlapping extractions are discarded from later passes (earlier passes win).

When to Use Multiple Passes

Use extraction_passes=2-3 when:

Processing very long documents (>10,000 characters)
Recall is critical (medical records, legal discovery)
Entities are sparse across the document
You’re willing to pay 2-3x the API cost

Example: Romeo and Juliet extraction From the README example (lines 142-152):

result = lx.extract(
    text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt",
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
    extraction_passes=3,    # Multiple passes for recall
    max_workers=20,         # Parallel processing
    max_char_buffer=1000    # Smaller chunks for precision
)

This configuration:

Uses small chunks (1000 chars) for precision
Runs 3 independent passes for recall
Parallelizes with 20 workers for speed
Processes full novel (147,843 chars)

Cost consideration: extraction_passes=3 means the document is processed 3 times, tripling token costs.

Parallel Processing

The max_workers parameter controls parallel processing:

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    max_char_buffer=1000,
    batch_length=20,      # Process 20 chunks per batch
    max_workers=20,       # Use 20 parallel workers
)

Batch Processing

From extraction.py:111-115:

batch_length: int = 10  # Number of chunks per batch
max_workers: int = 10   # Maximum parallel workers

# Effective parallelization = min(batch_length, max_workers)

If batch_length < max_workers, only batch_length workers are used. Set batch_length >= max_workers for full parallelization.

Optimizing for Speed

For fastest processing:

result = lx.extract(
    text_or_documents=long_text,
    prompt_description=prompt,
    examples=examples,
    max_char_buffer=2000,  # Reasonable chunk size
    batch_length=50,       # Large batches
    max_workers=50,        # Maximum parallelization
)

API Rate Limits

For large-scale processing, use Tier 2 Gemini quota to increase throughput and avoid rate limits. See rate-limit documentation.

Chunking Best Practices

1. Start Conservative

Begin with small chunks and single pass:

# Initial configuration
result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    max_char_buffer=1000,
    extraction_passes=1,
)

Measure precision and recall, then adjust.

2. Monitor Token Usage

Estimate costs before scaling:

total_chars = len(document_text)
chunks_per_pass = total_chars / max_char_buffer
total_chunks = chunks_per_pass * extraction_passes

print(f"Estimated chunks to process: {total_chunks}")
print(f"Estimated tokens (rough): {total_chunks * max_char_buffer / 4}")

3. Tune Based on Document Type

Document Type	Recommended Config
Medical records	`max_char_buffer=1000`, `extraction_passes=1`
Legal contracts	`max_char_buffer=1500`, `context_window_chars=300`
Research papers	`max_char_buffer=2000`, `extraction_passes=1`
Novels	`max_char_buffer=1000`, `extraction_passes=3`, `max_workers=20`
News articles	`max_char_buffer=2500`, `extraction_passes=1`

4. Use Context Windows Selectively

Enable only when necessary:

# For narrative text with pronouns
result = lx.extract(
    text_or_documents=narrative_text,
    max_char_buffer=2000,
    context_window_chars=500,  # Enable context
)

# For structured data (no pronouns)
result = lx.extract(
    text_or_documents=tabular_data,
    max_char_buffer=1000,
    # context_window_chars not needed
)

5. Balance Recall and Cost

Use multiple passes strategically:

# High-value documents: Maximize recall
result = lx.extract(
    text_or_documents=important_document,
    max_char_buffer=1000,
    extraction_passes=3,      # High recall
)

# Bulk processing: Optimize cost
result = lx.extract(
    text_or_documents=bulk_documents,
    max_char_buffer=3000,     # Fewer chunks
    extraction_passes=1,      # Single pass
)

Advanced: Custom Tokenization

Provide a custom tokenizer for specialized chunking:

from langextract.core import tokenizer as tokenizer_lib

custom_tokenizer = tokenizer_lib.RegexTokenizer()

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    tokenizer=custom_tokenizer,
)

See chunking.py:385-421 for ChunkIterator tokenizer integration.

Debugging Chunking Issues

Inspect Chunks

from langextract import chunking
from langextract.core import tokenizer as tokenizer_lib

tokenizer = tokenizer_lib.RegexTokenizer()
tokenized = tokenizer.tokenize(long_text)

chunk_iter = chunking.ChunkIterator(
    text=tokenized,
    max_char_buffer=1000,
    tokenizer_impl=tokenizer,
)

for i, chunk in enumerate(chunk_iter):
    print(f"Chunk {i}: {len(chunk.chunk_text)} chars")
    print(f"Preview: {chunk.chunk_text[:100]}...")
    print(f"Char interval: {chunk.char_interval}")
    print()

Track Performance

import time

start_time = time.time()

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    max_char_buffer=1000,
    show_progress=True,  # Enable progress bar
)

elapsed = time.time() - start_time
print(f"Processed {len(text)} chars in {elapsed:.2f}s")
print(f"Found {len(result.extractions)} extractions")

Get Started

Core Concepts

Guides

Model Providers

Examples

​Why Chunking Matters

​The max_char_buffer Parameter

​How Chunking Works

​Example Chunking Behavior

​Choosing max_char_buffer

​Smaller Buffers (500-1500 chars)

​Larger Buffers (2000-5000 chars)

​Recommended Values

High Precision Tasks

Balanced Tasks

High Recall Tasks

Cost Optimization

​Context Windows

​How Context Windows Work

​Example: Coreference Resolution

​When to Use Context Windows

​Multiple Extraction Passes

​How Multiple Passes Work

​Merge Strategy

​When to Use Multiple Passes

​Parallel Processing

​Batch Processing

​Optimizing for Speed

​API Rate Limits

​Chunking Best Practices

​1. Start Conservative

​2. Monitor Token Usage

​3. Tune Based on Document Type

​4. Use Context Windows Selectively

​5. Balance Recall and Cost

​Advanced: Custom Tokenization

​Debugging Chunking Issues

​Inspect Chunks

​Track Performance

​Next Steps

Source Grounding

Prompt Engineering

Build docs developers (and LLMs) love

Why Chunking Matters

The max_char_buffer Parameter

How Chunking Works

Example Chunking Behavior

Choosing max_char_buffer

Smaller Buffers (500-1500 chars)

Larger Buffers (2000-5000 chars)

Recommended Values

Context Windows

How Context Windows Work

Example: Coreference Resolution

When to Use Context Windows

Multiple Extraction Passes

How Multiple Passes Work

Merge Strategy

When to Use Multiple Passes

Parallel Processing

Batch Processing

Optimizing for Speed

API Rate Limits

Chunking Best Practices

1. Start Conservative

2. Monitor Token Usage

3. Tune Based on Document Type

4. Use Context Windows Selectively

5. Balance Recall and Cost

Advanced: Custom Tokenization

Debugging Chunking Issues

Inspect Chunks

Track Performance

Next Steps