Skip to main content
This example demonstrates how to use LangExtract to extract structured information from Japanese text.
For non-spaced languages like Japanese, use UnicodeTokenizer to ensure correct character-based segmentation and alignment.

Overview

When working with Japanese and other non-spaced languages, proper tokenization is critical for accurate entity extraction and position tracking. LangExtract provides UnicodeTokenizer specifically for this purpose.

Why UnicodeTokenizer?

  • Correct Grapheme Segmentation: Japanese uses multiple character systems (Hiragana, Katakana, Kanji) that require proper Unicode handling
  • Accurate Position Tracking: Ensures extracted entities map to correct character positions in the original text
  • Language Compatibility: Works with any non-spaced language including Chinese, Thai, and others

Full Pipeline Example

1

Import and define input

import langextract as lx
from langextract.core import tokenizer

# Japanese text with entities (Person, Location, Organization)
# "Mr. Tanaka from Tokyo works at Google."
input_text = "東京出身の田中さんはGoogleで働いています。"
2

Define extraction task

# Define extraction prompt
prompt_description = "Extract named entities including Person, Location, and Organization."

# Define example data (few-shot examples help the model understand the task)
examples = [
    lx.data.ExampleData(
        text="大阪の山田さんはソニーに入社しました。",  # Mr. Yamada from Osaka joined Sony.
        extractions=[
            lx.data.Extraction(extraction_class="Location", extraction_text="大阪"),
            lx.data.Extraction(extraction_class="Person", extraction_text="山田"),
            lx.data.Extraction(extraction_class="Organization", extraction_text="ソニー"),
        ]
    )
]
3

Initialize UnicodeTokenizer

# 1. Initialize the UnicodeTokenizer
# Essential for Japanese to ensure correct grapheme segmentation.
unicode_tokenizer = tokenizer.UnicodeTokenizer()
4

Run extraction with custom tokenizer

# 2. Run Extraction with the Custom Tokenizer
result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt_description,
    examples=examples,
    model_id="gemini-2.5-flash",
    tokenizer=unicode_tokenizer,   # <--- Pass the tokenizer here
    api_key="your-api-key-here"    # Optional if env var is set
)
5

Display results

# 3. Display Results
print(f"Input: {input_text}\n")
print("Extracted Entities:")
for entity in result.extractions:
    position_info = ""
    if entity.char_interval:
        start, end = entity.char_interval.start_pos, entity.char_interval.end_pos
        position_info = f" (pos: {start}-{end})"
    
    print(f"• {entity.extraction_class}: {entity.extraction_text}{position_info}")

Expected Output

Input: 東京出身の田中さんはGoogleで働いています。

Extracted Entities:
• Location: 東京 (pos: 0-2)
• Person: 田中 (pos: 5-7)
• Organization: Google (pos: 10-16)

Key Points

Always use UnicodeTokenizer for:
  • Japanese (日本語)
  • Chinese (中文)
  • Korean (한국어)
  • Thai (ไทย)
  • And other languages without clear word boundaries
The default tokenizer assumes space-separated words and will not work correctly for these languages.

Best Practices

  1. Always use UnicodeTokenizer for non-spaced languages
  2. Provide language-specific examples to guide the model
  3. Verify position alignment in your output to ensure correct tokenization
  4. Test with mixed scripts if your use case involves multiple writing systems
While this example focuses on Japanese, the same approach works for any non-spaced language. Simply adapt your prompt and examples to match your target language while using UnicodeTokenizer for proper segmentation.

Build docs developers (and LLMs) love