Japanese Information Extraction

This example demonstrates how to use LangExtract to extract structured information from Japanese text.

For non-spaced languages like Japanese, use UnicodeTokenizer to ensure correct character-based segmentation and alignment.

Overview

When working with Japanese and other non-spaced languages, proper tokenization is critical for accurate entity extraction and position tracking. LangExtract provides UnicodeTokenizer specifically for this purpose.

Why UnicodeTokenizer?

Correct Grapheme Segmentation: Japanese uses multiple character systems (Hiragana, Katakana, Kanji) that require proper Unicode handling
Accurate Position Tracking: Ensures extracted entities map to correct character positions in the original text
Language Compatibility: Works with any non-spaced language including Chinese, Thai, and others

Full Pipeline Example

Import and define input

import langextract as lx
from langextract.core import tokenizer

# Japanese text with entities (Person, Location, Organization)
# "Mr. Tanaka from Tokyo works at Google."
input_text = "東京出身の田中さんはGoogleで働いています。"

Define extraction task

# Define extraction prompt
prompt_description = "Extract named entities including Person, Location, and Organization."

# Define example data (few-shot examples help the model understand the task)
examples = [
    lx.data.ExampleData(
        text="大阪の山田さんはソニーに入社しました。",  # Mr. Yamada from Osaka joined Sony.
        extractions=[
            lx.data.Extraction(extraction_class="Location", extraction_text="大阪"),
            lx.data.Extraction(extraction_class="Person", extraction_text="山田"),
            lx.data.Extraction(extraction_class="Organization", extraction_text="ソニー"),
        ]
    )
]

Initialize UnicodeTokenizer

# 1. Initialize the UnicodeTokenizer
# Essential for Japanese to ensure correct grapheme segmentation.
unicode_tokenizer = tokenizer.UnicodeTokenizer()

Run extraction with custom tokenizer

# 2. Run Extraction with the Custom Tokenizer
result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt_description,
    examples=examples,
    model_id="gemini-2.5-flash",
    tokenizer=unicode_tokenizer,   # <--- Pass the tokenizer here
    api_key="your-api-key-here"    # Optional if env var is set
)

Display results

# 3. Display Results
print(f"Input: {input_text}\n")
print("Extracted Entities:")
for entity in result.extractions:
    position_info = ""
    if entity.char_interval:
        start, end = entity.char_interval.start_pos, entity.char_interval.end_pos
        position_info = f" (pos: {start}-{end})"
    
    print(f"• {entity.extraction_class}: {entity.extraction_text}{position_info}")

Expected Output

Input: 東京出身の田中さんはGoogleで働いています。

Extracted Entities:
• Location: 東京 (pos: 0-2)
• Person: 田中 (pos: 5-7)
• Organization: Google (pos: 10-16)

Key Points

Tokenizer Selection
Position Accuracy
Few-Shot Examples
Mixed Language Text

Always use UnicodeTokenizer for:

Japanese (日本語)
Chinese (中文)
Korean (한국어)
Thai (ไทย)
And other languages without clear word boundaries

The default tokenizer assumes space-separated words and will not work correctly for these languages.

The UnicodeTokenizer ensures that:

Character positions are counted correctly
Multi-byte Unicode characters are handled properly
Extracted spans align perfectly with the source text

This is critical for visualization and verification of extraction results.

The example demonstrates mixed script handling:

Kanji characters: 東京, 田中
Hiragana: の, さん, は, で
Katakana: (none in this example)
Latin alphabet: Google

UnicodeTokenizer handles all scripts seamlessly within the same text.

Best Practices

Always use UnicodeTokenizer for non-spaced languages
Provide language-specific examples to guide the model
Verify position alignment in your output to ensure correct tokenization
Test with mixed scripts if your use case involves multiple writing systems

While this example focuses on Japanese, the same approach works for any non-spaced language. Simply adapt your prompt and examples to match your target language while using UnicodeTokenizer for proper segmentation.

Get Started

Core Concepts

Guides

Model Providers

Examples

Japanese Information Extraction

Overview

Why UnicodeTokenizer?

Full Pipeline Example

Expected Output

Key Points

Best Practices

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Model Providers

Examples

​Overview

​Why UnicodeTokenizer?

​Full Pipeline Example

​Expected Output

​Key Points

​Best Practices

Build docs developers (and LLMs) love

Overview

Why UnicodeTokenizer?

Full Pipeline Example

Expected Output

Key Points

Best Practices