This example demonstrates how to use LangExtract to extract structured information from Japanese text.
For non-spaced languages like Japanese, use UnicodeTokenizer to ensure correct character-based segmentation and alignment.
Overview
When working with Japanese and other non-spaced languages, proper tokenization is critical for accurate entity extraction and position tracking. LangExtract provides UnicodeTokenizer specifically for this purpose.
Why UnicodeTokenizer?
- Correct Grapheme Segmentation: Japanese uses multiple character systems (Hiragana, Katakana, Kanji) that require proper Unicode handling
- Accurate Position Tracking: Ensures extracted entities map to correct character positions in the original text
- Language Compatibility: Works with any non-spaced language including Chinese, Thai, and others
Full Pipeline Example
Import and define input
import langextract as lx
from langextract.core import tokenizer
# Japanese text with entities (Person, Location, Organization)
# "Mr. Tanaka from Tokyo works at Google."
input_text = "東京出身の田中さんはGoogleで働いています。"
Define extraction task
# Define extraction prompt
prompt_description = "Extract named entities including Person, Location, and Organization."
# Define example data (few-shot examples help the model understand the task)
examples = [
lx.data.ExampleData(
text="大阪の山田さんはソニーに入社しました。", # Mr. Yamada from Osaka joined Sony.
extractions=[
lx.data.Extraction(extraction_class="Location", extraction_text="大阪"),
lx.data.Extraction(extraction_class="Person", extraction_text="山田"),
lx.data.Extraction(extraction_class="Organization", extraction_text="ソニー"),
]
)
]
Initialize UnicodeTokenizer
# 1. Initialize the UnicodeTokenizer
# Essential for Japanese to ensure correct grapheme segmentation.
unicode_tokenizer = tokenizer.UnicodeTokenizer()
Run extraction with custom tokenizer
# 2. Run Extraction with the Custom Tokenizer
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt_description,
examples=examples,
model_id="gemini-2.5-flash",
tokenizer=unicode_tokenizer, # <--- Pass the tokenizer here
api_key="your-api-key-here" # Optional if env var is set
)
Display results
# 3. Display Results
print(f"Input: {input_text}\n")
print("Extracted Entities:")
for entity in result.extractions:
position_info = ""
if entity.char_interval:
start, end = entity.char_interval.start_pos, entity.char_interval.end_pos
position_info = f" (pos: {start}-{end})"
print(f"• {entity.extraction_class}: {entity.extraction_text}{position_info}")
Expected Output
Input: 東京出身の田中さんはGoogleで働いています。
Extracted Entities:
• Location: 東京 (pos: 0-2)
• Person: 田中 (pos: 5-7)
• Organization: Google (pos: 10-16)
Key Points
Tokenizer Selection
Position Accuracy
Few-Shot Examples
Mixed Language Text
Always use UnicodeTokenizer for:
- Japanese (日本語)
- Chinese (中文)
- Korean (한국어)
- Thai (ไทย)
- And other languages without clear word boundaries
The default tokenizer assumes space-separated words and will not work correctly for these languages. The UnicodeTokenizer ensures that:
- Character positions are counted correctly
- Multi-byte Unicode characters are handled properly
- Extracted spans align perfectly with the source text
This is critical for visualization and verification of extraction results. Providing examples in the target language helps the model:
- Understand language-specific entity patterns
- Learn proper entity boundaries
- Maintain consistency in extraction style
Always include at least one example in the same language as your input text. The example demonstrates mixed script handling:
- Kanji characters: 東京, 田中
- Hiragana: の, さん, は, で
- Katakana: (none in this example)
- Latin alphabet: Google
UnicodeTokenizer handles all scripts seamlessly within the same text.
Best Practices
- Always use UnicodeTokenizer for non-spaced languages
- Provide language-specific examples to guide the model
- Verify position alignment in your output to ensure correct tokenization
- Test with mixed scripts if your use case involves multiple writing systems
While this example focuses on Japanese, the same approach works for any non-spaced language. Simply adapt your prompt and examples to match your target language while using UnicodeTokenizer for proper segmentation.