Function Signature
Parameters
Array of source segments (Arabic text with IDs). May be the full corpus - the validator automatically reduces to only those IDs parsed from the response.Each segment has shape:
{ id: string; text: string }The raw LLM translation response text containing segment markers in the format
ID - Translation text.The validator normalizes this input (splits merged markers, normalizes line endings) before validation.Custom validation rules to apply. If not provided, all default rules are used.Default rules include:
invalid_marker_format, newline_after_id, truncated_segment, duplicate_id, invented_id, missing_id_gap, arabic_leak, empty_parentheses, length_mismatch, all_caps, collapsed_speakers, multiword_translit_without_glossConfiguration object for validation rules.
allCapsWordRunThreshold(number, default: 5): Minimum number of consecutive ALL CAPS words to trigger anall_capserror
Return Value
The normalized version of the input response (merged markers split, line endings normalized, escaped brackets removed).
Array of segment IDs successfully parsed from the response (in order of appearance).
Array of validation errors found. Each error contains:
type(ValidationErrorType): Machine-readable error typemessage(string): Human-readable error messagerange(Range): Character range{ start: number; end: number }in the raw responsematchText(string): The text that triggered the errorid(string, optional): The segment ID associated with this errorruleId(string, optional): Stable rule identifier for tooling/triage
Behavior
Normalization
The function normalizes the response before validation:- Splits merged markers (e.g.,
helloP1 - Textbecomeshello\nP1 - Text) - Normalizes line endings
- Removes escaped brackets (e.g.,
\[becomes[)
ID Validation
- No valid markers: If no valid
ID - Textpatterns are found, returns a singleno_valid_markerserror - Invented IDs: Detects IDs in the response that don’t exist in the source segments
- Duplicate IDs: Flags IDs that appear more than once in the response
- Missing ID gaps: Detects when the response contains IDs A and C but the corpus order includes B between them
Content Validation
- Arabic leak: Detects Arabic script characters (except ﷺ which is allowed)
- Truncated segments: Flags segments containing only
…,..., or[INCOMPLETE] - Length mismatch: Checks if translation is too short relative to Arabic source (ratio-based heuristic, only for Arabic text ≥ 100 chars)
- Empty parentheses: Detects excessive
()patterns (> 3) indicating failed transliterations - All caps: Flags runs of N consecutive ALL CAPS words (configurable threshold)
- Collapsed speakers: Detects speaker labels that appear mid-line instead of at line start
- Multi-word transliteration without gloss: Flags patterns like
al-hajr fi al-madajīʿwithout immediate parenthetical gloss
Format Validation
- Invalid marker format: Detects malformed markers (wrong ID shape, missing content after dash, dollar signs, etc.)
- Newline after ID: Flags
ID -\nTextinstead ofID - Text
Examples
Valid Response (No Errors)
Invented ID Error
Arabic Leak Error
Allowed ﷺ Symbol
Collapsed Speaker Labels
Missing ID Gap
Custom Configuration
Error Ranges
All errors include character ranges that map to the original raw response:See Also
- VALIDATION_ERROR_TYPE_INFO - Human-readable descriptions for all error types
- ValidationError - Error object structure with all fields