Arabic text normalization

Arabic text normalization is essential for accurate search matching. The library provides two normalization functions that handle diacritics, Unicode variants, and character unification.

Overview

The normalization module exports two primary functions:

removeTashkeel() - Removes diacritics and Quranic marks
normalizeArabic() - Advanced normalization for search indexing

Remove diacritics

`removeTashkeel(text: string): string`

Removes Tashkeel (diacritics) and Quranic marks from Arabic text. Use case: Stripping diacritics for display or simple comparisons.

import { removeTashkeel } from 'quran-search-engine';

const out = removeTashkeel('بِسْمِ ٱللَّهِ');
// out => 'بسم الله'

What it removes

From src/utils/normalization.ts:7-11:

export const removeTashkeel = (text: string): string => {
  return text
    .replace(/\u0671/g, '\u0627') // Wasl alef → regular alef
    .replace(/[\u064B-\u065F\u0670\u06D6-\u06DC\u06DF-\u06E8\u06EA-\u06FC]/g, '');
};

The function removes:

Wasl alef (converts to regular alef)
All diacritical marks (Unicode range \u064B-\u065F)
Quranic annotation marks
Superscript alef

Advanced normalization

`normalizeArabic(text: string): string`

Advanced Arabic normalization for search indexing. Handles Unicode normalization, variant unification, and cleanup. Use case: Preparing user input for searching (unifies alef variants, removes tashkeel, etc).

import { normalizeArabic } from 'quran-search-engine';

const out = normalizeArabic('بِسْمِ ٱللَّهِ');
// out => 'بسم الله'

Normalization steps

Remove diacritics

Calls removeTashkeel() to strip all diacritical marks and applies Unicode NFC normalization.

let normalizedText = removeTashkeel(text).normalize('NFC');

Remove special characters

Removes dagger alif and tatweel (elongation character).

// dagger alif + tatweel
normalizedText = normalizedText.replace(/[\u0670\u0640]/g, '');

Unify alef variants

Normalizes all alef variants to a single form.

// alef variants → ا
normalizedText = normalizedText.replace(/[إأآٱ]/g, 'ا');

Converts: إ أ آ ٱ → ا

Unify hamza variants

Normalizes hamza on different carriers to standalone hamza.

// hamza variants → ء
normalizedText = normalizedText.replace(/[ؤئء]/g, 'ء');

Converts: ؤ ئ → ء

Unify ya variants

Converts alif maqsura to regular ya.

// alif maqsura → ي
normalizedText = normalizedText.replace(/ى/g, 'ي');

Converts: ى → ي

Clean whitespace and control characters

Removes line breaks, non-Arabic characters, and normalizes whitespace.

// remove control chars / CRLF / non-Arabic symbols
normalizedText = normalizedText.replace(/[\r\n]+/g, ' ');
normalizedText = normalizedText.replace(/[^\u0621-\u064A\s-]+/g, '');
normalizedText = normalizedText.replace(/\s{2,}/g, ' ');

Trim and return

Removes leading and trailing whitespace.

return normalizedText.trim();

Full implementation

From src/utils/normalization.ts:20-43:

export const normalizeArabic = (text: string): string => {
  if (!text) return '';

  let normalizedText = removeTashkeel(text).normalize('NFC');

  // dagger alif + tatweel
  normalizedText = normalizedText.replace(/[\u0670\u0640]/g, '');

  // alef variants → ا
  normalizedText = normalizedText.replace(/[إأآٱ]/g, 'ا');

  // hamza variants → ء
  normalizedText = normalizedText.replace(/[ؤئء]/g, 'ء');

  // alif maqsura → ي
  normalizedText = normalizedText.replace(/ى/g, 'ي');

  // remove control chars / CRLF / non-Arabic symbols
  normalizedText = normalizedText.replace(/[\r\n]+/g, ' ');
  normalizedText = normalizedText.replace(/[^\u0621-\u064A\s-]+/g, '');
  normalizedText = normalizedText.replace(/\s{2,}/g, ' ');

  return normalizedText.trim();
};

Usage in search

The normalizeArabic() function is used internally throughout the search engine to ensure consistent matching:

import { normalizeArabic } from 'quran-search-engine';

export function containsAllTokens(value: string, query: string): boolean {
  const normalizedQuery = normalizeArabic(query);
  if (!normalizedQuery) return false;

  const tokens = normalizedQuery.split(/\s+/);
  const normalizedValue = normalizeArabic(value);
  return tokens.every((token) => normalizedValue.includes(token));
}

Always normalize both the search query and the text being searched to ensure accurate matching regardless of input variations.

Character unification table

Category	Original Characters	Normalized To
Alef variants	إ أ آ ٱ	ا
Hamza variants	ؤ ئ	ء
Ya variants	ى	ي
Removed	Diacritics, tatweel, dagger alif	(empty)

Empty input strings return empty strings - the function gracefully handles null or undefined input.

Get Started

Core Concepts

Guides

Examples

Overview

Remove diacritics

`removeTashkeel(text: string): string`

What it removes

Advanced normalization

`normalizeArabic(text: string): string`

Normalization steps

Full implementation

Usage in search

Character unification table

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Examples

Documentation Index

​Overview

​Remove diacritics

​removeTashkeel(text: string): string

​What it removes

​Advanced normalization

​normalizeArabic(text: string): string

​Normalization steps

​Full implementation

​Usage in search

​Character unification table

Build docs developers (and LLMs) love

Overview

Remove diacritics

`removeTashkeel(text: string): string`

What it removes

Advanced normalization

`normalizeArabic(text: string): string`

Normalization steps

Full implementation

Usage in search

Character unification table