Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/ragaeeb/shamela/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Removes Arabic numeral page markers enclosed in turtle ⦗ ⦘ brackets. These markers are commonly used in Shamela texts to denote page numbers in the original printed edition.

Signature

removeArabicNumericPageMarkers(text: string): string

Parameters

text
string
required
Text potentially containing page markers

Returns

string
string
The text with numeric markers replaced by a single space

Behavior

  • Matches Arabic numerals (٠-٩) enclosed in ⦗ ⦘ brackets
  • Removes up to two preceding whitespace characters (space or \r)
  • Removes up to one following whitespace character
  • Replaces the entire match with a single space
  • Uses the pattern: /(?: |\r){0,2}⦗[\u0660-\u0669]+⦘(?: |\r)?/g

Example

import { removeArabicNumericPageMarkers } from 'shamela';

const text = 'النص الأول ⦗١٢٣⦘ النص الثاني ⦗٤٥٦⦘ النص الثالث';
const cleaned = removeArabicNumericPageMarkers(text);

console.log(cleaned);
// => "النص الأول  النص الثاني  النص الثالث"

Arabic Numerals

The function recognizes Arabic-Indic numerals (٠-٩):
ArabicLatinUnicode
٠0U+0660
١1U+0661
٢2U+0662
٣3U+0663
٤4U+0664
٥5U+0665
٦6U+0666
٧7U+0667
٨8U+0668
٩9U+0669

Use Cases

  • Clean display text - Remove page markers before displaying to users
  • Search preparation - Remove markers before indexing for search
  • Text analysis - Clean text for linguistic analysis
  • Export formatting - Remove markers when exporting to other formats

Processing Order

Recommended order in a processing pipeline:
import {
  mapPageCharacterContent,
  removeTagsExceptSpan,
  removeArabicNumericPageMarkers,
  parseContentRobust,
} from 'shamela';

// 1. Normalize characters
let content = mapPageCharacterContent(rawContent);

// 2. Remove unwanted tags
content = removeTagsExceptSpan(content);

// 3. Remove page markers
content = removeArabicNumericPageMarkers(content);

// 4. Parse into structured lines
const lines = parseContentRobust(content);

Build docs developers (and LLMs) love