Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/ragaeeb/shamela/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Parses Shamela HTML content into structured lines while preserving headings. This is the primary function for processing raw Shamela page content into a format that preserves title hierarchy and Arabic punctuation.

Signature

parseContentRobust(content: string): Line[]

Parameters

content
string
required
The raw HTML markup representing a page

Returns

Line[]
array
An array of Line objects containing text and optional IDs

Behavior

  • Normalizes line endings to Unix-style (\n) before processing
  • Fast path optimization when no <span> tags are present
  • Preserves title hierarchy from <span data-type="title" id="..."> elements
  • Merges punctuation-only lines into preceding titles
  • Handles nested spans and maintains title context across line breaks
  • Filters out empty lines from the result

Example

import { parseContentRobust } from 'shamela';

const rawHtml = `
<span data-type="title" id="toc-123">الباب الأول</span>
النص العادي
<span data-type="title" id="toc-456">الباب الثاني</span>
نص آخر
`;

const lines = parseContentRobust(rawHtml);

lines.forEach((line) => {
  if (line.id) {
    console.log(`Title [${line.id}]: ${line.text}`);
  } else {
    console.log(`Text: ${line.text}`);
  }
});

// Output:
// Title [123]: الباب الأول
// Text: النص العادي
// Title [456]: الباب الثاني
// Text: نص آخر

Processing Pipeline

  1. Normalize line endings - Convert all line endings to \n
  2. Fast path check - Skip tokenization if no spans present
  3. Tokenize HTML - Break HTML into structural tokens
  4. Process tokens - Extract text and title metadata
  5. Merge punctuation - Combine dangling punctuation with titles
  6. Filter empties - Remove empty lines

Build docs developers (and LLMs) love