parseContentRobust()

Overview

Parses Shamela HTML content into structured lines while preserving headings. This is the primary function for processing raw Shamela page content into a format that preserves title hierarchy and Arabic punctuation.

Signature

parseContentRobust(content: string): Line[]

Parameters

content

string

required

The raw HTML markup representing a page

Returns

Line[]

array

An array of Line objects containing text and optional IDs

Show Line properties

string

Title identifier extracted from data-type="title" spans

text

string

The text content of the line

Behavior

Normalizes line endings to Unix-style (\n) before processing
Fast path optimization when no <span> tags are present
Preserves title hierarchy from <span data-type="title" id="..."> elements
Merges punctuation-only lines into preceding titles
Handles nested spans and maintains title context across line breaks
Filters out empty lines from the result

Example

import { parseContentRobust } from 'shamela';

const rawHtml = `
<span data-type="title" id="toc-123">الباب الأول</span>
النص العادي
<span data-type="title" id="toc-456">الباب الثاني</span>
نص آخر
`;

const lines = parseContentRobust(rawHtml);

lines.forEach((line) => {
  if (line.id) {
    console.log(`Title [${line.id}]: ${line.text}`);
  } else {
    console.log(`Text: ${line.text}`);
  }
});

// Output:
// Title [123]: الباب الأول
// Text: النص العادي
// Title [456]: الباب الثاني
// Text: نص آخر

Processing Pipeline

Normalize line endings - Convert all line endings to \n
Fast path check - Skip tokenization if no spans present
Tokenize HTML - Break HTML into structural tokens
Process tokens - Extract text and title metadata
Merge punctuation - Combine dangling punctuation with titles
Filter empties - Remove empty lines

removeTagsExceptSpan() - Strip all tags except spans before parsing
normalizeLineEndings() - Normalize line endings
convertContentToMarkdown() - Full pipeline including this function

Configuration

Metadata & Downloads

Data Access

Content Utilities

Utilities

Types

parseContentRobust()

Overview

Signature

Parameters

Returns

Behavior

Example

Processing Pipeline

Build docs developers (and LLMs) love

Configuration

Metadata & Downloads

Data Access

Content Utilities

Utilities

Types

Documentation Index

​Overview

​Signature

​Parameters

​Returns

​Behavior

​Example

​Processing Pipeline

​Related Functions

Build docs developers (and LLMs) love

Overview

Signature

Parameters

Returns

Behavior

Example

Processing Pipeline

Related Functions