Documentation Index Fetch the complete documentation index at: https://mintlify.com/ragaeeb/shamela/llms.txt
Use this file to discover all available pages before exploring further.
Utilities for parsing, sanitizing, and transforming Shamela HTML content.
Parse HTML Content
Parse Shamela HTML into structured lines while preserving title hierarchy:
import { parseContentRobust } from 'shamela/content' ;
const rawHtml = `
<span data-type="title" id="toc-10">كِتَابُ الْإِيمَانِ</span>
حَدَّثَنَا أَبُو بَكْرٍ
` ;
const lines = parseContentRobust ( rawHtml );
lines . forEach (( line ) => {
if ( line . id ) {
console . log ( `Title ${ line . id } : ${ line . text } ` );
} else {
console . log ( `Content: ${ line . text } ` );
}
});
[
{
id: '10' ,
text: 'كِتَابُ الْإِيمَانِ'
},
{
text: 'حَدَّثَنَا أَبُو بَكْرٍ'
}
]
Character Normalization
Apply regex-based replacement rules to normalize Arabic text:
import { mapPageCharacterContent } from 'shamela/content' ;
// Default rules: remove \u821C, fix img tags, expand abbreviations
const text = 'Prophet Muhammad \uFD4C was born' ;
const normalized = mapPageCharacterContent ( text );
console . log ( normalized );
// Output: "Prophet Muhammad صلى الله عليه وآله وسلم was born"
Custom Mapping Rules
import { mapPageCharacterContent } from 'shamela/content' ;
import { DEFAULT_MAPPING_RULES } from 'shamela/constants' ;
// Extend default rules
const customRules = {
... DEFAULT_MAPPING_RULES ,
'customPattern' : 'replacement' ,
};
const processed = mapPageCharacterContent ( rawContent , customRules );
Separate Body from Footnotes
Split page content from trailing footnotes:
import { splitPageBodyFromFooter } from 'shamela/content' ;
const content = 'Main text content_________Footnote text here' ;
const [ body , footnotes ] = splitPageBodyFromFooter ( content );
console . log ( 'Body:' , body ); // "Main text content"
console . log ( 'Footnotes:' , footnotes ); // "Footnote text here"
import { splitPageBodyFromFooter } from 'shamela/content' ;
const content = 'Text===NOTES===Footnotes' ;
const [ body , footnotes ] = splitPageBodyFromFooter ( content , '===NOTES===' );
Remove Page Markers
Remove Arabic numeral page markers enclosed in turtle brackets:
import { removeArabicNumericPageMarkers } from 'shamela/content' ;
const text = 'النص ⦗١٢٣⦘ هنا' ;
const clean = removeArabicNumericPageMarkers ( text );
console . log ( clean ); // "النص هنا"
Remove anchor and hadeeth tags while preserving span elements:
import { removeTagsExceptSpan } from 'shamela/content' ;
const html = 'قبل <a href="#">رابط</a> <hadeeth>نص</hadeeth> <span>يبقى</span>' ;
const clean = removeTagsExceptSpan ( html );
console . log ( clean ); // "قبل رابط نص <span>يبقى</span>"
Convert to Markdown
Convert Shamela HTML to Markdown format:
import { htmlToMarkdown } from 'shamela/content' ;
const html = `
<span data-type="title">باب الإيمان</span>
حَدَّثَنَا <a href="inr://man-123">أبو بكر</a>
` ;
const markdown = htmlToMarkdown ( html );
console . log ( markdown );
// Output: "## باب الإيمان\nحَدَّثَنَا أبو بكر"
Normalize Title Spans
Handle consecutive title spans that would produce multiple headings:
import { normalizeTitleSpans } from 'shamela/content' ;
const html = '<span data-type="title">باب الميم</span><span data-type="title">من اسمه محمد</span>' ;
// Split onto separate lines (recommended)
const split = normalizeTitleSpans ( html , { strategy: 'splitLines' });
// Output: "<span data-type=\"title\">باب الميم</span>\n<span data-type=\"title\">من اسمه محمد</span>"
// Merge into single title
const merged = normalizeTitleSpans ( html , { strategy: 'merge' , separator: ' — ' });
// Output: "<span data-type=\"title\">باب الميم — من اسمه محمد</span>"
// Convert to hierarchy
const hierarchy = normalizeTitleSpans ( html , { strategy: 'hierarchy' });
// Output: First span stays title, rest become data-type="subtitle"
Move Pre-Title Text
Move text after line breaks into title spans:
import { moveContentAfterLineBreakIntoSpan } from 'shamela/content' ;
const html = ' \r ١ - <span data-type="title">الباب الأول</span>' ;
const result = moveContentAfterLineBreakIntoSpan ( html );
console . log ( result );
// Output: "\r<span data-type=\"title\">١ - الباب الأول</span>"
Full Markdown Conversion Pipeline
Apply the recommended transformation sequence:
import { convertContentToMarkdown } from 'shamela/content' ;
const html = '<span data-type="title">كتاب</span><span data-type="title">الإيمان</span>' ;
const markdown = convertContentToMarkdown ( html );
console . log ( markdown );
// Output: "## كتاب\n## الإيمان"
With Custom Options
import { convertContentToMarkdown } from 'shamela/content' ;
const html = '<span data-type="title">First</span><span data-type="title">Second</span>' ;
const markdown = convertContentToMarkdown ( html , {
strategy: 'merge' ,
separator: ' - '
});
console . log ( markdown );
// Output: "## First - Second"
Remove all HTML tags, keeping only text:
import { stripHtmlTags } from 'shamela/content' ;
const html = '<div><p>Hello <strong>World</strong></p></div>' ;
const text = stripHtmlTags ( html );
console . log ( text ); // "Hello World"
Normalize HTML for Styling
Convert hadeeth tags to standard spans:
import { normalizeHtml } from 'shamela/content' ;
const html = '<hadeeth-123>Hadith content</hadeeth>' ;
const normalized = normalizeHtml ( html );
console . log ( normalized );
// Output: "<span class=\"hadeeth\">Hadith content</span>"
Normalize Line Endings
Convert all line endings to Unix-style:
import { normalizeLineEndings } from 'shamela/content' ;
const windowsText = 'line1 \r\n line2' ;
const normalized = normalizeLineEndings ( windowsText );
console . log ( normalized ); // "line1\nline2"
Complete Processing Pipeline
Combine utilities for comprehensive content cleaning:
import {
mapPageCharacterContent ,
removeTagsExceptSpan ,
removeArabicNumericPageMarkers ,
splitPageBodyFromFooter ,
parseContentRobust
} from 'shamela/content' ;
function processPageContent ( rawHtml : string ) {
// 1. Remove unwanted tags
let content = removeTagsExceptSpan ( rawHtml );
// 2. Normalize characters
content = mapPageCharacterContent ( content );
// 3. Remove page markers
content = removeArabicNumericPageMarkers ( content );
// 4. Separate body from footnotes
const [ body , footnotes ] = splitPageBodyFromFooter ( content );
// 5. Parse into structured lines
const lines = parseContentRobust ( body );
return { lines , footnotes };
}
// Use the pipeline
const result = processPageContent ( '<a href="#">Text</a> ⦗١٢٣⦘_________Footnotes' );
console . log ( 'Lines:' , result . lines );
console . log ( 'Footnotes:' , result . footnotes );
Browser-Only Usage
Import content utilities without the full library:
// Lightweight import for browser (no sql.js dependency)
import {
mapPageCharacterContent ,
splitPageBodyFromFooter ,
removeTagsExceptSpan ,
parseContentRobust ,
htmlToMarkdown ,
convertContentToMarkdown
} from 'shamela/content' ;
// Process pre-downloaded content in the browser
const clean = removeTagsExceptSpan ( mapPageCharacterContent ( rawContent ));
const [ body , footnotes ] = splitPageBodyFromFooter ( clean );
const markdown = htmlToMarkdown ( body );
The shamela/content export is ideal for client-side React/Next.js components where you want to avoid loading sql.js WASM (~1.5KB gzipped vs ~900KB).