Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/ragaeeb/paragrafs/llms.txt

Use this file to discover all available pages before exploring further.

Utility functions provide common text processing operations with special support for Arabic text.

createHints

Creates normalized hints for robust Arabic matching (diacritics/punctuation tolerant). Hints are used by markTokensWithDividers to insert hard segment breaks at specific multi-word phrases.
function createHints(
  first: ArabicNormalizationOptions | string,
  ...restHints: string[]
): Hints

Parameters

first
ArabicNormalizationOptions | string
required
Either the first hint string, or an options object overriding the default normalization
restHints
string[]
Remaining hint strings (if the first argument was an options object)

Returns

hints
Hints
A normalized hint map plus the normalization settings used for matching

Example

import { createHints, markTokensWithDividers } from 'paragrafs';

// Default normalization (Arabic-first)
const hints = createHints(
  'أحسن الله إليكم',
  'بارك الله فيكم'
);

// Custom normalization
const customHints = createHints(
  { normalizeAlef: false, normalizeYa: true },
  'custom hint',
  'another hint'
);

// Use with markTokensWithDividers
const tokens = [/* ... */];
const marked = markTokensWithDividers(tokens, {
  gapThreshold: 3,
  hints
});

Default Normalization

By default, hints use the following Arabic normalization:
  • normalizeAlef: true - Converts أإآ → ا
  • normalizeYa: true - Converts ى → ي
  • removeTatweel: true - Removes tatweel (ـ)
  • normalizeHamza: false - Preserves hamza variations

formatSecondsToTimestamp

Formats seconds into a human-readable timestamp.
function formatSecondsToTimestamp(seconds: number): string

Parameters

seconds
number
required
The time duration in seconds

Returns

timestamp
string
Formatted timestamp string:
  • For durations less than an hour: m:ss (e.g., “1:05”)
  • For durations an hour or longer: h:mm:ss (e.g., “1:02:05”)

Example

import { formatSecondsToTimestamp } from 'paragrafs';

console.log(formatSecondsToTimestamp(65));
// "1:05"

console.log(formatSecondsToTimestamp(3725));
// "1:02:05"

console.log(formatSecondsToTimestamp(45));
// "0:45"

console.log(formatSecondsToTimestamp(0));
// "0:00"

isEndingWithPunctuation

Checks if a text string ends with sentence-ending punctuation. Supports both English and Arabic punctuation marks.
function isEndingWithPunctuation(text: string): boolean

Parameters

text
string
required
The text to check for ending punctuation

Returns

hasPunctuation
boolean
true if the text ends with punctuation, false otherwise

Supported Punctuation

  • Period: .
  • Question mark: ? or ؟ (Arabic)
  • Exclamation: !
  • Arabic semicolon: ؛
  • Ellipsis:

Example

import { isEndingWithPunctuation } from 'paragrafs';

console.log(isEndingWithPunctuation('Hello world.'));
// true

console.log(isEndingWithPunctuation('Hello world'));
// false

console.log(isEndingWithPunctuation('كيف حالك؟'));
// true (Arabic question mark)

console.log(isEndingWithPunctuation('Wait...'));
// false

console.log(isEndingWithPunctuation('Wait…'));
// true (ellipsis character)

tokenizeGroundTruth

Tokenizes ground truth text properly, ensuring punctuation is attached to words rather than creating separate tokens.
function tokenizeGroundTruth(groundTruth: string): string[]

Parameters

groundTruth
string
required
The ground truth text to tokenize

Returns

tokens
string[]
The tokenized ground truth with punctuation properly attached to preceding words

Example

import { tokenizeGroundTruth } from 'paragrafs';

const text = 'Hello world! How are you?';
const tokens = tokenizeGroundTruth(text);
console.log(tokens);
// ['Hello', 'world!', 'How', 'are', 'you?']
// Note: punctuation is attached to words, not separate tokens

// Handles Arabic punctuation
const arabic = 'السلام عليكم! كيف حالك؟';
const arabicTokens = tokenizeGroundTruth(arabic);
console.log(arabicTokens);
// ['السلام', 'عليكم!', 'كيف', 'حالك؟']

normalizeTokenText

Normalizes token text for Arabic-first matching and mining. This builds on basic normalization (diacritics + trim punctuation) and adds optional Arabic-specific normalizations. Use the same normalization for:
  • Mining repeated sequences
  • Matching hints against tokens
function normalizeTokenText(
  text: string,
  options?: ArabicNormalizationOptions
): string

Parameters

text
string
required
The token text to normalize
options
ArabicNormalizationOptions
Optional Arabic-specific normalizations

Returns

normalized
string
A normalized token string suitable for comparisons

Normalization Process

  1. Decomposes Unicode characters (NFD normalization)
  2. Removes zero-width characters
  3. Removes Arabic diacritics
  4. Strips leading/trailing punctuation
  5. Applies optional Arabic-specific normalizations
  6. Recomposes Unicode characters (NFC normalization)

Example

import { normalizeTokenText } from 'paragrafs';

// Remove diacritics
const text1 = normalizeTokenText('أَحْسَنَ');
console.log(text1);
// 'احسن'

// Normalize alef variants (default)
const text2 = normalizeTokenText('أإآ', { normalizeAlef: true });
console.log(text2);
// 'ااا' → 'ا' (all become regular alef)

// Preserve alef variants
const text3 = normalizeTokenText('أإآ', { normalizeAlef: false });
console.log(text3);
// 'أإآ' (preserved)

// Remove punctuation from edges
const text4 = normalizeTokenText('(hello!)');
console.log(text4);
// 'hello'

Use Cases

  • Hint matching: Normalize both hints and tokens for robust matching
  • Phrase mining: Normalize text before counting n-gram frequencies
  • Search: Normalize search queries and transcript text for better results
  • Deduplication: Identify duplicate phrases despite different spellings

Build docs developers (and LLMs) love