Utility Functions

Utility functions provide common text processing operations with special support for Arabic text.

createHints

Creates normalized hints for robust Arabic matching (diacritics/punctuation tolerant). Hints are used by markTokensWithDividers to insert hard segment breaks at specific multi-word phrases.

function createHints(
  first: ArabicNormalizationOptions | string,
  ...restHints: string[]
): Hints

Parameters

first

ArabicNormalizationOptions | string

required

Either the first hint string, or an options object overriding the default normalization

restHints

string[]

Remaining hint strings (if the first argument was an options object)

Returns

hints

Hints

A normalized hint map plus the normalization settings used for matching

Show properties

map

HintMap

Map of hints organized by their first word

normalization

ArabicNormalizationOptions

The normalization settings used

Example

import { createHints, markTokensWithDividers } from 'paragrafs';

// Default normalization (Arabic-first)
const hints = createHints(
  'أحسن الله إليكم',
  'بارك الله فيكم'
);

// Custom normalization
const customHints = createHints(
  { normalizeAlef: false, normalizeYa: true },
  'custom hint',
  'another hint'
);

// Use with markTokensWithDividers
const tokens = [/* ... */];
const marked = markTokensWithDividers(tokens, {
  gapThreshold: 3,
  hints
});

Default Normalization

By default, hints use the following Arabic normalization:

normalizeAlef: true - Converts أإآ → ا
normalizeYa: true - Converts ى → ي
removeTatweel: true - Removes tatweel (ـ)
normalizeHamza: false - Preserves hamza variations

formatSecondsToTimestamp

Formats seconds into a human-readable timestamp.

function formatSecondsToTimestamp(seconds: number): string

Parameters

seconds

number

required

The time duration in seconds

Returns

timestamp

string

Formatted timestamp string:

For durations less than an hour: m:ss (e.g., “1:05”)
For durations an hour or longer: h:mm:ss (e.g., “1:02:05”)

Example

import { formatSecondsToTimestamp } from 'paragrafs';

console.log(formatSecondsToTimestamp(65));
// "1:05"

console.log(formatSecondsToTimestamp(3725));
// "1:02:05"

console.log(formatSecondsToTimestamp(45));
// "0:45"

console.log(formatSecondsToTimestamp(0));
// "0:00"

isEndingWithPunctuation

Checks if a text string ends with sentence-ending punctuation. Supports both English and Arabic punctuation marks.

function isEndingWithPunctuation(text: string): boolean

Parameters

text

string

required

The text to check for ending punctuation

Returns

hasPunctuation

boolean

true if the text ends with punctuation, false otherwise

Supported Punctuation

Period: .
Question mark: ? or ؟ (Arabic)
Exclamation: !
Arabic semicolon: ؛
Ellipsis: …

Example

import { isEndingWithPunctuation } from 'paragrafs';

console.log(isEndingWithPunctuation('Hello world.'));
// true

console.log(isEndingWithPunctuation('Hello world'));
// false

console.log(isEndingWithPunctuation('كيف حالك؟'));
// true (Arabic question mark)

console.log(isEndingWithPunctuation('Wait...'));
// false

console.log(isEndingWithPunctuation('Wait…'));
// true (ellipsis character)

tokenizeGroundTruth

Tokenizes ground truth text properly, ensuring punctuation is attached to words rather than creating separate tokens.

function tokenizeGroundTruth(groundTruth: string): string[]

Parameters

groundTruth

string

required

The ground truth text to tokenize

Returns

tokens

string[]

The tokenized ground truth with punctuation properly attached to preceding words

Example

import { tokenizeGroundTruth } from 'paragrafs';

const text = 'Hello world! How are you?';
const tokens = tokenizeGroundTruth(text);
console.log(tokens);
// ['Hello', 'world!', 'How', 'are', 'you?']
// Note: punctuation is attached to words, not separate tokens

// Handles Arabic punctuation
const arabic = 'السلام عليكم! كيف حالك؟';
const arabicTokens = tokenizeGroundTruth(arabic);
console.log(arabicTokens);
// ['السلام', 'عليكم!', 'كيف', 'حالك؟']

normalizeTokenText

Normalizes token text for Arabic-first matching and mining. This builds on basic normalization (diacritics + trim punctuation) and adds optional Arabic-specific normalizations. Use the same normalization for:

Mining repeated sequences
Matching hints against tokens

function normalizeTokenText(
  text: string,
  options?: ArabicNormalizationOptions
): string

Parameters

text

string

required

The token text to normalize

options

ArabicNormalizationOptions

Optional Arabic-specific normalizations

Show properties

normalizeAlef

boolean

default:true

Convert أإآ → ا

normalizeHamza

boolean

default:false

Normalize hamza variations

normalizeYa

boolean

default:true

Convert ى → ي

removeTatweel

boolean

default:true

Remove tatweel character (ـ)

Returns

normalized

string

A normalized token string suitable for comparisons

Normalization Process

Decomposes Unicode characters (NFD normalization)
Removes zero-width characters
Removes Arabic diacritics
Strips leading/trailing punctuation
Applies optional Arabic-specific normalizations
Recomposes Unicode characters (NFC normalization)

Example

import { normalizeTokenText } from 'paragrafs';

// Remove diacritics
const text1 = normalizeTokenText('أَحْسَنَ');
console.log(text1);
// 'احسن'

// Normalize alef variants (default)
const text2 = normalizeTokenText('أإآ', { normalizeAlef: true });
console.log(text2);
// 'ااا' → 'ا' (all become regular alef)

// Preserve alef variants
const text3 = normalizeTokenText('أإآ', { normalizeAlef: false });
console.log(text3);
// 'أإآ' (preserved)

// Remove punctuation from edges
const text4 = normalizeTokenText('(hello!)');
console.log(text4);
// 'hello'

Use Cases

Hint matching: Normalize both hints and tokens for robust matching
Phrase mining: Normalize text before counting n-gram frequencies
Search: Normalize search queries and transcript text for better results
Deduplication: Identify duplicate phrases despite different spellings

Getting Started

Core Concepts

Guides

API Reference

Resources

createHints

Parameters

Returns

Example

Default Normalization

formatSecondsToTimestamp

Parameters

Returns

Example

isEndingWithPunctuation

Parameters

Returns

Supported Punctuation

Example

tokenizeGroundTruth

Parameters

Returns

Example

normalizeTokenText

Parameters

Returns

Normalization Process

Example

Use Cases

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Guides

API Reference

Resources

Documentation Index

​createHints

​Parameters

​Returns

​Example

​Default Normalization

​formatSecondsToTimestamp

​Parameters

​Returns

​Example

​isEndingWithPunctuation

​Parameters

​Returns

​Supported Punctuation

​Example

​tokenizeGroundTruth

​Parameters

​Returns

​Example

​normalizeTokenText

​Parameters

​Returns

​Normalization Process

​Example

​Use Cases

Build docs developers (and LLMs) love

createHints

Parameters

Returns

Example

Default Normalization

formatSecondsToTimestamp

Parameters

Returns

Example

isEndingWithPunctuation

Parameters

Returns

Supported Punctuation

Example

tokenizeGroundTruth

Parameters

Returns

Example

normalizeTokenText

Parameters

Returns

Normalization Process

Example

Use Cases