Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/ragaeeb/paragrafs/llms.txt

Use this file to discover all available pages before exploring further.

Hint generation functions mine frequent n-grams from token streams and return sorted hint candidates. This is particularly useful for Arabic transcripts where repeated phrases like “أحسن الله إليكم” should trigger segment breaks.

generateHintsFromTokens

Mine frequent n-grams from a token stream and return hint candidates sorted by frequency. This is Arabic-first: mining is performed on normalized token text.
function generateHintsFromTokens(
  tokens: Token[],
  options?: GenerateHintsOptions
): GeneratedHint[]

Parameters

tokens
Token[]
required
Array of tokens to mine for repeated phrases
options
GenerateHintsOptions
Configuration options for hint generation

Returns

hints
GeneratedHint[]
Array of generated hints sorted by frequency, then length, then alphabetically

Example

import { generateHintsFromTokens, createHints } from 'paragrafs';

const tokens = [
  { start: 0, end: 1, text: 'أَحْسَنَ' },
  { start: 1, end: 2, text: 'الله' },
  { start: 2, end: 3, text: 'إليكم،' },
  { start: 3, end: 4, text: 'شيخنا' },
  { start: 5, end: 6, text: 'أَحْسَنَ' },
  { start: 6, end: 7, text: 'الله' },
  { start: 7, end: 8, text: 'إليكم' },
  // ... more tokens ...
];

const mined = generateHintsFromTokens(tokens, {
  minN: 2,
  maxN: 4,
  minCount: 2,
  dedupe: 'closed',
  normalization: { normalizeAlef: true }
});

console.log(mined);
// [
//   {
//     phrase: 'أحسن الله إليكم',
//     normalizedPhrase: 'احسن الله اليكم',
//     count: 2,
//     length: 3,
//     firstOccurrenceIndex: 0,
//     topSurfaceForms: ['أحسن الله إليكم', 'أَحْسَنَ الله إليكم،']
//   }
// ]

// Convert top hints to usable hints
const hints = createHints(
  { normalizeAlef: true },
  ...mined.slice(0, 25).map(h => h.phrase)
);

Use Cases

  • Auto-discovery: Automatically find repeated phrases in long transcripts
  • Quality improvement: Identify common expressions that should trigger segment breaks
  • Arabic lectures: Find repeated phrases like greetings, blessings, and transitions
  • Custom segmentation: Use discovered phrases to improve transcript formatting

generateHintsFromSegments

Mine frequent n-grams from segments. By default, phrases cannot cross segment boundaries (use boundaryStrategy: 'none' to mine across boundaries).
function generateHintsFromSegments(
  segments: Segment[],
  options?: GenerateHintsOptions
): GeneratedHint[]

Parameters

segments
Segment[]
required
Array of segments to mine for repeated phrases
options
GenerateHintsOptions
Configuration options (same as generateHintsFromTokens, plus boundaryStrategy)

Returns

hints
GeneratedHint[]
Array of generated hints sorted by frequency, then length, then alphabetically

Example

import { generateHintsFromSegments } from 'paragrafs';

const segments = [
  {
    start: 0,
    end: 10,
    text: 'أحسن الله إليكم يا شيخ',
    tokens: [
      { start: 0, end: 2, text: 'أحسن' },
      { start: 2, end: 4, text: 'الله' },
      { start: 4, end: 6, text: 'إليكم' },
      { start: 6, end: 8, text: 'يا' },
      { start: 8, end: 10, text: 'شيخ' }
    ]
  },
  {
    start: 10,
    end: 18,
    text: 'بارك الله فيكم',
    tokens: [
      { start: 10, end: 13, text: 'بارك' },
      { start: 13, end: 15, text: 'الله' },
      { start: 15, end: 18, text: 'فيكم' }
    ]
  },
  // ... more segments with repeated phrases ...
];

// Default: phrases don't cross segment boundaries
const hints = generateHintsFromSegments(segments, {
  minN: 2,
  maxN: 4,
  minCount: 2
});

// Allow phrases to cross segment boundaries
const crossBoundaryHints = generateHintsFromSegments(segments, {
  minN: 2,
  maxN: 4,
  minCount: 2,
  boundaryStrategy: 'none'
});

Boundary Strategy Comparison

boundaryStrategy: 'segment' (default)
  • Mines each segment independently
  • Merges results across segments
  • Phrases like “end_of_segment start_of_next” won’t be detected
  • Recommended for most use cases
boundaryStrategy: 'none'
  • Treats all segments as one continuous token stream
  • Can detect phrases that span segment boundaries
  • May find less meaningful phrases at segment edges
  • Useful for finding transitions between segments

Use Cases

  • Segment-aware mining: Find repeated phrases within natural segment boundaries
  • Lecture analysis: Identify repeated expressions in educational content
  • Quality metrics: Measure how often specific phrases appear across a transcript
  • Custom formatting: Use discovered patterns to improve segment formatting

Build docs developers (and LLMs) love