Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/ragaeeb/paragrafs/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Paragrafs provides a simple API for converting raw AI transcription tokens into properly formatted paragraphs. This guide covers the essential functions you’ll need to get started.

Installation

First, install Paragrafs in your project:
npm install paragrafs

Core Workflow

The basic workflow for processing transcriptions involves three main steps:
1

Estimate segments from tokens

Convert multi-word tokens into segments with word-level timing information.
2

Mark and combine segments

Process segments to identify natural paragraph breaks based on fillers, gaps, and punctuation.
3

Format into readable output

Transform marked segments into clean, formatted text.

Quick Start Example

Here’s a complete example showing how to process a simple transcription:
import { 
  estimateSegmentFromToken, 
  markAndCombineSegments, 
  mapSegmentsIntoFormattedSegments 
} from 'paragrafs';

// Example token from transcription
const token = {
  start: 0,
  end: 5,
  text: 'This is a sample text. It should be properly segmented.',
};

// Estimate segment with word-level tokens
const segment = estimateSegmentFromToken(token);

// Combine and format segments
const formattedSegments = mapSegmentsIntoFormattedSegments([segment]);

console.log(formattedSegments[0].text);
// Output: "This is a sample text. It should be properly segmented."

Working with Multiple Segments

For more complex transcriptions with multiple segments, use the complete processing pipeline:
import {
  markAndCombineSegments,
  mapSegmentsIntoFormattedSegments,
} from 'paragrafs';

// Example transcription segments
const segments = [
  {
    start: 0,
    end: 6.5,
    text: 'The quick brown fox!',
    tokens: [
      { start: 0, end: 1, text: 'The' },
      { start: 1, end: 2, text: 'quick' },
      { start: 2, end: 3, text: 'brown' },
      { start: 3, end: 6.5, text: 'fox!' },
    ],
  },
  {
    start: 8,
    end: 13,
    text: 'Jumps right over the',
    tokens: [
      { start: 8, end: 9, text: 'Jumps' },
      { start: 9, end: 10, text: 'right' },
      { start: 10, end: 11, text: 'over' },
      { start: 12, end: 13, text: 'the' },
    ],
  },
];

// Options for segment formatting
const options = {
  fillers: ['uh', 'umm', 'hmmm'],
  gapThreshold: 3,
  maxSecondsPerSegment: 12,
  minWordsPerSegment: 3,
};

// Process the segments
const combinedSegments = markAndCombineSegments(segments, options);
const formattedSegments = mapSegmentsIntoFormattedSegments(combinedSegments);

console.log(formattedSegments);

Configuration Options

The markAndCombineSegments function accepts several options to customize paragraph reconstruction:
OptionTypeDescription
fillersstring[]Words to treat as filler (e.g., “uh”, “umm”) that trigger segment breaks
gapThresholdnumberMinimum time gap in seconds to trigger a segment break
maxSecondsPerSegmentnumberMaximum duration in seconds for a single segment
minWordsPerSegmentnumberMinimum words required for a segment to stand alone
hintsHintsOptional multi-word phrase hints for custom break points

Core Data Types

Understanding the basic types will help you work effectively with Paragrafs:
type Token = {
  start: number;  // Start time in seconds
  end: number;    // End time in seconds
  text: string;   // The transcribed text
};

type Segment = Token & {
  tokens: Token[];  // Word-by-word breakdown with timings
};

type MarkedSegment = {
  start: number;
  end: number;
  tokens: MarkedToken[];  // Tokens with break markers
};

Next Steps

Timestamped Transcripts

Learn how to create human-readable transcripts with timestamps

Ground Truth Alignment

Align AI tokens with human-edited text

Auto-Hint Generation

Automatically discover repeated phrases

Arabic Support

Work with Arabic text normalization

Build docs developers (and LLMs) love