Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/ragaeeb/kokokor/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Kokokor uses sophisticated heuristics to group text lines into paragraphs. Two key parameters control this behavior: verticalJumpFactor and widthTolerance. Understanding how these work helps you tune paragraph detection for different document types.

Core Concepts

Paragraph detection in Kokokor uses four coordinated signals:
  1. Vertical Jump Detection - Detects spacing increases between lines
  2. Indent Detection - Identifies right-edge indentation from baseline
  3. List-Start Detection - Recognizes repeated left-edge patterns
  4. Short Line Detection - Marks paragraph-ending lines
The verticalJumpFactor and widthTolerance options primarily control signals 1 and 4.

verticalJumpFactor

What It Does

The verticalJumpFactor determines when a vertical gap between lines is large enough to indicate a new paragraph. It works by comparing consecutive gaps:
// A new paragraph starts when:
// currentGap > previousGap * verticalJumpFactor

Default Value

{
  paragraph: {
    verticalJumpFactor: 2.0 // default
  }
}

How It Works

Consider three consecutive lines:
Line A         (y: 100)
  gap: 25px
Line B         (y: 125)
  gap: 60px    ← Is this a paragraph break?
Line C         (y: 185)
With verticalJumpFactor = 2.0:
  • currentGap = 60
  • previousGap = 25
  • threshold = 25 * 2.0 = 50
  • 60 > 50Yes, new paragraph

Tuning the Factor

// verticalJumpFactor: 1.5
// Even small spacing increases create new paragraphs
const result = reconstructParagraphs(input, {
  paragraph: {
    verticalJumpFactor: 1.5
  }
});

// Example: gap of 40px after 30px → new paragraph
// 40 > 30 * 1.5 (45)? No
// But gap of 50px after 30px → new paragraph
// 50 > 30 * 1.5 (45)? Yes

Important Notes

The vertical jump signal only activates when the preceding lines are full-width (not short lines). This prevents false breaks after natural line endings.
// This will NOT trigger a vertical break:
This is a long line that ends the paragraph.
Short line.        ← Short line
  (big gap)
Next paragraph.    ← Gap is ignored due to short line above

// The short line already signals the paragraph break,
// so vertical jump detection is suppressed

widthTolerance

What It Does

The widthTolerance determines what constitutes a “short line” that indicates a paragraph ending. Lines narrower than this threshold trigger a new paragraph for the following line.

Default Value

{
  paragraph: {
    widthTolerance: 0.85 // default (85% of reference width)
  }
}

How It Works

Kokokor computes a reference width from the document:
  1. Collects all line widths
  2. Calculates the 75th percentile (p75) width
  3. This becomes the reference width
Then for each line:
thresholdWidth = referenceWidth * widthTolerance

if (line.width < thresholdWidth) {
  // This is a "short line" - next line starts new paragraph
}

Example Calculation

// Document with line widths: [400, 420, 410, 300, 415, 405, 350]
// Sorted: [300, 350, 400, 405, 410, 415, 420]
// p75 width = 415 (75th percentile)

// With widthTolerance = 0.85:
thresholdWidth = 415 * 0.85 = 352.75

// Line classification:
// 400px → full-width (400 > 352.75)
// 420px → full-width (420 > 352.75)
// 350px → SHORT LINE (350 < 352.75) → triggers paragraph break
// 300px → SHORT LINE (300 < 352.75) → triggers paragraph break

Tuning the Tolerance

// widthTolerance: 0.75
// More lines are considered "short"
const result = reconstructParagraphs(input, {
  paragraph: {
    widthTolerance: 0.75
  }
});

// With reference width 400:
// threshold = 400 * 0.75 = 300
// Lines < 300px are short
// More paragraph breaks

How They Work Together

The two parameters work in coordination:
// Example document:
This is a long line of text that continues.  (width: 420)
This is another long line in same paragraph. (width: 415)
Short line.                                  (width: 300)
                                             (gap: 50px)
This starts a new paragraph with more text.  (width: 410)
And this continues that paragraph.           (width: 405)
                                             (gap: 80px, previous gap: 25px)
This is another paragraph after big gap.     (width: 418)

// With defaults (verticalJumpFactor: 2.0, widthTolerance: 0.85):
// Reference width (p75): ~415
// Threshold width: 415 * 0.85 = 352.75

// Paragraph 1: Lines 1-2
//   Line 3 is short (300 < 352.75) → triggers break

// Paragraph 2: Lines 4-5
//   Gap of 80px vs previous 25px
//   80 > 25 * 2.0? Yes → triggers break

// Paragraph 3: Line 6

Tuning for Document Types

Dense Academic Papers

Academic papers often have consistent spacing and few short lines:
const result = reconstructParagraphs(input, {
  paragraph: {
    verticalJumpFactor: 1.8,  // Sensitive to spacing
    widthTolerance: 0.90       // Only very short lines
  }
});

Books and Novels

Books have clear paragraph breaks with indentation and spacing:
const result = reconstructParagraphs(input, {
  paragraph: {
    verticalJumpFactor: 2.0,  // Standard sensitivity
    widthTolerance: 0.85       // Standard threshold
  }
});

Technical Documents

Technical docs may have lists, code blocks, and varied formatting:
const result = reconstructParagraphs(input, {
  paragraph: {
    verticalJumpFactor: 2.5,  // Less sensitive
    widthTolerance: 0.75       // More short line breaks
  }
});

Poetry Collections

Poetry is handled separately, but for prose sections:
const result = reconstructParagraphs(input, {
  paragraph: {
    verticalJumpFactor: 1.5,  // Very sensitive
    widthTolerance: 0.95       // Preserve short lines
  }
});
Poetry detection (isPoetic flag) happens before paragraph grouping. Poetic lines are never merged into paragraphs regardless of these settings.

Multi-Column Layouts

const result = reconstructParagraphs(input, {
  paragraph: {
    verticalJumpFactor: 2.5,  // Conservative on spacing
    widthTolerance: 0.70       // Aggressive on width (columns are narrow)
  }
});

Diagnostic Tips

Too Many Paragraphs?

// Reduce paragraph breaks by:
// 1. Increase verticalJumpFactor (less sensitive to spacing)
// 2. Increase widthTolerance (fewer short lines)

const result = reconstructParagraphs(input, {
  paragraph: {
    verticalJumpFactor: 2.5,  // Was: 2.0
    widthTolerance: 0.90       // Was: 0.85
  }
});

Too Few Paragraphs?

// Increase paragraph breaks by:
// 1. Decrease verticalJumpFactor (more sensitive to spacing)
// 2. Decrease widthTolerance (more short lines)

const result = reconstructParagraphs(input, {
  paragraph: {
    verticalJumpFactor: 1.5,  // Was: 2.0
    widthTolerance: 0.80       // Was: 0.85
  }
});

Debug Paragraph Detection

// Use low-level API for detailed control
import { mapObservationsToTextLines, mapTextLinesToParagraphs } from 'kokokor';

const lines = mapObservationsToTextLines(observations, page, {
  log: console.log // Enable debug logging
});

console.log('Line widths:', lines.map(l => l.bbox.width));
console.log('Line gaps:', lines.map((l, i) => 
  i > 0 ? l.bbox.y - lines[i-1].bbox.y : 0
));

const paragraphs = mapTextLinesToParagraphs(lines, {
  verticalJumpFactor: 2.0,
  widthTolerance: 0.85
});

console.log(`${lines.length} lines → ${paragraphs.length} paragraphs`);

Advanced: Reference Width Calculation

Kokokor uses a robust p75 percentile for reference width:
// Internal algorithm (simplified):
function computeReferenceWidth(lines) {
  const widths = lines.map(l => l.bbox.width).sort((a, b) => a - b);
  
  // Use p75 if we have enough lines, otherwise max width
  if (widths.length >= 4) {
    const p75Index = Math.floor((widths.length - 1) * 0.75);
    return widths[p75Index];
  }
  
  return widths[widths.length - 1];
}
This approach is resilient to outliers and works well with varied document layouts.

Next Steps

Advanced Configuration

Explore all configuration options

Basic Usage

Back to basic usage patterns

Build docs developers (and LLMs) love