Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/ragaeeb/kokokor/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Kokokor provides extensive configuration options to handle different document types, languages, and layouts. This guide covers all configurable parameters and their effects.

Configuration Structure

The reconstructParagraphs function accepts an optional second parameter for configuration:
const result = reconstructParagraphs(input, {
  line: { /* line detection options */ },
  paragraph: { /* paragraph grouping options */ },
  format: { /* text formatting options */ }
});

Line Detection Options

Control how OCR observations are grouped into text lines:

Pixel Tolerance

Additional vertical tolerance for grouping observations into the same line:
const result = reconstructParagraphs(input, {
  line: {
    pixelTolerance: 5 // pixels at 72 DPI (default)
  }
});
pixelTolerance
number
default:5
Vertical tolerance in pixels at 72 DPI. This value is automatically scaled based on the document’s actual DPI.
  • Higher values: More permissive line grouping (more text on same line)
  • Lower values: Stricter line grouping (more separate lines)

Line Height Factor

Fixed line height factor for grouping observations:
const result = reconstructParagraphs(input, {
  line: {
    lineHeightFactor: 0.3 // optional fixed factor
  }
});
lineHeightFactor
number
default:"adaptive"
If not provided, the system computes an adaptive factor based on document analysis.Typical values:
  • 0.15 - Very tight line grouping (small gaps)
  • 0.3 - Standard line height
  • 0.5 - Generous spacing tolerance
When lineHeightFactor is not specified, Kokokor analyzes the document’s spacing patterns using an internal adaptive algorithm to determine the optimal value automatically.

RTL Text Support

Enable right-to-left text processing:
const result = reconstructParagraphs(input, {
  line: {
    isRTL: true // default is true
  }
});
isRTL
boolean
default:true
When enabled, coordinates are flipped for proper RTL text alignment. The default is true as Kokokor was originally designed for Arabic text processing.

Centering Detection

Control how centered text (titles, headings, poetry) is identified:
const result = reconstructParagraphs(input, {
  line: {
    centerToleranceRatio: 0.05, // 5% of page width
    minMarginRatio: 0.2          // 20% minimum margin
  }
});
centerToleranceRatio
number
Tolerance for center point alignment as a ratio of page width.
  • 0.02 - Stricter centering (within 2%)
  • 0.05 - Standard centering (within 5%)
  • 0.1 - Looser centering (within 10%)
minMarginRatio
number
Minimum margin required on each side as a ratio of page width.
  • 0.1 - At least 10% whitespace on each side
  • 0.2 - At least 20% whitespace (default)
  • 0.3 - At least 30% whitespace (very strict)

Poetry Detection Options

Fine-tune the poetry detection algorithm:
const result = reconstructParagraphs(input, {
  line: {
    poetryDetectionOptions: {
      // Centering for poetry
      centerToleranceRatio: 0.05,
      minMarginRatio: 0.1,
      
      // Hemistich detection
      maxVerticalGapRatio: 2.0,
      pairWidthSimilarityRatio: 0.4,
      pairWordCountSimilarityRatio: 0.5,
      
      // Wide line detection
      minWidthRatioForMerged: 0.6,
      wordDensityComparisonRatio: 0.95,
      
      // General
      minWordCount: 2
    },
    poetryPairDelimiter: ' ... ' // delimiter for hemistichs
  }
});

Hemistich Pair Detection

maxVerticalGapRatio
number
Maximum vertical gap between two observations to be considered a poetry pair (hemistichs).Measured as a ratio of average line height:
  • 1.5 - Closer spacing required
  • 2.0 - Standard spacing
  • 3.0 - Wider spacing allowed
pairWidthSimilarityRatio
number
How similar in width two hemistichs must be.The check: |width1 - width2| / average < ratio
  • 0.2 - Very similar widths required
  • 0.4 - Moderate similarity
  • 0.6 - More variation allowed
pairWordCountSimilarityRatio
number
How similar in word count two hemistichs must be.The check: |count1 - count2| / max < ratio
  • 0.3 - Very similar counts
  • 0.5 - Moderate similarity
  • 0.7 - More variation allowed

Wide Poetic Line Detection

minWidthRatioForMerged
number
Minimum width a single line must have to be analyzed for poetry.As a ratio of page width:
  • 0.4 - Shorter lines included
  • 0.6 - Standard threshold
  • 0.8 - Only very wide lines
wordDensityComparisonRatio
number
Word density threshold for identifying poetry. Poetry typically has lower word density than prose.A line is poetic if its density ≤ ratio * avgProseDensity:
  • 0.7 - Very sparse text required
  • 0.95 - Close to prose density allowed
  • 0.9 - Moderate spacing required

General Poetry Options

minWordCount
number
default:2
Minimum words required for a line to be considered poetry. Filters out noise like page numbers.
poetryPairDelimiter
string
default:" "
Delimiter used when merging detected poetry pairs (hemistichs).Examples:
  • " " - Simple space
  • " ... " - Visual separator: صدر ... عجز
  • " – " - Em dash separator

Layout Elements

Provide structural hints for better text classification:
const result = reconstructParagraphs({
  observations: ocrObservations,
  page: pageContext,
  layout: {
    horizontalLines: horizontalLineBboxes,
    rectangles: rectangleBboxes
  }
});
horizontalLines
BoundingBox[]
default:[]
Array of horizontal line elements detected in the document. Used to identify footnote boundaries - text appearing below the last horizontal line is classified as footnotes.
rectangles
BoundingBox[]
default:[]
Array of rectangle elements detected in the document. Text within rectangles is classified as headings.
See the Layout Elements guide for detailed examples of working with horizontal lines and rectangles.

Paragraph Grouping Options

Control how text lines are grouped into paragraphs:
const result = reconstructParagraphs(input, {
  paragraph: {
    verticalJumpFactor: 2.0,
    widthTolerance: 0.85
  }
});
verticalJumpFactor
number
Factor for detecting paragraph breaks based on vertical spacing.A new paragraph starts when gap > previousGap * verticalJumpFactor:
  • 1.5 - More sensitive to spacing changes
  • 2.0 - Standard sensitivity
  • 3.0 - Less sensitive (fewer paragraph breaks)
widthTolerance
number
Threshold for identifying “short” lines that indicate paragraph endings.As a ratio of reference width:
  • 0.75 - Mark more lines as short
  • 0.85 - Standard threshold
  • 0.95 - Only very short lines marked
See the Paragraph Options guide for detailed explanations of how these options affect paragraph breaks.

Text Formatting Options

Control the final text output format:
const result = reconstructParagraphs(input, {
  format: {
    footerSymbol: '---' // inserted before first footnote
  }
});
Optional symbol to insert before the first footnote in the formatted text output.Examples:
  • "---" - Horizontal line separator
  • "\n***\n" - Decorative separator
  • "Footnotes:" - Text label

Debug Logging

Enable detailed logging for troubleshooting:
const result = reconstructParagraphs(input, {
  line: {
    log: (message, ...args) => {
      console.log(`[Kokokor] ${message}`, ...args);
    }
  }
});
log
function
Optional logging function for debugging. Receives detailed information about processing decisions and intermediate steps.

Complete Configuration Example

Here’s a complete example with all major options configured:
import { reconstructParagraphs } from 'kokokor';

const result = reconstructParagraphs(
  {
    observations: ocrObservations,
    page: {
      width: 2480,
      height: 3508,
      dpiX: 300,
      dpiY: 300
    },
    layout: {
      horizontalLines: detectedHorizontalLines,
      rectangles: detectedRectangles
    }
  },
  {
    line: {
      pixelTolerance: 5,
      lineHeightFactor: 0.3,
      isRTL: true,
      centerToleranceRatio: 0.05,
      minMarginRatio: 0.2,
      poetryDetectionOptions: {
        centerToleranceRatio: 0.05,
        minMarginRatio: 0.1,
        maxVerticalGapRatio: 2.0,
        minWidthRatioForMerged: 0.6,
        minWordCount: 2,
        pairWidthSimilarityRatio: 0.4,
        pairWordCountSimilarityRatio: 0.5,
        wordDensityComparisonRatio: 0.95
      },
      poetryPairDelimiter: ' ... '
    },
    paragraph: {
      verticalJumpFactor: 2.0,
      widthTolerance: 0.85
    },
    format: {
      footerSymbol: '\n---\n'
    }
  }
);

Next Steps

Paragraph Options

Deep dive into paragraph grouping behavior

Layout Elements

Work with horizontal lines and rectangles

Build docs developers (and LLMs) love