Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/ragaeeb/kokokor/llms.txt

Use this file to discover all available pages before exploring further.

What is Kokokor?

Kokokor is a lightweight TypeScript library designed to reconstruct paragraphs from OCR outputs. It transforms unstructured OCR text into well-formatted, readable paragraphs while preserving the semantic structure of the original document. Whether you’re processing Arabic manuscripts, poetry collections, or multi-column documents, Kokokor intelligently analyzes text layout and spacing to deliver accurate paragraph reconstruction.

Quick Start

Get up and running with Kokokor in minutes

Installation

Install Kokokor in your project

API Reference

Explore the complete API documentation

Examples

View real-world usage examples

Key Features

Intelligent Text Analysis

Smart Line Grouping

Groups text lines based on vertical proximity and adaptive spacing analysis with DPI-aware thresholds

Paragraph Reconstruction

Advanced paragraph detection using vertical gap and line width analysis

Poetry Detection

Automatically identifies and preserves poetic content using centering, word density, and hemistich analysis

Layout Recognition

Recognizes document structure including headings, footnotes, and multi-column layouts

Multilingual Support

  • Right-to-Left (RTL) Text: Full support for Arabic, Hebrew, and other RTL languages with coordinate flipping and normalization
  • Coordinate Normalization: Ensures consistent results regardless of source document resolution
  • DPI-Aware Processing: Adapts thresholds based on document DPI for accurate spacing analysis

Advanced Capabilities

  • Surya OCR Integration: Built-in support for converting Surya OCR format to Kokokor observations
  • Noise Filtering: Removes OCR artifacts to improve text quality
  • Customizable Parameters: Tune detection thresholds for different document types and languages
  • Comprehensive Metadata: Text blocks include centering, heading, footnote, and poetry flags
  • Line Spacing Analytics: Adaptive line height factors based on document characteristics

Real Use Cases

OCR Post-Processing

Transform raw OCR output from tools like Tesseract, Google Vision API, or Surya into clean, readable text with proper paragraph breaks:
import { reconstructParagraphs } from 'kokokor';

const result = reconstructParagraphs({
  observations: ocrOutput,
  page: { dpiX: 300, dpiY: 300, width: 2480, height: 3508 }
});

console.log(result.text); // Properly formatted paragraphs

Arabic Text Processing

Handle Arabic manuscripts and documents with RTL support and poetry detection:
import { flipAndAlignObservations, reconstructParagraphs } from 'kokokor';

// Flip coordinates for RTL text
const flipped = flipAndAlignObservations(
  observations,
  pageWidth,
  dpiX
);

const result = reconstructParagraphs({
  observations: flipped,
  page: { dpiX, dpiY, width: pageWidth, height: pageHeight }
});

Poetry Preservation

Automatically detect and preserve poetic structures in mixed-content documents:
const result = reconstructParagraphs({
  observations,
  page: pageContext,
  options: {
    poetryDetectionOptions: {
      centerToleranceRatio: 0.05,
      minMarginRatio: 0.1,
      minWordCount: 2
    },
    poetryPairDelimiter: ' ... '
  }
});

// Poetry lines are preserved with metadata
result.textBlocks.forEach(block => {
  if (block.isPoetic) {
    console.log('Poetry:', block.text);
  }
});

Document Structure Analysis

Extract headings, footnotes, and body content from complex documents:
const result = reconstructParagraphs({
  observations,
  page: pageContext,
  options: {
    rectangles: headingBoundingBoxes,
    horizontalLines: separatorLines
  }
});

// Access structured content
result.textBlocks.forEach(block => {
  if (block.isHeading) console.log('Heading:', block.text);
  if (block.isFootnote) console.log('Footnote:', block.text);
});

Why Kokokor?

Built for Production: Kokokor is designed to handle real-world OCR challenges including variable spacing, noise, multi-column layouts, and mixed content types.
  • Zero Configuration: Works out of the box with sensible defaults
  • Highly Customizable: Fine-tune every aspect of paragraph detection
  • Type-Safe: Written in TypeScript with comprehensive type definitions
  • Lightweight: Minimal dependencies, tree-shakeable ESM modules
  • Well-Tested: Comprehensive test suite with snapshot testing
  • Production Ready: Used in real-world OCR processing pipelines

Next Steps

Install Kokokor

Add Kokokor to your project

Quick Start Guide

Build your first OCR reconstruction pipeline

Build docs developers (and LLMs) love