Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/ragaeeb/kokokor/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Kokokor can process documents with complex layouts by using layout elements like rectangles and horizontal lines. These elements help identify structural components like headings and footnotes.

Layout Elements

Rectangles

Used to identify headings and boxed content. Text within rectangles is marked with isHeading: true.

Horizontal Lines

Used to detect footnotes. Text below the last horizontal line (outside rectangles) is marked with isFootnote: true.

Basic Layout Example

import { reconstructParagraphs } from 'kokokor';

const result = reconstructParagraphs(
  {
    observations: [
      // Chapter title (inside rectangle)
      { text: 'Chapter 5: Advanced Topics', bbox: { x: 100, y: 100, width: 600, height: 30 } },
      
      // Main body text
      { text: 'This chapter covers', bbox: { x: 100, y: 200, width: 340, height: 18 } },
      { text: 'advanced topics in', bbox: { x: 100, y: 223, width: 330, height: 18 } },
      { text: 'document processing.', bbox: { x: 100, y: 246, width: 360, height: 18 } },
      
      // Footnote (below horizontal line)
      { text: '1. See Smith (2020) for details.', bbox: { x: 100, y: 1100, width: 400, height: 14 } },
    ],
    page: {
      width: 800,
      height: 1200,
      dpiX: 72,
      dpiY: 72,
    },
    layout: {
      rectangles: [
        // Rectangle around chapter title
        { x: 95, y: 95, width: 610, height: 40 },
      ],
      horizontalLines: [
        // Line separating body from footnotes
        { x: 100, y: 1050, width: 600, height: 2 },
      ],
    },
  },
  {
    format: {
      footerSymbol: '___',  // Insert separator before footnotes
    },
  }
);

console.log(result.text);
// Output:
// Chapter 5: Advanced Topics
//
// This chapter covers advanced topics in document processing.
//
// ___
// 1. See Smith (2020) for details.

Multi-Column Document

Kokokor’s current paragraph detection works best with single-column layouts. For multi-column documents, pre-process observations to group by column, or process each column separately.
import { reconstructParagraphs, filterHorizontalLinesOutsideRectangles } from 'kokokor';

const document = {
  observations: [
    // Header in rectangle
    { text: 'Document Title', bbox: { x: 50, y: 50, width: 700, height: 35 } },
    
    // Left column
    { text: 'This is the first column', bbox: { x: 50, y: 120, width: 320, height: 18 } },
    { text: 'of text in a two-column', bbox: { x: 50, y: 143, width: 310, height: 18 } },
    { text: 'layout.', bbox: { x: 50, y: 166, width: 120, height: 18 } },
    
    // Right column (note: same y-coordinates as left column)
    { text: 'This is the second column', bbox: { x: 430, y: 120, width: 330, height: 18 } },
    { text: 'which runs parallel to the', bbox: { x: 430, y: 143, width: 320, height: 18 } },
    { text: 'first column.', bbox: { x: 430, y: 166, width: 220, height: 18 } },
  ],
  page: {
    width: 800,
    height: 1200,
    dpiX: 150,
    dpiY: 150,
  },
  layout: {
    rectangles: [
      { x: 45, y: 45, width: 710, height: 45 },  // Header box
    ],
  },
};

// Filter horizontal lines that are not in rectangles
const relevantLines = filterHorizontalLinesOutsideRectangles(
  document.layout?.rectangles || [],
  document.layout?.horizontalLines || [],
  5  // pixel tolerance
);

const result = reconstructParagraphs(document, {
  line: {
    rectangles: document.layout?.rectangles,
    horizontalLines: relevantLines,
  },
});

// Check which lines are headings
result.lines.forEach(line => {
  if (line.isHeading) {
    console.log('Heading:', line.text);
  }
});

Complete Example with All Features

import { 
  reconstructParagraphs,
  filterHorizontalLinesOutsideRectangles,
  calculateDPI 
} from 'kokokor';

// Layout elements from document analysis
const rectangles = [
  // Chapter title box
  { x: 100, y: 80, width: 600, height: 50 },
  // Sidebar/callout box
  { x: 100, y: 400, width: 600, height: 100 },
];

const horizontalLines = [
  // Separator before footnotes
  { x: 100, y: 1000, width: 600, height: 2 },
  // Line in callout box (should be filtered)
  { x: 110, y: 450, width: 580, height: 1 },
];

// Filter out lines inside rectangles
const footnoteSeparators = filterHorizontalLinesOutsideRectangles(
  rectangles,
  horizontalLines,
  5  // pixel tolerance
);

const document = {
  observations: [
    // Chapter title
    { text: 'Chapter 3: Layout Analysis', bbox: { x: 120, y: 95, width: 560, height: 30 } },
    
    // Body paragraph 1
    { text: 'Document layout analysis', bbox: { x: 100, y: 180, width: 380, height: 18 } },
    { text: 'is crucial for accurate', bbox: { x: 100, y: 203, width: 360, height: 18 } },
    { text: 'text reconstruction.', bbox: { x: 100, y: 226, width: 340, height: 18 } },
    
    // Body paragraph 2
    { text: 'Kokokor uses rectangles', bbox: { x: 100, y: 270, width: 380, height: 18 } },
    { text: 'and horizontal lines to', bbox: { x: 100, y: 293, width: 370, height: 18 } },
    { text: 'identify structure.', bbox: { x: 100, y: 316, width: 320, height: 18 } },
    
    // Callout box content
    { text: 'Important Note:', bbox: { x: 120, y: 420, width: 260, height: 16 } },
    { text: 'Layout elements must be', bbox: { x: 120, y: 445, width: 340, height: 14 } },
    { text: 'provided by the OCR system', bbox: { x: 120, y: 464, width: 360, height: 14 } },
    
    // More body text
    { text: 'The reconstruction quality', bbox: { x: 100, y: 550, width: 400, height: 18 } },
    { text: 'improves significantly', bbox: { x: 100, y: 573, width: 360, height: 18 } },
    { text: 'with layout information.', bbox: { x: 100, y: 596, width: 380, height: 18 } },
    
    // Footnotes
    { text: '1. Smith et al. (2020)', bbox: { x: 100, y: 1030, width: 280, height: 14 } },
    { text: '2. See documentation for details', bbox: { x: 100, y: 1050, width: 380, height: 14 } },
  ],
  page: {
    width: 800,
    height: 1200,
    dpiX: 150,
    dpiY: 150,
  },
  layout: {
    rectangles,
    horizontalLines: footnoteSeparators,
  },
};

const result = reconstructParagraphs(document, {
  line: {
    rectangles: document.layout.rectangles,
    horizontalLines: document.layout.horizontalLines,
    pixelTolerance: 5,
  },
  paragraph: {
    verticalJumpFactor: 2,
    widthTolerance: 0.85,
  },
  format: {
    footerSymbol: '___',
  },
});

console.log(result.text);

// Analyze the structure
console.log('\nStructure Analysis:');
result.paragraphs.forEach((para, i) => {
  const flags = [];
  if (para.isHeading) flags.push('HEADING');
  if (para.isFootnote) flags.push('FOOTNOTE');
  if (para.isCentered) flags.push('CENTERED');
  if (para.isPoetic) flags.push('POETRY');
  
  console.log(`${i + 1}. ${flags.length ? `[${flags.join(', ')}]` : '[BODY]'} ${para.text}`);
});

Utility Functions

filterHorizontalLinesOutsideRectangles

Filters horizontal lines that are contained within rectangles:
import { filterHorizontalLinesOutsideRectangles } from 'kokokor';

const filteredLines = filterHorizontalLinesOutsideRectangles(
  rectangles,        // Array of rectangle bounding boxes
  horizontalLines,   // Array of horizontal line bounding boxes
  5                  // Pixel tolerance
);

calculateDPI

Calculates DPI from image and PDF dimensions:
import { calculateDPI } from 'kokokor';

const dpi = calculateDPI(
  { width: 2480, height: 3508 },  // Image size in pixels
  { width: 595, height: 842 }     // PDF size in points (72 points = 1 inch)
);

console.log(dpi);  // { x: 300, y: 300 }

mapMatrixToBoundingBox

Converts array-format bounding boxes to object format:
import { mapMatrixToBoundingBox } from 'kokokor';

const bbox = mapMatrixToBoundingBox([100, 200, 400, 220]);
console.log(bbox);
// { x: 100, y: 200, width: 300, height: 20 }

Configuration Options

layout.rectangles
BoundingBox[]
Array of rectangle coordinates for heading detection. Text within rectangles is marked with isHeading: true.
layout.horizontalLines
BoundingBox[]
Array of horizontal line coordinates for footnote detection. Text below the last line (outside rectangles) is marked with isFootnote: true.
Optional symbol to insert before the first footnote. Common values: '___', '---', '***'.
line.pixelTolerance
number
default:"5"
Tolerance in pixels (at 72 DPI) for determining if a line is inside a rectangle.

Best Practices

Filter Separator Lines: Use filterHorizontalLinesOutsideRectangles() to exclude horizontal lines that are part of headers or callout boxes.
Multi-Column Limitation: Kokokor’s paragraph detection assumes single-column layout. For multi-column documents, pre-process observations to separate columns.
Heading Detection: Text is only marked as heading if it’s contained within a rectangle. Use your OCR system’s layout analysis to identify heading boxes.

Advanced: Custom Layout Processing

For complex layouts, use the low-level API:
import { 
  mapObservationsToTextLines,
  mapTextLinesToParagraphs,
  formatTextBlocks 
} from 'kokokor';

// Step 1: Process observations to lines
const lines = mapObservationsToTextLines(
  observations,
  page,
  {
    rectangles,
    horizontalLines,
  }
);

// Step 2: Separate body and footnotes
const bodyLines = lines.filter(line => !line.isFootnote);
const footnoteLines = lines.filter(line => line.isFootnote);

// Step 3: Process each section separately
const bodyParagraphs = mapTextLinesToParagraphs(bodyLines);
const footnoteParagraphs = mapTextLinesToParagraphs(footnoteLines);

// Step 4: Format with custom logic
const bodyText = formatTextBlocks(bodyParagraphs);
const footnoteText = formatTextBlocks(footnoteParagraphs);

const finalText = `${bodyText}\n\n___\n${footnoteText}`;

See Also

Simple OCR

Basic paragraph reconstruction

Poetry Documents

Preserve poetic formatting

API Reference

Complete API documentation

Arabic Text

RTL text processing

Build docs developers (and LLMs) love