Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/ragaeeb/kokokor/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Surya is a powerful multilingual document OCR toolkit. Kokokor provides seamless integration with Surya through format conversion utilities that transform Surya’s output into Kokokor’s observation format.

What is Surya?

Surya is an OCR toolkit that provides:
  • Multilingual text recognition
  • Layout detection
  • Table recognition
  • Reading order detection
  • Line-level text extraction
Kokokor focuses on the final step: taking Surya’s line-level text output and reconstructing it into properly formatted paragraphs with intelligent break detection.

Format Conversion

Surya uses a different bounding box format than Kokokor. Kokokor provides the mapMatrixToBoundingBox utility to handle this conversion:

Surya Format

Surya represents bounding boxes as arrays with corner coordinates:
// Surya format: [x1, y1, x2, y2]
// where (x1, y1) is top-left, (x2, y2) is bottom-right
const suryaBbox = [100, 100, 400, 120];

Kokokor Format

Kokokor uses an object format with position and dimensions:
// Kokokor format: { x, y, width, height }
// where (x, y) is top-left, width and height are dimensions
const kokokorBbox = {
  x: 100,
  y: 100,
  width: 300,  // x2 - x1
  height: 20   // y2 - y1
};

Using mapMatrixToBoundingBox

The mapMatrixToBoundingBox function converts between these formats:
import { mapMatrixToBoundingBox } from 'kokokor';

// Convert Surya [x1, y1, x2, y2] to Kokokor format
const bbox = mapMatrixToBoundingBox([100, 100, 400, 120]);
// Result: { x: 100, y: 100, width: 300, height: 20 }
box
[number, number, number, number]
required
Array containing [x1, y1, x2, y2] coordinates where:
  • x1, y1 - Top-left corner coordinates
  • x2, y2 - Bottom-right corner coordinates
BoundingBox
object

Converting Surya Output

Here’s how to convert Surya’s OCR results to Kokokor observations:
import { mapMatrixToBoundingBox, reconstructParagraphs } from 'kokokor';

// Surya OCR result structure
const suryaResult = {
  text_lines: [
    {
      bbox: [100, 100, 400, 120], // [x1, y1, x2, y2] format
      text: 'First line of text from Surya OCR',
    },
    {
      bbox: [100, 130, 450, 150],
      text: 'Second line of text',
    },
    {
      bbox: [100, 160, 380, 180],
      text: 'Third line of text',
    },
  ],
};

// Convert Surya bounding boxes to Kokokor format
const observations = suryaResult.text_lines.map((line) => ({
  text: line.text,
  bbox: mapMatrixToBoundingBox(line.bbox as [number, number, number, number]),
}));

// Now use with Kokokor
const result = reconstructParagraphs({
  observations,
  page: {
    width: 2480,
    height: 3508,
    dpiX: 300,
    dpiY: 300,
  },
});

console.log(result.text);

Complete Working Example

Here’s a complete example showing the full workflow from Surya OCR to formatted text:
import { mapMatrixToBoundingBox, reconstructParagraphs } from 'kokokor';
// Assume you have Surya results already

// Step 1: Your Surya OCR results
const suryaOcrResult = {
  text_lines: [
    {
      bbox: [100, 100, 400, 120],
      text: 'Text from Surya OCR',
      confidence: 0.95,
    },
    {
      bbox: [100, 130, 450, 150],
      text: 'can be easily converted',
      confidence: 0.93,
    },
    {
      bbox: [100, 160, 380, 175],
      text: 'to Kokokor format.',
      confidence: 0.97,
    },
    {
      bbox: [100, 200, 420, 220],
      text: 'This is a new paragraph with proper spacing.',
      confidence: 0.96,
    },
  ],
  image_bbox: [0, 0, 2480, 3508],
};

// Step 2: Extract page dimensions from Surya result
const [, , pageWidth, pageHeight] = suryaOcrResult.image_bbox;

// Step 3: Convert Surya observations to Kokokor format
const observations = suryaOcrResult.text_lines.map((line) => ({
  text: line.text,
  bbox: mapMatrixToBoundingBox(line.bbox as [number, number, number, number]),
}));

// Step 4: Reconstruct paragraphs with Kokokor
const result = reconstructParagraphs({
  observations,
  page: {
    width: pageWidth,
    height: pageHeight,
    dpiX: 300, // Adjust based on your Surya configuration
    dpiY: 300,
  },
});

// Step 5: Access results
console.log('Formatted text:');
console.log(result.text);

console.log('\nLine metadata:');
result.lines.forEach((line, i) => {
  console.log(`Line ${i + 1}: ${line.text}`);
  if (line.isCentered) console.log('  - Centered');
  if (line.isPoetic) console.log('  - Poetry');
  if (line.isHeading) console.log('  - Heading');
});

console.log(`\nGrouped into ${result.paragraphs.length} paragraphs`);

Working with Surya Layout Detection

If you’re also using Surya’s layout detection features, you can pass detected rectangles and lines to Kokokor:
import { mapMatrixToBoundingBox, reconstructParagraphs } from 'kokokor';

// Surya layout detection results
const suryaLayout = {
  bboxes: [
    {
      bbox: [200, 50, 600, 90],
      label: 'heading',
    },
    {
      bbox: [100, 2900, 700, 2905],
      label: 'line',
    },
  ],
};

// Separate rectangles and horizontal lines
const rectangles = suryaLayout.bboxes
  .filter((item) => item.label === 'heading')
  .map((item) => mapMatrixToBoundingBox(item.bbox as [number, number, number, number]));

const horizontalLines = suryaLayout.bboxes
  .filter((item) => item.label === 'line')
  .map((item) => mapMatrixToBoundingBox(item.bbox as [number, number, number, number]));

// Use with Kokokor
const result = reconstructParagraphs(
  {
    observations: convertedObservations,
    page: pageContext,
    layout: {
      rectangles,
      horizontalLines,
    },
  }
);
When using Surya’s layout detection, map detected headings to rectangles and detected lines/separators to horizontalLines for optimal text classification.

Handling Different DPI Values

If you know the DPI at which Surya processed your images, make sure to provide it to Kokokor:
import { calculateDPI } from 'kokokor';

// If you have both image dimensions and original PDF dimensions
const dpi = calculateDPI(
  { width: 2480, height: 3508 }, // Image size from Surya
  { width: 595, height: 842 }     // PDF size in points (A4)
);

const result = reconstructParagraphs({
  observations: convertedObservations,
  page: {
    width: 2480,
    height: 3508,
    dpiX: dpi.x,
    dpiY: dpi.y,
  },
});
Accurate DPI values are important for proper scaling of pixel-based tolerances. If you don’t know the exact DPI, 300 is a common default for high-quality scans.

TypeScript Type Safety

For full type safety when working with Surya results:
import type { Observation } from 'kokokor';
import { mapMatrixToBoundingBox } from 'kokokor';

// Define Surya result type
type SuryaTextLine = {
  bbox: [number, number, number, number];
  text: string;
  confidence?: number;
};

// Type-safe conversion function
function convertSuryaToKokokor(suryaLines: SuryaTextLine[]): Observation[] {
  return suryaLines.map((line) => ({
    text: line.text,
    bbox: mapMatrixToBoundingBox(line.bbox),
  }));
}

// Use the conversion
const observations = convertSuryaToKokokor(suryaResult.text_lines);

Next Steps

Advanced Configuration

Fine-tune paragraph reconstruction for your documents

Layout Elements

Learn more about working with rectangles and horizontal lines

Build docs developers (and LLMs) love