Documentation Index
Fetch the complete documentation index at: https://mintlify.com/ragaeeb/kokokor/llms.txt
Use this file to discover all available pages before exploring further.
What is Kokokor?
Kokokor is a lightweight TypeScript library designed to reconstruct paragraphs from OCR outputs. It transforms unstructured OCR text into well-formatted, readable paragraphs while preserving the semantic structure of the original document. Whether you’re processing Arabic manuscripts, poetry collections, or multi-column documents, Kokokor intelligently analyzes text layout and spacing to deliver accurate paragraph reconstruction.Quick Start
Get up and running with Kokokor in minutes
Installation
Install Kokokor in your project
API Reference
Explore the complete API documentation
Examples
View real-world usage examples
Key Features
Intelligent Text Analysis
Smart Line Grouping
Groups text lines based on vertical proximity and adaptive spacing analysis with DPI-aware thresholds
Paragraph Reconstruction
Advanced paragraph detection using vertical gap and line width analysis
Poetry Detection
Automatically identifies and preserves poetic content using centering, word density, and hemistich analysis
Layout Recognition
Recognizes document structure including headings, footnotes, and multi-column layouts
Multilingual Support
- Right-to-Left (RTL) Text: Full support for Arabic, Hebrew, and other RTL languages with coordinate flipping and normalization
- Coordinate Normalization: Ensures consistent results regardless of source document resolution
- DPI-Aware Processing: Adapts thresholds based on document DPI for accurate spacing analysis
Advanced Capabilities
- Surya OCR Integration: Built-in support for converting Surya OCR format to Kokokor observations
- Noise Filtering: Removes OCR artifacts to improve text quality
- Customizable Parameters: Tune detection thresholds for different document types and languages
- Comprehensive Metadata: Text blocks include centering, heading, footnote, and poetry flags
- Line Spacing Analytics: Adaptive line height factors based on document characteristics
Real Use Cases
OCR Post-Processing
Transform raw OCR output from tools like Tesseract, Google Vision API, or Surya into clean, readable text with proper paragraph breaks:Arabic Text Processing
Handle Arabic manuscripts and documents with RTL support and poetry detection:Poetry Preservation
Automatically detect and preserve poetic structures in mixed-content documents:Document Structure Analysis
Extract headings, footnotes, and body content from complex documents:Why Kokokor?
Built for Production: Kokokor is designed to handle real-world OCR challenges including variable spacing, noise, multi-column layouts, and mixed content types.
- Zero Configuration: Works out of the box with sensible defaults
- Highly Customizable: Fine-tune every aspect of paragraph detection
- Type-Safe: Written in TypeScript with comprehensive type definitions
- Lightweight: Minimal dependencies, tree-shakeable ESM modules
- Well-Tested: Comprehensive test suite with snapshot testing
- Production Ready: Used in real-world OCR processing pipelines
Next Steps
Install Kokokor
Add Kokokor to your project
Quick Start Guide
Build your first OCR reconstruction pipeline