Documentation Index Fetch the complete documentation index at: https://mintlify.com/ragaeeb/kokokor/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Kokokor processes OCR observations through a three-stage pipeline that progressively reconstructs document structure:
Observations → Text Lines
Groups OCR observations into lines using vertical proximity analysis
Text Lines → Paragraphs
Merges lines into paragraphs while preserving poetry and special formatting
Paragraphs → Formatted Text
Converts structured blocks into readable text with proper spacing
Pipeline Architecture
The complete pipeline is orchestrated by the reconstructParagraphs function:
export const reconstructParagraphs = (
input : ReconstructInput ,
options : ReconstructOptions = {}
) : ReconstructResult => {
// Stage 1: Observations → Text Lines
const lines = mapObservationsToTextLines (
input . observations ,
input . page ,
{
horizontalLines: input . layout ?. horizontalLines ,
rectangles: input . layout ?. rectangles ,
... ( options . line ?? {}),
}
);
// Stage 2: Text Lines → Paragraphs
const paragraphs = mapTextLinesToParagraphs (
lines ,
options . paragraph ?? {}
);
// Stage 3: Paragraphs → Formatted Text
const text = formatTextBlocks (
paragraphs ,
options . format ?. footerSymbol
);
return { lines , paragraphs , text };
};
Reference: src/index.ts:35
Stage 1: Observations to Text Lines
Purpose
Convert raw OCR observations (individual words) into structured text lines with rich metadata.
Process
Preprocessing
Normalize coordinates, filter noise, and handle RTL text direction observations = flipAndAlignObservations (
observations ,
page . width ,
page . dpiX ,
options
);
Line Grouping
Group observations by vertical proximity using adaptive spacing analysis const marked = indexItemsAsLines (
observations ,
page . dpiY ,
options . pixelTolerance ,
options . lineHeightFactor
);
Metadata Detection
Identify centering, headings, footnotes, and poetry // Internal: centering detection algorithm
if ( textIsCentered ( o . bbox , page . width , options )) {
e . isCentered = true ;
}
if ( footerLineY !== undefined && o . bbox . y > footerLineY ) {
e . isFootnote = true ;
}
Poetry Detection
Apply multiple heuristics to identify poetic content // Internal: poetry detection uses multiple heuristics
if ( groupMatchesPoetryCriteria ( group , page . width , avgProseWordDensity , options )) {
for ( const observation of group ) {
observation . isPoetic = true ;
}
}
Reference: src/utils/paragraphs.ts:82
Adaptive Line Detection
The algorithm automatically adjusts to document characteristics:
Spacing Analysis : Calculates median and 75th percentile gaps between observations
Line Height Factor : Adapts based on gap-to-height ratio:
0.15 for small gaps (tight line grouping)
0.25 for medium gaps (standard spacing)
0.4 for large gaps (widely spaced lines)
DPI Scaling : Adjusts pixel tolerances based on document resolution
Reference: src/utils/layout.ts:196
Stage 2: Text Lines to Paragraphs
Purpose
Merge text lines into coherent paragraphs while preserving poetry and special formatting.
Process
The algorithm separates body content from footnotes and processes each independently:
export const mapTextLinesToParagraphs = (
textLines : TextBlock [],
options : ParagraphOptions = {}
) => {
const bodyBlocks = groupProseToParagraphs (
textLines . filter (( t ) => ! t . isFootnote ),
resolvedOptions . verticalJumpFactor ,
resolvedOptions . widthTolerance
);
const footerBlocks = groupProseToParagraphs (
textLines . filter (( t ) => t . isFootnote ),
resolvedOptions . verticalJumpFactor ,
resolvedOptions . widthTolerance
);
return bodyBlocks . concat ( footerBlocks );
};
Reference: src/utils/paragraphs.ts:236
Break Detection Signals
The paragraph grouping algorithm uses four coordinated signals:
Vertical Jump Significant spacing increase between full-width lines
Indent Start Line that newly indents from the right-edge baseline
List Start Repeated left-edge starts with short continuations
Short Line Lines significantly narrower than reference width
Reference: src/utils/marking.ts:525
Poetry Preservation
Poetic content receives special treatment:
for ( const line of textLines ) {
if ( line . isPoetic ) {
// Poetry lines are NOT merged into paragraphs
result . push ( line );
} else {
// Prose lines accumulate for paragraph grouping
current . push ( line );
}
}
Reference: src/utils/paragraphs.ts:204
Poetry lines maintain their individual line breaks to preserve artistic and structural integrity.
Stage 3: Paragraphs to Formatted Text
Purpose
Convert structured text blocks into a readable string with proper line breaks and spacing.
Process
export const formatTextBlocks = (
textBlocks : TextBlock [],
footerSymbol ?: string
) => {
let isAtLeastOneFootnoteHit = false ;
const paragraphs = textBlocks . flatMap (( t ) => {
// Insert footer symbol before first footnote
if ( footerSymbol && t . isFootnote && ! isAtLeastOneFootnoteHit ) {
isAtLeastOneFootnoteHit = true ;
return [ footerSymbol , t . text ];
}
// Add blank line after headings
if ( t . isHeading ) {
return [ t . text , '' ];
}
return [ t . text ];
});
return paragraphs . join ( ' \n ' );
};
Reference: src/index.ts:11
Headings receive a blank line after them for visual separation: Chapter Title
First paragraph text...
Each poetic line appears on its own line: Poetic line one
Poetic line two
Paragraphs are separated by single newlines: First paragraph text.
Second paragraph text.
Configuration Options
Line Detection Options
type MapObservationsToTextLinesOptions = {
// RTL text handling
isRTL ?: boolean ;
// Spacing tolerance
pixelTolerance ?: number ; // Default: 5px at 72 DPI
lineHeightFactor ?: number ; // Adaptive if not provided
// Layout elements
horizontalLines ?: BoundingBox []; // For footnote detection
rectangles ?: BoundingBox []; // For heading detection
// Centering detection
centerToleranceRatio ?: number ; // Default: 0.05 (5%)
minMarginRatio ?: number ; // Default: 0.2 (20%)
// Poetry detection
poetryDetectionOptions ?: Partial < PoetryDetectionOptions >;
poetryPairDelimiter ?: string ; // Default: " "
// Debugging
log ?: ( message : string , ... args : any []) => void ;
};
Reference: src/types.ts:69
Paragraph Grouping Options
type ParagraphOptions = {
// Vertical spacing threshold
verticalJumpFactor ?: number ; // Default: 2
// Short line threshold
widthTolerance ?: number ; // Default: 0.85 (85%)
};
Reference: src/types.ts:142
Complete Example
import { reconstructParagraphs } from 'kokokor' ;
const result = reconstructParagraphs (
{
observations: [
{ bbox: { x: 100 , y: 0 , width: 200 , height: 20 }, text: "First" },
{ bbox: { x: 310 , y: 0 , width: 200 , height: 20 }, text: "line" },
// ... more observations
],
page: {
width: 800 ,
height: 1200 ,
dpiX: 300 ,
dpiY: 300 ,
},
layout: {
horizontalLines: [],
rectangles: [],
},
},
{
line: {
isRTL: true ,
poetryDetectionOptions: {
minWordCount: 2 ,
},
},
paragraph: {
verticalJumpFactor: 2 ,
widthTolerance: 0.85 ,
},
format: {
footerSymbol: '---' ,
},
}
);
console . log ( result . text );
// First line
// Second paragraph...
Next Steps
TextBlock Type Learn about metadata and properties
Poetry Detection Understand poetry identification algorithms
RTL Support Explore right-to-left text handling
API Reference View complete API documentation