Poetry Detection

Overview

Kokokor uses sophisticated heuristics to identify poetic content in OCR output, distinguishing poetry from prose based on visual layout, word density, and structural patterns.

Poetry detection is crucial for preserving the artistic and structural integrity of verses, where line breaks carry semantic meaning.

Why Poetry Detection Matters

Poetry requires different handling than prose:

Line Breaks

Each line must remain separate (not merged into paragraphs)

Visual Layout

Centering and spacing are semantically meaningful

Hemistichs

Two-part verses common in Arabic/Persian poetry

Word Density

Poetry typically has more spacing than prose

Detection Strategy

The algorithm uses multiple coordinated heuristics:

// Internal detection algorithm (not exported)
// Kokokor uses this logic internally within mapObservationsToTextLines

function detectPoetryInGroup(
  group: Observation[],
  imageWidth: number,
  avgProseWordDensity: number,
  options: PoetryDetectionOptions
) {
  // Heuristic 1: Single wide poetic line
  if (group.length === 1 && minWidthRatioForMerged !== null) {
    return isWidePoeticLine(
      group[0],
      imageWidth,
      avgProseWordDensity,
      options
    );
  }

  // Heuristic 2: Paired hemistichs
  if (group.length === 2) {
    return isPoetryPair(group[0], group[1], imageWidth, options);
  }

  return false;
}

These algorithms are internal to Kokokor. Configure behavior via poetryDetectionOptions in mapObservationsToTextLines.

Heuristic 1: Paired Hemistichs

Concept

Traditional poetry (especially Arabic and Persian) often splits verses into two balanced parts called hemistichs:

صدر البيت (first hemistich)    عجز البيت (second hemistich)

Detection Criteria

Two observations form a poetry pair when they meet ALL conditions:

Minimum Word Count

Both hemistichs must have at least minWordCount words (default: 2)

const words1 = getWordCount(obs1.text);
const words2 = getWordCount(obs2.text);

if (words1 < minWordCount || words2 < minWordCount) {
  return false;
}

Compatible Widths

Widths must be similar within tolerance (default: 40%)

const avgWidth = (obs1.bbox.width + obs2.bbox.width) / 2;
const widthDiffRatio = Math.abs(obs1.bbox.width - obs2.bbox.width) / avgWidth;

return widthDiffRatio < pairWidthSimilarityRatio; // Default: 0.4

Reference: src/utils/poetry.ts:67

Compatible Word Counts

Word counts must be similar within tolerance (default: 50%)

const maxWords = Math.max(words1, words2);
const wordCountDiffRatio = Math.abs(words1 - words2) / maxWords;

return wordCountDiffRatio < pairWordCountSimilarityRatio; // Default: 0.5

Reference: src/utils/poetry.ts:74

Compatible Vertical Gap

Vertical distance must be within tolerance (default: 200% of height)

const centerY1 = obs1.bbox.y + obs1.bbox.height / 2;
const centerY2 = obs2.bbox.y + obs2.bbox.height / 2;
const dy = Math.abs(centerY1 - centerY2);
const avgHeight = (obs1.bbox.height + obs2.bbox.height) / 2;

return dy <= maxVerticalGapRatio * avgHeight; // Default: 2.0

Reference: src/utils/poetry.ts:81

Combined Centering

When combined, the hemistichs must be centered on the page

const combinedBbox = {
  x: Math.min(obs1.bbox.x, obs2.bbox.x),
  width: Math.max(
    obs1.bbox.x + obs1.bbox.width,
    obs2.bbox.x + obs2.bbox.width
  ) - Math.min(obs1.bbox.x, obs2.bbox.x),
  // ... height and y
};

return textIsCentered(
  combinedBbox,
  imageWidth,
  centeringOptions
);

Reference: src/utils/poetry.ts:276

Adaptive Centering

For hemistichs with significant gaps (visual separation), centering tolerance is relaxed:

const hasSignificantGap = gap > imageWidth * 0.07 || gap > avgWidth * 0.15;

if (hasSignificantGap) {
  return {
    centerToleranceRatio: (options.centerToleranceRatio ?? 0.05) * 2.5,
    minMarginRatio: (options.minMarginRatio ?? 0.1) * 0.75,
  };
}

Reference: src/utils/poetry.ts:116

Asymmetry Detection

Rejects pairs with asymmetric sparse gaps (likely not poetry):

const pageCenter = imageWidth / 2;
const innerLeft = leftObs.bbox.x + leftObs.bbox.width;
const innerRight = rightObs.bbox.x;
const leftDelta = Math.abs(pageCenter - innerLeft);
const rightDelta = Math.abs(innerRight - pageCenter);
const asymmetry = Math.abs(leftDelta - rightDelta);
const isVerySparsePair = gap > avgWidth * 2;

return isVerySparsePair && asymmetry > imageWidth * 0.12;

Reference: src/utils/poetry.ts:98

Heuristic 2: Wide Poetic Lines

Concept

Some poetry appears as single wide lines rather than split hemistichs. These are identified by comparing to prose characteristics.

Detection Criteria

Minimum Word Count

Must have at least minWordCount words (default: 2)

No Prose Punctuation

Filters out prose that might otherwise match

const PROSE_PUNCTUATION_PATTERN = /[،,؛;؟?۔.:()]/;

if (PROSE_PUNCTUATION_PATTERN.test(obs.text)) {
  return false; // Likely prose
}

Reference: src/utils/constants.ts:73

Centered on Page

Must be centered with adequate margins

if (!textIsCentered(obs.bbox, imageWidth, options)) {
  return false;
}

Poetry-Like Density

Word density must be lower than average prose

const obsDensity = wordCount / obs.bbox.width;
const densityRatio = obsDensity / avgProseWordDensity;

// Threshold varies by line width
const widthRatio = obs.bbox.width / imageWidth;
const requiredDensityRatio = widthRatio > 0.75
  ? wordDensityComparisonRatio * 0.95  // Stricter for very wide
  : 0.5;                                // More lenient

return densityRatio < requiredDensityRatio;

Reference: src/utils/poetry.ts:142

Minimum Width Check

Only lines spanning significant page width are considered:

if (obs.bbox.width <= imageWidth * minWidthRatioForMerged) {
  return false; // Too narrow
}
// Default minWidthRatioForMerged: 0.6 (60% of page width)

Reference: src/utils/poetry.ts:151

Prose Density Baseline

Both heuristics rely on calculating average prose word density as a baseline:

// Internal function: calculates baseline word density
// This is done automatically by mapObservationsToTextLines

function calculateProseDensityBaseline(
  observations: Observation[],
  imageWidth: number,
  options: PoetryDetectionOptions
): number {
  let totalWords = 0;
  let totalWidth = 0;

  for (const obs of observations) {
    const wordCount = getWordCount(obs.text);

    // Identify likely prose (not centered, wide, moderate word count)
    const isLikelyProse =
      !textIsCentered(obs.bbox, imageWidth, options) &&
      obs.bbox.width > imageWidth * 0.4 &&
      wordCount >= minWordCount &&
      wordCount <= MAX_PROSE_WORD_COUNT; // Default: 25

    if (isLikelyProse) {
      totalWords += wordCount;
      totalWidth += obs.bbox.width;
    }
  }

  return totalWords / totalWidth; // Words per pixel
}

Prose Identification

Prose is identified by:

Not centered (left-aligned text)
Width > 40% of page width
Word count between minimum and maximum (2-25 words)

Configuration Options

type PoetryDetectionOptions = {
  // Centering detection
  centerToleranceRatio: number;       // Default: 0.05 (5%)
  minMarginRatio: number;              // Default: 0.1 (10%)

  // Paired hemistichs
  maxVerticalGapRatio: number;         // Default: 2.0 (200%)
  pairWidthSimilarityRatio: number;    // Default: 0.4 (40%)
  pairWordCountSimilarityRatio: number; // Default: 0.5 (50%)

  // Wide poetic lines
  minWidthRatioForMerged: number | null; // Default: 0.6 (60%)
  wordDensityComparisonRatio: number;    // Default: 0.95 (95%)

  // General
  minWordCount: number;                  // Default: 2
};

Reference: src/types.ts:283

Real-World Examples

Example 1: Arabic Poetry Pair (Hemistichs)

Input Observations:

[
  {
    bbox: { x: 150, y: 200, width: 220, height: 18 },
    text: "في البدء كانت الكلمة"  // 4 words
  },
  {
    bbox: { x: 430, y: 200, width: 210, height: 18 },
    text: "والكلمة عند الله"      // 3 words
  }
]

Analysis:

✓ Word counts: 4 and 3 (within 50% tolerance)
✓ Widths: 220px and 210px (within 40% tolerance)
✓ Vertical gap: 0px (same Y coordinate)
✓ Combined width: 490px starting at x=150
✓ Combined center: (150 + 640) / 2 = 395px
✓ Page center: 400px (within 5% tolerance)

Result: isPoetic = true Output:

في البدء كانت الكلمة والكلمة عند الله

Example 2: Wide Poetic Line

Input Observation:

{
  bbox: { x: 100, y: 150, width: 600, height: 20 },
  text: "يا ليل الصب متى غده"  // 5 words
}

Page Width: 800px
Avg Prose Density: 0.015 words/pixel Analysis:

✓ Word count: 5 (>= 2)
✓ No prose punctuation
✓ Width: 600px (75% of page, >= 60% threshold)
✓ Centered: x=100, width=600, center=400 vs page center=400
✓ Density: 5/600 = 0.0083 words/pixel
✓ Density ratio: 0.0083/0.015 = 0.55 (< 0.95)

Result: isPoetic = true

Example 3: Prose (Not Poetry)

Input Observation:

{
  bbox: { x: 50, y: 300, width: 700, height: 20 },
  text: "This is a regular paragraph of text, with commas and punctuation."
}

Analysis:

✗ Contains prose punctuation (commas, period)
✗ High word density (prose-like)
✗ Not centered (x=50, only 50px left margin)

Result: isPoetic = false

Custom Configuration Example

import { reconstructParagraphs } from 'kokokor';

const result = reconstructParagraphs(
  { observations, page, layout },
  {
    line: {
      poetryDetectionOptions: {
        // Stricter centering for poetry
        centerToleranceRatio: 0.03,    // 3% instead of 5%
        minMarginRatio: 0.15,           // 15% instead of 10%

        // More lenient hemistich matching
        pairWidthSimilarityRatio: 0.5,  // 50% instead of 40%
        pairWordCountSimilarityRatio: 0.6, // 60% instead of 50%

        // Require minimum 3 words
        minWordCount: 3,

        // Disable wide poetic line detection
        minWidthRatioForMerged: null,

        // Lower density threshold
        wordDensityComparisonRatio: 0.85, // 85% instead of 95%
      },
      poetryPairDelimiter: ' ... ',  // Custom separator
    },
  }
);

Integration with Pipeline

Poetry detection runs during Stage 1 (Observations → Text Lines):

const avgProseWordDensity = calculateProseDensityBaseline(
  observations,
  page.width,
  options.poetryDetectionOptions
);

for (const group of groups) {
  if (groupMatchesPoetryCriteria(
    group,
    page.width,
    avgProseWordDensity,
    options.poetryDetectionOptions
  )) {
    for (const observation of group) {
      observation.isPoetic = true;
    }
  }
}

Reference: src/utils/paragraphs.ts:159

Disabling Poetry Detection

To disable poetry detection entirely:

const result = reconstructParagraphs(
  { observations, page, layout },
  {
    line: {
      poetryDetectionOptions: undefined, // Disable detection
    },
  }
);

Performance Considerations

Poetry detection runs once per document during line grouping. The prose density calculation is O(n) where n is the number of observations.

Optimizations:

Prose density calculated once, reused for all groups
Early rejection based on word count (cheapest check)
Width and word count checks before expensive centering calculations

Next Steps

TextBlock Metadata

Learn about the isPoetic flag

Processing Pipeline

See where poetry detection fits

Configuration

Explore all configuration options

RTL Support

Poetry detection for RTL languages

Getting Started

Core Concepts

Guides

Examples

Poetry Detection

Overview

Why Poetry Detection Matters

Line Breaks

Visual Layout

Hemistichs

Word Density

Detection Strategy

Heuristic 1: Paired Hemistichs

Concept

Detection Criteria

Adaptive Centering

Asymmetry Detection

Heuristic 2: Wide Poetic Lines

Concept

Detection Criteria

Minimum Width Check

Prose Density Baseline

Prose Identification

Configuration Options

Real-World Examples

Example 1: Arabic Poetry Pair (Hemistichs)

Example 2: Wide Poetic Line

Example 3: Prose (Not Poetry)

Custom Configuration Example

Integration with Pipeline

Disabling Poetry Detection

Performance Considerations

Next Steps

TextBlock Metadata

Processing Pipeline

Configuration

RTL Support

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Guides

Examples

Documentation Index

​Overview

​Why Poetry Detection Matters

Line Breaks

Visual Layout

Hemistichs

Word Density

​Detection Strategy

​Heuristic 1: Paired Hemistichs

​Concept

​Detection Criteria

​Adaptive Centering

​Asymmetry Detection

​Heuristic 2: Wide Poetic Lines

​Concept

​Detection Criteria

​Minimum Width Check

​Prose Density Baseline

​Prose Identification

​Configuration Options

​Real-World Examples

​Example 1: Arabic Poetry Pair (Hemistichs)

​Example 2: Wide Poetic Line

​Example 3: Prose (Not Poetry)

​Custom Configuration Example

​Integration with Pipeline

​Disabling Poetry Detection

​Performance Considerations

​Next Steps

TextBlock Metadata

Processing Pipeline

Configuration

RTL Support

Build docs developers (and LLMs) love

Overview

Why Poetry Detection Matters

Detection Strategy

Heuristic 1: Paired Hemistichs

Concept

Detection Criteria

Adaptive Centering

Asymmetry Detection

Heuristic 2: Wide Poetic Lines

Concept

Detection Criteria

Minimum Width Check

Prose Density Baseline

Prose Identification

Configuration Options

Real-World Examples

Example 1: Arabic Poetry Pair (Hemistichs)

Example 2: Wide Poetic Line

Example 3: Prose (Not Poetry)

Custom Configuration Example

Integration with Pipeline

Disabling Poetry Detection

Performance Considerations

Next Steps