Skip to main content

CLI

1

Install the CLI

Install the lit command globally:
npm install -g @llamaindex/liteparse
Verify the install:
lit --version
2

Parse your first document

Pass any PDF (or supported document format) to lit parse:
lit parse document.pdf
Output is printed to stdout as plain text with the spatial layout preserved. To save it to a file:
lit parse document.pdf -o output.txt
Pipe a remote PDF directly without downloading it first:
curl -sL https://example.com/report.pdf | lit parse -
3

Try JSON output

Use --format json to get structured output with per-page text items and bounding boxes:
lit parse document.pdf --format json -o output.json

Library

1

Install the package

Add LiteParse as a dependency:
npm install @llamaindex/liteparse
2

Parse a file

Import LiteParse and call parse() with a file path:
import { LiteParse } from '@llamaindex/liteparse';

const parser = new LiteParse({ ocrEnabled: true });
const result = await parser.parse('document.pdf');
console.log(result.text);
result.text contains the full document text with spatial layout preserved across all pages. Per-page data is available in result.pages.
3

Parse from a Buffer or Uint8Array

You can pass raw bytes instead of a file path. PDF bytes go straight to the parser with zero disk I/O; non-PDF bytes are written to a temp file for format conversion.
import { LiteParse } from '@llamaindex/liteparse';
import { readFile } from 'fs/promises';

const parser = new LiteParse();

const pdfBytes = await readFile('document.pdf');
const result = await parser.parse(pdfBytes);
console.log(result.text);
4

Get structured JSON output

Set outputFormat: 'json' to receive per-page text items with coordinates and font metadata:
import { LiteParse } from '@llamaindex/liteparse';

const parser = new LiteParse({ outputFormat: 'json' });
const result = await parser.parse('document.pdf');

for (const page of result.json.pages) {
  console.log(`Page ${page.page}: ${page.textItems.length} text items`);
  for (const item of page.textItems) {
    console.log(`  "${item.text}" at (${item.x}, ${item.y})`);
  }
}
Each textItem includes text, x, y, width, height, fontName, and fontSize.
OCR is enabled by default using the built-in Tesseract.js engine — no setup required. On the first run, Tesseract downloads language data from the internet. For offline use, set TESSDATA_PREFIX to a directory containing pre-downloaded .traineddata files.

Next steps

Library usage

Explore the full LiteParse API: configuration options, OCR setup, screenshot generation, and more.

CLI reference

Full reference for lit parse, lit batch-parse, and lit screenshot commands and all their flags.

Build docs developers (and LLMs) love