Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/apache/pdfbox/llms.txt

Use this file to discover all available pages before exploring further.

The export:text command reads a PDF and writes its text content to a file. By default it produces a .txt file next to the input PDF, but it can also output HTML or Markdown, write to stdout, or append to an existing file. It handles password-protected documents, embedded PDFs, and rotated or skewed text via the rotationMagic mode.

Usage

java -jar pdfbox-app-3.0.0.jar export:text -i <input.pdf> [options]

Options

OptionDefaultDescription
-i, --input(required)Path to the input PDF file
-o, --output(auto)Path for the output file; defaults to the input filename with .txt/.html/.md
-password(none)Password to open an encrypted PDF or certificate keystore
-encodingUTF-8Output character encoding (e.g. ISO-8859-1, UTF-16BE)
-startPage1First page to extract (1-based)
-endPage(all)Last page to extract (1-based, inclusive)
-htmlfalseOutput HTML instead of plain text (forces UTF-8 encoding)
-mdfalseOutput Markdown instead of plain text
-sortfalseSort text by position before writing
-ignoreBeadsfalseDisable bead-based text separation
-rotationMagicfalseDetect and handle rotated/skewed text per page (slower; ignored with -html)
-alwaysNextfalseContinue to the next page even if an IOException occurs (ignored with -html)
-consolefalseWrite output to stdout instead of a file
-addFileNamefalsePrepend the PDF filename to the output text
-appendfalseOpen the output file in append mode
-debugfalsePrint timing information for each processing stage to stderr
-html and -md are mutually exclusive. -html always uses UTF-8 regardless of the -encoding value. -encoding is ignored when -console is set.

Examples

Extract all text from a PDF to a .txt file alongside the source:
java -jar pdfbox-app-3.0.0.jar export:text -i report.pdf
Extract pages 3 through 7 with position-sorted text, saved to a specific output file:
java -jar pdfbox-app-3.0.0.jar export:text -i report.pdf -o pages3-7.txt \
  -startPage 3 -endPage 7 -sort
Output HTML to stdout from a password-protected PDF:
java -jar pdfbox-app-3.0.0.jar export:text -i protected.pdf -password secret \
  -html -console
Extract with rotation handling for scanned documents containing angled text:
java -jar pdfbox-app-3.0.0.jar export:text -i scanned.pdf -rotationMagic

Build docs developers (and LLMs) love