PDFBox provides several classes for extracting text from PDF documents.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/apache/pdfbox/llms.txt
Use this file to discover all available pages before exploring further.
PDFTextStripper handles full-document and page-range extraction, PDFTextStripperByArea lets you extract text from a specific rectangular region, and TextPosition gives you per-glyph coordinates and font metrics.
Basic text extraction
The simplest approach is to create aPDFTextStripper, optionally configure it, then call getText(). The example below is adapted from ExtractTextSimple.java:
ExtractTextSimple.java
setSortByPosition(true) re-orders characters by their x/y coordinates before building the string, which is useful when text is rendered out of reading order. Turn it off if the PDF’s content stream already encodes the correct reading order (for example, multi-column layouts where the stream follows column order).
Extracting text from a specific page range
SetstartPage and endPage (both 1-based) before calling getText(). To process each page individually:
ExtractTextSimple.java
setStartPage(3) and setEndPage(5) once and call getText() a single time.
Extracting text by region
PDFTextStripperByArea extracts text from a named rectangular region. Coordinates use Java screen coordinates where y = 0 is the top of the page (not the PDF default where y = 0 is the bottom).
ExtractTextByArea.java
extractRegions(), then retrieve each with getTextForRegion(name).
Getting text positions
SubclassPDFTextStripper and override writeString() to access individual TextPosition objects with character-level coordinates and font information. DrawPrintTextLocations.java in the examples module demonstrates the full pattern:
DrawPrintTextLocations.java
TextPosition methods:
| Method | Description |
|---|---|
getX() | X coordinate in user space |
getY() | Y coordinate in user space |
getWidth() | Width of the character |
getFontSizeInPt() | Rendered font size in points |
getUnicode() | Unicode string for this glyph |
getFont() | The PDFont used to render this character |
Handling permissions
PDFs can restrict content extraction. Always checkAccessPermission.canExtractContent() before attempting extraction:
Permission check
Loader.loadPDF(file, password) accepts a password string to open encrypted documents before checking permissions.
Common issues and tips
Extracted text is empty or garbled
Extracted text is empty or garbled
Some PDFs encode text using custom glyph mappings without proper Unicode ToUnicode entries. PDFBox can only decode what the PDF provides. Try opening the file in Adobe Reader and copying text manually — if that also fails, the PDF likely does not contain extractable text.If Adobe Reader can copy text but PDFBox cannot, file a bug report with a minimal reproducer.
Text is extracted in the wrong order
Text is extracted in the wrong order
By default,
PDFTextStripper follows the content stream order, which may not match reading order. Enable setSortByPosition(true) to sort characters by their x/y coordinates.For multi-column documents, position-based sorting can merge columns incorrectly. Try disabling setSortByPosition and rely on stream order instead.Spaces are missing between words
Spaces are missing between words
PDFBox infers word spacing from glyph positions. If the inter-glyph distance is smaller than the threshold, spaces are omitted. There is no public API to tune this threshold directly; subclassing
PDFTextStripper and overriding writeLine() gives you lower-level control.Scanned PDFs return no text
Scanned PDFs return no text
Scanned documents are images, not text. PDFBox cannot extract text from image-only PDFs. You need an OCR engine such as Tesseract to first convert the scanned image to a text layer before PDFBox can process it.
Text extraction is slow on large files
Text extraction is slow on large files
Loading a large PDF fully into memory via
Loader.loadPDF(File) can be slow. Use Loader.loadPDF(RandomAccessRead) backed by a memory-mapped file for better performance. You can also limit the page range with setStartPage / setEndPage to avoid processing the entire document.