Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/apache/pdfbox/llms.txt

Use this file to discover all available pages before exploring further.

This quickstart walks you through adding PDFBox to a Java project, creating a PDF that contains a line of text, and extracting text from an existing PDF. By the end you will have a working setup and two runnable Java programs to build on.
PDFBox 3.x requires Java 11 or higher. Make sure your JAVA_HOME points to a compatible JDK before proceeding.
1

Add the dependency

Add the pdfbox artifact to your build. The groupId is org.apache.pdfbox and the latest stable release is 3.0.0.
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>3.0.0</version>
</dependency>
If you use encryption or digital signatures, also add the Bouncy Castle provider. PDFBox 3.x uses Bouncy Castle 1.7x or later.
<dependency>
    <groupId>org.bouncycastle</groupId>
    <artifactId>bcprov-jdk18on</artifactId>
    <version>1.84</version>
</dependency>
<dependency>
    <groupId>org.bouncycastle</groupId>
    <artifactId>bcpkix-jdk18on</artifactId>
    <version>1.84</version>
</dependency>
2

Create a blank PDF

The simplest possible PDFBox program creates a one-page document and saves it to disk. A valid PDF must contain at least one page.
CreateBlankPDF.java
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;

public final class CreateBlankPDF
{
    public static void main(String[] args) throws IOException
    {
        String filename = "blank.pdf";

        try (PDDocument doc = new PDDocument())
        {
            // a valid PDF document requires at least one page
            PDPage blankPage = new PDPage();
            doc.addPage(blankPage);
            doc.save(filename);
        }
    }
}
PDDocument implements AutoCloseable, so wrapping it in a try-with-resources statement ensures the document and any underlying scratch files are closed correctly.
3

Write text to a PDF

To add content to a page, open a PDPageContentStream and use the text operators. The example below uses the built-in Helvetica Bold font from the PDF standard-14 set, which requires no font embedding.
HelloWorld.java
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
import org.apache.pdfbox.pdmodel.font.Standard14Fonts.FontName;

public final class HelloWorld
{
    public static void main(String[] args) throws IOException
    {
        String filename = "hello.pdf";
        String message  = "Hello, PDFBox!";

        try (PDDocument doc = new PDDocument())
        {
            PDPage page = new PDPage();
            doc.addPage(page);

            PDFont font = new PDType1Font(FontName.HELVETICA_BOLD);

            try (PDPageContentStream contents = new PDPageContentStream(doc, page))
            {
                contents.beginText();
                contents.setFont(font, 12);
                contents.newLineAtOffset(100, 700);
                contents.showText(message);
                contents.endText();
            }

            doc.save(filename);
        }
    }
}
Coordinates in PDFBox follow the PDF convention: the origin (0, 0) is at the bottom-left corner of the page, and Y increases upward. An A4 page is 595 × 842 points; a US Letter page is 612 × 792 points.
4

Extract text from a PDF

Use Loader.loadPDF() to open an existing PDF, then PDFTextStripper to extract the text. The example below iterates page-by-page and prints each page’s content to standard output.
ExtractTextSimple.java
import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.Loader;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.encryption.AccessPermission;
import org.apache.pdfbox.text.PDFTextStripper;

public class ExtractTextSimple
{
    public static void main(String[] args) throws IOException
    {
        if (args.length != 1)
        {
            System.err.println("Usage: java ExtractTextSimple <input-pdf>");
            System.exit(-1);
        }

        try (PDDocument document = Loader.loadPDF(new File(args[0])))
        {
            AccessPermission ap = document.getCurrentAccessPermission();
            if (!ap.canExtractContent())
            {
                throw new IOException("You do not have permission to extract text");
            }

            PDFTextStripper stripper = new PDFTextStripper();

            // Sorting by position helps with multi-column layouts. In some
            // files it can change the order — disable if results look wrong.
            stripper.setSortByPosition(true);

            for (int p = 1; p <= document.getNumberOfPages(); ++p)
            {
                // Extract one page at a time
                stripper.setStartPage(p);
                stripper.setEndPage(p);

                String text = stripper.getText(document);

                String pageStr = String.format("page %d:", p);
                System.out.println(pageStr);
                for (int i = 0; i < pageStr.length(); ++i)
                {
                    System.out.print("-");
                }
                System.out.println();
                System.out.println(text.trim());
                System.out.println();
            }
        }
    }
}
In PDFBox 3.x, use Loader.loadPDF() to open documents — not PDDocument.load(), which was removed in 3.0. See the migration guide for the full list of breaking changes.
5

Build and run

Compile and run with Maven:
# Compile
mvn compile

# Run the blank PDF creator
mvn exec:java -Dexec.mainClass="CreateBlankPDF"

# Run text extraction on an existing PDF
mvn exec:java -Dexec.mainClass="ExtractTextSimple" -Dexec.args="input.pdf"
Or build a fat JAR and run directly:
mvn package
java -cp target/my-app.jar ExtractTextSimple input.pdf
You should see each page’s text printed to the console. If a page prints empty or garbled characters, the PDF may use a non-standard font encoding. Consult the text extraction guide for troubleshooting steps.
Ready for more? Continue with the core guides:

Build docs developers (and LLMs) love