Get started with Apache PDFBox

This quickstart walks you through adding PDFBox to a Java project, creating a PDF that contains a line of text, and extracting text from an existing PDF. By the end you will have a working setup and two runnable Java programs to build on.

PDFBox 3.x requires Java 11 or higher. Make sure your JAVA_HOME points to a compatible JDK before proceeding.

Add the dependency

Add the pdfbox artifact to your build. The groupId is org.apache.pdfbox and the latest stable release is 3.0.0.

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>3.0.0</version>
</dependency>

If you use encryption or digital signatures, also add the Bouncy Castle provider. PDFBox 3.x uses Bouncy Castle 1.7x or later.

<dependency>
    <groupId>org.bouncycastle</groupId>
    <artifactId>bcprov-jdk18on</artifactId>
    <version>1.84</version>
</dependency>
<dependency>
    <groupId>org.bouncycastle</groupId>
    <artifactId>bcpkix-jdk18on</artifactId>
    <version>1.84</version>
</dependency>

Create a blank PDF

The simplest possible PDFBox program creates a one-page document and saves it to disk. A valid PDF must contain at least one page.

CreateBlankPDF.java

import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;

public final class CreateBlankPDF
{
    public static void main(String[] args) throws IOException
    {
        String filename = "blank.pdf";

        try (PDDocument doc = new PDDocument())
        {
            // a valid PDF document requires at least one page
            PDPage blankPage = new PDPage();
            doc.addPage(blankPage);
            doc.save(filename);
        }
    }
}

PDDocument implements AutoCloseable, so wrapping it in a try-with-resources statement ensures the document and any underlying scratch files are closed correctly.

Write text to a PDF

To add content to a page, open a PDPageContentStream and use the text operators. The example below uses the built-in Helvetica Bold font from the PDF standard-14 set, which requires no font embedding.

HelloWorld.java

import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
import org.apache.pdfbox.pdmodel.font.Standard14Fonts.FontName;

public final class HelloWorld
{
    public static void main(String[] args) throws IOException
    {
        String filename = "hello.pdf";
        String message  = "Hello, PDFBox!";

        try (PDDocument doc = new PDDocument())
        {
            PDPage page = new PDPage();
            doc.addPage(page);

            PDFont font = new PDType1Font(FontName.HELVETICA_BOLD);

            try (PDPageContentStream contents = new PDPageContentStream(doc, page))
            {
                contents.beginText();
                contents.setFont(font, 12);
                contents.newLineAtOffset(100, 700);
                contents.showText(message);
                contents.endText();
            }

            doc.save(filename);
        }
    }
}

Coordinates in PDFBox follow the PDF convention: the origin (0, 0) is at the bottom-left corner of the page, and Y increases upward. An A4 page is 595 × 842 points; a US Letter page is 612 × 792 points.

Extract text from a PDF

Use Loader.loadPDF() to open an existing PDF, then PDFTextStripper to extract the text. The example below iterates page-by-page and prints each page’s content to standard output.

ExtractTextSimple.java

import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.Loader;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.encryption.AccessPermission;
import org.apache.pdfbox.text.PDFTextStripper;

public class ExtractTextSimple
{
    public static void main(String[] args) throws IOException
    {
        if (args.length != 1)
        {
            System.err.println("Usage: java ExtractTextSimple <input-pdf>");
            System.exit(-1);
        }

        try (PDDocument document = Loader.loadPDF(new File(args[0])))
        {
            AccessPermission ap = document.getCurrentAccessPermission();
            if (!ap.canExtractContent())
            {
                throw new IOException("You do not have permission to extract text");
            }

            PDFTextStripper stripper = new PDFTextStripper();

            // Sorting by position helps with multi-column layouts. In some
            // files it can change the order — disable if results look wrong.
            stripper.setSortByPosition(true);

            for (int p = 1; p <= document.getNumberOfPages(); ++p)
            {
                // Extract one page at a time
                stripper.setStartPage(p);
                stripper.setEndPage(p);

                String text = stripper.getText(document);

                String pageStr = String.format("page %d:", p);
                System.out.println(pageStr);
                for (int i = 0; i < pageStr.length(); ++i)
                {
                    System.out.print("-");
                }
                System.out.println();
                System.out.println(text.trim());
                System.out.println();
            }
        }
    }
}

In PDFBox 3.x, use Loader.loadPDF() to open documents — not PDDocument.load(), which was removed in 3.0. See the migration guide for the full list of breaking changes.

Build and run

Compile and run with Maven:

# Compile
mvn compile

# Run the blank PDF creator
mvn exec:java -Dexec.mainClass="CreateBlankPDF"

# Run text extraction on an existing PDF
mvn exec:java -Dexec.mainClass="ExtractTextSimple" -Dexec.args="input.pdf"

Or build a fat JAR and run directly:

mvn package
java -cp target/my-app.jar ExtractTextSimple input.pdf

You should see each page’s text printed to the console. If a page prints empty or garbled characters, the PDF may use a non-standard font encoding. Consult the text extraction guide for troubleshooting steps.

Ready for more? Continue with the core guides:

Creating PDFs — add images, shapes, and custom fonts
Text extraction — region-based extraction, sorting, and encoding issues
Rendering — convert pages to PNG/TIFF images at custom DPI
Migrating from 2.x — if you are upgrading an existing project

Getting started

Customization

Writing content

Agent ready

Build docs developers (and LLMs) love

Getting started

Customization

Writing content

Agent ready

Documentation Index

Build docs developers (and LLMs) love