How it works

Overview

The PDF visual diff tool uses a sophisticated pipeline that combines PDF rendering, image processing, and structural similarity analysis to detect visual differences between PDFs. The core algorithm operates on a page-by-page comparison basis, leveraging industry-standard libraries for accuracy and performance.

Core libraries

The tool is built on three essential Python libraries:

PyMuPDF (fitz)

Used for PDF parsing and rendering. PyMuPDF converts PDF pages into high-resolution pixmaps for comparison.Key features:

Fast PDF page rendering
Configurable DPI via zoom matrix
Efficient memory handling for large PDFs

See pdf_visual_diff.py:2 for the import and pdf_visual_diff.py:19-20 for PDF loading.

scikit-image (SSIM)

Provides the Structural Similarity Index (SSIM) algorithm for quantitative image comparison.Key features:

Perceptual similarity measurement (0.0 to 1.0)
Multi-channel support for color images
More accurate than pixel-by-pixel comparison

The SSIM calculation happens at pdf_visual_diff.py:54 with configurable threshold.

Pillow (PIL)

Handles image manipulation, difference visualization, and output generation.Key features:

RGB pixmap conversion
ImageChops for pixel-level differences
Alpha compositing for highlighted diffs

Image processing occurs from pdf_visual_diff.py:42-69.

The comparison algorithm

PDF loading and validation

The tool opens both PDFs and validates page counts. If the PDFs have different page counts, it warns the user and compares up to the minimum page count.

pdf1 = fitz.open(pdf1_path)
pdf2 = fitz.open(pdf2_path)

if len(pdf1) != len(pdf2):
    print(f"Warning: PDFs have different page counts...")

page_count = min(len(pdf1), len(pdf2))

See pdf_visual_diff.py:19-30

Page rendering at high resolution

Each page is rendered at 2x zoom (144 DPI) for accurate visual comparison. The zoom matrix ensures consistent rendering quality.

zoom = 2  # DPI = 144
mat = fitz.Matrix(zoom, zoom)

img1 = page1.get_pixmap(matrix=mat)
img2 = page2.get_pixmap(matrix=mat)

The rendering happens at pdf_visual_diff.py:31-39

Image normalization

PyMuPDF pixmaps are converted to PIL RGB images. If dimensions differ, the second image is resized to match the first using LANCZOS interpolation.

pil_img1 = Image.frombytes("RGB", [img1.width, img1.height], img1.samples)
pil_img2 = Image.frombytes("RGB", [img2.width, img2.height], img2.samples)

if pil_img1.size != pil_img2.size:
    pil_img2 = pil_img2.resize(pil_img1.size, Image.LANCZOS)

See pdf_visual_diff.py:42-47

SSIM calculation

The Structural Similarity Index measures perceptual similarity between the two images. The default threshold is 0.999, meaning 99.9% similarity is required to consider pages identical.

# Convert to numpy arrays for ssim
np_img1 = np.array(pil_img1)
np_img2 = np.array(pil_img2)

# Compute SSIM with color channel support
similarity = ssim(np_img1, np_img2, channel_axis=-1, data_range=255)

if similarity < threshold:
    diff_pages.append(i + 1)

SSIM computation at pdf_visual_diff.py:49-57

Difference visualization

When differences are detected, ImageChops creates a pixel-level difference image. The differences are thresholded and highlighted in red with 50% transparency.

# Calculate pixel differences
diff = ImageChops.difference(pil_img1, pil_img2)

# Threshold to make differences more visible
thresholded_diff = diff.point(lambda p: 255 if p > 20 else 0)

# Create red highlight overlay
if thresholded_diff.getbbox():
    drawing_layer = Image.new("RGBA", pil_img1.size, (0,0,0,0))
    drawing_layer.paste((255,0,0,128), mask=thresholded_diff.convert('L'))
    highlighted_img = Image.alpha_composite(pil_img1.convert("RGBA"), drawing_layer)
    highlighted_img.convert("RGB").save(os.path.join(output_dir, f"diff_page_{i+1}.png"))

Visualization logic at pdf_visual_diff.py:59-69

Results generation

The tool generates a JSON report with detailed comparison results including timestamps, page counts, diff locations, and status.

results = {
    "timestamp": timestamp,
    "status": "success" if (not diff_pages and not extra_pages) else "error",
    "description": description,
    "pdf1_pages": pdf1_page_count,
    "pdf2_pages": pdf2_page_count,
    "threshold": threshold,
    "identical": not diff_pages and not extra_pages,
    "diff_pages": diff_pages,
    "extra_pages": extra_pages
}

Results are saved at pdf_visual_diff.py:109-126

Handling edge cases

Different page counts

When PDFs have different page counts, the tool compares pages up to the minimum count and exports extra pages from the longer PDF as separate images.

if len(pdf1) > len(pdf2):
    longer_pdf = "PDF1"
    for i in range(page_count, len(pdf1)):
        extra_pages.append(i + 1)
        # Export extra page as image
        pil_img.save(os.path.join(output_dir, f"extra_page_{i+1}_only_in_pdf1.png"))

See pdf_visual_diff.py:72-89

Different page dimensions

When page dimensions differ, the second image is automatically resized to match the first using high-quality LANCZOS resampling (pdf_visual_diff.py:45-47).

The default SSIM threshold of 0.999 is very strict. You can adjust it with the --threshold flag to make comparisons more or less sensitive to minor variations.

Performance considerations

DPI setting: The 2x zoom factor (144 DPI) balances quality and performance. Higher zoom increases accuracy but requires more memory.
Memory usage: Each page is processed individually and closed after comparison to manage memory efficiently.
Output optimization: Only pages with detected differences generate output images, minimizing disk usage.

Code references

The core comparison logic is in the compare_pdfs() function at pdf_visual_diff.py:10-136. Key sections:

PDF loading: Lines 19-20
Page rendering: Lines 31-39
Image conversion: Lines 42-47
SSIM calculation: Lines 49-57
Diff visualization: Lines 59-69
Extra page handling: Lines 72-89
Results generation: Lines 109-126

Get Started

Usage

Examples

Development

Overview

Core libraries

The comparison algorithm

Handling edge cases

Different page counts

Different page dimensions

Performance considerations

Code references

Build docs developers (and LLMs) love

Get Started

Usage

Examples

Development

Documentation Index

​Overview

​Core libraries

​The comparison algorithm

​Handling edge cases

​Different page counts

​Different page dimensions

​Performance considerations

​Code references

Build docs developers (and LLMs) love

Overview

Core libraries

The comparison algorithm

Handling edge cases

Different page counts

Different page dimensions

Performance considerations

Code references