PDF Parsing & AcroForm Detection

Overview

The PDF Form Parser automatically analyzes uploaded PDF files to detect AcroForm fields, including text inputs, checkboxes, radio buttons, and digital signature fields. This enables form templates to be created from existing PDF documents without manual field mapping.

How PDF Parsing Works

When you upload a PDF form template, the system uses the PdfFormsParserService to extract all interactive fields.

Upload PDF Document

Navigate to Form Templates and click New Form Template. Upload a PDF file containing AcroForm fields.

Automatic Field Detection

The system uses pdftk to enumerate all form fields in the PDF, detecting:

Text fields (TextBox)
Checkboxes (Button)
Radio buttons (Button with options)
Signature fields (/FT /Sig)
Dropdown lists (Choice)

Field Metadata Extraction

For each field, the parser extracts:

Field name (original and sanitized)
Field type
Available options (for checkboxes/radio buttons)
Human-readable label (generated from field name)
Signature metadata (for signature fields)

Structure Generation

The parsed fields are stored as JSON in the form_structure column, ready for customization in the Form Builder.

Field Detection Engine

The parsing service is located in app/services/pdf_forms_parser_service.rb:6.

Standard Field Parsing

def parse
  raw_fields = get_fields_with_encoding
  
  parsed = raw_fields.map do |field|
    {
      name: sanitize_field_name(field.name),
      original_name: field.name,
      type: field.type,
      value: '',
      options: field.options,
      human_label: generate_human_label(field.name),
      label_name: field.value
    }
  end
end

The parser automatically generates human-readable labels by transforming field names:

Location_row_1 → “Location Row 1”
buildingAddress → “Building Address”
Inspector_Name → “Inspector Name”

Signature Field Detection

Signature fields are detected using HexaPDF to identify PDF signature annotations (/FT /Sig type).

def self.list_signature_fields(file_path)
  doc = HexaPDF::Document.open(file_path)
  
  fields = []
  doc.acro_form.each_field do |field|
    next unless signature_field?(field)
    
    fields << {
      name: field_name,
      is_signed: !info.nil?,
      info: extract_signature_info(field)
    }
  end
end

Signature fields are always preserved during parsing, even if they’re empty or unsigned. The system marks them with is_signature: true for special handling.

UTF-8 and Special Characters

The parser handles international characters and special symbols through multiple encoding strategies:

UTF-8 Sanitization: Field names are sanitized to remove invalid UTF-8 sequences
Fallback Parsing: If standard parsing fails, the system uses pdftk dump_data_fields as a backup
Character Replacement: Invalid characters are replaced rather than causing parse failures

def sanitize_field_name(name)
  name.to_s.encode('UTF-8', invalid: :replace, undef: :replace, replace: '')
end

Field Filtering

The parser automatically filters out empty or invalid fields:

Fields with empty label_name values are excluded (except signature fields)
Fields with value “Off” (unchecked checkboxes in their default state) are filtered
Signature fields are always preserved regardless of their state

If your PDF has fields with blank labels or default values, they may be filtered out during parsing. Use the Form Builder to manually add fields if needed.

Error Handling

The parser includes robust error handling for corrupted or non-standard PDFs:

PdftkError - Standard Parsing Failed

When pdftk cannot read the PDF structure, the system automatically switches to dump_data_fields method which uses raw PDF data extraction.Resolution: No action needed - fallback is automatic.

StandardError - Unexpected PDF Format

If the PDF structure is completely unreadable, parsing returns an empty array and logs the error.Resolution: Verify the PDF is a valid AcroForm document. Some PDFs created with form builders may not have proper field annotations.

Encoding Errors

For PDFs with special characters in field names, the parser attempts multiple encoding approaches.Resolution: Handled automatically through UTF-8 sanitization and fallback methods.

Background Processing

Large PDFs with many fields are processed asynchronously to avoid blocking the web interface:

Upload initiates ParseFormTemplateJob
Job processes PDF in background worker
Form structure is saved when complete
Page automatically refreshes to show parsed fields

def create
  if @form_template.save
    ParseFormTemplateJob.perform_later(@form_template.id)
    redirect_to @form_template, 
      notice: 'The file is being processed and the structure will appear shortly.'
  end
end

Supported Field Types

PDF Field Type	Detected As	Usage
`/FT /Tx`	Text	Single-line or multi-line text input
`/FT /Btn` (checkbox)	Button	Checkbox with On/Off state
`/FT /Btn` (radio)	Button	Radio button group with options
`/FT /Ch`	Choice	Dropdown or list selection
`/FT /Sig`	Signature_Field	Digital signature field

Next Steps

After PDF parsing completes:

Customize Fields

Use the Form Builder to organize, rename, and configure parsed fields

Create Inspections

Start using your form template for fire safety inspections

Get Started

Core Features

User Guide

Configuration

PDF Parsing & AcroForm Detection

Overview

How PDF Parsing Works

Field Detection Engine

Standard Field Parsing

Signature Field Detection

UTF-8 and Special Characters

Field Filtering

Error Handling

Background Processing

Supported Field Types

Next Steps

Customize Fields

Create Inspections

Build docs developers (and LLMs) love

Get Started

Core Features

User Guide

Configuration

​Overview

​How PDF Parsing Works

​Field Detection Engine

​Standard Field Parsing

​Signature Field Detection

​UTF-8 and Special Characters

​Field Filtering

​Error Handling

​Background Processing

​Supported Field Types

​Next Steps

Customize Fields

Create Inspections

Build docs developers (and LLMs) love

Overview

How PDF Parsing Works

Field Detection Engine

Standard Field Parsing

Signature Field Detection

UTF-8 and Special Characters

Field Filtering

Error Handling

Background Processing

Supported Field Types

Next Steps