Parsing .PDF’s with Ruby

In our journey through the world of test automation with ruby we have found that sometimes the data we need to validate is locked up in some .pdf formatted file and can be difficult to get at. The 3Qi Labs team decided there had to be a way to automate the extraction and parsing of these PDF’s within our test automation scripts and the search began. Enter Xpdf.

There are many programs/ruby libraries that can do a the parsing job we need done such as PDFMiner, PoDoFo, Origami, and the PDF-Reader gem, but we have found Xpdf to be a the best choice for our needs to both view and parse out the data from pdf files when your testing includes doing some validation of the contents of generated pdf files. Xpdf is an open source viewer for Adobe “.pdf” files that includes a set of utilities to do just about everything you would want to do to a PDF: extracting the PDF’s info or attachments or images or converting the PDF to a bitmap format, but the utility we are after here is Xpdf’s text extractor, pdftotext.exe, which will do just what it says. Converts your PDF to a text document. (examples below)

The following post will teach you how to use Xpdf to convert a PDF into a text file and then use ruby to parse out the returned data. We will start by explaining how to get the utility installed (example is for windows) and then we will go over some methods we used to do the conversion and parse the data.

To install Xpdf, download the package for your desired platform (we are currently working with Windows) from http://foolabs.com/xpdf/download.html. Install so that the pdftotext.exe file is in the path.

This method can be used to do the actual pdf conversion in Ruby:

def pdf_to_text(file, noblank = true)
spec = file.sub(/.pdf$/, '')
`pdftotext #{spec}.pdf`
file = File.new("#{spec}.txt")
text = []
file.readlines.each do |l|
l.chomp! if noblank
if l.length &gt; 0
text &lt;&lt; l
end
end
file.close
text
end

def pdf_to_text(file, noblank = true)

spec = file.sub(/.pdf$/, '')

`pdftotext #{spec}.pdf`

file = File.new("#{spec}.txt")

text = []

file.readlines.each do |l|

l.chomp! if noblank

if l.length > 0

text << l

end

file.close

text

end

Where file is the full path to the PDF file to be converted and noblank indicates whether to remove empty lines from the text output.

The output is an array of strings, each entry representing a line in the file produced by pdftotext.exe.

pdftotext does not have an option to send the conversion to stdout so the file read is necessary.

How you parse the output depends on the original PDF document, how it gets converted, and what you are validating.

The following parse method is built for a document that contains a report name and date in a header, a report number in a footer, and several subject headings that may have information in them to be validated. Here we are just validating the presence of the subject headings and the expected values of the report name, date, and number.

The converted text from the PDF above looks like this with blank lines removed.

Blue indicates page header content:

Green means page footer content:

First Street Bank
Client Name
First Street Bank
Commercial Division
Report Date
Important Report
07/05/2011
Table of Contents
Recommendation Brief Business Description Borrower/Management Analysis Collateral Analysis Financial Analysis
Recommendation
Enter Text Here
Brief Business Description
Enter Text Here
Borrower/Management Analysis
Edit text
First Street Bank Confidential
Report #:
12345
First Street Bank
Client Name
First Street Bank
Commercial Division
Report Date
Important Report
07/05/2011
Collateral Analysis
Enter Text Here
Financial Analysis
Enter Text Here
First Street Bank Confidential
Report #:
12345

The parsing method for this particular PDF report format looks like this:

def parse_standard_report_pdf(file, has, client, dt, name, nbr)
  text     = pdf_to_text(file)
  rpt_date = nil
  rpt_name = nil
  rpt_nbr  = nil
  rpt_client   = nil

  text.each_index do |idx|
    ln = text[idx]
    if ln =~ /Report #/ and not rpt_nbr
      rpt_nbr = text[idx + 1]
    elsif ln =~ /Client Name/ and not rpt_name
      rpt_name = text[idx + 1]
    elsif ln =~ /Report Date/ and not rpt_date
      rpt_client = text[idx + 1]
      rpt_date = text[idx + 2]
    elsif has.has_key?(ln)
      has[ln] = idx
    end
  end

  if rpt_client
    if rpt_client == client
      puts "Found correct client name: #{rpt_client}"
    else
      puts "Found wrong client name: #{rpt_client}: expected #{client}"
    end
  else
    puts "Client name not found"
  end
  if rpt_nbr
    if rpt_nbr == nbr
      puts "Found correct report number: #{rpt_nbr}"
    else
      puts "Found wrong report number: #{rpt_nbr}: expected #{nbr}"
    end
  else
    puts "Report number not found"
  end
  if rpt_name
    if rpt_name == name
      puts "Found correct report name: #{rpt_name}"
    else
      puts "Found wrong report name: #{rpt_name}: expected #{name}"
    end
  else
    puts "Report name not found"
  end
  if rpt_date
    if rpt_date == dt
      puts "Found correct report date: #{rpt_date}"
    else
      puts "Found wrong report date: #{rpt_date}: expected #{dt}"
    end
  else
    puts "Report date not found"
  end
  has.each_key do |key|
    msg = "Find #{key}"
    if has[key] &gt; 0
      #~ passed_to_log(msg)
      puts msg + " passed."
    else
      puts msg + " FAILED."
    end
  end
end

def parse_standard_report_pdf(file, has, client, dt, name, nbr)

text = pdf_to_text(file)

rpt_date = nil

rpt_name = nil

rpt_nbr = nil

rpt_client = nil

text.each_index do |idx|

ln = text[idx]

if ln =~ /Report #/ and not rpt_nbr

rpt_nbr = text[idx + 1]

elsif ln =~ /Client Name/ and not rpt_name

rpt_name = text[idx + 1]

elsif ln =~ /Report Date/ and not rpt_date

rpt_client = text[idx + 1]

rpt_date = text[idx + 2]

elsif has.has_key?(ln)

has[ln] = idx

end

if rpt_client

if rpt_client == client

puts "Found correct client name: #{rpt_client}"

else

puts "Found wrong client name: #{rpt_client}: expected #{client}"

end

else

puts "Client name not found"

end

if rpt_nbr

if rpt_nbr == nbr

puts "Found correct report number: #{rpt_nbr}"

else

puts "Found wrong report number: #{rpt_nbr}: expected #{nbr}"

end

else

puts "Report number not found"

end

if rpt_name

if rpt_name == name

puts "Found correct report name: #{rpt_name}"

else

puts "Found wrong report name: #{rpt_name}: expected #{name}"

end

else

puts "Report name not found"

end

if rpt_date

if rpt_date == dt

puts "Found correct report date: #{rpt_date}"

else

puts "Found wrong report date: #{rpt_date}: expected #{dt}"

end

else

puts "Report date not found"

end

has.each_key do |key|

msg = "Find #{key}"

if has[key] > 0

#~ passed_to_log(msg)

puts msg + " passed."

else

puts msg + " FAILED."

end

Input parameters:
file is the full path to the target PDF file

has is a hash that contains an index for each expected document section with starting value of 0.

client, dt, name, nbr are the expected values for the report data, name, and number

The has hash looks like this:

required_topics = {
		'Borrower/Management Analysis' =&gt; 0,
		'Collateral Analysis' =&gt; 0,
		'Recommendation' =&gt; 0,
		'Brief Business Description' =&gt; 0,
		'Favorite Passtime' =&gt; 0,
		'Risk Analysis' =&gt; 0,
		'Financial Analysis' =&gt; 0}

required_topics = {

'Borrower/Management Analysis' => 0,

'Collateral Analysis' => 0,

'Recommendation' => 0,

'Brief Business Description' => 0,

'Favorite Passtime' => 0,

'Risk Analysis' => 0,

'Financial Analysis' => 0}

A call to the method looks like this:

parse_standard_report_pdf(
		‘/directorypath/my_report.pdf’,
		required_topics,
		‘First Street Bank .’,
		‘07/05/2011’,
		‘Important Report’,
		‘54321’
		)

parse_standard_report_pdf(

‘/directorypath/my_report.pdf’,

required_topics,

‘First Street Bank .’,

‘07/05/2011’,

‘Important Report’,

‘54321’

)

The converted PDF contains values in the line or lines following a label so the index (line count) is used to get the values for report name, date, and number relative to the search string for each. These are only captured the first time encountered.

The index where a topic name has been found is saved in the has hash for possible later use in collecting and validating the information between it and the next topic.

Then the captured values and the hash are evaluated and report messages generated. Report name, date, and time must be present and the expected values. Topics must simply be present.

Sample output looks like this:

Found correct client name: First Street Bank.
Found correct report number: 12345
Found correct report name: IMPORTANT REPORT
Found correct report date: 07/05/2011
Find Financial Analysis passed.
Find Borrower/Management Analysis passed.
Find Favorite Passtime FAILED.
Find Recommendation passed.
Find Collateral Analysis passed.
Find Brief Business Description passed.

In this post we described how to use Xpdf to convert a PDF into a text file and then use ruby to parse out the returned data. We started by explaining how to get the Xpdf pdftotext.exe utility installed in a windows environment and then discussed the methods we used to do the convert the PDF and parse the data. Hopefully this can help anyone looking to added extracted data from a PDF files into their test automation methods.

Credit our team members Pat and Deepti for all their contributions to this post.

Blog

How to convert a PDF file to text and parse it in Ruby

Parsing .PDF’s with Ruby

This method can be used to do the actual pdf conversion in Ruby:

The converted text from the PDF above looks like this with blank lines removed.

Sound Interesting?