Parsing .PDF’s with Ruby
In our journey through the world of test automation with ruby we have found that sometimes the data we need to validate is locked up in some .pdf formatted file and can be difficult to get at. The 3Qi Labs team decided there had to be a way to automate the extraction and parsing of these PDF’s within our test automation scripts and the search began. Enter Xpdf.
There are many programs/ruby libraries that can do a the parsing job we need done such as PDFMiner, PoDoFo, Origami, and the PDF-Reader gem, but we have found Xpdf to be a the best choice for our needs to both view and parse out the data from pdf files when your testing includes doing some validation of the contents of generated pdf files. Xpdf is an open source viewer for Adobe “.pdf” files that includes a set of utilities to do just about everything you would want to do to a PDF: extracting the PDF’s info or attachments or images or converting the PDF to a bitmap format, but the utility we are after here is Xpdf’s text extractor, pdftotext.exe, which will do just what it says. Converts your PDF to a text document. (examples below)
The following post will teach you how to use Xpdf to convert a PDF into a text file and then use ruby to parse out the returned data. We will start by explaining how to get the utility installed (example is for windows) and then we will go over some methods we used to do the conversion and parse the data.
To install Xpdf, download the package for your desired platform (we are currently working with Windows) from http://foolabs.com/xpdf/download.html. Install so that the pdftotext.exe file is in the path.
This method can be used to do the actual pdf conversion in Ruby:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
def pdf_to_text(file, noblank = true) spec = file.sub(/.pdf$/, '') `pdftotext #{spec}.pdf` file = File.new("#{spec}.txt") text = [] file.readlines.each do |l| l.chomp! if noblank if l.length > 0 text << l end end file.close text end |
Where file is the full path to the PDF file to be converted and noblank indicates whether to remove empty lines from the text output.
The output is an array of strings, each entry representing a line in the file produced by pdftotext.exe.
pdftotext does not have an option to send the conversion to stdout so the file read is necessary.
How you parse the output depends on the original PDF document, how it gets converted, and what you are validating.
The following parse method is built for a document that contains a report name and date in a header, a report number in a footer, and several subject headings that may have information in them to be validated. Here we are just validating the presence of the subject headings and the expected values of the report name, date, and number.
The converted text from the PDF above looks like this with blank lines removed.
Blue indicates page header content:
Green means page footer content:
First Street Bank
Client Name
First Street Bank
Commercial Division
Report Date
Important Report
07/05/2011
Table of Contents
Recommendation Brief Business Description Borrower/Management Analysis Collateral Analysis Financial Analysis
Recommendation
Enter Text Here
Brief Business Description
Enter Text Here
Borrower/Management Analysis
Edit text
First Street Bank Confidential
Report #:
12345
First Street Bank
Client Name
First Street Bank
Commercial Division
Report Date
Important Report
07/05/2011
Collateral Analysis
Enter Text Here
Financial Analysis
Enter Text Here
First Street Bank Confidential
Report #:
12345
The parsing method for this particular PDF report format looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
def parse_standard_report_pdf(file, has, client, dt, name, nbr) text = pdf_to_text(file) rpt_date = nil rpt_name = nil rpt_nbr = nil rpt_client = nil text.each_index do |idx| ln = text[idx] if ln =~ /Report #/ and not rpt_nbr rpt_nbr = text[idx + 1] elsif ln =~ /Client Name/ and not rpt_name rpt_name = text[idx + 1] elsif ln =~ /Report Date/ and not rpt_date rpt_client = text[idx + 1] rpt_date = text[idx + 2] elsif has.has_key?(ln) has[ln] = idx end end if rpt_client if rpt_client == client puts "Found correct client name: #{rpt_client}" else puts "Found wrong client name: #{rpt_client}: expected #{client}" end else puts "Client name not found" end if rpt_nbr if rpt_nbr == nbr puts "Found correct report number: #{rpt_nbr}" else puts "Found wrong report number: #{rpt_nbr}: expected #{nbr}" end else puts "Report number not found" end if rpt_name if rpt_name == name puts "Found correct report name: #{rpt_name}" else puts "Found wrong report name: #{rpt_name}: expected #{name}" end else puts "Report name not found" end if rpt_date if rpt_date == dt puts "Found correct report date: #{rpt_date}" else puts "Found wrong report date: #{rpt_date}: expected #{dt}" end else puts "Report date not found" end has.each_key do |key| msg = "Find #{key}" if has[key] > 0 #~ passed_to_log(msg) puts msg + " passed." else puts msg + " FAILED." end end end |
Input parameters:
file is the full path to the target PDF file
has is a hash that contains an index for each expected document section with starting value of 0.
client, dt, name, nbr are the expected values for the report data, name, and number
The has hash looks like this:
1 2 3 4 5 6 7 8 |
required_topics = { 'Borrower/Management Analysis' => 0, 'Collateral Analysis' => 0, 'Recommendation' => 0, 'Brief Business Description' => 0, 'Favorite Passtime' => 0, 'Risk Analysis' => 0, 'Financial Analysis' => 0} |
A call to the method looks like this:
1 2 3 4 5 6 7 8 |
parse_standard_report_pdf( ‘/directorypath/my_report.pdf’, required_topics, ‘First Street Bank .’, ‘07/05/2011’, ‘Important Report’, ‘54321’ ) |
The converted PDF contains values in the line or lines following a label so the index (line count) is used to get the values for report name, date, and number relative to the search string for each. These are only captured the first time encountered.
The index where a topic name has been found is saved in the has hash for possible later use in collecting and validating the information between it and the next topic.
Then the captured values and the hash are evaluated and report messages generated. Report name, date, and time must be present and the expected values. Topics must simply be present.
Sample output looks like this:
Found correct client name: First Street Bank.
Found correct report number: 12345
Found correct report name: IMPORTANT REPORT
Found correct report date: 07/05/2011
Find Financial Analysis passed.
Find Borrower/Management Analysis passed.
Find Favorite Passtime FAILED.
Find Recommendation passed.
Find Collateral Analysis passed.
Find Brief Business Description passed.
In this post we described how to use Xpdf to convert a PDF into a text file and then use ruby to parse out the returned data. We started by explaining how to get the Xpdf pdftotext.exe utility installed in a windows environment and then discussed the methods we used to do the convert the PDF and parse the data. Hopefully this can help anyone looking to added extracted data from a PDF files into their test automation methods.
Credit our team members Pat and Deepti for all their contributions to this post.