Hadoop Binary Streaming and PDF File Inclusion

In a previous post I talked about Hadoop Binary Streaming for the processing of Microsoft Office Word documents. However, due to there popularity, I thought inclusion for support of Adobe PDF documents would be beneficial. To this end I have updated the source code to support processing of both “.docx” and “.pdf” documents.

iTextSharp

To support reading PDFs I have used the open source library provided by iText (http://itextpdf.com/). iText is a library that allows you to read, create and manipulate PDF documents (http://itextpdf.com/download.php). The original code was written in Java but a port for .Net is also available (http://sourceforge.net/projects/itextsharp/files/).

In using these libraries I only use the PdfReader class, from the Core library. This class allows one to derive the page count, and the Author from an Info property.

To use the library in Hadoop one just has to specify a file property for the iTextSharp core library:

-file "C:\Reference Assemblies\itextsharp.dll"

This assumes the downloaded and extracted DLL has been copied to and referenced from the “Reference Assemblies” folder.

Source Code Changes

To support the PDF document inclusion only two changes were necessary to the code.

Firstly, a new Mapper was defined that supports the processing of a PdfReader type and returns the author and pages for the document:

namespace FSharp.Hadoop.MapReduce

open System

open iTextSharp.text
open iTextSharp.text.pdf

// Calculates the pages per author for a Pdf document
module OfficePdfPageMapper =

    let authorKey = "Author"
    let unknownAuthor = "unknown author"

    let getAuthors (document:PdfReader) =          
        // For PDF documents perform the split on a ","
        if document.Info.ContainsKey(authorKey) then
            let creators = document.Info.[authorKey]
            if String.IsNullOrWhiteSpace(creators) then
                [| unknownAuthor |]
            else
                creators.Split(',')
        else
            [| unknownAuthor |]

    let getPages (document:PdfReader) =
        // return page count
        document.NumberOfPages

    // Map the data from input name/value to output name/value
    let Map (document:PdfReader) =
        let pages = getPages document
        (getAuthors document)
        |> Seq.map (fun author -> (author, pages))

Secondly one has to call the correct mapper based on the document type; namely the file extension:

let (|WordDocument|PdfDocument|UnsupportedDocument|) extension = 
    if String.Equals(extension, ".docx", StringComparison.InvariantCultureIgnoreCase) then
        WordDocument
    elseif String.Equals(extension, ".pdf", StringComparison.InvariantCultureIgnoreCase) then
        PdfDocument
    else
        UnsupportedDocument

// Check we do not have a null document
if (reader.Length > 0L) then
    try
        match Path.GetExtension(filename) with
        | WordDocument ->
            // Get access to the word processing document from the input stream
            use document = WordprocessingDocument.Open(reader, false)
            // Process the word document with the mapper
            OfficeWordPageMapper.Map document
            |> Seq.iter (fun value -> outputCollector value)        
            // close document
            document.Close()
        | PdfDocument ->
            // Get access to the pdf processing document from the input stream
            let document = new PdfReader(reader)
            // Process the word document with the mapper
            OfficePdfPageMapper.Map document
            |> Seq.iter (fun value -> outputCollector value)        
            // close document
            document.Close()
        | UnsupportedDocument ->
            ()
    with
    | :? System.IO.FileFormatException ->
        // Ignore invalid files formats
        ()

And that is it.

Conclusion

In Microsoft Word, if one needs to process the actual text/words of a document, this is relatively straight-forward:

document.MainDocumentPart.Document.Body.InnerText

Using iText the text/word extraction code is a little more complex but relativity easy. An example can be found here:

http://itextpdf.com/examples/iia.php?id=275

Enjoy!

Hadoop Binary Streaming and PDF File Inclusion

iTextSharp

Source Code Changes

Conclusion

Trending Articles

Police confirm man stabbed to death in Selsdon was Andrew David Else of Croydon

Practice Sheet of Right form of verbs for HSC Students

Thread: Ticket to Ride Legacy: Legends of the West:: General:: [SPOILERS]...

Raj Panchayat 3rd / Third Grade Teacher Revised Result 2012 Level 1-2...

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

libdevinfo を使ってネットワークインターフェイスデバイスの一覧を取得する

DD Kashir channel packaging bids invited by 29 june

Current scandal has roots in NPF saga

HResult: 0x80240033 Context: uecGeneral Msg: The license terms of one or more...

Re: How to fix error on printer HP Color LaserJet Pro MFP 3303 with event...

Brunei reaffirms healthcare commitment

Muloraki Au

Born To Be Wild: Chicago Outfit Hit Squad Littered The Streets With Bodies...

Gudur Mandal Sarpanch Wardmumbers Mobile Numbers List Warangal District in...

Mp3 Download: Mdu - Nammer

Ilahi mera jee aaye/ Shaame Malang si Lyrics Translation

Re: My Sisters Plan For Me To Smell Her Feet (Fiction): Part 1,2,3 and 4!!!

Procedure for conduct of supplementary DPC

Srinagar Kitty’s brother dies at 67 due to Covid-19

spreading clines