Quantcast
Channel: Carl's Blog
Viewing all articles
Browse latest Browse all 30

Hadoop Streaming and Reporting

$
0
0

If like me you are a .Net developer and have written some Streaming jobs it is not immediately obvious how one can do any reporting. However if you dig through the Streaming Documentation you will come across this in the FAQs:

  • How do I update counters in streaming applications? A streaming process can use the stderr to emit counter information.reporter:counter:<group>,<counter>,<amount> should be sent to stderr to update the counter.

  • How do I update status in streaming applications? A streaming process can use the stderr to emit status information. To set a status, reporter:status:<message> should be sent to stderr.

So this does provide an easy mechanism to provide feedback from a running streaming job.

If you take my last binary streaming post, in running the code one has no idea of how many Microsoft Word, PDF, or Unknown documents have been processed.

Thus using the counter output format, one can define a simple counterReporter function:

let counterReporter docType =
    stderr.WriteLine (sprintf "reporter:counter:Documents Processed,%s,1" docType)

One can then easily report on documents processed using the following slight code modification:

match Path.GetExtension(filename) with
| WordDocument ->
    // Get access to the word processing document from the input stream
    use document = WordprocessingDocument.Open(reader, false)
    // Process the word document with the mapper
    OfficeWordPageMapper.Map document
    |> Seq.iter (fun value -> outputCollector value)        
    // close document
    document.Close()
    counterReporter "Word Document"
| PdfDocument ->
    // Get access to the pdf processing document from the input stream
    let document = new PdfReader(reader)
    // Process the word document with the mapper
    OfficePdfPageMapper.Map document
    |> Seq.iter (fun value -> outputCollector value)        
    // close document
    document.Close()
    counterReporter "PDF Document"
| UnsupportedDocument ->
    counterReporter "Unknown Document Type"
    ()

Thus we update the Group “Documents Processed”, with the document type, each time we process a document. Looking at the Hadoop job log we can now see:


CounterMapReduceTotal
Documents ProcessedPDF Document101
Word Document303
File Input Format Counters Bytes Read2,003,15702,003,157
Job Counters SLOTS_MILLIS_MAPS0028,925
Launched reduce tasks001
Launched map tasks004
Data-local map tasks004
FileSystemCountersHDFS_BYTES_READ2,003,62002,003,620
FILE_BYTES_WRITTEN90,354090,354
Map-Reduce FrameworkMap output materialized bytes98098
Combine output records505
Map input records404
Spilled Records505
Map output bytes64064
Map input bytes2,003,15702,003,157
SPLIT_RAW_BYTES4630463
Map output records505
Combine input records505

All nice and easy.

If you want to do some error reporting the process is the same just with a different string format.


Viewing all articles
Browse latest Browse all 30

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>