Hadoop with supercsv - hadoop

I have to process data in very large text files(like 5 TB in size). The processing logic uses supercsv to parse through the data and run some checks on it. Obviously as the size is quite large, we planned on using hadoop to take advantage of parallel computation. I install hadoop on my machine and I start off to write the mapper and reducer classes and I am stuck. Because the map requires a key value pair, so to read this text file I am not sure what should be the key and value in this particular scenario. Can someone help me out with that.
My thought process is something like this (let me know if I am correct)
1) Read the file using superCSV and hadoop generate the supercsv beans for each chunk of file in hdfs.(I am assuming that hadoop takes care of splitting the file)
2) For each of these supercsvbeans run my check logic.

Is the data newline-separated? i.e., if you just split the data on each newline character, will each chunk always be a single, complete record? This depends on how superCSV encodes text, and on whether your actual data contains newline characters.
If yes:
Just use TextInputFormat. It provides you with (I think) the byte offset as the map key, and the whole line as the value. You can ignore the key, and parse the line using superCSV.
If no:
You'll have to write your own custom InputFormat - here's a good tutorial: http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat. The specifics of exactly what the key is and what the value is don't matter too much to the mapper input; just make sure one of the two contains the actual data that you want. You can even use NullWritable as the type for one of them.

Related

Understanding TextInputFormat in Hadoop

I am from Mainframes background and trying to understand different input formats in Hadoop. Can anyone please explain the all below three input formats:
– TextInputFormat
- KeyValueInputFormat
- SequenceFileInputFormat
TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper.
The TextInputFormat works as An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text..
KeyValueTextInputFormat(old KeyValueInputFormat): This format also treats each line of input as a separate record. Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper.
While the TextInputFormat treats the entire line as the value, the KeyValueInputFormat breaks the line itself into the key and value by searching for a tab character.
SequenceFileInputFormat: reads special binary files that are specific to Hadoop. These files include many features designed to allow data to be rapidly read into Hadoop mappers.
Sequence files are block-compressed and provide direct serialization and deserialization of several arbitrary data types (not just text). Sequence files can be generated as the output of other MapReduce tasks and are an efficient intermediate representation for data that is passing from one MapReduce job to anther.

Is it possible to know the serial number of the block of input data on which map function is currently working?

I am a novice in Hadoop and here I have the following questions:
(1) As I can understand, the original input file is split into several blocks and distributed over the network. Does a map function always execute on a block in its entirety? Could there be more than one map functions executing on data in a single block?
(2) Is there any way that it can be learned, from within the map function, which section of the original input text the mapper is currently working on? I would like to get something like a serial number, for instance, for each block starting from the first block of the input text.
(3) Is it possible to make the splits of the input text in such a way that each block has a predefined word count? If possible then how?
Any help would be appreciated.
As I can understand, the original input file is split into several blocks and distributed over the network. Does a map function always execute on a block in its entirety? Could there be more than one map functions executing on data in a single block?
No. A block(split to be precise) gets processed by only one mapper.
Is there any way that it can be learned, from within the map function, which section of the original input text the mapper is currently working on? I would like to get something like a serial number, for instance, for each block starting from the first block of the input text.
You can get some valuable info, like the file containing split's data, the position of the first byte in the file to process. etc, with the help of FileSplit class. You might find it helpful.
Is it possible to make the splits of the input text in such a way that each block has a predefined word count? If possible then how?
You can do that by extending FileInputFormat class. To begin with you could do this :
In your getSplits() method maintain a counter. Now, as you read the file line by line keep on tokenizing them. Collect each token and increase the counter by 1. Once the counter reaches the desired value, emit the data read upto this point as one split. Reset the counter and start with the second split.
HTH
If you define a small max split size you can actually have multiple mappers processing a single HDFS block (say 32mb max split for a 128 MB block size - you'll get 4 mappers working on the same HDFS block). With the standard input formats, you'll typically never see two or more mappers processing the same part of the block (the same records).
MapContext.getInputSplit() can usually be cast to a FileSplit and then you have the Path, offset and length of the file being / block being processed).
If your input files are true text flies, then you can use the method suggested by Tariq, but note this is highly inefficient for larger data sources as the Job Client has to process each input file to discover the split locations (so you end up reading each file twice). If you really only want each mapper to process a set number of words, you could run a job to re-format the text files into sequence files (or another format), and write the records down to disk with a fixed number of words per file (using Multiple outputs to get a file per number of words, but this again is inefficient). Maybe if you shared the use case as for why you want a fixed number of words, we can better understand your needs and come up with alternatives

How to use "typedbytes" or "rawbytes" in Hadoop Streaming?

I have a problem that would be solved by Hadoop Streaming in "typedbytes" or "rawbytes" mode, which allow one to analyze binary data in a language other than Java. (Without this, Streaming interprets some characters, usually \t and \n, as delimiters and complains about non-utf-8 characters. Converting all my binary data to Base64 would slow down the workflow, defeating the purpose.)
These binary modes were added by HADOOP-1722. On the command line that invokes a Hadoop Streaming job, "-io rawbytes" lets you define your data as a 32-bit integer size followed by raw data of that size, and "-io typedbytes" lets you define your data as a 1-bit zero (which means raw bytes), followed by a 32-bit integer size, followed by raw data of that size. I have created files with these formats (with one or many records) and verified that they are in the right format by checking them with/against the output of typedbytes.py. I've also tried all conceivable variations (big-endian, little-endian, different byte offsets, etc.). I'm using Hadoop 0.20 from CDH4, which has the classes that implement the typedbytes handling, and it is entering those classes when the "-io" switch is set.
I copied the binary file to HDFS with "hadoop fs -copyFromLocal". When I try to use it as input to a map-reduce job, it fails with an OutOfMemoryError on the line where it tries to make a byte array of the length I specify (e.g. 3 bytes). It must be reading the number incorrectly and trying to allocate a huge block instead. Despite this, it does manage to get a record to the mapper (the previous record? not sure), which writes it to standard error so that I can see it. There are always too many bytes at the beginning of the record: for instance, if the file is "\x00\x00\x00\x00\x03hey", the mapper would see "\x04\x00\x00\x00\x00\x00\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\x00\x03hey" (reproducible bits, though no pattern that I can see).
From page 5 of this talk, I learned that there are "loadtb" and "dumptb" subcommands of streaming, which copy to/from HDFS and wrap/unwrap the typed bytes in a SequenceFile, in one step. When used with "-inputformat org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat", Hadoop correctly unpacks the SequenceFile, but then misinterprets the typedbytes contained within, in exactly the same way.
Moreover, I can find no documentation of this feature. On Feb 7 (I e-mailed it to myself), it was briefly mentioned in the streaming.html page on Apache, but this r0.21.0 webpage has since been taken down and the equivalent page for r1.1.1 has no mention of rawbytes or typedbytes.
So my question is: what is the correct way to use rawbytes or typedbytes in Hadoop Streaming? Has anyone ever gotten it to work? If so, could someone post a recipe? It seems like this would be a problem for anyone who wants to use binary data in Hadoop Streaming, which ought to be a fairly broad group.
P.S. I noticed that Dumbo, Hadoopy, and rmr all use this feature, but there ought to be a way to use it directly, without being mediated by a Python-based or R-based framework.
Okay, I've found a combination that works, but it's weird.
Prepare a valid typedbytes file in your local filesystem, following the documentation or by imitating typedbytes.py.
Use
hadoop jar path/to/streaming.jar loadtb path/on/HDFS.sequencefile < local/typedbytes.tb
to wrap the typedbytes in a SequenceFile and put it in HDFS, in one step.
Use
hadoop jar path/to/streaming.jar -inputformat org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat ...
to run a map-reduce job in which the mapper gets input from the SequenceFile. Note that -io typedbytes or -D stream.map.input=typedbytes should not be used--- explicitly asking for typedbytes leads to the misinterpretation I described in my question. But fear not: Hadoop Streaming splits the input on its binary record boundaries and not on its '\n' characters. The data arrive in the mapper as "rawdata" separated by '\t' and '\n', like this:
32-bit signed integer, representing length (note: no type character)
block of raw binary with that length: this is the key
'\t' (tab character... why?)
32-bit signed integer, representing length
block of raw binary with that length: this is the value
'\n' (newline character... ?)
If you want to additionally send raw data from mapper to reducer, add
-D stream.map.output=typedbytes -D stream.reduce.input=typedbytes
to your Hadoop command line and format the mapper's output and reducer's expected input as valid typedbytes. They also alternate for key-value pairs, but this time with type characters and without '\t' and '\n'. Hadoop Streaming correctly splits these pairs on their binary record boundaries and groups by keys.
The only documentation on stream.map.output and stream.reduce.input that I could find was in the HADOOP-1722 exchange, starting 6 Feb 09. (Earlier discussion considered a different way to parameterize the formats.)
This recipe does not provide strong typing for the input: the type characters are lost somewhere in the process of creating a SequenceFile and interpreting it with the -inputformat. It does, however, provide splitting at the binary record boundaries, rather than '\n', which is the really important thing, and strong typing between the mapper and the reducer.
We solved the binary data issue using hexaencoding the data at split level when streaming down data to the Mapper. This would utilize and increase the Parallel efficiency of your operation instead of first tranforming your data before processing on a node.
Apparently there is a patch for a JustBytes IO mode for streaming, that feeds a whole input file to the mapper command:
https://issues.apache.org/jira/browse/MAPREDUCE-5018

Hadoop: Mapping binary files

Typically in a the input file is capable of being partially read and processed by Mapper function (as in text files). Is there anything that can be done to handle binaries (say images, serialized objects) which would require all the blocks to be on same host, before the processing can start.
Stick your images into a SequenceFile; then you will be able to process them iteratively, using map-reduce.
To be a bit less cryptic: Hadoop does not natively know anything about text and not-text. It just has a class that knows how to open an input stream (hdfs handles sticthing together blocks on different nodes, to make them appear as one large files). On top of that, you have an Reader and an InputFormat that knows how to determine where in that stream records start, where they end, and how to find the beginning of the next record if you are dropped somewhere in the middle of the file. TextInputFormat is just one implementation, which treats newlines as record delimiter. There is also a special format called a SequenceFile that you can write arbitrary binary records into, and then get them back out. Use that.

Hadoop 0.2: How to read outputs from TextOutputFormat?

My reducer class produces outputs with TextOutputFormat (the default OutputFormat given by Job). I like to consume this outputs after the MapReduce job complete to aggregate the outputs. In addition to this, I like to write out the aggregated information with TextInputFormat so that the output from this process can be consumed by the next iteration of MapReduce task. Can anyone give me an example on how to write & read with TextFormat? By the way, the reason why I am using TextFormat, rather Sequence, is the interoperability. The outputs should be consumed by any software.
Don't rule out sequence files just yet; they make it fast and easy to chain MapReduce jobs, and you can use "hadoop fs -text filename" to output them in text format if you need them that way for other things.
But, back to your original question: to use TextInputFormat, set it as the input format in the Job, and then use TextInputFormat.setInputPaths to specify what files it should use as input. The key to your mapper should be a LongWritable, and the value a Text.
For using TextOutputFormat as output, set it as the output format in the Job, and then use TextOuputFormat.setOutputPath to specify the output path. Your reducers (or mappers, if it is a map-only job) need to use NullWritable as the type of the output key to get just the text representation of the values one per line, or otherwise each line will be the text representation of the key and the value separated by a tab (by default, you can change this by setting "mapred.textoutputformat.separator" to a different separator).

Resources