Hadoop read multiple lines at a time - hadoop

I have a file in which a set of every four lines represents a record.
eg, first four lines represent record1, next four represent record 2 and so on..
How can I ensure Mapper input these four lines at a time?
Also, I want the file splitting in Hadoop to happen at the record boundary (line number should be a multiple of four), so records don't get span across multiple split files..
How can this be done?

A few approaches, some dirtier than others:
The right way
You may have to define your own RecordReader, InputSplit, and InputFormat. Depending on exactly what you are trying to do, you will be able to reuse some of the already existing ones of the three above. You will likely have to write your own RecordReader to define the key/value pair and you will likely have to write your own InputSplit to help define the boundary.
Another right way, which may not be possible
The above task is quite daunting. Do you have any control over your data set? Can you preprocess it in someway (either while it is coming in or at rest)? If so, you should strongly consider trying to transform your dataset int something that is easier to read out of the box in Hadoop.
Something like:
ALine1
ALine2 ALine1;Aline2;Aline3;Aline4
ALine3
ALine4 ->
BLine1
BLine2 BLine1;Bline2;Bline3;Bline4;
BLine3
BLine4
Down and Dirty
Do you have any control over the file sizes of your data? If you manually split your data on the block boundary, you can force Hadoop to not care about records spanning splits. For example, if your block size is 64MB, write your files out in 60MB chunks.
Without worrying about input splits, you could do something dirty: In your map function, add your new key/value pair into a list object. If the list object has 4 items in it, do processing, emit something, then clean out the list. Otherwise, don't emit anything and move on without doing anything.
The reason why you have to manually split the data is that you are not going to be guaranteed that an entire 4-row record will be given to the same map task.

Another way (easy but may not be efficient in some cases) is to implement the FileInputFormat#isSplitable(). Then the input files are not split and are processed one per map.
import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapred.TextInputFormat;
public class NonSplittableTextInputFormat extends TextInputFormat {
#Override
protected boolean isSplitable(FileSystem fs, Path file) {
return false;
}
}
And as orangeoctopus said
In your map function, add your new key/value pair into a list object. If the list object has 4 items in it, do processing, emit something, then clean out the list. Otherwise, don't emit anything and move on without doing anything.
This has some overhead for the following reasons
Time to process the largest file drags the job completion time.
A lot of data may be transferred between the data nodes.
The cluster is not properly utilized, since # of maps = # of files.
** The above code is from Hadoop : The Definitive Guide

Related

Is it possible to know the serial number of the block of input data on which map function is currently working?

I am a novice in Hadoop and here I have the following questions:
(1) As I can understand, the original input file is split into several blocks and distributed over the network. Does a map function always execute on a block in its entirety? Could there be more than one map functions executing on data in a single block?
(2) Is there any way that it can be learned, from within the map function, which section of the original input text the mapper is currently working on? I would like to get something like a serial number, for instance, for each block starting from the first block of the input text.
(3) Is it possible to make the splits of the input text in such a way that each block has a predefined word count? If possible then how?
Any help would be appreciated.
As I can understand, the original input file is split into several blocks and distributed over the network. Does a map function always execute on a block in its entirety? Could there be more than one map functions executing on data in a single block?
No. A block(split to be precise) gets processed by only one mapper.
Is there any way that it can be learned, from within the map function, which section of the original input text the mapper is currently working on? I would like to get something like a serial number, for instance, for each block starting from the first block of the input text.
You can get some valuable info, like the file containing split's data, the position of the first byte in the file to process. etc, with the help of FileSplit class. You might find it helpful.
Is it possible to make the splits of the input text in such a way that each block has a predefined word count? If possible then how?
You can do that by extending FileInputFormat class. To begin with you could do this :
In your getSplits() method maintain a counter. Now, as you read the file line by line keep on tokenizing them. Collect each token and increase the counter by 1. Once the counter reaches the desired value, emit the data read upto this point as one split. Reset the counter and start with the second split.
HTH
If you define a small max split size you can actually have multiple mappers processing a single HDFS block (say 32mb max split for a 128 MB block size - you'll get 4 mappers working on the same HDFS block). With the standard input formats, you'll typically never see two or more mappers processing the same part of the block (the same records).
MapContext.getInputSplit() can usually be cast to a FileSplit and then you have the Path, offset and length of the file being / block being processed).
If your input files are true text flies, then you can use the method suggested by Tariq, but note this is highly inefficient for larger data sources as the Job Client has to process each input file to discover the split locations (so you end up reading each file twice). If you really only want each mapper to process a set number of words, you could run a job to re-format the text files into sequence files (or another format), and write the records down to disk with a fixed number of words per file (using Multiple outputs to get a file per number of words, but this again is inefficient). Maybe if you shared the use case as for why you want a fixed number of words, we can better understand your needs and come up with alternatives

How to go through the OutputFormat.RecordWriter write(key,value) twice in Hadoop

I have a situation where I need to go through the key/value pairs of my OutputFormat twice. In essence:
OutputFormat.getRecordWriter() // returns RecordWriteType1
... and when all those are complete across all machines
OutputFormat.getRecordWriter() // return RecordWriterType2
The typing of both RecordWriterType1/2 are the same. Is there a way to do this?
Thank you,
Marko.
Unfortunately you cannot simply run over the reducer data twice.
You do have some options to possibly work around:
Use an identity reducer to output the sorted data to HDFS, then run two jobs over the data with identity mappers - wasteful but simple if you don't have that much data
As above, but you could use map only jobs and the key comparator to emulate the reducer function as you know the input is already sorted (you'll need to make sure the split size is set sufficiently large to ensure all data from the first reducer output file is processed in a single mapper and not split over 2+ mapper instances
You could write the reducer key/values to local disk in your reducer, and then in the clean up method of the reducer, opening the local file up and process as detailed in the second option (using the group comparator to detemine key boundary).
If you dig through the source for ReduceTask, you may even be able to 'abuse' the merged sorted segments on local disk and run over the data again, but this option is pure unadulterated hackery...

Hadoop with supercsv

I have to process data in very large text files(like 5 TB in size). The processing logic uses supercsv to parse through the data and run some checks on it. Obviously as the size is quite large, we planned on using hadoop to take advantage of parallel computation. I install hadoop on my machine and I start off to write the mapper and reducer classes and I am stuck. Because the map requires a key value pair, so to read this text file I am not sure what should be the key and value in this particular scenario. Can someone help me out with that.
My thought process is something like this (let me know if I am correct)
1) Read the file using superCSV and hadoop generate the supercsv beans for each chunk of file in hdfs.(I am assuming that hadoop takes care of splitting the file)
2) For each of these supercsvbeans run my check logic.
Is the data newline-separated? i.e., if you just split the data on each newline character, will each chunk always be a single, complete record? This depends on how superCSV encodes text, and on whether your actual data contains newline characters.
If yes:
Just use TextInputFormat. It provides you with (I think) the byte offset as the map key, and the whole line as the value. You can ignore the key, and parse the line using superCSV.
If no:
You'll have to write your own custom InputFormat - here's a good tutorial: http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat. The specifics of exactly what the key is and what the value is don't matter too much to the mapper input; just make sure one of the two contains the actual data that you want. You can even use NullWritable as the type for one of them.

How to create hadoop input splits that span two files?

My data input files are all of the same length, but, the records therein may span two files (starting at the end of the first file and finishing at the beginning of the second).
Is it possible to create an inputsplit that would allow me to span those two files?
Is it better to create an entirely new set of files so that records do not span more than one file?
I would definitely ensure your records do not span more than one file: you could, theoretically, write your own input format that takes care of this, but the overhead is likely to be considerable as you are - in having to ensure that you know which files belong together - taking over part of the responsiblity which the jobtracker and name node fulfill for you.
You should be free to tell the jobtracker/name node where the inputs are, and for the processing to be truly parallel, you don't want to then have to take back some of that control: IMHO it would partially defeat the object of using haoop in the first place.

Hadoop: Mapping binary files

Typically in a the input file is capable of being partially read and processed by Mapper function (as in text files). Is there anything that can be done to handle binaries (say images, serialized objects) which would require all the blocks to be on same host, before the processing can start.
Stick your images into a SequenceFile; then you will be able to process them iteratively, using map-reduce.
To be a bit less cryptic: Hadoop does not natively know anything about text and not-text. It just has a class that knows how to open an input stream (hdfs handles sticthing together blocks on different nodes, to make them appear as one large files). On top of that, you have an Reader and an InputFormat that knows how to determine where in that stream records start, where they end, and how to find the beginning of the next record if you are dropped somewhere in the middle of the file. TextInputFormat is just one implementation, which treats newlines as record delimiter. There is also a special format called a SequenceFile that you can write arbitrary binary records into, and then get them back out. Use that.

Resources