isSplittable() method - hadoop

I have a question on isSplitable() of FileInputFormat class. As per the definition this method will restrict to create multiple mapper on the input split. But the number of mapper is based on the number of splits of a file. Like a 160 MB file is broken into 3 splits say 64, 64 and 32 MB. There will be 3 map tasks, one for each input split. If I override isSplitable() with the false value, what it will restrict, any way there will be 3 mappers to process the file based on the input splits.

If you do not want your data file to split or you want a single mapper which will process your entire file. So that one file will be processed by only one mapper. In that case extending map/reduce inputformat and overriding isSplitable() method and return "false" as boolean will help you.
Spliting and reading the entire file as a chunk are 2 different things.

Related

Single or multiple files per mapper in hadoop?

Does a mapper process multiple files at the same time or a mapper can only process a single file at a time? I want to know the default behaviour
Typical Mapreduce jobs follow one input split per mapper by default.
If the file size is larger than the split size (i.e., it has more
than one input split), then it is multiple mappers per file.
It is one file per mapper if the file is not splittable like a Gzip
file or if the process is Distcp where file is the finest level of granularity.
If you go to the definition of FileInputFormat you will see that on the top it has three methods:
addInputPath(JobConf conf, Path path) - Add a Path to the list of inputs for the map-reduce job. So it will pick up all files in catalog but not the single one, as you say
addInputPathRecursively(List result, FileSystem fs, Path path, PathFilter inputFilter) - Add files in the input path recursively into the results.
addInputPaths(JobConf conf, String commaSeparatedPaths) - Add the given comma separated paths to the list of inputs for the map-reduce job
Operating these three methods you can easily setup any multiple input you want. Then InputSplits of your InputFormat start to spliting this data among the mapper jobs. The Map-Reduce framework relies on the InputFormat of the job to:
Validate the input-specification of the job.
Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper.
Provide the RecordReader implementation to be used to glean input records from the logical InputSplit for processing by the Mapper.
So technically single mapper will process its own part only which can contain the data from several files. But for each particular format you should look into InputSplit to understand how data will be distributed accross the mappers.

How to set split size equal one line in Hadoop's MapReduce Streaming?

Goal: each node, having a copy of the matrix, reads the matrix, calculates some value via mapper(matrix, key), and emits <key, value>
I'm, trying to use mapper written in python via streaming. There are no reducers.
Essentially, I'm trying to do the task similar to https://hadoop.apache.org/docs/current/hadoop-streaming/HadoopStreaming.html#How_do_I_process_files_one_per_map
Approach: I generated an input file (tasks) in the following format (header just for reference):
/path/matrix.csv 0
/path/matrix.csv 0
... 99
Then I run (hadoop streaming) mapper on this tasks. Mapper parses line to get the arguments - filename, key; then mapper reads the matrix by filename and calculates the value associated with the key; then emits <key, value>.
Problem: current approach works and produces correct results, but it does so in one mapper, since the input file size is mere 100 lines of text, and it is not getting split into several mappers. How do I force such split despite small input size?
I realized that instead of doing several mappers and no reducers I could do just the exact opposite. Now my architecture is as follows:
thin mapper simply reads the input parameters and emits key, value
fat reducers read the files and execute algorithm with received key, then emit results
set -D mapreduce.job.reduces=10 to change the parallelization level
It was silly (mistaken) approach, but the correct one was not obvious either.

Is it possible to know the serial number of the block of input data on which map function is currently working?

I am a novice in Hadoop and here I have the following questions:
(1) As I can understand, the original input file is split into several blocks and distributed over the network. Does a map function always execute on a block in its entirety? Could there be more than one map functions executing on data in a single block?
(2) Is there any way that it can be learned, from within the map function, which section of the original input text the mapper is currently working on? I would like to get something like a serial number, for instance, for each block starting from the first block of the input text.
(3) Is it possible to make the splits of the input text in such a way that each block has a predefined word count? If possible then how?
Any help would be appreciated.
As I can understand, the original input file is split into several blocks and distributed over the network. Does a map function always execute on a block in its entirety? Could there be more than one map functions executing on data in a single block?
No. A block(split to be precise) gets processed by only one mapper.
Is there any way that it can be learned, from within the map function, which section of the original input text the mapper is currently working on? I would like to get something like a serial number, for instance, for each block starting from the first block of the input text.
You can get some valuable info, like the file containing split's data, the position of the first byte in the file to process. etc, with the help of FileSplit class. You might find it helpful.
Is it possible to make the splits of the input text in such a way that each block has a predefined word count? If possible then how?
You can do that by extending FileInputFormat class. To begin with you could do this :
In your getSplits() method maintain a counter. Now, as you read the file line by line keep on tokenizing them. Collect each token and increase the counter by 1. Once the counter reaches the desired value, emit the data read upto this point as one split. Reset the counter and start with the second split.
HTH
If you define a small max split size you can actually have multiple mappers processing a single HDFS block (say 32mb max split for a 128 MB block size - you'll get 4 mappers working on the same HDFS block). With the standard input formats, you'll typically never see two or more mappers processing the same part of the block (the same records).
MapContext.getInputSplit() can usually be cast to a FileSplit and then you have the Path, offset and length of the file being / block being processed).
If your input files are true text flies, then you can use the method suggested by Tariq, but note this is highly inefficient for larger data sources as the Job Client has to process each input file to discover the split locations (so you end up reading each file twice). If you really only want each mapper to process a set number of words, you could run a job to re-format the text files into sequence files (or another format), and write the records down to disk with a fixed number of words per file (using Multiple outputs to get a file per number of words, but this again is inefficient). Maybe if you shared the use case as for why you want a fixed number of words, we can better understand your needs and come up with alternatives

Get Line number in map method using FileInputFormat

I was wondering whether it is possible to get the line number in my map method?
My input file is just a single column of values like,
Apple
Orange
Banana
Is it possible to get key: 1, Value: Apple , Key: 2, Value: Orange ... in my map method?
Using CDH3/CDH4. Changing the input data so as to use KeyValueInputFormat is not an option.
Thanks ahead.
The default behaviour of InputFormats such as TextInputFormat is to give the byte offset of the record rather than the actual line number - this is mainly due to being unable to determine the true line number when an input file is splittable and being processed by two or more mappers.
You could create your own InputFormat (based upon the TextInputFormat and associated LineRecordReader) to produce line numbers rather than byte offsets but you'd need to configure your input format to return false from the isSplittable method (meaning that a large input file would not be processed by multiple mappers). If you have small files, or files that are close in size the HDFS block size then this shouldn't be a problem. Also non-splittable compression formats (GZip .gz for example) means the entire file will be processed by a single mapper anyway.

How can I emit a list of values from a Mapper or Reducer?

I have a file that contains some geophysical data (seismic data) and I am reading these files from the local file system and storing them as Hadoop Sequential files in HDFS.
Now I want to write a MapReduce job that can read the values from these Sequential files and store them into an HBase table. These files are not simply flat files. Instead they consist of many pieces, where each piece is a block of 240 bytes and has several fields. Each field can either be a short or an integer. I am using the block number as the key and a byte array of 240 bytes (which contains all the fields) as the value of Sequential files. So each Sequential file has all the blocks as byte arrays and their block number.
My question is, while processing such a file, how can I read each 240 byte block, read individual fields and emit all the fields in one shot once a 240 bytes block is done? Suppose I have a file that has 1000 blocks. So in my MapReduce program I have to read these 1000 blocks one at a time, extract each field (short or int) and emit all the fields as the result of one Map.
I need some help, regarding this.
Just to make sure, you want to read each 240 bytes blocks, emit the block number as the key and the byte array as value? I think you'll have to extend the default SequenceFileInputFormat. I'm not exactly sure how Sequence File works, or what their structure is like (sorry), but I was trying to read entire contents of a file to emit as output value, and the way I did it was to extend FileInputFormat. Perhaps you can take a look at the source code for SequenceFileInputFormat and see if there is a way to make an InputSplit every 240 bytes (if your data is structured), or at some delimiter.
Hope this helps!

Resources