How does MapReduce process multiple input files? - hadoop

So I'm writing a MR job to read hundreds of files from an input folder. Since all the files are compressed, so instead of using the default TextInputFormat, I was using the WholeFileReadFormat from an online code source.
So my question is that does the Mapper process multiple input files in sequence? I mean, if I have three files A B C, and since I'm reading the whole file content as the map input value, will mapreduce process the files in the order of, say, A->B->C, which means, only after doing with A, Mapper will start to process B?
Actually, I'm kind of confused on the concept of Map job and Map task. In my understanding the Map job is just the same thing as Mapper. And a mapper job contains several map tasks, in my case, each map task will read in a single file. But what I don't understand is that I think map tasks are executed in parallel, so I think all the input files should be processed in parallel, which turns out to be a paradox....
Can any one please explain it to me?

Related

Single or multiple files per mapper in hadoop?

Does a mapper process multiple files at the same time or a mapper can only process a single file at a time? I want to know the default behaviour
Typical Mapreduce jobs follow one input split per mapper by default.
If the file size is larger than the split size (i.e., it has more
than one input split), then it is multiple mappers per file.
It is one file per mapper if the file is not splittable like a Gzip
file or if the process is Distcp where file is the finest level of granularity.
If you go to the definition of FileInputFormat you will see that on the top it has three methods:
addInputPath(JobConf conf, Path path) - Add a Path to the list of inputs for the map-reduce job. So it will pick up all files in catalog but not the single one, as you say
addInputPathRecursively(List result, FileSystem fs, Path path, PathFilter inputFilter) - Add files in the input path recursively into the results.
addInputPaths(JobConf conf, String commaSeparatedPaths) - Add the given comma separated paths to the list of inputs for the map-reduce job
Operating these three methods you can easily setup any multiple input you want. Then InputSplits of your InputFormat start to spliting this data among the mapper jobs. The Map-Reduce framework relies on the InputFormat of the job to:
Validate the input-specification of the job.
Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper.
Provide the RecordReader implementation to be used to glean input records from the logical InputSplit for processing by the Mapper.
So technically single mapper will process its own part only which can contain the data from several files. But for each particular format you should look into InputSplit to understand how data will be distributed accross the mappers.

Hadoop streaming with multiple input files

I want to build an inverted index from a set of files with Hadoop using the Streaming API. The documentation always refers to using a file whose lines have the entries to the mapper to be fed. But in this case, I have multiple input files, and I need the mappers to process only one file at a time. Is there a way to accomplish that. For preprocessing reasons, I need the input to be like this, and I cannot have the input in the classic line = key, value format that the documentation refers.
By default a mapper only processes one file, unless you use an input class that allow combine inputs like CombineFileInputFormat.
Then, if you have 10 files you will end with 10 mappers and each of them will process only one file. If you are only using mappers (not reducers) that will end in 10 outputs files (one for each mapper).
In the other side, if you have enough big splittable files, it is possible that one file be processed by several mappers at the same time.

how output files(part-m-0001/part-r-0001) are created in map reduce

I understand that the map reduce output are stored in files named like part-r-* for reducer and part-m-* for mapper.
When I run a mapreduce job sometimes a get the whole output in a single file(size around 150MB), and sometimes for almost same data size I get two output files(one 100mb and other 50mb). This seems very random to me. I cant find out any reason for this.
I want to know how its decided to put that data in a single or multiple output files. and if any way we can control it.
Thanks
Unlike specified in the answer by Jijo here - the number of the files depends on on the number of Reducers/Mappers.
It has nothing to do with the number of physical nodes in the cluster.
The rule is: one part-r-* file for one Reducer. The number of Reducers is set by job.setNumReduceTasks();
If there are no Reducers in your job - then one part-m-* file for one Mapper. There is one Mapper for one InputSplit (usually - unless you use custom InputFormat implementation, there is one InputSplit for one HDFS block of your input data).
The number of output files part-m-* and part-r-* is set according to the number of map tasks and the number of reduce tasks respectively.

Writing to single file from mappers

I am working on mapreduce that is generating CSV file out of some data that is read from HBase. Is there a way to write to single file from mappers without reduce phase (or to merge multiple files generated by mappers at the end of job)? I know that I can set output format to write in file on Job level, is it possible to do similar thing for mappers?
Thanks
It is possible (and not uncommon) to have a Map/Reduce-Job without a reduce phase (example). For that you just use job.setNumReduceTasks(0).
However I am not sure how Job-Output is handled in this case. Ususally you get one result file per reducer. Without reducers I could imagine that you either get one file per mapper or that you cannot produce job output. You will have to try/research that.
If the above does not work for you, you could still use the default Reducer implementation, that just forwards the mapper output (identity function).
Seriously, this is not how MapReduce works.
Why do you even need a Job for that? Write a simple Java application that does the same for you. There are also command line utils that does the same for you.

Correlating input files to output files

I have a MR streaming job. My code is in C++. Its a mapper only job, with no reducer. Input to the the job is a directory containing three files. Job creates 3 mappers. Each mapper processes one input file and produces one output file in different format.
Input files are like:
MyDir/file1
MyDir/file2
MyDir/file3
Output file are like:
MyDir/Output/part-00000
MyDir/Output/part-00001
MyDir/Output/part-00002
I want to correlate input files to output files. For example, input file MyDir/file1 may correspond to output file MyDir/Output/part-00002, i.e. mapper that processed input file MyDir/file1 may have produced output file MyDir/Output/part-00002.
I want to know this relationship, i.e., which input file corresponds to which output file. Is there a simple way to know this?
One way I can think of is it to have the i/p and the o/p file names of the Job the same. Get the input file name (map.input.file environment property) which the mapper is processing and then us it in the MultipleOutputFormat#generateFileNameForKeyValue method.
With how Hadoop is designed, the only relationship that you can rely on, without you expressly naming the output files as per the other answer, is that the number of output files corresponds to the number of final tasks being run, usually reducers (mappers in your case, since you're not running any reducers).
If Hadoop later decides to run more mappers/reducers instead of just 3 (larger input files, more nodes available), you'll get a different number of output files.

Resources