order of execution of different components/actions in map-reduce - hadoop

What is the order of execution of the actions/components in map-reduce -
Mapper --> Combiner --> Shuffling/Sorting --> Partitioner --> Reducer
Is the order same??

The process is almost correct but let's clearly understand in depth of it.
First Map phase will start by running map.
Once map process the input, it will sort before it save in the local file system,
which is called sort and then copy to the local file system and next this data will be copied to reducer which is shuffle phase.
Since the data is already sorted in each mapper output, it performs merge sort by each key coming to the reducer located node.
Once merge is done, the data is ready to enter the reduce phase. It depends on the manual configuration of your choice.
we can set number of reducers to zero as well. In that case all the output of map will be written directly to Output path either to local file system or hdfs as well.
Hope it helps !

Related

How does MapReduce process multiple input files?

So I'm writing a MR job to read hundreds of files from an input folder. Since all the files are compressed, so instead of using the default TextInputFormat, I was using the WholeFileReadFormat from an online code source.
So my question is that does the Mapper process multiple input files in sequence? I mean, if I have three files A B C, and since I'm reading the whole file content as the map input value, will mapreduce process the files in the order of, say, A->B->C, which means, only after doing with A, Mapper will start to process B?
Actually, I'm kind of confused on the concept of Map job and Map task. In my understanding the Map job is just the same thing as Mapper. And a mapper job contains several map tasks, in my case, each map task will read in a single file. But what I don't understand is that I think map tasks are executed in parallel, so I think all the input files should be processed in parallel, which turns out to be a paradox....
Can any one please explain it to me?

how output files(part-m-0001/part-r-0001) are created in map reduce

I understand that the map reduce output are stored in files named like part-r-* for reducer and part-m-* for mapper.
When I run a mapreduce job sometimes a get the whole output in a single file(size around 150MB), and sometimes for almost same data size I get two output files(one 100mb and other 50mb). This seems very random to me. I cant find out any reason for this.
I want to know how its decided to put that data in a single or multiple output files. and if any way we can control it.
Thanks
Unlike specified in the answer by Jijo here - the number of the files depends on on the number of Reducers/Mappers.
It has nothing to do with the number of physical nodes in the cluster.
The rule is: one part-r-* file for one Reducer. The number of Reducers is set by job.setNumReduceTasks();
If there are no Reducers in your job - then one part-m-* file for one Mapper. There is one Mapper for one InputSplit (usually - unless you use custom InputFormat implementation, there is one InputSplit for one HDFS block of your input data).
The number of output files part-m-* and part-r-* is set according to the number of map tasks and the number of reduce tasks respectively.

hadoop get actual number of mappers

In the map phase of my program, I need to know the total number of mappers that are created. This will help me in the key creation process of the map (I want to emit as many key-value pairs for each object as the number of mappers).
I know that setting the number of mappers is just a hint, but what is the way to get the actual number of mappers.
I tried the following in the configure method of my Mapper:
public void configure(JobConf conf) {
System.out.println("map tasks: "+conf.get("mapred.map.tasks"));
System.out.println("tipid: "+conf.get("mapred.tip.id"));
System.out.println("taskpartition: "+conf.get("mapred.task.partition"));
}
But I get the results:
map tasks: 1
tipid: task_local1204340194_0001_m_000000
taskpartition: 0
map tasks: 1
tipid: task_local1204340194_0001_m_000001
taskpartition: 1
which means (?) that there are two map tasks, and not just one, as printed (which is quite natural, since I have two small input files). Shouldn't the number after map tasks be 2?
For now, I just count the number of files in the input folder, but this is not a good solution, since a file could be larger than the block size and result in more than one input splits and hence mappers. Any suggestions?
Finally, it seems that conf.get("mapred.map.tasks")) DOES work after all, when I generate an executable jar file and run my program in the cluster/locally. Now the output of "map tasks" is correct.
It did not work only when running my mapreduce program locally on hadoop from the eclipse-plugin. Maybe it is an eclipse-plugin's issue.
I hope this will help someone else having the same issue. Thank you for your answers!
I don't think there is an easy way to do this. I've implemented my own InputFormat class, if you do that you can implement a method to count the number of InputSplits which you can request in the process that starts the job. If you put that number in some Configuration setting, you can read it in your mapper process.
btw the number of input files is not always the number of mappers, as large files can be split.

How to go through the OutputFormat.RecordWriter write(key,value) twice in Hadoop

I have a situation where I need to go through the key/value pairs of my OutputFormat twice. In essence:
OutputFormat.getRecordWriter() // returns RecordWriteType1
... and when all those are complete across all machines
OutputFormat.getRecordWriter() // return RecordWriterType2
The typing of both RecordWriterType1/2 are the same. Is there a way to do this?
Thank you,
Marko.
Unfortunately you cannot simply run over the reducer data twice.
You do have some options to possibly work around:
Use an identity reducer to output the sorted data to HDFS, then run two jobs over the data with identity mappers - wasteful but simple if you don't have that much data
As above, but you could use map only jobs and the key comparator to emulate the reducer function as you know the input is already sorted (you'll need to make sure the split size is set sufficiently large to ensure all data from the first reducer output file is processed in a single mapper and not split over 2+ mapper instances
You could write the reducer key/values to local disk in your reducer, and then in the clean up method of the reducer, opening the local file up and process as detailed in the second option (using the group comparator to detemine key boundary).
If you dig through the source for ReduceTask, you may even be able to 'abuse' the merged sorted segments on local disk and run over the data again, but this option is pure unadulterated hackery...

Intermediate output when a reducer is specified

I've written a Hadoop Map Reduce job. When I run it locally, I notice that if I don't specify any reduce tasks there are some temporary files written to the output directory. If I specify reducers no temporary files are written. Is this normal behavior? I would expect to see the temporary files written otherwise it would mean that the mapper is trying to do everything in memory and then transfer to the reducer in memory. This strikes me as implausible.
Any insights into how/when/where the mapper writes intermediate output to the file system would be appreciated.
Thanks
Map tasks write their output to the local disk, not to HDFS. Map output
is intermediate output: it’s processed by reduce tasks to produce the final output, and
once the job is complete the map output can be thrown away. So storing it in HDFS,
with replication, would be overkill.
But if we set number of reducers to 0 then map output is stored on HDFS as final output. There is no reduce phase so output of the mapper is the output of the whole job.
Additionally here is how to look into intermediate files even if reducer is specified.

Resources