What if we only have one reducer - hadoop

As we know that Hadoop tend to lanunch reducer on the machines that the corresponding mapper is run. What if we have 100 mappers and 1 reducer. We know that the mapper stores data on local disk ,will all the mapped data be transfered to the single reducer?

Yes, if the reducer is only one, all the data will be transferred to that reducer.
Each mapper initially stores its output in its local buffer(100mb default), and when the buffer is filled to a certain percentage defined by io.sort.spill.percent, the result will be spilled on to disk defined by mapred.local.dir.
These files are copied on to the reducer during copy phase, in which output of each mapper is copied by mapred.reduce.parallel.copies parallel threads.(default 5)

If you fix reducer number to one (by job.setNumReduceTasks(1) or -Dmapred.reduce.tasks=1) then all data from mappers will be transferred to one reducer that will process all keys.

If you have only 1 reducer then all the data get tranferred to that reducer and all the output will be stored in HDFS as a single file.
If you are not giving no of reducers then the default no of reducer that run is one.
You can set no of reducers using job.setNumReduceTasks(__) and if you are using ToolRunner you can set no of reducers through command line itself.
-Dmapred.reduce.tasks=4

Related

Map task results incase of no reducer

While mapreduce job runs the map task results are stored in local file system and then final results from reducer are stored in hdfs. The question is
What is the reason that map task results being stored in local file system ?
In the case of map reduce job where there is no reduce phase(only map phase exist) where is the final result stored ?
1) Mapper output is stored in local fs because, in most of the scenarios we are interested in output given by Reducer phase(which is also known as final output).Mapper <K,V> pair is intermediate output which is of least importance once passed to Reducer. If we store Mapper output in hdfs, it will be a waste of storage, because, hdfs have replication factor(by default 3) and hence 3 times the space will be taken by data which is not at all required in further processing.
2) In case of map only job, final output is stored in hdfs.
1) After TaskTracker(TT) mapper logic is done, before sending the output to Sort and Shuffle phase, the TT is going to store the o/p in temporary files(LFS)
This is to avoid starting the entire MR job again incase of network glitch.Once stored in LFS, the mapper output can be picked directly from LFS.This data is called Intermediate data and the concept is called Data Localization
This intermediate data will be deleted once the job is completed.Otherwise, the LFS would grow in size with Intermediate data from different jobs as time progresses.
Data Localization is only applicable for Mapper phase but not for Sort & Shuffle,Reducer phases
2) When there is no reducer phase, the Intermediate Data would eventually be pushed onto HDFS.
What is the reason that map task results being stored in local file system ?
Mapper output is temporary output and is relevant only for Reducer. Storing temporary output in HDFS (with replication factor) is overkill. Due to this reason, Hadoop framework stores output of Mapper into local file system instead of HDFS system. It saves lot of disk space.
One more important point from Apache tutorial page :
All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer(s) to determine the final output.
The Mapper outputs are sorted and then partitioned per Reducer
In the case of map reduce job where there is no reduce phase(only map phase exist) where is the final result stored ?
You can more details about this query from Apache tutorial page.
Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by FileOutputFormat.setOutputPath(Job, Path). The framework does not sort the map-outputs before writing them out to the FileSystem.
If number of Reducers are greater than 0, mapper outputs are stored in local file system and sorted before sending them to Reducer. If number of Reducers are 0, then mapper outputs are stored in HDFS without sorting.

hadoop job output files

I currently have one hadoop oozie job running. The output files are automatically generated. The expected number of output files is just ONE; however, there are two output files called part-r-00000 and part-r-00001. Sometimes, the first one(part-r-00000) has data, and the second one (part-r-00001) doesn't. Sometimes, the second one has, and the first one doesn't. Can anyone tell me why? Also, How to set the output file to part-r-00000?
In Hadoop, the output files are a product of the Reducers (or Mappers if it's a map-side only job, in which case it will be a part-m-xxxxx file). If your job uses two reducers, that means that after each has finished with its portion, it will write to the output directory in the form of part-r-xxxxx, where the numbers denote which reducer wrote it out.
That said, you cannot specify a single output file, but only the directory. To get all of the files from the output directory into a single file, use:
hdfs dfs -getmerge <src> <localdst> [addnl]
Or if you're using an older version of hadoop:
hadoop fs -getmerge <src> <localdst> [addnl]
See the shell guide for more info.
As to why one of your output files is empty, data is passed from Mappers to Reducers based on the grouping comparator. If you specify two reducers, but there is only one group (as identified by the grouping comparator), data will not be written from one reducer. Alternatively, if some logic within the reducer prevents a writing operation, that's another reason data may not be written from one reducer.
The output files are by default named part-x-yyyyy where:
x is either 'm' or 'r', depending on whether this file was generated by a map or reduce task
yyyyy is the mapper or reducer task number (zero based)
The number of tasks has nothing to do with the number of physical nodes in the cluster. For map task output the number of tasks is given by the input splits. Usually the reducer task are set with job.setNumReduceTasks() or passed as input parameter.
A job which has 100 reducers will have files named part-r-00000 to part-r-00100, one for each reducer task.
A map only job with 100 input splits will have files named part-m-00000 to part-m-00100, one for each reducer task.
The number of files output is dependent on the number of mappers and reducers. In your case, the number of files and names of files indicates that your output came from 2 reducers.
To limit the number of mappers or reducers is dependent on your language (Hive, Java, etc), but each has a property that you can set to limit these. See here for Java MapReduce jobs.
Files can be empty if that particular mapper or reducer task had no resulting data on the given data node.
Finally, I don't think you want to limit your mappers and reducers. This will defeat the point of using Hadoop. If you're aiming to read all files as one, make sure they are consolidated in a given directory and pass the directory as the file name. The files will be treated as one.

Mapreduce: writing from both mapper and reducer in single job

I have a need to send only selected records from mapper to reducer and rest filter record to write to hdfs from mapper itself. Reducer will write the records send to reducer. My job is processing huge data in 20TBs, it uses 30K mappers, so I believe I cannot write from mapper's cleanup method as well, because to load that data from 30K mapper's output files(30k files) will be a another problem for the next job. I am using CDH4. Has anyone implemented a similar scenario with any other different approach?
When you want to write the data to HDFS, is it through java client and to HDFS? If yes, then you can write conditional logic to write to HDFS and write to output location, from where reducer picks up. Records not meeting the condition can then use mapper to write to output location, and later be picked up by reducer.
By default the output location is also a HDFS location, but you have to see which way you want the data to be in HDFS as per your case.

Hadoop: Where the mapper output stored (local or HDFS) if no reducer job available?

In Hadoop, all mapper outputs are stored in local disk (not in HDFS). It is possible that any Hadoop job can have zero reducer.
In this case, the mapper output still be stored in local disk? What about reliability if the output is stored in local disk? Is there any way to store the mapper output on HDFS if no reducer available?
Thanks and Regards,
KB Devaraj
MR job can be defined with no reducer. In this case, all the mappers write their outputs under specified job output directory in HDFS. So; there will be no sorting and no partitioning.
Just set the number of reduces to 0.
job.setNumReduceTasks(0);
So the no. of output files will be equal to no. of mappers and output files will be named as part-m-00000.
And once Reducer task is set to Zero the result will be unsorted.
If we are not specifying this property in Configuration, an Identity Reducer will get executed in which the same value is simply emitted along with the incoming key and the output file will be part-r-00000.

Write Reducer output of a Mapreduce job to a single File

I have written a map-reduce job for the data in HBase. It contains multiple mappers and just a single reducer. The Reducer method takes in the data supplied from the mapper and do some analytic on it. After the processing is complete for all the data in HBase I wanted to write the data back to a file in HDFS through the single Reducer. Presently I am able to write the data to HDFS every time I get new one but unable to figure how to write the final conclusion to HDFS only at last.
So, if you trying to write a final result from a single reducer to HDFS, you can try any one of the approaches below -
Use Hadoop API FileSystem's create() function to write to HDFS from the reducer.
Emit a single key and value from reducer after the final calculation
Override Reducers cleanup() function and do point (1) there.
Details on 3:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html#cleanup-org.apache.hadoop.mapreduce.Reducer.Context-
Hope this helps.

Resources