hadoop job output files - hadoop

I currently have one hadoop oozie job running. The output files are automatically generated. The expected number of output files is just ONE; however, there are two output files called part-r-00000 and part-r-00001. Sometimes, the first one(part-r-00000) has data, and the second one (part-r-00001) doesn't. Sometimes, the second one has, and the first one doesn't. Can anyone tell me why? Also, How to set the output file to part-r-00000?

In Hadoop, the output files are a product of the Reducers (or Mappers if it's a map-side only job, in which case it will be a part-m-xxxxx file). If your job uses two reducers, that means that after each has finished with its portion, it will write to the output directory in the form of part-r-xxxxx, where the numbers denote which reducer wrote it out.
That said, you cannot specify a single output file, but only the directory. To get all of the files from the output directory into a single file, use:
hdfs dfs -getmerge <src> <localdst> [addnl]
Or if you're using an older version of hadoop:
hadoop fs -getmerge <src> <localdst> [addnl]
See the shell guide for more info.
As to why one of your output files is empty, data is passed from Mappers to Reducers based on the grouping comparator. If you specify two reducers, but there is only one group (as identified by the grouping comparator), data will not be written from one reducer. Alternatively, if some logic within the reducer prevents a writing operation, that's another reason data may not be written from one reducer.

The output files are by default named part-x-yyyyy where:
x is either 'm' or 'r', depending on whether this file was generated by a map or reduce task
yyyyy is the mapper or reducer task number (zero based)
The number of tasks has nothing to do with the number of physical nodes in the cluster. For map task output the number of tasks is given by the input splits. Usually the reducer task are set with job.setNumReduceTasks() or passed as input parameter.
A job which has 100 reducers will have files named part-r-00000 to part-r-00100, one for each reducer task.
A map only job with 100 input splits will have files named part-m-00000 to part-m-00100, one for each reducer task.

The number of files output is dependent on the number of mappers and reducers. In your case, the number of files and names of files indicates that your output came from 2 reducers.
To limit the number of mappers or reducers is dependent on your language (Hive, Java, etc), but each has a property that you can set to limit these. See here for Java MapReduce jobs.
Files can be empty if that particular mapper or reducer task had no resulting data on the given data node.
Finally, I don't think you want to limit your mappers and reducers. This will defeat the point of using Hadoop. If you're aiming to read all files as one, make sure they are consolidated in a given directory and pass the directory as the file name. The files will be treated as one.

Related

Map task results incase of no reducer

While mapreduce job runs the map task results are stored in local file system and then final results from reducer are stored in hdfs. The question is
What is the reason that map task results being stored in local file system ?
In the case of map reduce job where there is no reduce phase(only map phase exist) where is the final result stored ?
1) Mapper output is stored in local fs because, in most of the scenarios we are interested in output given by Reducer phase(which is also known as final output).Mapper <K,V> pair is intermediate output which is of least importance once passed to Reducer. If we store Mapper output in hdfs, it will be a waste of storage, because, hdfs have replication factor(by default 3) and hence 3 times the space will be taken by data which is not at all required in further processing.
2) In case of map only job, final output is stored in hdfs.
1) After TaskTracker(TT) mapper logic is done, before sending the output to Sort and Shuffle phase, the TT is going to store the o/p in temporary files(LFS)
This is to avoid starting the entire MR job again incase of network glitch.Once stored in LFS, the mapper output can be picked directly from LFS.This data is called Intermediate data and the concept is called Data Localization
This intermediate data will be deleted once the job is completed.Otherwise, the LFS would grow in size with Intermediate data from different jobs as time progresses.
Data Localization is only applicable for Mapper phase but not for Sort & Shuffle,Reducer phases
2) When there is no reducer phase, the Intermediate Data would eventually be pushed onto HDFS.
What is the reason that map task results being stored in local file system ?
Mapper output is temporary output and is relevant only for Reducer. Storing temporary output in HDFS (with replication factor) is overkill. Due to this reason, Hadoop framework stores output of Mapper into local file system instead of HDFS system. It saves lot of disk space.
One more important point from Apache tutorial page :
All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer(s) to determine the final output.
The Mapper outputs are sorted and then partitioned per Reducer
In the case of map reduce job where there is no reduce phase(only map phase exist) where is the final result stored ?
You can more details about this query from Apache tutorial page.
Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by FileOutputFormat.setOutputPath(Job, Path). The framework does not sort the map-outputs before writing them out to the FileSystem.
If number of Reducers are greater than 0, mapper outputs are stored in local file system and sorted before sending them to Reducer. If number of Reducers are 0, then mapper outputs are stored in HDFS without sorting.

is there a way to collect output from reducer in hadoop?

Is there a way to collect the output from a reducer and prevent it from writing to file? I'd like to sort the reduced output before writing to file.
No, there is no way to do it. MapReduce job must be finished writing result to file.
If I understand correctly you want to sort reducers output in a certain way instead of default sorting by keys passed to reducer.
You have 2 possible ways for this:
Change output key on Map phase to another one by which your data
should be sorted on Reduce phase.
if 1st way is impossible you can
sort reducers output by another MapReduce job or different tools.
You can start sorting job right after the main job from the same
driver specifying the output directory of the main job as the input
directory for the sorting job.

Hadoop: Where the mapper output stored (local or HDFS) if no reducer job available?

In Hadoop, all mapper outputs are stored in local disk (not in HDFS). It is possible that any Hadoop job can have zero reducer.
In this case, the mapper output still be stored in local disk? What about reliability if the output is stored in local disk? Is there any way to store the mapper output on HDFS if no reducer available?
Thanks and Regards,
KB Devaraj
MR job can be defined with no reducer. In this case, all the mappers write their outputs under specified job output directory in HDFS. So; there will be no sorting and no partitioning.
Just set the number of reduces to 0.
job.setNumReduceTasks(0);
So the no. of output files will be equal to no. of mappers and output files will be named as part-m-00000.
And once Reducer task is set to Zero the result will be unsorted.
If we are not specifying this property in Configuration, an Identity Reducer will get executed in which the same value is simply emitted along with the incoming key and the output file will be part-r-00000.

What if we only have one reducer

As we know that Hadoop tend to lanunch reducer on the machines that the corresponding mapper is run. What if we have 100 mappers and 1 reducer. We know that the mapper stores data on local disk ,will all the mapped data be transfered to the single reducer?
Yes, if the reducer is only one, all the data will be transferred to that reducer.
Each mapper initially stores its output in its local buffer(100mb default), and when the buffer is filled to a certain percentage defined by io.sort.spill.percent, the result will be spilled on to disk defined by mapred.local.dir.
These files are copied on to the reducer during copy phase, in which output of each mapper is copied by mapred.reduce.parallel.copies parallel threads.(default 5)
If you fix reducer number to one (by job.setNumReduceTasks(1) or -Dmapred.reduce.tasks=1) then all data from mappers will be transferred to one reducer that will process all keys.
If you have only 1 reducer then all the data get tranferred to that reducer and all the output will be stored in HDFS as a single file.
If you are not giving no of reducers then the default no of reducer that run is one.
You can set no of reducers using job.setNumReduceTasks(__) and if you are using ToolRunner you can set no of reducers through command line itself.
-Dmapred.reduce.tasks=4

How to configure Avro MapReduce job to output results into a single file?

I have a three nodes cluster and when the Avro job completes, it creates three output files (split files), however, I would like to output only one file. Any suggestions?
Set mapred.reduce.tasks=1, but this might increase the execution time.
You could also use hadoop -getmerge command to get a single file after the job is over.

Resources