What are SUCCESS and part-r-00000 files in hadoop - hadoop

Although I use Hadoop frequently on my Ubuntu machine I have never thought about SUCCESS and part-r-00000 files. The output always resides in part-r-00000 file, but what is the use of SUCCESS file? Why does the output file have the name part-r-0000? Is there any significance/any nomenclature or is this just a randomly defined?

See http://www.cloudera.com/blog/2010/08/what%E2%80%99s-new-in-apache-hadoop-0-21/
On the successful completion of a job, the MapReduce runtime creates a _SUCCESS file in the output directory. This may be useful for applications that need to see if a result set is complete just by inspecting HDFS. (MAPREDUCE-947)
This would typically be used by job scheduling systems (such as OOZIE), to denote that follow-on processing on the contents of this directory can commence as all the data has been output.
Update (in response to comment)
The output files are by default named part-x-yyyyy where:
x is either 'm' or 'r', depending on whether the job was a map only job, or reduce
yyyyy is the mapper or reducer task number (zero based)
So a job which has 32 reducers will have files named part-r-00000 to part-r-00031, one for each reducer task.

Related

What are the difference between part-r-00000 and part-m-00000 files in Hadoop?

We are working with BigData while using Hadoop my Virtual Box running CentOS. whenever we perform some programs it creates 2 different files 1) part-r-00000 and 2) part-m-00000. so what are the difference and pupose of that two files?
The output files are by default named part-x-yyyyy
where:
1) x is either ‘m’ or ‘r’, depending on whether the job was a map only job, or reduce
2) yyyyy is the Mapper, or Reducer task number (zero based(00000))
So if a job which has 10 reducers, files generated will have named part-r-00000 to part-r-00009, one for each reducer task.
It is possible to change the default name.
This is all you need to do in the Driver class to change the default of the output file:
job.getConfiguration().set(“mapreduce.output.basename”, “Neo”);
So this will result in your files being called “Neo-r-00000”.
These are files produced by MapReduce jobs. r means the file has been output by a Reducer, m means the file has been output by a Mapper.

Hadoop Streaming Job with no input file

Is it possible to execute a Hadoop Streaming job that has no input file?
In my use case, I'm able to generate the necessary records for the reducer with a single mapper and execution parameters. Currently, I'm using a stub input file with a single line, I'd like to remove this requirement.
We have 2 use cases in mind.
1)
I want to distribute the loading of files into hdfs from a network location available to all nodes. Basically, I'm going to run ls in the mapper and send the output to a small set of reducers.
We are going to be running fits leveraging several different parameter ranges against several models. The model names do not change and will go to the reducer as keys while the list of tests to run is generated in the mapper.
According to the docs this is not possible. The following are required parameters for execution:
input directoryname or filename
output directoryname
mapper executable or JavaClassName
reducer executable or JavaClassName
It looks like providing a dummy input file is the way to go currently.

How to control the number of hadoop streaming output files

Here is the detail:
The input files is in the hdfs path /user/rd/input, and the hdfs output path is /user/rd/output
In the input path, there are 20,000 files from part-00000 to part-19999, each file is about 64MB.
What I want to do is to write a hadoop streaming job to merge these 20,000 files into 10,000 files.
Is there a way to merge these 20,000 files to 10,000 files using hadoop streaming job? Or, in other words, Is there a way to control the number of hadoop streaming output files?
Thanks in advance!
It looks like right now you have a map-only streaming job. The behavior with a map-only job is to have one output file per map task. There isn't much you can do about changing this behavior.
You can exploit the way MapReduce works by adding the reduce phase so that it has 10,000 reducers. Then, each reducer will output one file, so you are left with 10,000 files. Note that your data records will be "scattered" across the 10,000... it won't be just two files concatenated. To do this, use the -D mapred.reduce.tasks=10000 flag in your command line args.
This is probably the default behavior, but you can also specify the identity reducer as your reducer. This doesn't do anything other than pass on the record, which is what I think you want here. Use this flag to do this: -reducer org.apache.hadoop.mapred.lib.IdentityReducer

Intermediate output when a reducer is specified

I've written a Hadoop Map Reduce job. When I run it locally, I notice that if I don't specify any reduce tasks there are some temporary files written to the output directory. If I specify reducers no temporary files are written. Is this normal behavior? I would expect to see the temporary files written otherwise it would mean that the mapper is trying to do everything in memory and then transfer to the reducer in memory. This strikes me as implausible.
Any insights into how/when/where the mapper writes intermediate output to the file system would be appreciated.
Thanks
Map tasks write their output to the local disk, not to HDFS. Map output
is intermediate output: it’s processed by reduce tasks to produce the final output, and
once the job is complete the map output can be thrown away. So storing it in HDFS,
with replication, would be overkill.
But if we set number of reducers to 0 then map output is stored on HDFS as final output. There is no reduce phase so output of the mapper is the output of the whole job.
Additionally here is how to look into intermediate files even if reducer is specified.

Correlating input files to output files

I have a MR streaming job. My code is in C++. Its a mapper only job, with no reducer. Input to the the job is a directory containing three files. Job creates 3 mappers. Each mapper processes one input file and produces one output file in different format.
Input files are like:
MyDir/file1
MyDir/file2
MyDir/file3
Output file are like:
MyDir/Output/part-00000
MyDir/Output/part-00001
MyDir/Output/part-00002
I want to correlate input files to output files. For example, input file MyDir/file1 may correspond to output file MyDir/Output/part-00002, i.e. mapper that processed input file MyDir/file1 may have produced output file MyDir/Output/part-00002.
I want to know this relationship, i.e., which input file corresponds to which output file. Is there a simple way to know this?
One way I can think of is it to have the i/p and the o/p file names of the Job the same. Get the input file name (map.input.file environment property) which the mapper is processing and then us it in the MultipleOutputFormat#generateFileNameForKeyValue method.
With how Hadoop is designed, the only relationship that you can rely on, without you expressly naming the output files as per the other answer, is that the number of output files corresponds to the number of final tasks being run, usually reducers (mappers in your case, since you're not running any reducers).
If Hadoop later decides to run more mappers/reducers instead of just 3 (larger input files, more nodes available), you'll get a different number of output files.

Resources