Output file of size zero - hadoop

I am running a Hadoop mapreduce streaming job (mappers only job). In some cases my job writes to stdout whereupon an output file with non-zero size is created. In some cases my job does not write anything to stdout but still an output file of size zero is created. Is there a way to avoid creation of this file of size zero when nothing is written to stdout.

If you don't mind extending your current output format, you just need to override the OutputCommitter to 'abort' the commitTask stage when no data was written.
Note that not all output formats show zero file bytes for an empty file (sequence files for example have a header), so you can't just check the output file size.
Look at the source for the following files:
OutputCommitter - The base abstract class
FileOutputCommitter - Most FileOutputFormats use this committer so it's a good place to start. Look into the private method moveTaskOutputs, this is where your logic will most likely go (to not copy the file if nothing was written)

Are you using MultipleOutputs?
If yes, MultipleOutputs creates defaults files even if the reducer has nothing to write to the output.
To avoid this default zero-sized output, you can use LazyOutputFormat.setOutputFormatClass()
From my experience, even if you are using LazyOutputFormat, zero-sized files are created when: Reducer has some data to write (so output file is created) but reducer gets killed before writing the output. I believe this is a timing issue, so you might observe that only partial reducer output files are present in HDFS or you may not observe this at all.
eg. If you have 10 reducers, you might have only 'n' (n<=10) number of files and some of them have file size equal to 0 bytes.

Related

Hadoop streaming with multiple input files

I want to build an inverted index from a set of files with Hadoop using the Streaming API. The documentation always refers to using a file whose lines have the entries to the mapper to be fed. But in this case, I have multiple input files, and I need the mappers to process only one file at a time. Is there a way to accomplish that. For preprocessing reasons, I need the input to be like this, and I cannot have the input in the classic line = key, value format that the documentation refers.
By default a mapper only processes one file, unless you use an input class that allow combine inputs like CombineFileInputFormat.
Then, if you have 10 files you will end with 10 mappers and each of them will process only one file. If you are only using mappers (not reducers) that will end in 10 outputs files (one for each mapper).
In the other side, if you have enough big splittable files, it is possible that one file be processed by several mappers at the same time.

How to delete intermediate output file from Hdfs

I am trying to delete intermediate output directory of mapreduce program using
FileUtils.deleteDirectory(new File(tempFiles));
but this command doesn't delete directories from hdfs.
Map reduce does not write intermediate results on hdfs ,it writes on local disk.
Whenever mapper produce output it first goes on memory buffer where partitioning and sorting takes place when buffer exceeds its default capacity it spill those results into local disk .
Summary is output produced by mapper goes into local file system .
Only in one condition mapper will write their output to hdfs if specifically it has been set in the driver class not to use any reducer.
In above case there would be final output we won't say its intermediate.
You are using the wrong API boy ! You should be using apache FileUtil instead FileUtils. The later one is used for file manipulation in local filesystems.
https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileUtil.html#fullyDelete
http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/FileUtils.html
I understand that one can easily pick the wrong one due to the similar names. Your current code is looking into your local file system to delete that path without any effect on the HDFS.
Sample code :
FileUtil.fullyDelete(new File("pathToDir"));
On the other hand, you can make use of FileSystem api itself which has a method delete. You need to get the FileSystem object though. eg:
filesystem.delete(new Path("pathToDir"), true);
The second argument is the recursive flag.

What are SUCCESS and part-r-00000 files in hadoop

Although I use Hadoop frequently on my Ubuntu machine I have never thought about SUCCESS and part-r-00000 files. The output always resides in part-r-00000 file, but what is the use of SUCCESS file? Why does the output file have the name part-r-0000? Is there any significance/any nomenclature or is this just a randomly defined?
See http://www.cloudera.com/blog/2010/08/what%E2%80%99s-new-in-apache-hadoop-0-21/
On the successful completion of a job, the MapReduce runtime creates a _SUCCESS file in the output directory. This may be useful for applications that need to see if a result set is complete just by inspecting HDFS. (MAPREDUCE-947)
This would typically be used by job scheduling systems (such as OOZIE), to denote that follow-on processing on the contents of this directory can commence as all the data has been output.
Update (in response to comment)
The output files are by default named part-x-yyyyy where:
x is either 'm' or 'r', depending on whether the job was a map only job, or reduce
yyyyy is the mapper or reducer task number (zero based)
So a job which has 32 reducers will have files named part-r-00000 to part-r-00031, one for each reducer task.

Intermediate output when a reducer is specified

I've written a Hadoop Map Reduce job. When I run it locally, I notice that if I don't specify any reduce tasks there are some temporary files written to the output directory. If I specify reducers no temporary files are written. Is this normal behavior? I would expect to see the temporary files written otherwise it would mean that the mapper is trying to do everything in memory and then transfer to the reducer in memory. This strikes me as implausible.
Any insights into how/when/where the mapper writes intermediate output to the file system would be appreciated.
Thanks
Map tasks write their output to the local disk, not to HDFS. Map output
is intermediate output: it’s processed by reduce tasks to produce the final output, and
once the job is complete the map output can be thrown away. So storing it in HDFS,
with replication, would be overkill.
But if we set number of reducers to 0 then map output is stored on HDFS as final output. There is no reduce phase so output of the mapper is the output of the whole job.
Additionally here is how to look into intermediate files even if reducer is specified.

Correlating input files to output files

I have a MR streaming job. My code is in C++. Its a mapper only job, with no reducer. Input to the the job is a directory containing three files. Job creates 3 mappers. Each mapper processes one input file and produces one output file in different format.
Input files are like:
MyDir/file1
MyDir/file2
MyDir/file3
Output file are like:
MyDir/Output/part-00000
MyDir/Output/part-00001
MyDir/Output/part-00002
I want to correlate input files to output files. For example, input file MyDir/file1 may correspond to output file MyDir/Output/part-00002, i.e. mapper that processed input file MyDir/file1 may have produced output file MyDir/Output/part-00002.
I want to know this relationship, i.e., which input file corresponds to which output file. Is there a simple way to know this?
One way I can think of is it to have the i/p and the o/p file names of the Job the same. Get the input file name (map.input.file environment property) which the mapper is processing and then us it in the MultipleOutputFormat#generateFileNameForKeyValue method.
With how Hadoop is designed, the only relationship that you can rely on, without you expressly naming the output files as per the other answer, is that the number of output files corresponds to the number of final tasks being run, usually reducers (mappers in your case, since you're not running any reducers).
If Hadoop later decides to run more mappers/reducers instead of just 3 (larger input files, more nodes available), you'll get a different number of output files.

Resources