Having multiple reduce tasks assemble a single HDFS file as output - hadoop

Is there any low level API in Hadoop allowing multiple reduce tasks running on different machines to assemble a single HDFS as output of their computation?
Something like, a stub HDFS file is created at the beginning of the job then each reducer creates, as output, a variable number of data blocks and assigns them to this file according to a certain order

The answer is no, that would be an unnecessary complication for a rare use case.
What you should do
option 1 - add some code at the end of your hadoop command
int result = job.waitForCompletion(true) ? 0 : 1;
if (result == 0) { // status code OK
// ls job output directory, collect part-r-XXXXX file names
// create HDFS readers for files
// merge them in a single file in whatever way you want
}
All of the required methods are present in hadoop FileSystem api.
option 2 - add job to merge files
You can create a generic hadoop job that would accept directory name as input and pass everything as-is to the single reducer, that would merge results into one output file. Call this job in a pipeline with your main job.
This would work faster for big inputs.

If you want merged output file on local, you can use hadoop command getmerge to combine multiple reduce task files into one single local output file, below is command for same.
hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

Related

access hadoop intermediate map output files

Is it possible to access or read map intermediate output files i.e. sequential file- file.out ?
I want to read a file.out file. I used approach mentioned in this link, http://hadooptutorial.info/hadoop-sequence-files-example/ ,
but it says that file.out is not a sequential file.
If you want to read the output of the mapper use the following configuration argument while running the job
-Dmapreduce.job.reduces = 0
It will set the number of reducers to 0, leading to 0 partitioners and only mappers result will be displayed in output directory.

Hadoop job just ends

I'm having a rather strange problem with Hadoop.
I wrote a MR job that ends just like that, without executing map or reduce code. It produces the output folder, but that folder is empty. I see no reason for such a behavior.
I'm even trying out this with default Mapper and Reducer, just to find the problem, but I get no exception, no error, the job just finishes and produces an empty folder. Here's the simplest driver:
Configuration conf = new Configuration();
//DistributedCache.addCacheFile(new URI(firstPivotsInput), conf);
Job pivotSelection = new Job(conf);
pivotSelection.setJarByClass(Driver.class);
pivotSelection.setJobName("Silhoutte");
pivotSelection.setMapperClass(Mapper.class);
pivotSelection.setReducerClass(Reducer.class);
pivotSelection.setMapOutputKeyClass(IntWritable.class);
pivotSelection.setMapOutputValueClass(Text.class);
pivotSelection.setOutputKeyClass(IntWritable.class);
pivotSelection.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(pivotSelection, new Path("/home/pera/WORK/DECOMPRESSION_RESULT.csv"));
FileOutputFormat.setOutputPath(pivotSelection, new Path("/home/pera/WORK/output"));
pivotSelection.setNumReduceTasks(1);
pivotSelection.waitForCompletion(true);
What could be the problem in such a simple example?
The simplest explanation is that the input Path ("/home/pera/WORK/DECOMPRESSION_RESULT.csv") does not contain anything on HDFS. You can verify that by the value of the MAP_INPUT_RECORDS counter. You can also check the size of this file on HDFS with hadoop dfs -ls /home/pera/WORK, or you can even see the first few lines of this file by hadoop dfs -cat /home/pera/WORK/DECOMPRESSION_RESULT.csv | head. (or -text instead of -cat if it is compressed).
Another problem could be that the reducer has a special (if) condition that fails for every mapper's output, but this should not hold in the case of identity mapper and reducer, so I believe the case is the former one.

Merging MapReduce output

I have two MapReduce jobs which produce files in two separate directories which look like so:
Directory output1:
------------------
/output/20140102-r-00000.txt
/output/20140102-r-00000.txt
/output/20140103-r-00000.txt
/output/20140104-r-00000.txt
Directory output2:
------------------
/output-update/20140102-r-00000.txt
I want to merge these two directories together in a new directory /output-complete/ where the 20140102-r-00000.txt replaces the original file in the /output directory and all of the "-r-0000x" is removed from the file name. The two original directories will now be empty and the resulting directory should look as follows:
Directory output3:
-------------------
/output-complete/20140102.txt
/output-complete/20140102.txt
/output-complete/20140103.txt
/output-complete/20140104.txt
What is the best way to do this? Can I use only HDFS shell commands? Do I need to create a java program to traverse both directories and do the logic?
you can use pig ...
get_data = load '/output*/20140102*.txt' using Loader()
store get_data into "/output-complete/20140102.txt"
or HDFS Command...
hadoop fs -cat '/output*/20140102*.txt' > output-complete/20140102.txt
single qoutes may not work, then try with double quotes
You can use hdfs command -getMerge for merging hdfs files.

Hadoop: How to generate custom reduce output file name?

Now, I use MultipuleOuputs.
I would like to remove the suffix string "-00001" from reducer's output filename such as "xxxx-[r/m]-00001".
Is there any idea?
Thanks.
From Hadoop javadoc to the write() method of MultipleOutputs:
Output path is a unique file generated for the namedOutput. For example, {namedOutput}-(m|r)-{part-number}
So you need to rename or merge these files on the HDFS.
I think you can do it on job driver. When your job completes, change the file names. Also you could do it via terminal commands.

Hadoop preinstalled example Jars

I just successfully set up Hadoop on my local machines. I am following one of the examples in a popular book I just bought. I am trying to get a list of all hadoop examples that comes with installation. I type the following command to do so:
bin/hadoop jar hadoop-*-examples.jar
Once I enter this I am supposed to get a list of Hadoop examples right? However all I see is this error message:
Not a valid JAR: /home/user/hadoop/hadoop-*-examples.jar
How do I solve this problem? Is it just a simple permission issue?
This is most probably configuration issue or usage of invalid file paths.
Most probably the name of hadoop-*-examples.jar is not correct because in my version of Hadoop (1.0.0) file name is hadoop-examples-1.0.0.jar.
So I have run following command to list all examples and it works like charm:
bin/hadoop jar hadoop-examples-*.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
dbcount: An example job that count the pageview counts from a database.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using monte-carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sleep: A job that sleeps at each map and reduce task.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
Also if I use same file name pattern as You I got error:
bin/hadoop jar hadoop-*examples.jar
Exception in thread "main" java.io.IOException: Error opening job jar: hadoop-*examples.jar
HTH
You must specify the class name of the jar file which you want to use:
hadoop jar pathtojarfile classname arg1 arg2 ..
Example:
hadoop jar example.jar wordcount inputPath outputPath
#Anup. The full / relative path to the jar file is required.
In your case it might be /home/user/hadoop/share/hadoop-*-examples.jar
The complete command from hadoop's directory might be
/home/user/hadoop/bin/hadoop /home/user/hadoop/share/hadoop-*-examples.jar
(I used absolute full paths there, but you can use relative paths).
you will find the jar in $HADOOP_HOME/share/hadoop/mapreduce/hadoop-*-examples*.jar

Resources