Hadoop reducer cleanup function - hadoop

In hadoop reduce code, I have a cleanup function which prints the total count, but it print twice. I think this is because it's printing the count of key+values and the count alone, but I'm not sure.
My code has this:
protected void cleanup(Context context) throws IOException,
InterruptedException {
Text t1 = new Text("Total Count");
context.write(t1, new IntWritable(count));
}
inside the reducer class and the output is:
Total Count 9477
Total Count 4738

The cleanup method is called at the end of each task. So I assume you are running 2 reducers in the code. therefore 2 outputs

Related

Hadoop - Classic MapReduce Wordcount

In my Reducer code, I am using this code snippet to sum the values:
for(IntWritable val : values) {
sum += val.get();
}
As the above mentioned gives me expected output, I tried changing the code to:
for(IntWritable val : values) {
sum += 1;
}
Can anyone please explain what is the difference it makes when I use sum += 1 in the reducer rather than sum += val.get()? Why does it give me the same output? Does it have anything to do with Combiner, because when I used this same reducer code as Combiner, class the output was incorrect with all words showing a count of 1.
Mapper Code :
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer token = new StringTokenizer(line);
while(token.hasMoreTokens()) {
word.set(token.nextToken());
context.write(word, new IntWritable(1));
}
}
Reducer Code :
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for(IntWritable val : values) {
//sum += val.get();
sum += 1;
}
context.write(key, new IntWritable(sum));
}
Driver Code:
job.setJarByClass(WordCountWithCombiner.class);
//job.setJobName("WordCount");
job.setMapperClass(WordCountMapper.class);
job.setCombinerClass(WordCountReducer.class);
job.setReducerClass(WordCountReducer.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
Input - "to be or not to be"
Expected Output - (be,2) , (to,2) , (or,1) , (not,1)
But Output i am getting is - (be,1) , (to,1) , (or,1) , (not,1)
Can anyone please explain what is the difference it makes when I use sum += 1 in the reducer rather than sum += val.get()?
Both the statements are performing addition operation. In the first you are counting how many times the for-loop has run. In the later, you are are actually doing a sum operation, on the int value returned by the each val object for a given key.
Why does it give me the same output? Does it have anything to do with Combiner
The answer is Yes. It is because of the Combiner.
Now lets look at the input you are passing, this will instantiate only one Mapper. The output of the Mapper is:
(to,1), (be,1), (or,1), (not,1), (to,1), (be,1)
When this goes to the Combiner, which is essentially same logic as Reducer. The output will be:
(be,2) , (to,2) , (or,1) , (not,1)
Now the above output of Combiner goes to the Reducer and it will perform the sum operation however you define it. So if your logic is sum += 1 then output will be:
(be,1) , (to,1) , (or,1) , (not,1)
But if your logic is sum += val.get() then your output will be:
(be,2) , (to,2) , (or,1) , (not,1)
I hope you understand it now. The logic of the Combiner and Reducer is same, But the input which is coming to them for processing is different.
All depends on the value of sum += val.get();
If always val.get() return 1, then sum += val.get(); is the same than sum += 1; as it is happening in your reducer.
BUT
The combainer is used to do a pre-aggregation (similar than the reducer aggregation) in the mapper side, previous to send the key-values pairs to the recuder(s).
Hadoop framework doesn't warranty the times that the combiner is executed by Mapper, it will depend on the number of Mapper's outputs. Then, if only one time the combiner is executed, the aggregation in the mapper side will be ok but in the reducer instead to only receive 1's you could receive other number (val.get() >= 1). And if you use sum += 1; in your reducer, you will be dropping the aggregated numbers in the mapper, generating a wrong output.
If the combiner is executed more than one time in the Mapper side, then you could imagine that the problem could be even worst.
In summary, sum += 1; only works if and only if that statement is executed only one time for each key-value. Using the combiner, that is not warranted.
val.get(); return an int so basically both the codes are same. The reason we are using val.get() depends on the problem we are trying to solve. In your case we are sure that in the mapper each word is emitted as the key and the value as 1, so in the reducer you can be sure that val.get() will always return 1. Hence the hard coded integer value 1 gives the same result.
Also using the same reducer as the combiner function should not cause any problem. One of the scenario where the output would be with all words giving count as '1' would be when the number of reducers is set as 0 and the mapper output is written to the output path.

MapReduce Job distribution among reducers

I developed a small mapreduce program. When i opened the process log, i saw that one map and two reducers were created by the framework. I had only one file for input and got two output files. Now please tell me
1) Number of mapper and reducer are created by framework or it can be changed?
2) Number of output files always equal to number of reducers? i.e. each reducer
creates its own output file?
3) How one input file is distributed among mappers? And output of one mapper is
distributed among multiple reducers (this is done by framework or you can change)?
4) How to manage when multiple input files are there i.e. A directory ,
containing input files?
Please answer these questions. I am beginner to MapReduce.
Let me attempt to answer your questions. Please tell me wherever you think is incorrect -
1) Number of mapper and reducer are created by framework or it can be changed?
Total number of map tasks created depends on the total number of logical splits being made out of the HDFS blocks. So, fixing the number of map tasks may not always be possible because different files can have different sizes and with that different number of total blocks. So, if you are using TextInputFormat, roughly each logical split equals to a block and fixing number of total map task would not be possible since, for each file there can be different number of blocks created.
Unlike number of mappers, reducers can be fixed.
2) Number of output files always equal to number of reducers? i.e. each reducer
creates its own output file?
To certain degree yes but there are ways with which it's possible to create more than one output file from a reducer. For e.g.: MultipleOutputs
3) How one input file is distributed among mappers? And output of one mapper is
distributed among multiple reducers (this is done by framework or you can change)?
Each file in HDFS is composed of blocks. Those blocks are replicated and can remain in multiple nodes (machines). Map tasks are then scheduled to runs upon these blocks.
The level of concurrency with which map task can run, depends upon the number of processors each machine have.
E.g. for a file if 10,000 map tasks are scheduled, depending upon total number of processors throughout the cluster, only a 100 can run concurrently at a time.
By default Hadoop uses HashPartitioner, which calculates the hashcode of the keys being sent from the Mapper to the framework and converts them to a partition.
E.g.:
public int getPartition(K2 key, V2 value,
int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
As you can see above, a partition is selected out of the total number of reducers that's fixed based upon the hash code. So, if your numReduceTask = 4, the value returned would be between 0 to 3.
4) How to manage when multiple input files are there i.e. A directory ,
containing input files?
Hadoop supports a directory consisting of multiple files as a input to a job.
As explained by 'SSaikia_JtheRocker' mapper tasks are created according to the total number of logical splits on HDFS blocks.
I would like to add something to the question #3 "How one input file is distributed among mappers? And output of one mapper is distributed among multiple reducers (this is done by framework or you can change)?"
For example consider my word count program which counts the number of words in a file is shown below:
#
public class WCMapper extends Mapper {
#Override
public void map(LongWritable key, Text value, Context context) // Context context is output
throws IOException, InterruptedException {
// value = "How Are You"
String line = value.toString(); // This is converting the Hadoop's "How Are you" to Java compatible "How Are You"
StringTokenizer tokenizer = new StringTokenizer (line); // StringTokenizer returns an array tokenizer = {"How", "Are", "You"}
while (tokenizer.hasMoreTokens()) // hasMoreTokens is a method in Java which returns boolean values 'True' or 'false'
{
value.set(tokenizer.nextToken()); // value's values are overwritten with "How"
context.write(value, new IntWritable(1)); // writing the current context to local disk
// How, 1
// Are, 1
// You, 1
// Mapper will run as many times as the number of lines
}
}
}
#
So in the above program, for the line "How are you" is split into 3 words by StringTokenizer and when used this in the while loop, the mapper is called as many times as the number of words, so here 3 mappers are called.
And reducer, we can specify like how many reducers we want our output to be generated in using 'job.setNumReduceTasks(5);' statement. Below code snippet will give you an idea.
#
public class BooksMain {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// Use programArgs array to retrieve program arguments.
String[] programArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
Job job = new Job(conf);
job.setJarByClass(BooksMain.class);
job.setMapperClass(BookMapper.class);
job.setReducerClass(BookReducer.class);
job.setNumReduceTasks(5);
// job.setCombinerClass(BookReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// TODO: Update the input path for the location of the inputs of the map-reduce job.
FileInputFormat.addInputPath(job, new Path(programArgs[0]));
// TODO: Update the output path for the output directory of the map-reduce job.
FileOutputFormat.setOutputPath(job, new Path(programArgs[1]));
// Submit the job and wait for it to finish.
job.waitForCompletion(true);
// Submit and return immediately:
// job.submit();
}
}
#

How to determine the right number of mappers in Hadoop?

I feed my Hadoop program with an input file of size 4MB (which has 100k records). As each HDFS block is 64 MB, and the file fits in only one block, I choose the number of mappers as 1. However, when I increase the number of mappers (let's sat to 24), the running time becomes much better. I have no idea why is that the case? as all the file can be read by only one mapper.
A brief description of the algorithm: The clusters are read from DistributeCache using the configure function, and get stored within a global variable called clusters. The mapper read each chunk line by line and find the cluster to which each line belongs. Here are some of the code:
public void configure(JobConf job){
//retrieve the clusters from DistributedCache
try {
Path[] eqFile = DistributedCache.getLocalCacheFiles(job);
BufferedReader reader = new BufferedReader(new FileReader(eqFile[0].toString()));
while((line=reader.readLine())!=null){
//construct the cluster represented by ``line`` and add it to a global variable called ``clusters``
}
reader.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
and the mapper
public void map(LongWritable key, Text value, OutputCollector<IntWritable, EquivalenceClsAggValue> output, Reporter reporter) throws IOException {
//assign each record to one of the existing clusters in ``clusters''.
String record = value.toString();
EquivalenceClsAggValue outputValue = new EquivalenceClsAggValue();
outputValue.addRecord(record);
int eqID = MondrianTree.findCluster(record, clusters);
IntWritable outputKey = new IntWritable(eqID);
output.collect(outputKey,outputValue);
}
I have input files of different sizes (starting from 4 MB up to 4GB). How can I find the optimal number of mappers/reducers? Each node in my Hadoop cluster has 2 cores and I have 58 nodes.
as all the file can be read by only one mapper.
This isn't really the case. A few points to keep in mind...
That single block is replicated 3 times (by default) which means that three separate nodes have access the same block without having to go over the network
There's no reason that a single block can't be copied over to multiple machines where they then seek to the split they have been allocated
You need to adjust "mapred.max.split.size". Give the appropriate size in bytes as the value. MR framework will compute the correct # of mappers based on this and block size.

Hadoop / MapReduce - Optimizing "Top N" Word Count MapReduce Job

I'm working on something similar to the canonical MapReduce example - the word count, but with a twist in that I'm looking to only get the Top N results.
Let's say I have a very large set of text data in HDFS. There are plenty of examples that show how to build a Hadoop MapReduce job that will provide you with a word count for every word in that text. For example, if my corpus is:
"This is a test of test data and a good one to test this"
The result set from the standard MapReduce word count job would be:
test:3, a:2, this:2, is: 1, etc..
But what if I ONLY want to get the Top 3 words that were used in my entire set of data?
I can still run the exact same standard MapReduce word-count job, and then just take the Top 3 results once it is ready and is spitting out the count for EVERY word, but that seems a little inefficient, because a lot of data needs to be moved around during the shuffle phase.
What I'm thinking is that, if this sample is large enough, and the data is well randomly and well distributed in HDFS, that each Mapper does not need to send ALL of its word counts to the Reducers, but rather, only some of the top data. So if one mapper has this:
a:8234, the: 5422, man: 4352, ...... many more words ... , rareword: 1, weirdword: 1, etc.
Then what I'd like to do is only send the Top 100 or so words from each Mapper to the Reducer phase - since there is very little chance that "rareword" will suddenly end up in the Top 3 when all is said and done. This seems like it would save on bandwidth and also on Reducer processing time.
Can this be done in the Combiner phase? Is this sort of optimization prior to the shuffle phase commonly done?
This is a very good question, because you have hit the inefficiency of Hadoop's word count example.
The tricks to optimize your problem are the following:
Do a HashMap based grouping in your local map stage, you can also use a combiner for that. This can look like this, I'm using the HashMultiSet of Guava, which faciliates a nice counting mechanism.
public static class WordFrequencyMapper extends
Mapper<LongWritable, Text, Text, LongWritable> {
private final HashMultiset<String> wordCountSet = HashMultiset.create();
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] tokens = value.toString().split("\\s+");
for (String token : tokens) {
wordCountSet.add(token);
}
}
And you emit the result in your cleanup stage:
#Override
protected void cleanup(Context context) throws IOException,
InterruptedException {
Text key = new Text();
LongWritable value = new LongWritable();
for (Entry<String> entry : wordCountSet.entrySet()) {
key.set(entry.getElement());
value.set(entry.getCount());
context.write(key, value);
}
}
So you have grouped the words in a local block of work, thus reducing network usage by using a bit of RAM. You can also do the same with a Combiner, but it is sorting to group- so this would be slower (especially for strings!) than using a HashMultiset.
To just get the Top N, you will only have to write the Top N in that local HashMultiset to the output collector and aggregate the results in your normal way on the reduce side.
This saves you a lot of network bandwidth as well, the only drawback is that you need to sort the word-count tuples in your cleanup method.
A part of the code might look like this:
Set<String> elementSet = wordCountSet.elementSet();
String[] array = elementSet.toArray(new String[elementSet.size()]);
Arrays.sort(array, new Comparator<String>() {
#Override
public int compare(String o1, String o2) {
// sort descending
return Long.compare(wordCountSet.count(o2), wordCountSet.count(o1));
}
});
Text key = new Text();
LongWritable value = new LongWritable();
// just emit the first n records
for(int i = 0; i < N, i++){
key.set(array[i]);
value.set(wordCountSet.count(array[i]));
context.write(key, value);
}
Hope you get the gist of doing as much of the word locally and then just aggregate the top N of the top N's ;)
Quoting Thomas
To just get the Top N, you will only have to write the Top N in that
local HashMultiset to the output collector and aggregate the results
in your normal way on the reduce side. This saves you a lot of network
bandwidth as well, the only drawback is that you need to sort the
word-count tuples in your cleanup method.
If you write only top N in the local HashMultiset then there is a possibility that you are going to miss the count of an element that, If passed from this local HashMultiset, could become one of the overall top 10 elements.
For example consider following format as three maps as MapName: elementName,elemenntcount:
Map A : Ele1,4 : Ele2,5 : Ele3,5 : Ele4,2
Map B : Ele1,1 : Ele5,7 : Ele6, 3 : Ele7,6
Map C : Ele5,4 : Ele8,3 : Ele1,1 : Ele9,3
Now If we considered the top 3 of each mappers we will Miss the element "Ele1" whose total count should have been 6 but since we are calculating each mapper's top 3 we see "Ele1"'s total count as 4.
I hope that makes sense. Please let me know what you think about it.

Permutations with MapReduce

Is there a way to generate permutations with MapReduce?
input file:
1 title1
2 title2
3 title3
my goal:
1,2 title1,title2
1,3 title1,title3
2,3 title2,title3
Since a file will have n inputs, the permutations should have n^2 outputs. It makes sense that you could have n tasks perform n of those operations. I believe you could do this (assuming only for one file):
Put your input file into the DistributedCache to be accessible as read-only to your Mapper/Reducers. Make an input split on each line of the file (like in WordCount). The mapper will thus recieve one line (e.g. title1 in your example). Then read the lines out of the file in the DistributedCache and emit your key/value pairs: with the key as your input and the values as each line from the file from DistributedCache.
In this model, you should only need a Map step.
Something like:
public static class PermuteMapper
extends Mapper<Object, Text, Text, Text>{
private static final IN_FILENAME="file.txt";
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String inputLine = value.toString();
// set the property mapred.cache.files in your
// configuration for the file to be available
Path[] cachedPaths = DistributedCache.getLocalCacheArchives(conf);
if ( cachedPaths[0].getName().equals(IN_FILENAME) ) {
// function defined elsewhere
String[] cachedLines = getLinesFromPath(cachedPaths[0]);
for (String line : cachedLines)
context.emit(inputLine, line);
}
}
}

Resources