Hadoop - Classic MapReduce Wordcount

Hadoop - Classic MapReduce Wordcount - hadoop

In my Reducer code, I am using this code snippet to sum the values:
for(IntWritable val : values) {
sum += val.get();
}
As the above mentioned gives me expected output, I tried changing the code to:
for(IntWritable val : values) {
sum += 1;
}
Can anyone please explain what is the difference it makes when I use sum += 1 in the reducer rather than sum += val.get()? Why does it give me the same output? Does it have anything to do with Combiner, because when I used this same reducer code as Combiner, class the output was incorrect with all words showing a count of 1.
Mapper Code :
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer token = new StringTokenizer(line);
while(token.hasMoreTokens()) {
word.set(token.nextToken());
context.write(word, new IntWritable(1));
}
}
Reducer Code :
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for(IntWritable val : values) {
//sum += val.get();
sum += 1;
}
context.write(key, new IntWritable(sum));
}
Driver Code:
job.setJarByClass(WordCountWithCombiner.class);
//job.setJobName("WordCount");
job.setMapperClass(WordCountMapper.class);
job.setCombinerClass(WordCountReducer.class);
job.setReducerClass(WordCountReducer.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
Input - "to be or not to be"
Expected Output - (be,2) , (to,2) , (or,1) , (not,1)
But Output i am getting is - (be,1) , (to,1) , (or,1) , (not,1)

Can anyone please explain what is the difference it makes when I use sum += 1 in the reducer rather than sum += val.get()?
Both the statements are performing addition operation. In the first you are counting how many times the for-loop has run. In the later, you are are actually doing a sum operation, on the int value returned by the each val object for a given key.
Why does it give me the same output? Does it have anything to do with Combiner
The answer is Yes. It is because of the Combiner.
Now lets look at the input you are passing, this will instantiate only one Mapper. The output of the Mapper is:
(to,1), (be,1), (or,1), (not,1), (to,1), (be,1)
When this goes to the Combiner, which is essentially same logic as Reducer. The output will be:
(be,2) , (to,2) , (or,1) , (not,1)
Now the above output of Combiner goes to the Reducer and it will perform the sum operation however you define it. So if your logic is sum += 1 then output will be:
(be,1) , (to,1) , (or,1) , (not,1)
But if your logic is sum += val.get() then your output will be:
(be,2) , (to,2) , (or,1) , (not,1)
I hope you understand it now. The logic of the Combiner and Reducer is same, But the input which is coming to them for processing is different.

All depends on the value of sum += val.get();
If always val.get() return 1, then sum += val.get(); is the same than sum += 1; as it is happening in your reducer.
BUT
The combainer is used to do a pre-aggregation (similar than the reducer aggregation) in the mapper side, previous to send the key-values pairs to the recuder(s).
Hadoop framework doesn't warranty the times that the combiner is executed by Mapper, it will depend on the number of Mapper's outputs. Then, if only one time the combiner is executed, the aggregation in the mapper side will be ok but in the reducer instead to only receive 1's you could receive other number (val.get() >= 1). And if you use sum += 1; in your reducer, you will be dropping the aggregated numbers in the mapper, generating a wrong output.
If the combiner is executed more than one time in the Mapper side, then you could imagine that the problem could be even worst.
In summary, sum += 1; only works if and only if that statement is executed only one time for each key-value. Using the combiner, that is not warranted.

val.get(); return an int so basically both the codes are same. The reason we are using val.get() depends on the problem we are trying to solve. In your case we are sure that in the mapper each word is emitted as the key and the value as 1, so in the reducer you can be sure that val.get() will always return 1. Hence the hard coded integer value 1 gives the same result.
Also using the same reducer as the combiner function should not cause any problem. One of the scenario where the output would be with all words giving count as '1' would be when the number of reducers is set as 0 and the mapper output is written to the output path.

Related

MapReduce TotalOrderPartitioning writes output to only to one file?

I am running a mapreduce job which read the input and sorts it using multiple reduces.
I am able to get the output sorted with the number of reducers to be 5. However, the output is written to only 1 file and have 4 empty files along with it.
I am using an input sampler and totalorderpartitioner for global sorting.
My driver looks like follows:
int numReduceTasks = 5;
Configuration conf = new Configuration();
Job job = new Job(conf, "DictionarySorter");
job.setJarByClass(SampleEMR.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setPartitionerClass(TotalOrderPartitioner.class);
job.setNumReduceTasks(numReduceTasks);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, input);
FileOutputFormat.setOutputPath(job, new Path(output
+ ".dictionary.sorted." + getCurrentDateTime()));
job.setPartitionerClass(TotalOrderPartitioner.class);
Path inputDir = new Path("/others/partitions");
Path partitionFile = new Path(inputDir, "partitioning");
TotalOrderPartitioner.setPartitionFile(job.getConfiguration(),
partitionFile);
double pcnt = 1.0;
int numSamples = numReduceTasks;
int maxSplits = numReduceTasks - 1;
if (0 >= maxSplits)
maxSplits = Integer.MAX_VALUE;
InputSampler.Sampler<LongWritable, Text> sampler = new InputSampler.RandomSampler<LongWritable, Text>(pcnt,
numSamples, maxSplits);
InputSampler.writePartitionFile(job, sampler);
job.waitForCompletion(true);

Your RandomSampler parameters seem suspicious to me:
The first parameter freq is a probability, not a percentage. For pcnt = 1 you are sampling 100% of the records.
The second parameter numSamples should be bigger. It should be enough to represent the distribution of your whole dataset.
Imagine you have the following keys: 4,7,8,9,4,1,2,5,6,3,2,4,7,4,8,1,7,1,8,9,9,9,9
Using freq = 0.3 and numSamples = 10. for the sake of simplicity let's say 0.3 means every 3 keys one if sampled. This will collect the following sample: 4,9,2,3,7,1,8,9. This will be sorted into 1,2,3,4,7,8,9,9. This sample has 8 elements, so all of them are kept, because it does not exceed the maximum number of samples numSamples = 10.
Out of this sample, the boundaries for your reducers will be something like 2,4,8,9. This means that if a pair has the key "1" it will end up in Reducer #1. A pair with key "2" will end up in Reducer #2. A pair with key "5" will end up in Reducer #3, etc... This would be a good distribution.
Now if we run your values on the same example keys. Your freq = 1 so you take each key into the sample. So your sample will be the same as the initial keyset. Except that you set a max number of samples numSamples = 4, which means that you only keep 4 elements in your sample. Your final sample is likely to be 9,9,9,9. In this case all your boundaries are the same, so all pairs always go to Reducer #5.
In my example it looks like we were very unlucky to have the same last 4 keys. But if your original dataset is already sorted, this is likely to happen (and the boundary distribution is guaranteed to be bad) if you use a high frequency with a small number of samples.
This blog post has lots of details on Sampling and TotalOrderPartitioning.

Hadoop reducer cleanup function

In hadoop reduce code, I have a cleanup function which prints the total count, but it print twice. I think this is because it's printing the count of key+values and the count alone, but I'm not sure.
My code has this:
protected void cleanup(Context context) throws IOException,
InterruptedException {
Text t1 = new Text("Total Count");
context.write(t1, new IntWritable(count));
}
inside the reducer class and the output is:
Total Count 9477
Total Count 4738

The cleanup method is called at the end of each task. So I assume you are running 2 reducers in the code. therefore 2 outputs

MapReduce Job distribution among reducers

I developed a small mapreduce program. When i opened the process log, i saw that one map and two reducers were created by the framework. I had only one file for input and got two output files. Now please tell me
1) Number of mapper and reducer are created by framework or it can be changed?
2) Number of output files always equal to number of reducers? i.e. each reducer
creates its own output file?
3) How one input file is distributed among mappers? And output of one mapper is
distributed among multiple reducers (this is done by framework or you can change)?
4) How to manage when multiple input files are there i.e. A directory ,
containing input files?
Please answer these questions. I am beginner to MapReduce.

Let me attempt to answer your questions. Please tell me wherever you think is incorrect -
1) Number of mapper and reducer are created by framework or it can be changed?
Total number of map tasks created depends on the total number of logical splits being made out of the HDFS blocks. So, fixing the number of map tasks may not always be possible because different files can have different sizes and with that different number of total blocks. So, if you are using TextInputFormat, roughly each logical split equals to a block and fixing number of total map task would not be possible since, for each file there can be different number of blocks created.
Unlike number of mappers, reducers can be fixed.
2) Number of output files always equal to number of reducers? i.e. each reducer
creates its own output file?
To certain degree yes but there are ways with which it's possible to create more than one output file from a reducer. For e.g.: MultipleOutputs
3) How one input file is distributed among mappers? And output of one mapper is
distributed among multiple reducers (this is done by framework or you can change)?
Each file in HDFS is composed of blocks. Those blocks are replicated and can remain in multiple nodes (machines). Map tasks are then scheduled to runs upon these blocks.
The level of concurrency with which map task can run, depends upon the number of processors each machine have.
E.g. for a file if 10,000 map tasks are scheduled, depending upon total number of processors throughout the cluster, only a 100 can run concurrently at a time.
By default Hadoop uses HashPartitioner, which calculates the hashcode of the keys being sent from the Mapper to the framework and converts them to a partition.
E.g.:
public int getPartition(K2 key, V2 value,
int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
As you can see above, a partition is selected out of the total number of reducers that's fixed based upon the hash code. So, if your numReduceTask = 4, the value returned would be between 0 to 3.
4) How to manage when multiple input files are there i.e. A directory ,
containing input files?
Hadoop supports a directory consisting of multiple files as a input to a job.

As explained by 'SSaikia_JtheRocker' mapper tasks are created according to the total number of logical splits on HDFS blocks.
I would like to add something to the question #3 "How one input file is distributed among mappers? And output of one mapper is distributed among multiple reducers (this is done by framework or you can change)?"
For example consider my word count program which counts the number of words in a file is shown below:
#
public class WCMapper extends Mapper {
#Override
public void map(LongWritable key, Text value, Context context) // Context context is output
throws IOException, InterruptedException {
// value = "How Are You"
String line = value.toString(); // This is converting the Hadoop's "How Are you" to Java compatible "How Are You"
StringTokenizer tokenizer = new StringTokenizer (line); // StringTokenizer returns an array tokenizer = {"How", "Are", "You"}
while (tokenizer.hasMoreTokens()) // hasMoreTokens is a method in Java which returns boolean values 'True' or 'false'
{
value.set(tokenizer.nextToken()); // value's values are overwritten with "How"
context.write(value, new IntWritable(1)); // writing the current context to local disk
// How, 1
// Are, 1
// You, 1
// Mapper will run as many times as the number of lines
}
}
}
#
So in the above program, for the line "How are you" is split into 3 words by StringTokenizer and when used this in the while loop, the mapper is called as many times as the number of words, so here 3 mappers are called.
And reducer, we can specify like how many reducers we want our output to be generated in using 'job.setNumReduceTasks(5);' statement. Below code snippet will give you an idea.
#
public class BooksMain {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// Use programArgs array to retrieve program arguments.
String[] programArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
Job job = new Job(conf);
job.setJarByClass(BooksMain.class);
job.setMapperClass(BookMapper.class);
job.setReducerClass(BookReducer.class);
job.setNumReduceTasks(5);
// job.setCombinerClass(BookReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// TODO: Update the input path for the location of the inputs of the map-reduce job.
FileInputFormat.addInputPath(job, new Path(programArgs[0]));
// TODO: Update the output path for the output directory of the map-reduce job.
FileOutputFormat.setOutputPath(job, new Path(programArgs[1]));
// Submit the job and wait for it to finish.
job.waitForCompletion(true);
// Submit and return immediately:
// job.submit();
}
}
#

Need help in writing Map/Reduce job to find average

I'm fairly new to Hadoop Map/Reduce. I'm trying to write a Map/Reduce job to find average time taken by n processes, given an input text file as below:
ProcessName Time
process1 10
process2 20
processn 30
I went through few tutorials but I'm still not able to get a thorough understanding. What should my mapper and reducer classes do for this problem? Will my output always be a text file or is it possible to directly store the average in some sort of a variable?
Thanks.

Your Mappers read the text file and apply the following map function on every line
map: (key, value)
time = value[2]
emit("1", time)
All map calls emit the key "1" which will be processed by one single reduce function
reduce: (key, values)
result = sum(values) / n
emit("1", result)
Since you're using Hadoop, you probably have seen the use of StringTokenizer in the map function, you can use this to get only the time in one line. Also you can think of some ways how to compute n (the number of processes), you could use for example a Counter in another job which just counts lines.
Update
If you were to execute this job, for each line a tuple would have to be sent to the reducer, potentially clogging the network if you run a Hadoop cluster on multiple machines.
A more clever approach can compute the sum of the times closer to the inputs, e.g. by specifying a combiner:
combine: (key, values)
emit(key, sum(values))
This combiner is then executed on the results of all map functions of the same machine, i.e., without networking in between.
The reducer would then only get as many tuples as there are machines in the cluster, rather than as many as lines in your log files.

Your mapper maps your inputs to the value that you want to take the average of. So let's say that your input is a text file formatted like
ProcessName Time
process1 10
process2 20
.
.
.
Then you would need to take each line in your file, split it, grab the second column, and output the value of that column as an IntWritable (or some other Writable numeric type). Since you want to take the average of all times, not grouped by process name or anything, you will have a single fixed key. Thus, your mapper would look something like
private IntWritable one = new IntWritable(1);
private IntWritable output = new IntWritable();
proctected void map(LongWritable key, Text value, Context context) {
String[] fields = value.split("\t");
output.set(Integer.parseInt(fields[1]));
context.write(one, output);
}
Your reducer takes these values, and simply computes the average. This would look something like
IntWritable one = new IntWritable(1);
DoubleWritable average = new DoubleWritable();
protected void reduce(IntWritable key, Iterable<IntWrtiable> values, Context context) {
int sum = 0;
int count = 0;
for(IntWritable value : values) {
sum += value.get();
count++;
}
average.set(sum / (double) count);
context.Write(key, average);
}
I'm making a lot of assumptions here, about your input format and what not, but they are reasonable assumptions and you should be able to adapt this to suit your exact needs.
Will my output always be a text file or is it possible to directly store the average in some sort of a variable?
You have a couple of options here. You can post-process the output of the job (written a single file), or, since you're computing a single value, you can store the result in a counter, for example.

Hadoop / MapReduce - Optimizing "Top N" Word Count MapReduce Job

I'm working on something similar to the canonical MapReduce example - the word count, but with a twist in that I'm looking to only get the Top N results.
Let's say I have a very large set of text data in HDFS. There are plenty of examples that show how to build a Hadoop MapReduce job that will provide you with a word count for every word in that text. For example, if my corpus is:
"This is a test of test data and a good one to test this"
The result set from the standard MapReduce word count job would be:
test:3, a:2, this:2, is: 1, etc..
But what if I ONLY want to get the Top 3 words that were used in my entire set of data?
I can still run the exact same standard MapReduce word-count job, and then just take the Top 3 results once it is ready and is spitting out the count for EVERY word, but that seems a little inefficient, because a lot of data needs to be moved around during the shuffle phase.
What I'm thinking is that, if this sample is large enough, and the data is well randomly and well distributed in HDFS, that each Mapper does not need to send ALL of its word counts to the Reducers, but rather, only some of the top data. So if one mapper has this:
a:8234, the: 5422, man: 4352, ...... many more words ... , rareword: 1, weirdword: 1, etc.
Then what I'd like to do is only send the Top 100 or so words from each Mapper to the Reducer phase - since there is very little chance that "rareword" will suddenly end up in the Top 3 when all is said and done. This seems like it would save on bandwidth and also on Reducer processing time.
Can this be done in the Combiner phase? Is this sort of optimization prior to the shuffle phase commonly done?

This is a very good question, because you have hit the inefficiency of Hadoop's word count example.
The tricks to optimize your problem are the following:
Do a HashMap based grouping in your local map stage, you can also use a combiner for that. This can look like this, I'm using the HashMultiSet of Guava, which faciliates a nice counting mechanism.
public static class WordFrequencyMapper extends
Mapper<LongWritable, Text, Text, LongWritable> {
private final HashMultiset<String> wordCountSet = HashMultiset.create();
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] tokens = value.toString().split("\\s+");
for (String token : tokens) {
wordCountSet.add(token);
}
}
And you emit the result in your cleanup stage:
#Override
protected void cleanup(Context context) throws IOException,
InterruptedException {
Text key = new Text();
LongWritable value = new LongWritable();
for (Entry<String> entry : wordCountSet.entrySet()) {
key.set(entry.getElement());
value.set(entry.getCount());
context.write(key, value);
}
}
So you have grouped the words in a local block of work, thus reducing network usage by using a bit of RAM. You can also do the same with a Combiner, but it is sorting to group- so this would be slower (especially for strings!) than using a HashMultiset.
To just get the Top N, you will only have to write the Top N in that local HashMultiset to the output collector and aggregate the results in your normal way on the reduce side.
This saves you a lot of network bandwidth as well, the only drawback is that you need to sort the word-count tuples in your cleanup method.
A part of the code might look like this:
Set<String> elementSet = wordCountSet.elementSet();
String[] array = elementSet.toArray(new String[elementSet.size()]);
Arrays.sort(array, new Comparator<String>() {
#Override
public int compare(String o1, String o2) {
// sort descending
return Long.compare(wordCountSet.count(o2), wordCountSet.count(o1));
}
});
Text key = new Text();
LongWritable value = new LongWritable();
// just emit the first n records
for(int i = 0; i < N, i++){
key.set(array[i]);
value.set(wordCountSet.count(array[i]));
context.write(key, value);
}
Hope you get the gist of doing as much of the word locally and then just aggregate the top N of the top N's ;)

Quoting Thomas
To just get the Top N, you will only have to write the Top N in that
local HashMultiset to the output collector and aggregate the results
in your normal way on the reduce side. This saves you a lot of network
bandwidth as well, the only drawback is that you need to sort the
word-count tuples in your cleanup method.
If you write only top N in the local HashMultiset then there is a possibility that you are going to miss the count of an element that, If passed from this local HashMultiset, could become one of the overall top 10 elements.
For example consider following format as three maps as MapName: elementName,elemenntcount:
Map A : Ele1,4 : Ele2,5 : Ele3,5 : Ele4,2
Map B : Ele1,1 : Ele5,7 : Ele6, 3 : Ele7,6
Map C : Ele5,4 : Ele8,3 : Ele1,1 : Ele9,3
Now If we considered the top 3 of each mappers we will Miss the element "Ele1" whose total count should have been 6 but since we are calculating each mapper's top 3 we see "Ele1"'s total count as 4.
I hope that makes sense. Please let me know what you think about it.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio