How to get one single key-value pair as output from reducer - hadoop

I am new to Hadoop Mapreduce. I have a requirement where lets say I want to find the student name with highest mark. Consider the sample dataset
Harry Maths 80
Harry Physics 67
Daisy Science 89
Daisy Physics 90
Greg Maths 70
Greg Chemistry 79
I know that reducer iterates over each of the unique key, hence I am going to get 3 output key value pairs with name and total marks. But I need the name of the student with the total highest mark ie. Reducer output -> Daisy 179
Following is the reduce function I have written :
static int maxMark = 0;
static Text name = new Text();
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{
int totalMarks = 0;
while(values.hasNext())
{
totalMarks+=values.next().get();
}
if (totalMarks > maxMark){
maxMark = totalMarks;
name = key;
output.collect(name, new IntWritable(maxMark));
}
}
But this logic is going to output the previously saved student's name and mark as well!
I can solve this problem if I know the number of input keys to the reducer before the reducer is even called, so that when the reducer iterates over the last key (name), I can call output.collect(name, new IntWritable(maxMark)); once..
So, is there a way to find the number of input keys to the reducer? Or else, what are the other alternatives to get one single output from reducer?

You need two map reduce jobs. The first will total up the marks by name, irrespective of group. Then you can run a job with a mapper that turns the keys and values around, so the key is the sum of marks from the previous step, making sure to use a descending comparator. Configure this job to use only a single reducer task and it can flag itself to ignore all but the first call to reduce.

Related

Number of key-value pair in reducer

My question is:
In hadoop mapreduce, for each intermediate key, each reducer task can emit, only one final key-value pair per key. Or as many as programmer wants?
Two points here:
Reducer can emit many key-value pairs
All keys are of same type and values have the same type.
for example,
public static class Reduce extends Reducer {
contex.write(new Text("key1"), new LongWritable(4));
conte.write(new LongWritable(1), new Text("value")); -- this lines gives you a compile time error.
Keys must be of Text type and Value must be of LongWritable type.
Suppose your key is a LongWritable and the values are Text. Then in the reducer you expect to get many Text values for the same key, and probably you want to write each of these value in a line:
for (Text value : values) {
context.write(key,value)
}
As many as the programmer wants, the only constraint is the type of all the keys and values should be same. MapReduce doesn't restricts you use the keys and values as such as long as you are using Writables
so for a particular key
for (Text value : values) {
context.write(key,value)
}
and
for(int i=0; i < 10000 ; i++){
context.write(key, new Text(String.valueOf(i)));
//context.write(new Text("MyRandomKey"), new Text(String.valueOf(i)));
}
both of these are fine considering you have defined your keys and values as text in reducer

Need help in writing Map/Reduce job to find average

I'm fairly new to Hadoop Map/Reduce. I'm trying to write a Map/Reduce job to find average time taken by n processes, given an input text file as below:
ProcessName Time
process1 10
process2 20
processn 30
I went through few tutorials but I'm still not able to get a thorough understanding. What should my mapper and reducer classes do for this problem? Will my output always be a text file or is it possible to directly store the average in some sort of a variable?
Thanks.
Your Mappers read the text file and apply the following map function on every line
map: (key, value)
time = value[2]
emit("1", time)
All map calls emit the key "1" which will be processed by one single reduce function
reduce: (key, values)
result = sum(values) / n
emit("1", result)
Since you're using Hadoop, you probably have seen the use of StringTokenizer in the map function, you can use this to get only the time in one line. Also you can think of some ways how to compute n (the number of processes), you could use for example a Counter in another job which just counts lines.
Update
If you were to execute this job, for each line a tuple would have to be sent to the reducer, potentially clogging the network if you run a Hadoop cluster on multiple machines.
A more clever approach can compute the sum of the times closer to the inputs, e.g. by specifying a combiner:
combine: (key, values)
emit(key, sum(values))
This combiner is then executed on the results of all map functions of the same machine, i.e., without networking in between.
The reducer would then only get as many tuples as there are machines in the cluster, rather than as many as lines in your log files.
Your mapper maps your inputs to the value that you want to take the average of. So let's say that your input is a text file formatted like
ProcessName Time
process1 10
process2 20
.
.
.
Then you would need to take each line in your file, split it, grab the second column, and output the value of that column as an IntWritable (or some other Writable numeric type). Since you want to take the average of all times, not grouped by process name or anything, you will have a single fixed key. Thus, your mapper would look something like
private IntWritable one = new IntWritable(1);
private IntWritable output = new IntWritable();
proctected void map(LongWritable key, Text value, Context context) {
String[] fields = value.split("\t");
output.set(Integer.parseInt(fields[1]));
context.write(one, output);
}
Your reducer takes these values, and simply computes the average. This would look something like
IntWritable one = new IntWritable(1);
DoubleWritable average = new DoubleWritable();
protected void reduce(IntWritable key, Iterable<IntWrtiable> values, Context context) {
int sum = 0;
int count = 0;
for(IntWritable value : values) {
sum += value.get();
count++;
}
average.set(sum / (double) count);
context.Write(key, average);
}
I'm making a lot of assumptions here, about your input format and what not, but they are reasonable assumptions and you should be able to adapt this to suit your exact needs.
Will my output always be a text file or is it possible to directly store the average in some sort of a variable?
You have a couple of options here. You can post-process the output of the job (written a single file), or, since you're computing a single value, you can store the result in a counter, for example.

Hadoop / MapReduce - Optimizing "Top N" Word Count MapReduce Job

I'm working on something similar to the canonical MapReduce example - the word count, but with a twist in that I'm looking to only get the Top N results.
Let's say I have a very large set of text data in HDFS. There are plenty of examples that show how to build a Hadoop MapReduce job that will provide you with a word count for every word in that text. For example, if my corpus is:
"This is a test of test data and a good one to test this"
The result set from the standard MapReduce word count job would be:
test:3, a:2, this:2, is: 1, etc..
But what if I ONLY want to get the Top 3 words that were used in my entire set of data?
I can still run the exact same standard MapReduce word-count job, and then just take the Top 3 results once it is ready and is spitting out the count for EVERY word, but that seems a little inefficient, because a lot of data needs to be moved around during the shuffle phase.
What I'm thinking is that, if this sample is large enough, and the data is well randomly and well distributed in HDFS, that each Mapper does not need to send ALL of its word counts to the Reducers, but rather, only some of the top data. So if one mapper has this:
a:8234, the: 5422, man: 4352, ...... many more words ... , rareword: 1, weirdword: 1, etc.
Then what I'd like to do is only send the Top 100 or so words from each Mapper to the Reducer phase - since there is very little chance that "rareword" will suddenly end up in the Top 3 when all is said and done. This seems like it would save on bandwidth and also on Reducer processing time.
Can this be done in the Combiner phase? Is this sort of optimization prior to the shuffle phase commonly done?
This is a very good question, because you have hit the inefficiency of Hadoop's word count example.
The tricks to optimize your problem are the following:
Do a HashMap based grouping in your local map stage, you can also use a combiner for that. This can look like this, I'm using the HashMultiSet of Guava, which faciliates a nice counting mechanism.
public static class WordFrequencyMapper extends
Mapper<LongWritable, Text, Text, LongWritable> {
private final HashMultiset<String> wordCountSet = HashMultiset.create();
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] tokens = value.toString().split("\\s+");
for (String token : tokens) {
wordCountSet.add(token);
}
}
And you emit the result in your cleanup stage:
#Override
protected void cleanup(Context context) throws IOException,
InterruptedException {
Text key = new Text();
LongWritable value = new LongWritable();
for (Entry<String> entry : wordCountSet.entrySet()) {
key.set(entry.getElement());
value.set(entry.getCount());
context.write(key, value);
}
}
So you have grouped the words in a local block of work, thus reducing network usage by using a bit of RAM. You can also do the same with a Combiner, but it is sorting to group- so this would be slower (especially for strings!) than using a HashMultiset.
To just get the Top N, you will only have to write the Top N in that local HashMultiset to the output collector and aggregate the results in your normal way on the reduce side.
This saves you a lot of network bandwidth as well, the only drawback is that you need to sort the word-count tuples in your cleanup method.
A part of the code might look like this:
Set<String> elementSet = wordCountSet.elementSet();
String[] array = elementSet.toArray(new String[elementSet.size()]);
Arrays.sort(array, new Comparator<String>() {
#Override
public int compare(String o1, String o2) {
// sort descending
return Long.compare(wordCountSet.count(o2), wordCountSet.count(o1));
}
});
Text key = new Text();
LongWritable value = new LongWritable();
// just emit the first n records
for(int i = 0; i < N, i++){
key.set(array[i]);
value.set(wordCountSet.count(array[i]));
context.write(key, value);
}
Hope you get the gist of doing as much of the word locally and then just aggregate the top N of the top N's ;)
Quoting Thomas
To just get the Top N, you will only have to write the Top N in that
local HashMultiset to the output collector and aggregate the results
in your normal way on the reduce side. This saves you a lot of network
bandwidth as well, the only drawback is that you need to sort the
word-count tuples in your cleanup method.
If you write only top N in the local HashMultiset then there is a possibility that you are going to miss the count of an element that, If passed from this local HashMultiset, could become one of the overall top 10 elements.
For example consider following format as three maps as MapName: elementName,elemenntcount:
Map A : Ele1,4 : Ele2,5 : Ele3,5 : Ele4,2
Map B : Ele1,1 : Ele5,7 : Ele6, 3 : Ele7,6
Map C : Ele5,4 : Ele8,3 : Ele1,1 : Ele9,3
Now If we considered the top 3 of each mappers we will Miss the element "Ele1" whose total count should have been 6 but since we are calculating each mapper's top 3 we see "Ele1"'s total count as 4.
I hope that makes sense. Please let me know what you think about it.

hadoop streaming getting optimal number of slots

I have a streaming map-reduce job. I have some 30 slots for processing. Initially I get a single input file containing 60 records (fields are tab separated), first field of every record is a number, for first record number(first field) is 1, for second record number(first field) is 2 and so on. I want to create 30 files from these records for next step of processing, each containing 2 records each (even distribution).
For this to work I specified number of reducers to hadoop job as 30. I expected that first field will be used as key and I will get 30 output files each containing 2 records.
I do get 30 output files but not all containing same number of records. Some files are even empty (zero size). Any idea
Hadoop by default suffle and combine the Map task outputs as Reducer input.So Map output sets
having same key values are mapped to same reducer.so by doing this some reducer may not have input sets ,so say part-00005 file will be of size 0 KB.
What's your output key type? If you're using Text rather than IntWritable (which i assume you must be as you're using streaming), then the reduce number is calculated based upon the hash of the bytes representation the UTF-8 'string' of the key value. You can write a simple unit test to observe this in action:
public class TextHashTest {
#Test
public void testHash() {
int partitions = 30;
for (int x = 0; x < 100; x++) {
int hash = new Text(String.valueOf(x)).hashCode();
int part = hash % partitions;
System.err.printf("%d = %d => %d\n", x, hash, part);
}
}
}
I won't paste the output, but of the 100 values, partition bins 0-7 never receive any value.
So like Thomas Jungblut says in his comment, you'll need to write a custom partitioner to translate the Text value back into an integer value, and then modulo this number by total number of partitions - but this may still not give you 'even' distribution if the values themselves are not in a 1-up sequence (which you say they are so you should be ok)
public class IntTextPartitioner implements Partitioner<Text, Text> {
public void configure(JobConf job) {}
public int getPartition(Text key, Text value, int numPartitions) {
return Integer.valueOf(key.toString()) % numPartitions;
}
}

Permutations with MapReduce

Is there a way to generate permutations with MapReduce?
input file:
1 title1
2 title2
3 title3
my goal:
1,2 title1,title2
1,3 title1,title3
2,3 title2,title3
Since a file will have n inputs, the permutations should have n^2 outputs. It makes sense that you could have n tasks perform n of those operations. I believe you could do this (assuming only for one file):
Put your input file into the DistributedCache to be accessible as read-only to your Mapper/Reducers. Make an input split on each line of the file (like in WordCount). The mapper will thus recieve one line (e.g. title1 in your example). Then read the lines out of the file in the DistributedCache and emit your key/value pairs: with the key as your input and the values as each line from the file from DistributedCache.
In this model, you should only need a Map step.
Something like:
public static class PermuteMapper
extends Mapper<Object, Text, Text, Text>{
private static final IN_FILENAME="file.txt";
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String inputLine = value.toString();
// set the property mapred.cache.files in your
// configuration for the file to be available
Path[] cachedPaths = DistributedCache.getLocalCacheArchives(conf);
if ( cachedPaths[0].getName().equals(IN_FILENAME) ) {
// function defined elsewhere
String[] cachedLines = getLinesFromPath(cachedPaths[0]);
for (String line : cachedLines)
context.emit(inputLine, line);
}
}
}

Resources