hadoop streaming getting optimal number of slots - hadoop

I have a streaming map-reduce job. I have some 30 slots for processing. Initially I get a single input file containing 60 records (fields are tab separated), first field of every record is a number, for first record number(first field) is 1, for second record number(first field) is 2 and so on. I want to create 30 files from these records for next step of processing, each containing 2 records each (even distribution).
For this to work I specified number of reducers to hadoop job as 30. I expected that first field will be used as key and I will get 30 output files each containing 2 records.
I do get 30 output files but not all containing same number of records. Some files are even empty (zero size). Any idea

Hadoop by default suffle and combine the Map task outputs as Reducer input.So Map output sets
having same key values are mapped to same reducer.so by doing this some reducer may not have input sets ,so say part-00005 file will be of size 0 KB.

What's your output key type? If you're using Text rather than IntWritable (which i assume you must be as you're using streaming), then the reduce number is calculated based upon the hash of the bytes representation the UTF-8 'string' of the key value. You can write a simple unit test to observe this in action:
public class TextHashTest {
#Test
public void testHash() {
int partitions = 30;
for (int x = 0; x < 100; x++) {
int hash = new Text(String.valueOf(x)).hashCode();
int part = hash % partitions;
System.err.printf("%d = %d => %d\n", x, hash, part);
}
}
}
I won't paste the output, but of the 100 values, partition bins 0-7 never receive any value.
So like Thomas Jungblut says in his comment, you'll need to write a custom partitioner to translate the Text value back into an integer value, and then modulo this number by total number of partitions - but this may still not give you 'even' distribution if the values themselves are not in a 1-up sequence (which you say they are so you should be ok)
public class IntTextPartitioner implements Partitioner<Text, Text> {
public void configure(JobConf job) {}
public int getPartition(Text key, Text value, int numPartitions) {
return Integer.valueOf(key.toString()) % numPartitions;
}
}

Related

Number of key-value pair in reducer

My question is:
In hadoop mapreduce, for each intermediate key, each reducer task can emit, only one final key-value pair per key. Or as many as programmer wants?
Two points here:
Reducer can emit many key-value pairs
All keys are of same type and values have the same type.
for example,
public static class Reduce extends Reducer {
contex.write(new Text("key1"), new LongWritable(4));
conte.write(new LongWritable(1), new Text("value")); -- this lines gives you a compile time error.
Keys must be of Text type and Value must be of LongWritable type.
Suppose your key is a LongWritable and the values are Text. Then in the reducer you expect to get many Text values for the same key, and probably you want to write each of these value in a line:
for (Text value : values) {
context.write(key,value)
}
As many as the programmer wants, the only constraint is the type of all the keys and values should be same. MapReduce doesn't restricts you use the keys and values as such as long as you are using Writables
so for a particular key
for (Text value : values) {
context.write(key,value)
}
and
for(int i=0; i < 10000 ; i++){
context.write(key, new Text(String.valueOf(i)));
//context.write(new Text("MyRandomKey"), new Text(String.valueOf(i)));
}
both of these are fine considering you have defined your keys and values as text in reducer

How to get one single key-value pair as output from reducer

I am new to Hadoop Mapreduce. I have a requirement where lets say I want to find the student name with highest mark. Consider the sample dataset
Harry Maths 80
Harry Physics 67
Daisy Science 89
Daisy Physics 90
Greg Maths 70
Greg Chemistry 79
I know that reducer iterates over each of the unique key, hence I am going to get 3 output key value pairs with name and total marks. But I need the name of the student with the total highest mark ie. Reducer output -> Daisy 179
Following is the reduce function I have written :
static int maxMark = 0;
static Text name = new Text();
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{
int totalMarks = 0;
while(values.hasNext())
{
totalMarks+=values.next().get();
}
if (totalMarks > maxMark){
maxMark = totalMarks;
name = key;
output.collect(name, new IntWritable(maxMark));
}
}
But this logic is going to output the previously saved student's name and mark as well!
I can solve this problem if I know the number of input keys to the reducer before the reducer is even called, so that when the reducer iterates over the last key (name), I can call output.collect(name, new IntWritable(maxMark)); once..
So, is there a way to find the number of input keys to the reducer? Or else, what are the other alternatives to get one single output from reducer?
You need two map reduce jobs. The first will total up the marks by name, irrespective of group. Then you can run a job with a mapper that turns the keys and values around, so the key is the sum of marks from the previous step, making sure to use a descending comparator. Configure this job to use only a single reducer task and it can flag itself to ignore all but the first call to reduce.

Need help in writing Map/Reduce job to find average

I'm fairly new to Hadoop Map/Reduce. I'm trying to write a Map/Reduce job to find average time taken by n processes, given an input text file as below:
ProcessName Time
process1 10
process2 20
processn 30
I went through few tutorials but I'm still not able to get a thorough understanding. What should my mapper and reducer classes do for this problem? Will my output always be a text file or is it possible to directly store the average in some sort of a variable?
Thanks.
Your Mappers read the text file and apply the following map function on every line
map: (key, value)
time = value[2]
emit("1", time)
All map calls emit the key "1" which will be processed by one single reduce function
reduce: (key, values)
result = sum(values) / n
emit("1", result)
Since you're using Hadoop, you probably have seen the use of StringTokenizer in the map function, you can use this to get only the time in one line. Also you can think of some ways how to compute n (the number of processes), you could use for example a Counter in another job which just counts lines.
Update
If you were to execute this job, for each line a tuple would have to be sent to the reducer, potentially clogging the network if you run a Hadoop cluster on multiple machines.
A more clever approach can compute the sum of the times closer to the inputs, e.g. by specifying a combiner:
combine: (key, values)
emit(key, sum(values))
This combiner is then executed on the results of all map functions of the same machine, i.e., without networking in between.
The reducer would then only get as many tuples as there are machines in the cluster, rather than as many as lines in your log files.
Your mapper maps your inputs to the value that you want to take the average of. So let's say that your input is a text file formatted like
ProcessName Time
process1 10
process2 20
.
.
.
Then you would need to take each line in your file, split it, grab the second column, and output the value of that column as an IntWritable (or some other Writable numeric type). Since you want to take the average of all times, not grouped by process name or anything, you will have a single fixed key. Thus, your mapper would look something like
private IntWritable one = new IntWritable(1);
private IntWritable output = new IntWritable();
proctected void map(LongWritable key, Text value, Context context) {
String[] fields = value.split("\t");
output.set(Integer.parseInt(fields[1]));
context.write(one, output);
}
Your reducer takes these values, and simply computes the average. This would look something like
IntWritable one = new IntWritable(1);
DoubleWritable average = new DoubleWritable();
protected void reduce(IntWritable key, Iterable<IntWrtiable> values, Context context) {
int sum = 0;
int count = 0;
for(IntWritable value : values) {
sum += value.get();
count++;
}
average.set(sum / (double) count);
context.Write(key, average);
}
I'm making a lot of assumptions here, about your input format and what not, but they are reasonable assumptions and you should be able to adapt this to suit your exact needs.
Will my output always be a text file or is it possible to directly store the average in some sort of a variable?
You have a couple of options here. You can post-process the output of the job (written a single file), or, since you're computing a single value, you can store the result in a counter, for example.

Hadoop / MapReduce - Optimizing "Top N" Word Count MapReduce Job

I'm working on something similar to the canonical MapReduce example - the word count, but with a twist in that I'm looking to only get the Top N results.
Let's say I have a very large set of text data in HDFS. There are plenty of examples that show how to build a Hadoop MapReduce job that will provide you with a word count for every word in that text. For example, if my corpus is:
"This is a test of test data and a good one to test this"
The result set from the standard MapReduce word count job would be:
test:3, a:2, this:2, is: 1, etc..
But what if I ONLY want to get the Top 3 words that were used in my entire set of data?
I can still run the exact same standard MapReduce word-count job, and then just take the Top 3 results once it is ready and is spitting out the count for EVERY word, but that seems a little inefficient, because a lot of data needs to be moved around during the shuffle phase.
What I'm thinking is that, if this sample is large enough, and the data is well randomly and well distributed in HDFS, that each Mapper does not need to send ALL of its word counts to the Reducers, but rather, only some of the top data. So if one mapper has this:
a:8234, the: 5422, man: 4352, ...... many more words ... , rareword: 1, weirdword: 1, etc.
Then what I'd like to do is only send the Top 100 or so words from each Mapper to the Reducer phase - since there is very little chance that "rareword" will suddenly end up in the Top 3 when all is said and done. This seems like it would save on bandwidth and also on Reducer processing time.
Can this be done in the Combiner phase? Is this sort of optimization prior to the shuffle phase commonly done?
This is a very good question, because you have hit the inefficiency of Hadoop's word count example.
The tricks to optimize your problem are the following:
Do a HashMap based grouping in your local map stage, you can also use a combiner for that. This can look like this, I'm using the HashMultiSet of Guava, which faciliates a nice counting mechanism.
public static class WordFrequencyMapper extends
Mapper<LongWritable, Text, Text, LongWritable> {
private final HashMultiset<String> wordCountSet = HashMultiset.create();
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] tokens = value.toString().split("\\s+");
for (String token : tokens) {
wordCountSet.add(token);
}
}
And you emit the result in your cleanup stage:
#Override
protected void cleanup(Context context) throws IOException,
InterruptedException {
Text key = new Text();
LongWritable value = new LongWritable();
for (Entry<String> entry : wordCountSet.entrySet()) {
key.set(entry.getElement());
value.set(entry.getCount());
context.write(key, value);
}
}
So you have grouped the words in a local block of work, thus reducing network usage by using a bit of RAM. You can also do the same with a Combiner, but it is sorting to group- so this would be slower (especially for strings!) than using a HashMultiset.
To just get the Top N, you will only have to write the Top N in that local HashMultiset to the output collector and aggregate the results in your normal way on the reduce side.
This saves you a lot of network bandwidth as well, the only drawback is that you need to sort the word-count tuples in your cleanup method.
A part of the code might look like this:
Set<String> elementSet = wordCountSet.elementSet();
String[] array = elementSet.toArray(new String[elementSet.size()]);
Arrays.sort(array, new Comparator<String>() {
#Override
public int compare(String o1, String o2) {
// sort descending
return Long.compare(wordCountSet.count(o2), wordCountSet.count(o1));
}
});
Text key = new Text();
LongWritable value = new LongWritable();
// just emit the first n records
for(int i = 0; i < N, i++){
key.set(array[i]);
value.set(wordCountSet.count(array[i]));
context.write(key, value);
}
Hope you get the gist of doing as much of the word locally and then just aggregate the top N of the top N's ;)
Quoting Thomas
To just get the Top N, you will only have to write the Top N in that
local HashMultiset to the output collector and aggregate the results
in your normal way on the reduce side. This saves you a lot of network
bandwidth as well, the only drawback is that you need to sort the
word-count tuples in your cleanup method.
If you write only top N in the local HashMultiset then there is a possibility that you are going to miss the count of an element that, If passed from this local HashMultiset, could become one of the overall top 10 elements.
For example consider following format as three maps as MapName: elementName,elemenntcount:
Map A : Ele1,4 : Ele2,5 : Ele3,5 : Ele4,2
Map B : Ele1,1 : Ele5,7 : Ele6, 3 : Ele7,6
Map C : Ele5,4 : Ele8,3 : Ele1,1 : Ele9,3
Now If we considered the top 3 of each mappers we will Miss the element "Ele1" whose total count should have been 6 but since we are calculating each mapper's top 3 we see "Ele1"'s total count as 4.
I hope that makes sense. Please let me know what you think about it.

How can I get an integer index for a key in hadoop?

Intuitively, hadoop is doing something like this to distribute keys to mappers, using python-esque pseudocode.
# data is a dict with many key-value pairs
keys = data.keys()
key_set_size = len(keys) / num_mappers
index = 0
mapper_keys = []
for i in range(num_mappers):
end_index = index + key_set_size
send_to_mapper(keys[int(index):int(end_index)], i)
index = end_index
# And something vaguely similar for the reducer (but not exactly).
It seems like somewhere hadoop knows the index of each key it is passing around, since it distributes them evenly among the mappers (or reducers). My question is: how can I access this index? I'm looking for a range of integers [0, n) mapping to all my n keys; this is what I mean by an "index".
I'm interested in the ability to get the index from within either the mapper or reducer.
After doing more research on this question, I don't believe it is possible to do exactly what I want. Hadoop does not seem to have such an index that is user-visible after all, although it does try to distribute work evenly among the mappers (so such an index is theoretically possible).
Actually, your reducer (each individual one) gets an array of items back that correspond to the reduce key. So do you want the offset of items within the reduce key in your reducer, or do you want the overall offset of the particular item in the global array of all lines being processed? To get an indeex in your mapper, you can simply prepend a line number to each line of the file before the file gets to the mapper. This will tell you the "global index". However keep in mind that with 1 000 000 items, item 662 345 could be processed before item 10 000.
If you are using the new MR API then the org.apache.hadoop.mapreduce.lib.partition.HashPartitioner is the default partitioner or else org.apache.hadoop.mapred.lib.HashPartitioner is the default partitioner. You can call the getPartition() on either of the HashPartitioner to get the partition number for the key (which you mentioned as index).
Note that the HashPartitioner class is only used to distribute the keys to the Reducer. When it comes to a mapper, each input split is processed by a map task and the keys are not distributed.
Here is the code from HashPartitioner for the getPartition(). You can write a simple Java program for the same.
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
Edit: Including another way to get the index.
The following code from should also work. To be included in the map or the reduce function.
public void configure(JobConf job) {
partition = job.getInt( "mapred.task.partition", 0);
}

Resources