Need help in writing Map/Reduce job to find average - hadoop

I'm fairly new to Hadoop Map/Reduce. I'm trying to write a Map/Reduce job to find average time taken by n processes, given an input text file as below:
ProcessName Time
process1 10
process2 20
processn 30
I went through few tutorials but I'm still not able to get a thorough understanding. What should my mapper and reducer classes do for this problem? Will my output always be a text file or is it possible to directly store the average in some sort of a variable?
Thanks.

Your Mappers read the text file and apply the following map function on every line
map: (key, value)
time = value[2]
emit("1", time)
All map calls emit the key "1" which will be processed by one single reduce function
reduce: (key, values)
result = sum(values) / n
emit("1", result)
Since you're using Hadoop, you probably have seen the use of StringTokenizer in the map function, you can use this to get only the time in one line. Also you can think of some ways how to compute n (the number of processes), you could use for example a Counter in another job which just counts lines.
Update
If you were to execute this job, for each line a tuple would have to be sent to the reducer, potentially clogging the network if you run a Hadoop cluster on multiple machines.
A more clever approach can compute the sum of the times closer to the inputs, e.g. by specifying a combiner:
combine: (key, values)
emit(key, sum(values))
This combiner is then executed on the results of all map functions of the same machine, i.e., without networking in between.
The reducer would then only get as many tuples as there are machines in the cluster, rather than as many as lines in your log files.

Your mapper maps your inputs to the value that you want to take the average of. So let's say that your input is a text file formatted like
ProcessName Time
process1 10
process2 20
.
.
.
Then you would need to take each line in your file, split it, grab the second column, and output the value of that column as an IntWritable (or some other Writable numeric type). Since you want to take the average of all times, not grouped by process name or anything, you will have a single fixed key. Thus, your mapper would look something like
private IntWritable one = new IntWritable(1);
private IntWritable output = new IntWritable();
proctected void map(LongWritable key, Text value, Context context) {
String[] fields = value.split("\t");
output.set(Integer.parseInt(fields[1]));
context.write(one, output);
}
Your reducer takes these values, and simply computes the average. This would look something like
IntWritable one = new IntWritable(1);
DoubleWritable average = new DoubleWritable();
protected void reduce(IntWritable key, Iterable<IntWrtiable> values, Context context) {
int sum = 0;
int count = 0;
for(IntWritable value : values) {
sum += value.get();
count++;
}
average.set(sum / (double) count);
context.Write(key, average);
}
I'm making a lot of assumptions here, about your input format and what not, but they are reasonable assumptions and you should be able to adapt this to suit your exact needs.
Will my output always be a text file or is it possible to directly store the average in some sort of a variable?
You have a couple of options here. You can post-process the output of the job (written a single file), or, since you're computing a single value, you can store the result in a counter, for example.

Related

hadoop job with single mapper and two different reducers

I have a large document corpus as an input to a MapReduce job (old hadoop API). In the mapper, I can produce two kinds of output: one counting words and one producing minHash signatures. What I need to do is:
give the word counting output to one reducer class (a typical WordCount reducer) and
give the minHash signatures to another reducer class (performing some calculations on the size of the buckets).
The input is the same corpus of documents and there is no need to process it twice. I think that MultipleOutputs is not the solution, as I cannot find a way to give my Mapper output to two different Reduce classes.
In a nutshell, what I need is the following:
WordCounting Reducer --> WordCount output
/
Input --> Mapper
\
MinHash Buckets Reducer --> MinHash output
Is there any way to use the same Mapper (in the same job), or should I split that in two jobs?
You can do it, but it will involve some coding tricks (Partitioner and a prefix convention). The idea is for mapper to output the word prefixed with "W:" and minhash prefixed with "M:". Than use a Partitioner to decide into which partition (aka reducer) it needs to go into.
Pseudo code
MAIN method:
Set number of reducers to 2
MAPPER:
.... parse the word ...
... generate minhash ..
context.write("W:" + word, 1);
context.write("M:" + minhash, 1);
Partitioner:
IF Key starts with "W:" { return 0; } // reducer 1
IF Key starts with "M:" { return 1; } // reducer 2
Combiner:
IF Key starts with "W:" { iterate over values and sum; context.write(Key, SUM); return;}
Iterate and context.write all of the values
Reducer:
IF Key starts with "W:" { iterate over values and sum; context.write(Key, SUM); return;}
IF Key starts with "M:" { perform min hash logic }
In the output part-0000 will be you word counts and part-0001 your min hash calculations.
Unfortunately it is not possible to provide different Reducer classes, but with IF and prefix you can simulate it.
Also having just 2 reducers might not be an efficient from performance point of view, than you could play with Partitioner to allocate first N partitions to the Word Count.
If you do not like the prefix idea than you would need to implement secondary sort with custom WritableComparable class for the key. But it is worth the effort only in more sophisticated cases.
AFAIK this is not possible in a single map reduce job , only the default out-put files part--r--0000 files will be fed to reducer, so so if you are creating two multiple named outputs naming WordCount--m--0 and MinHash--m--0
you can create two other different Map/Reduce job with Identity Mapper and the respective Reducers, specifying the inputs as hdfspath/WordCount--* and hdfspath/MinHash--* as a input to the respective jobs.

MapReduce Job distribution among reducers

I developed a small mapreduce program. When i opened the process log, i saw that one map and two reducers were created by the framework. I had only one file for input and got two output files. Now please tell me
1) Number of mapper and reducer are created by framework or it can be changed?
2) Number of output files always equal to number of reducers? i.e. each reducer
creates its own output file?
3) How one input file is distributed among mappers? And output of one mapper is
distributed among multiple reducers (this is done by framework or you can change)?
4) How to manage when multiple input files are there i.e. A directory ,
containing input files?
Please answer these questions. I am beginner to MapReduce.
Let me attempt to answer your questions. Please tell me wherever you think is incorrect -
1) Number of mapper and reducer are created by framework or it can be changed?
Total number of map tasks created depends on the total number of logical splits being made out of the HDFS blocks. So, fixing the number of map tasks may not always be possible because different files can have different sizes and with that different number of total blocks. So, if you are using TextInputFormat, roughly each logical split equals to a block and fixing number of total map task would not be possible since, for each file there can be different number of blocks created.
Unlike number of mappers, reducers can be fixed.
2) Number of output files always equal to number of reducers? i.e. each reducer
creates its own output file?
To certain degree yes but there are ways with which it's possible to create more than one output file from a reducer. For e.g.: MultipleOutputs
3) How one input file is distributed among mappers? And output of one mapper is
distributed among multiple reducers (this is done by framework or you can change)?
Each file in HDFS is composed of blocks. Those blocks are replicated and can remain in multiple nodes (machines). Map tasks are then scheduled to runs upon these blocks.
The level of concurrency with which map task can run, depends upon the number of processors each machine have.
E.g. for a file if 10,000 map tasks are scheduled, depending upon total number of processors throughout the cluster, only a 100 can run concurrently at a time.
By default Hadoop uses HashPartitioner, which calculates the hashcode of the keys being sent from the Mapper to the framework and converts them to a partition.
E.g.:
public int getPartition(K2 key, V2 value,
int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
As you can see above, a partition is selected out of the total number of reducers that's fixed based upon the hash code. So, if your numReduceTask = 4, the value returned would be between 0 to 3.
4) How to manage when multiple input files are there i.e. A directory ,
containing input files?
Hadoop supports a directory consisting of multiple files as a input to a job.
As explained by 'SSaikia_JtheRocker' mapper tasks are created according to the total number of logical splits on HDFS blocks.
I would like to add something to the question #3 "How one input file is distributed among mappers? And output of one mapper is distributed among multiple reducers (this is done by framework or you can change)?"
For example consider my word count program which counts the number of words in a file is shown below:
#
public class WCMapper extends Mapper {
#Override
public void map(LongWritable key, Text value, Context context) // Context context is output
throws IOException, InterruptedException {
// value = "How Are You"
String line = value.toString(); // This is converting the Hadoop's "How Are you" to Java compatible "How Are You"
StringTokenizer tokenizer = new StringTokenizer (line); // StringTokenizer returns an array tokenizer = {"How", "Are", "You"}
while (tokenizer.hasMoreTokens()) // hasMoreTokens is a method in Java which returns boolean values 'True' or 'false'
{
value.set(tokenizer.nextToken()); // value's values are overwritten with "How"
context.write(value, new IntWritable(1)); // writing the current context to local disk
// How, 1
// Are, 1
// You, 1
// Mapper will run as many times as the number of lines
}
}
}
#
So in the above program, for the line "How are you" is split into 3 words by StringTokenizer and when used this in the while loop, the mapper is called as many times as the number of words, so here 3 mappers are called.
And reducer, we can specify like how many reducers we want our output to be generated in using 'job.setNumReduceTasks(5);' statement. Below code snippet will give you an idea.
#
public class BooksMain {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// Use programArgs array to retrieve program arguments.
String[] programArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
Job job = new Job(conf);
job.setJarByClass(BooksMain.class);
job.setMapperClass(BookMapper.class);
job.setReducerClass(BookReducer.class);
job.setNumReduceTasks(5);
// job.setCombinerClass(BookReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// TODO: Update the input path for the location of the inputs of the map-reduce job.
FileInputFormat.addInputPath(job, new Path(programArgs[0]));
// TODO: Update the output path for the output directory of the map-reduce job.
FileOutputFormat.setOutputPath(job, new Path(programArgs[1]));
// Submit the job and wait for it to finish.
job.waitForCompletion(true);
// Submit and return immediately:
// job.submit();
}
}
#

Hadoop / MapReduce - Optimizing "Top N" Word Count MapReduce Job

I'm working on something similar to the canonical MapReduce example - the word count, but with a twist in that I'm looking to only get the Top N results.
Let's say I have a very large set of text data in HDFS. There are plenty of examples that show how to build a Hadoop MapReduce job that will provide you with a word count for every word in that text. For example, if my corpus is:
"This is a test of test data and a good one to test this"
The result set from the standard MapReduce word count job would be:
test:3, a:2, this:2, is: 1, etc..
But what if I ONLY want to get the Top 3 words that were used in my entire set of data?
I can still run the exact same standard MapReduce word-count job, and then just take the Top 3 results once it is ready and is spitting out the count for EVERY word, but that seems a little inefficient, because a lot of data needs to be moved around during the shuffle phase.
What I'm thinking is that, if this sample is large enough, and the data is well randomly and well distributed in HDFS, that each Mapper does not need to send ALL of its word counts to the Reducers, but rather, only some of the top data. So if one mapper has this:
a:8234, the: 5422, man: 4352, ...... many more words ... , rareword: 1, weirdword: 1, etc.
Then what I'd like to do is only send the Top 100 or so words from each Mapper to the Reducer phase - since there is very little chance that "rareword" will suddenly end up in the Top 3 when all is said and done. This seems like it would save on bandwidth and also on Reducer processing time.
Can this be done in the Combiner phase? Is this sort of optimization prior to the shuffle phase commonly done?
This is a very good question, because you have hit the inefficiency of Hadoop's word count example.
The tricks to optimize your problem are the following:
Do a HashMap based grouping in your local map stage, you can also use a combiner for that. This can look like this, I'm using the HashMultiSet of Guava, which faciliates a nice counting mechanism.
public static class WordFrequencyMapper extends
Mapper<LongWritable, Text, Text, LongWritable> {
private final HashMultiset<String> wordCountSet = HashMultiset.create();
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] tokens = value.toString().split("\\s+");
for (String token : tokens) {
wordCountSet.add(token);
}
}
And you emit the result in your cleanup stage:
#Override
protected void cleanup(Context context) throws IOException,
InterruptedException {
Text key = new Text();
LongWritable value = new LongWritable();
for (Entry<String> entry : wordCountSet.entrySet()) {
key.set(entry.getElement());
value.set(entry.getCount());
context.write(key, value);
}
}
So you have grouped the words in a local block of work, thus reducing network usage by using a bit of RAM. You can also do the same with a Combiner, but it is sorting to group- so this would be slower (especially for strings!) than using a HashMultiset.
To just get the Top N, you will only have to write the Top N in that local HashMultiset to the output collector and aggregate the results in your normal way on the reduce side.
This saves you a lot of network bandwidth as well, the only drawback is that you need to sort the word-count tuples in your cleanup method.
A part of the code might look like this:
Set<String> elementSet = wordCountSet.elementSet();
String[] array = elementSet.toArray(new String[elementSet.size()]);
Arrays.sort(array, new Comparator<String>() {
#Override
public int compare(String o1, String o2) {
// sort descending
return Long.compare(wordCountSet.count(o2), wordCountSet.count(o1));
}
});
Text key = new Text();
LongWritable value = new LongWritable();
// just emit the first n records
for(int i = 0; i < N, i++){
key.set(array[i]);
value.set(wordCountSet.count(array[i]));
context.write(key, value);
}
Hope you get the gist of doing as much of the word locally and then just aggregate the top N of the top N's ;)
Quoting Thomas
To just get the Top N, you will only have to write the Top N in that
local HashMultiset to the output collector and aggregate the results
in your normal way on the reduce side. This saves you a lot of network
bandwidth as well, the only drawback is that you need to sort the
word-count tuples in your cleanup method.
If you write only top N in the local HashMultiset then there is a possibility that you are going to miss the count of an element that, If passed from this local HashMultiset, could become one of the overall top 10 elements.
For example consider following format as three maps as MapName: elementName,elemenntcount:
Map A : Ele1,4 : Ele2,5 : Ele3,5 : Ele4,2
Map B : Ele1,1 : Ele5,7 : Ele6, 3 : Ele7,6
Map C : Ele5,4 : Ele8,3 : Ele1,1 : Ele9,3
Now If we considered the top 3 of each mappers we will Miss the element "Ele1" whose total count should have been 6 but since we are calculating each mapper's top 3 we see "Ele1"'s total count as 4.
I hope that makes sense. Please let me know what you think about it.

hadoop streaming getting optimal number of slots

I have a streaming map-reduce job. I have some 30 slots for processing. Initially I get a single input file containing 60 records (fields are tab separated), first field of every record is a number, for first record number(first field) is 1, for second record number(first field) is 2 and so on. I want to create 30 files from these records for next step of processing, each containing 2 records each (even distribution).
For this to work I specified number of reducers to hadoop job as 30. I expected that first field will be used as key and I will get 30 output files each containing 2 records.
I do get 30 output files but not all containing same number of records. Some files are even empty (zero size). Any idea
Hadoop by default suffle and combine the Map task outputs as Reducer input.So Map output sets
having same key values are mapped to same reducer.so by doing this some reducer may not have input sets ,so say part-00005 file will be of size 0 KB.
What's your output key type? If you're using Text rather than IntWritable (which i assume you must be as you're using streaming), then the reduce number is calculated based upon the hash of the bytes representation the UTF-8 'string' of the key value. You can write a simple unit test to observe this in action:
public class TextHashTest {
#Test
public void testHash() {
int partitions = 30;
for (int x = 0; x < 100; x++) {
int hash = new Text(String.valueOf(x)).hashCode();
int part = hash % partitions;
System.err.printf("%d = %d => %d\n", x, hash, part);
}
}
}
I won't paste the output, but of the 100 values, partition bins 0-7 never receive any value.
So like Thomas Jungblut says in his comment, you'll need to write a custom partitioner to translate the Text value back into an integer value, and then modulo this number by total number of partitions - but this may still not give you 'even' distribution if the values themselves are not in a 1-up sequence (which you say they are so you should be ok)
public class IntTextPartitioner implements Partitioner<Text, Text> {
public void configure(JobConf job) {}
public int getPartition(Text key, Text value, int numPartitions) {
return Integer.valueOf(key.toString()) % numPartitions;
}
}

How can I get an integer index for a key in hadoop?

Intuitively, hadoop is doing something like this to distribute keys to mappers, using python-esque pseudocode.
# data is a dict with many key-value pairs
keys = data.keys()
key_set_size = len(keys) / num_mappers
index = 0
mapper_keys = []
for i in range(num_mappers):
end_index = index + key_set_size
send_to_mapper(keys[int(index):int(end_index)], i)
index = end_index
# And something vaguely similar for the reducer (but not exactly).
It seems like somewhere hadoop knows the index of each key it is passing around, since it distributes them evenly among the mappers (or reducers). My question is: how can I access this index? I'm looking for a range of integers [0, n) mapping to all my n keys; this is what I mean by an "index".
I'm interested in the ability to get the index from within either the mapper or reducer.
After doing more research on this question, I don't believe it is possible to do exactly what I want. Hadoop does not seem to have such an index that is user-visible after all, although it does try to distribute work evenly among the mappers (so such an index is theoretically possible).
Actually, your reducer (each individual one) gets an array of items back that correspond to the reduce key. So do you want the offset of items within the reduce key in your reducer, or do you want the overall offset of the particular item in the global array of all lines being processed? To get an indeex in your mapper, you can simply prepend a line number to each line of the file before the file gets to the mapper. This will tell you the "global index". However keep in mind that with 1 000 000 items, item 662 345 could be processed before item 10 000.
If you are using the new MR API then the org.apache.hadoop.mapreduce.lib.partition.HashPartitioner is the default partitioner or else org.apache.hadoop.mapred.lib.HashPartitioner is the default partitioner. You can call the getPartition() on either of the HashPartitioner to get the partition number for the key (which you mentioned as index).
Note that the HashPartitioner class is only used to distribute the keys to the Reducer. When it comes to a mapper, each input split is processed by a map task and the keys are not distributed.
Here is the code from HashPartitioner for the getPartition(). You can write a simple Java program for the same.
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
Edit: Including another way to get the index.
The following code from should also work. To be included in the map or the reduce function.
public void configure(JobConf job) {
partition = job.getInt( "mapred.task.partition", 0);
}

Resources