MapReduce example - hadoop

I was reading about mapreduce and I was wondering about a particular scenario. Let's say we have a few files (fileA, fileB, fileC for example), each consisting of multiple integers. If we wanted to sort the numbers from all the files to create something like this:
23 fileA
34 fileB
35 fileA
60 fileA
60 fileC
how would the map and reduce process work?
Currently, this is what I have but it is not quite correct;
(fileName, fileContent) -> (map to) (Number, fileName)
sort the temporary key,value pairs and get
(Number, (list of){fileName1, fileName2...})
Reduce the temporary pairs and get
(Number, fileName1)
(Number, fileName2)
and so on
The problem is that during the sorting phase, the filenames may not be be in alphabetical order and so the reduce part will not generate a correct output. Could someone provide some insight as to the correct approach for this scenario?

The best way to achieve this is through secondary sort. You need to sort both keys (in your case numbers) and values (in your case file names). In Hadoop, the mapper output is only sorted on keys.
This can be achieved by using a composite key: the key which is a combination of both numbers and file names. For e.g. for first record, the key will be (23, fileA), instead of just (23).
You can read about secondary sort here: https://www.safaribooksonline.com/library/view/data-algorithms/9781491906170/ch01.html
You can also go through the section "Secondary Sort", in "Hadoop The Definitive Guide" book.
For the sake of simplicity, I have written a program to achieve the same.
In this program, the keys are sorted by default by the mappers. I have written a logic to sort the values at the reducer side. So it takes care of sorting both keys and values and produces that desired output.
Following is the program:
package com.myorg.hadooptests;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import java.util.*;
public class SortedValue {
public static class SortedValueMapper
extends Mapper<LongWritable, Text , Text, IntWritable>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] tokens = value.toString().split(" ");
if(tokens.length == 2) {
context.write(new Text(tokens[1]), new IntWritable(Integer.parseInt(tokens[0])));
}
}
}
public static class SortedValueReducer
extends Reducer<Text, IntWritable, IntWritable, Text> {
Map<String, ArrayList<Integer>> valueMap = new HashMap<String, ArrayList<Integer>>();
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
String keyStr = key.toString();
ArrayList<Integer> storedValues = valueMap.get(keyStr);
for (IntWritable value : values) {
if (storedValues == null) {
storedValues = new ArrayList<Integer>();
valueMap.put(keyStr, storedValues);
}
storedValues.add(value.get());
}
Collections.sort(storedValues);
for (Integer val : storedValues) {
context.write(new IntWritable(val), key);
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "CompositeKeyExample");
job.setJarByClass(SortedValue.class);
job.setMapperClass(SortedValueMapper.class);
job.setReducerClass(SortedValueReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path("/in/in1.txt"));
FileOutputFormat.setOutputPath(job, new Path("/out/"));
System.exit(job.waitForCompletion(true) ? 0:1);
}
}
Mapper Logic:
Parses each line. Assumes that key and value are separated by a blank character (" ").
If the line contains 2 tokens, it emits (filename, integer value). For e.g. for the first record, it emits (fileA, 23).
Reducer Logic:
It puts the (key, value) pairs in a HashMap, where key is the file name and value is a list of integers for that file. For e.g. for fileA, values stored will be 23, 34 and 35
Finally, it sorts the values for a particular key and for each value emits (value, key) from the reducer. For e.g. for fileA, the records output are: (23, fileA), (34, fileA) and (35, fileA)
I ran this program for the following input:
34 fileB
35 fileA
60 fileC
60 fileA
23 fileA
I got the following output:
23 fileA
35 fileA
60 fileA
34 fileB
60 fileC

Related

Hadoop - Classic MapReduce Wordcount

In my Reducer code, I am using this code snippet to sum the values:
for(IntWritable val : values) {
sum += val.get();
}
As the above mentioned gives me expected output, I tried changing the code to:
for(IntWritable val : values) {
sum += 1;
}
Can anyone please explain what is the difference it makes when I use sum += 1 in the reducer rather than sum += val.get()? Why does it give me the same output? Does it have anything to do with Combiner, because when I used this same reducer code as Combiner, class the output was incorrect with all words showing a count of 1.
Mapper Code :
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer token = new StringTokenizer(line);
while(token.hasMoreTokens()) {
word.set(token.nextToken());
context.write(word, new IntWritable(1));
}
}
Reducer Code :
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for(IntWritable val : values) {
//sum += val.get();
sum += 1;
}
context.write(key, new IntWritable(sum));
}
Driver Code:
job.setJarByClass(WordCountWithCombiner.class);
//job.setJobName("WordCount");
job.setMapperClass(WordCountMapper.class);
job.setCombinerClass(WordCountReducer.class);
job.setReducerClass(WordCountReducer.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
Input - "to be or not to be"
Expected Output - (be,2) , (to,2) , (or,1) , (not,1)
But Output i am getting is - (be,1) , (to,1) , (or,1) , (not,1)
Can anyone please explain what is the difference it makes when I use sum += 1 in the reducer rather than sum += val.get()?
Both the statements are performing addition operation. In the first you are counting how many times the for-loop has run. In the later, you are are actually doing a sum operation, on the int value returned by the each val object for a given key.
Why does it give me the same output? Does it have anything to do with Combiner
The answer is Yes. It is because of the Combiner.
Now lets look at the input you are passing, this will instantiate only one Mapper. The output of the Mapper is:
(to,1), (be,1), (or,1), (not,1), (to,1), (be,1)
When this goes to the Combiner, which is essentially same logic as Reducer. The output will be:
(be,2) , (to,2) , (or,1) , (not,1)
Now the above output of Combiner goes to the Reducer and it will perform the sum operation however you define it. So if your logic is sum += 1 then output will be:
(be,1) , (to,1) , (or,1) , (not,1)
But if your logic is sum += val.get() then your output will be:
(be,2) , (to,2) , (or,1) , (not,1)
I hope you understand it now. The logic of the Combiner and Reducer is same, But the input which is coming to them for processing is different.
All depends on the value of sum += val.get();
If always val.get() return 1, then sum += val.get(); is the same than sum += 1; as it is happening in your reducer.
BUT
The combainer is used to do a pre-aggregation (similar than the reducer aggregation) in the mapper side, previous to send the key-values pairs to the recuder(s).
Hadoop framework doesn't warranty the times that the combiner is executed by Mapper, it will depend on the number of Mapper's outputs. Then, if only one time the combiner is executed, the aggregation in the mapper side will be ok but in the reducer instead to only receive 1's you could receive other number (val.get() >= 1). And if you use sum += 1; in your reducer, you will be dropping the aggregated numbers in the mapper, generating a wrong output.
If the combiner is executed more than one time in the Mapper side, then you could imagine that the problem could be even worst.
In summary, sum += 1; only works if and only if that statement is executed only one time for each key-value. Using the combiner, that is not warranted.
val.get(); return an int so basically both the codes are same. The reason we are using val.get() depends on the problem we are trying to solve. In your case we are sure that in the mapper each word is emitted as the key and the value as 1, so in the reducer you can be sure that val.get() will always return 1. Hence the hard coded integer value 1 gives the same result.
Also using the same reducer as the combiner function should not cause any problem. One of the scenario where the output would be with all words giving count as '1' would be when the number of reducers is set as 0 and the mapper output is written to the output path.

Hadoop reducer cleanup function

In hadoop reduce code, I have a cleanup function which prints the total count, but it print twice. I think this is because it's printing the count of key+values and the count alone, but I'm not sure.
My code has this:
protected void cleanup(Context context) throws IOException,
InterruptedException {
Text t1 = new Text("Total Count");
context.write(t1, new IntWritable(count));
}
inside the reducer class and the output is:
Total Count 9477
Total Count 4738
The cleanup method is called at the end of each task. So I assume you are running 2 reducers in the code. therefore 2 outputs

How to get one single key-value pair as output from reducer

I am new to Hadoop Mapreduce. I have a requirement where lets say I want to find the student name with highest mark. Consider the sample dataset
Harry Maths 80
Harry Physics 67
Daisy Science 89
Daisy Physics 90
Greg Maths 70
Greg Chemistry 79
I know that reducer iterates over each of the unique key, hence I am going to get 3 output key value pairs with name and total marks. But I need the name of the student with the total highest mark ie. Reducer output -> Daisy 179
Following is the reduce function I have written :
static int maxMark = 0;
static Text name = new Text();
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{
int totalMarks = 0;
while(values.hasNext())
{
totalMarks+=values.next().get();
}
if (totalMarks > maxMark){
maxMark = totalMarks;
name = key;
output.collect(name, new IntWritable(maxMark));
}
}
But this logic is going to output the previously saved student's name and mark as well!
I can solve this problem if I know the number of input keys to the reducer before the reducer is even called, so that when the reducer iterates over the last key (name), I can call output.collect(name, new IntWritable(maxMark)); once..
So, is there a way to find the number of input keys to the reducer? Or else, what are the other alternatives to get one single output from reducer?
You need two map reduce jobs. The first will total up the marks by name, irrespective of group. Then you can run a job with a mapper that turns the keys and values around, so the key is the sum of marks from the previous step, making sure to use a descending comparator. Configure this job to use only a single reducer task and it can flag itself to ignore all but the first call to reduce.

hadoop streaming getting optimal number of slots

I have a streaming map-reduce job. I have some 30 slots for processing. Initially I get a single input file containing 60 records (fields are tab separated), first field of every record is a number, for first record number(first field) is 1, for second record number(first field) is 2 and so on. I want to create 30 files from these records for next step of processing, each containing 2 records each (even distribution).
For this to work I specified number of reducers to hadoop job as 30. I expected that first field will be used as key and I will get 30 output files each containing 2 records.
I do get 30 output files but not all containing same number of records. Some files are even empty (zero size). Any idea
Hadoop by default suffle and combine the Map task outputs as Reducer input.So Map output sets
having same key values are mapped to same reducer.so by doing this some reducer may not have input sets ,so say part-00005 file will be of size 0 KB.
What's your output key type? If you're using Text rather than IntWritable (which i assume you must be as you're using streaming), then the reduce number is calculated based upon the hash of the bytes representation the UTF-8 'string' of the key value. You can write a simple unit test to observe this in action:
public class TextHashTest {
#Test
public void testHash() {
int partitions = 30;
for (int x = 0; x < 100; x++) {
int hash = new Text(String.valueOf(x)).hashCode();
int part = hash % partitions;
System.err.printf("%d = %d => %d\n", x, hash, part);
}
}
}
I won't paste the output, but of the 100 values, partition bins 0-7 never receive any value.
So like Thomas Jungblut says in his comment, you'll need to write a custom partitioner to translate the Text value back into an integer value, and then modulo this number by total number of partitions - but this may still not give you 'even' distribution if the values themselves are not in a 1-up sequence (which you say they are so you should be ok)
public class IntTextPartitioner implements Partitioner<Text, Text> {
public void configure(JobConf job) {}
public int getPartition(Text key, Text value, int numPartitions) {
return Integer.valueOf(key.toString()) % numPartitions;
}
}

Permutations with MapReduce

Is there a way to generate permutations with MapReduce?
input file:
1 title1
2 title2
3 title3
my goal:
1,2 title1,title2
1,3 title1,title3
2,3 title2,title3
Since a file will have n inputs, the permutations should have n^2 outputs. It makes sense that you could have n tasks perform n of those operations. I believe you could do this (assuming only for one file):
Put your input file into the DistributedCache to be accessible as read-only to your Mapper/Reducers. Make an input split on each line of the file (like in WordCount). The mapper will thus recieve one line (e.g. title1 in your example). Then read the lines out of the file in the DistributedCache and emit your key/value pairs: with the key as your input and the values as each line from the file from DistributedCache.
In this model, you should only need a Map step.
Something like:
public static class PermuteMapper
extends Mapper<Object, Text, Text, Text>{
private static final IN_FILENAME="file.txt";
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String inputLine = value.toString();
// set the property mapred.cache.files in your
// configuration for the file to be available
Path[] cachedPaths = DistributedCache.getLocalCacheArchives(conf);
if ( cachedPaths[0].getName().equals(IN_FILENAME) ) {
// function defined elsewhere
String[] cachedLines = getLinesFromPath(cachedPaths[0]);
for (String line : cachedLines)
context.emit(inputLine, line);
}
}
}

Resources