Reducer receives identical value multiple times instead of expected input - hadoop

While writing a map-reduce job in my local hadoop environment I ran into the problem that the Reducer did not receive the values I expected. I abstracted the problem down to the following:
I create an arbitrary input file with 10 lines to have the map method executed 10 times. In the mapper I create an invocation count and write this count as value to the output with 0 as key if the value is even and 1 as key if the value is odd, i.e. the following (key, value) pairs:
(1,1), (0,2), (1,3), (0,4), (1,5), etc.
I would expect to receive two calls to the Reducer with
0 > [2,4,6,8,10]
1 > [1,3,5,7,9]
but I get two calls with
0 > [2,2,2,2,2]
1 > [1,1,1,1,1]
instead. It seems I receive the first value that was wrote in the mapper with the multiplicities of the key (if I reverse the counter, I receive values 10 and 9 instead of 2 and 1). From my understanding this is not the expected behaviour (?), but I cannot figure out what I am doing wrong.
I use the following Mapper and reducer:
public class TestMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
int count = 0;
#Override
protected void map(LongWritable keyUnused, Text valueUnused, Context context) throws IOException, InterruptedException {
count += 1;
context.write(new IntWritable(count % 2), new IntWritable(count));
System.err.println((count % 2) + "|" + count);
}
}
public class TestReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable>{
#Override
protected void reduce(IntWritable key, Iterable<IntWritable> valueItr, Context context) throws IOException, InterruptedException {
List<IntWritable> values = Lists.newArrayList(valueItr);
System.err.println(key + "|" + values);
}
}
I run the hadoop job with a local test runner as described for example in the book "Hadoop: The Definitive Guide" (O'Reilly):
public class TestDriver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.printf("Usage: %s [generic options] <input> <output>\n",
getClass().getSimpleName());
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}
Job jobConf = Job.getInstance(getConf());
jobConf.setJarByClass(getClass());
jobConf.setJobName("TestJob");
jobConf.setMapperClass(TestMapper.class);
jobConf.setReducerClass(TestReducer.class);
FileInputFormat.addInputPath(jobConf, new Path(args[0]));
FileOutputFormat.setOutputPath(jobConf, new Path(args[1]));
jobConf.setOutputKeyClass(IntWritable.class);
jobConf.setOutputValueClass(IntWritable.class);
return jobConf.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
System.exit(ToolRunner.run(new TestDriver(), args));
}
packaged in a jar and run with 'hadoop jar test.jar infile.txt /tmp/testout'.

Hadoop is reusing the value object while streaming the reducer values.
So in order to capture all of your different values, you need to copy:
#Override
protected void reduce(IntWritable key, Iterable<IntWritable> valueItr, Context context) throws IOException, InterruptedException {
List<IntWritable> values = Lists.newArrayList();
for(IntWritable writable : valueItr) {
values.add(new IntWritable(writable.get());
}
System.err.println(key + "|" + values);
}

Related

Get Top N items from mapper output - Mapreduce

My Mapper task returns me following output:
2 c
2 g
3 a
3 b
6 r
I have written reducer code and keycomparator that produces the correct output but how do I get Top 3 out (top N by count) of Mapper Output:
public static class WLReducer2 extends
Reducer<IntWritable, Text, Text, IntWritable> {
#Override
protected void reduce(IntWritable key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
for (Text x : values) {
context.write(new Text(x), key);
}
};
}
public static class KeyComparator extends WritableComparator {
protected KeyComparator() {
super(IntWritable.class, true);
}
#Override
public int compare(WritableComparable w1, WritableComparable w2) {
// TODO Auto-generated method stub
// Logger.error("--------------------------> writing Keycompare data = ----------->");
IntWritable ip1 = (IntWritable) w1;
IntWritable ip2 = (IntWritable) w2;
int cmp = -1 * ip1.compareTo(ip2);
return cmp;
}
}
This is the reducer output:
r 6
b 3
a 3
g 2
c 2
The expected output from reducer is top 3 by count which is:
r 6
b 3
a 3
Restrict your output from reducer. Something like this.
public static class WLReducer2 extends
Reducer<IntWritable, Text, Text, IntWritable> {
int count=0;
#Override
protected void reduce(IntWritable key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
for (Text x : values) {
if (count > 3)
context.write(new Text(x), key);
count++;
}
};
}
Set number of reducers to 1. job.setNumReduceTasks(1).
If your Top-N elements could be stored in memory, you could use a TreeMap to store the Top-N elements and if your process could be aggregated using only one reducer.
Instantiate a instance variable TreeMap in the setup() method of your reducer.
Inside your reducer() method you should aggregate all the values for the keygroup and then compare the result with the first (lowest) key in the Tree, map.firstKey(). If your current value is bigger than the lowest value in the Tree then insert the current value into the treemap, map.put(value, Item) and then delete the lowest value from the Tree map.remove(value).
In the reducer's cleanup() method, write to the output all the TreeMap's elements in the required order.
Note: The value to compare your records must be the key in your TreeMap. And the value of your TreeMap should be the description, tag, letter, etc; related with the number.

Hadoop not all values get assembled for one key

I have some data that I would like to aggregate by key using Mapper code and then perform something on all values that belong to a key using Reducer code. For example if I have:
key = 1, val = 1,
key = 1, val = 2,
key = 1, val = 3
I would like to get key=1, val=[1,2,3] in my Reducer.
The thing is, I get something like
key = 1, val=[1,2]
key = 1, val=[3]
Why is that so?
I thought that all the values for one specific key will be assembled in one reducer, but now it seems that there can be more key, val [ ] pairs, since there can be multiple reducers, is that so?
Should I set number of reducers to be 1?
I'm new to Hadoop so this confuses me.
Here's the code
public class SomeJob {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException
{
Job job = new Job();
job.setJarByClass(SomeJob.class);
FileInputFormat.addInputPath(job, new Path("/home/pera/data/input/some.csv"));
FileOutputFormat.setOutputPath(job, new Path("/home/pera/data/output"));
job.setMapperClass(SomeMapper.class);
job.setReducerClass(SomeReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.waitForCompletion(true);
}
}
public class SomeMapper extends Mapper<LongWritable, Text, Text, Text>{
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String parts[] = line.split(";");
context.write(new Text(parts[0]), new Text(parts[4]));
}
}
public class SomeReducer extends Reducer<Text, Text, Text, Text>{
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
String properties = "";
for(Text value : values)
{
properties += value + " ";
}
context.write(key, new Text(properties));
}
}

Can I get a Partition number of Hadoop?

I am a hadoop newbie.
I want to get a partition number on output file.
At first, I made a customized partitioner.
public static class MyPartitioner extends Partitioner<Text, LongWritable> {
public int getPartition(Text key, LongWritable value, int numReduceTasks) {
int numOfChars = key.toString().length();
return numOfChars % numReduceTasks;
}
}
It works. But, I want to output partition numbers 'visually' on Reducer.
How can I get a partiton number ??
Below is my reducer source.
public static class MyReducer extends Reducer<Text, LongWritable, Text, Text>{
private Text textList = new Text();
public void reduce(Text key, Iterable<LongWritable> values, Context context)
throws IOException, InterruptedException {
String list = new String();
for(LongWritable value: values) {
list = new String(list + "\t" + value.toString());
}
textList.set(list);
context.write(key, textList);
}
}
I want to put a partition number on 'list' respectively. There will be '0' or '1'.
list = new String(list + "\t" + value.toString() + "\t" + ??);
It would be great if someone helps me.
+
Thanks to the answer, I got a solution. But, It didn't work and I think I did something wrong.
Below is the modified MyPartitioner.
public static class MyPartitioner extends Partitioner {
public int getPartition(Text key, LongWritable value, int numReduceTasks) {
int numOfChars = key.toString().length();
return numOfChars % numReduceTasks;
private int bring_num = 0;
public void configure(JobConf job) {
bring_num = jobConf.getInt(numOfChars & numReduceTasks);
}
}
}
Add the below code to the Reducer class to get the partition number in a class variable which can be later used in the reducer method.
String partition;
protected void setup(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
partition = conf.get("mapred.task.partition");
}

Type mismatch in key from map, using SequenceFileInputFormat correctly

I am trying to run a recommender example from chapter6 (listing 6.1 ~ 6.4) in the ebook Mahout in Action. There are two mapper/reducer pairs. Here is the code:
Mapper - 1
public class WikipediaToItemPrefsMapper extends
Mapper<LongWritable,Text,VarLongWritable,VarLongWritable> {
private static final Pattern NUMBERS = Pattern.compile("(\d+)");
#Override
public void map(LongWritable key,
Text value,
Context context)
throws IOException, InterruptedException {
String line = value.toString();
Matcher m = NUMBERS.matcher(line);
m.find();
VarLongWritable userID = new VarLongWritable(Long.parseLong(m.group()));
VarLongWritable itemID = new VarLongWritable();
while (m.find()) {
itemID.set(Long.parseLong(m.group()));
context.write(userID, itemID);
}
}
}
Reducer - 1
public class WikipediaToUserVectorReducer extends
Reducer<VarLongWritable,VarLongWritable,VarLongWritable,VectorWritable> {
#Override
public void reduce(VarLongWritable userID,
Iterable<VarLongWritable> itemPrefs,
Context context)
throws IOException, InterruptedException {
Vector userVector = new RandomAccessSparseVector(
Integer.MAX_VALUE, 100);
for (VarLongWritable itemPref : itemPrefs) {
userVector.set((int)itemPref.get(), 1.0f);
}
//LongWritable userID_lw = new LongWritable(userID.get());
context.write(userID, new VectorWritable(userVector));
//context.write(userID_lw, new VectorWritable(userVector));
}
}
The reducer outputs a userID and a userVector and it looks like this: 98955 {590:1.0 22:1.0 9059:1.0 3:1.0 2:1.0 1:1.0} provided FileInputformat and TextInputFormat are used in the driver.
I want to use another pair of mapper-reducer to process this data further:
Mapper - 2
public class UserVectorToCooccurenceMapper extends
Mapper<VarLongWritable,VectorWritable,IntWritable,IntWritable> {
#Override
public void map(VarLongWritable userID,
VectorWritable userVector,
Context context)
throws IOException, InterruptedException {
Iterator<Vector.Element> it = userVector.get().iterateNonZero();
while (it.hasNext()) {
int index1 = it.next().index();
Iterator<Vector.Element> it2 = userVector.get().iterateNonZero();
while (it2.hasNext()) {
int index2 = it2.next().index();
context.write(new IntWritable(index1),
new IntWritable(index2));
}
}
}
}
Reducer - 2
public class UserVectorToCooccurenceReducer extends
Reducer {
#Override
public void reduce(IntWritable itemIndex1,
Iterable<IntWritable> itemIndex2s,
Context context)
throws IOException, InterruptedException {
Vector cooccurrenceRow = new RandomAccessSparseVector(Integer.MAX_VALUE, 100);
for (IntWritable intWritable : itemIndex2s) {
int itemIndex2 = intWritable.get();
cooccurrenceRow.set(itemIndex2, cooccurrenceRow.get(itemIndex2) + 1.0);
}
context.write(itemIndex1, new VectorWritable(cooccurrenceRow));
}
}
This is the driver I am using:
public final class RecommenderJob extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
Job job_preferenceValues = new Job (getConf());
job_preferenceValues.setJarByClass(RecommenderJob.class);
job_preferenceValues.setJobName("job_preferenceValues");
job_preferenceValues.setInputFormatClass(TextInputFormat.class);
job_preferenceValues.setOutputFormatClass(SequenceFileOutputFormat.class);
FileInputFormat.setInputPaths(job_preferenceValues, new Path(args[0]));
SequenceFileOutputFormat.setOutputPath(job_preferenceValues, new Path(args[1]));
job_preferenceValues.setMapOutputKeyClass(VarLongWritable.class);
job_preferenceValues.setMapOutputValueClass(VarLongWritable.class);
job_preferenceValues.setOutputKeyClass(VarLongWritable.class);
job_preferenceValues.setOutputValueClass(VectorWritable.class);
job_preferenceValues.setMapperClass(WikipediaToItemPrefsMapper.class);
job_preferenceValues.setReducerClass(WikipediaToUserVectorReducer.class);
job_preferenceValues.waitForCompletion(true);
Job job_cooccurence = new Job (getConf());
job_cooccurence.setJarByClass(RecommenderJob.class);
job_cooccurence.setJobName("job_cooccurence");
job_cooccurence.setInputFormatClass(SequenceFileInputFormat.class);
job_cooccurence.setOutputFormatClass(TextOutputFormat.class);
SequenceFileInputFormat.setInputPaths(job_cooccurence, new Path(args[1]));
FileOutputFormat.setOutputPath(job_cooccurence, new Path(args[2]));
job_cooccurence.setMapOutputKeyClass(VarLongWritable.class);
job_cooccurence.setMapOutputValueClass(VectorWritable.class);
job_cooccurence.setOutputKeyClass(IntWritable.class);
job_cooccurence.setOutputValueClass(VectorWritable.class);
job_cooccurence.setMapperClass(UserVectorToCooccurenceMapper.class);
job_cooccurence.setReducerClass(UserVectorToCooccurenceReducer.class);
job_cooccurence.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new RecommenderJob(), args);
}
}
The error that I get is:
java.io.IOException: Type mismatch in key from map: expected org.apache.mahout.math.VarLongWritable, received org.apache.hadoop.io.IntWritable
In course of Googling for a fix, I found out that my issue is similar to this question. But the difference is that I am already using SequenceFileInputFormat and SequenceFileOutputFormat, I believe correctly. I also see that org.apache.mahout.cf.taste.hadoop.item.RecommenderJob does more or less something similar. In my understanding & Yahoo Tutorial
SequenceFileOutputFormat rapidly serializes arbitrary data types to the file; the corresponding SequenceFileInputFormat will deserialize the file into the same types and presents the data to the next Mapper in the same manner as it was emitted by the previous Reducer.
What am I doing wrong? Will really appreciate some pointers from someone.. I spent the day trying to fix this and got nowhere :(
Your second mapper has the following signature:
public class UserVectorToCooccurenceMapper extends
Mapper<VarLongWritable,VectorWritable,IntWritable,IntWritable>
But you define the following in your driver code:
job_cooccurence.setMapOutputKeyClass(VarLongWritable.class);
job_cooccurence.setMapOutputValueClass(VectorWritable.class);
The reducer is expecting <IntWritable, IntWritable> as input, so you should just amend your driver code to:
job_cooccurence.setMapOutputKeyClass(IntWritable.class);
job_cooccurence.setMapOutputValueClass(IntWritable.class);

"Type mismatch in key from map: expected org.apache.hadoop.io.IntWritable, recieved org.apache.hadoop.io.LongWritable" -Every thing looks right

I am trying to write simple map reduce program to find largest prime number using new API (0.20.2). This is how my Map and reduce class look likeā€¦
public class PrimeNumberMap extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
public void map (LongWritable key, Text Kvalue,Context context) throws IOException,InterruptedException
{
Integer value = new Integer(Kvalue.toString());
if(isNumberPrime(value))
{
context.write(new IntWritable(value), new IntWritable(new Integer(key.toString())));
}
}
boolean isNumberPrime(Integer number)
{
if (number == 1) return false;
if (number == 2) return true;
for (int counter =2; counter<(number/2);counter++)
{
if(number%counter ==0 )
return false;
}
return true;
}
}
public class PrimeNumberReduce extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
public void reduce ( IntWritable primeNo, Iterable<IntWritable> Values,Context context) throws IOException ,InterruptedException
{
int maxValue = Integer.MIN_VALUE;
for (IntWritable value : Values)
{
maxValue= Math.max(maxValue, value.get());
}
//output.collect(primeNo, new IntWritable(maxValue));
context.write(primeNo, new IntWritable(maxValue)); }
}
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException{
if (args.length ==0)
{
System.err.println(" Usage:\n\tPrimenumber <input Directory> <output Directory>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(Main.class);
job.setJobName("Prime");
// Creating job configuration object
FileInputFormat.addInputPath(job, new Path (args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
String star ="*********************************************";
System.out.println(star+"\n Prime number computer \n"+star);
System.out.println(" Application started ... keeping fingers crossed :/ ");
System.exit(job.waitForCompletion(true)?0:1);
}
}
I am still getting error regarding mismatch of key for map
java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.IntWritable, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1034)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:595)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:668)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1109)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
2012-06-13 14:27:21,116 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
Can some one please suggest what is wrong. I have tried all hooks and crooks.
You've not configured the Mapper or reducer classes in your main block, so the default Mapper is being used - which is known as the identity mapper - each pair it receives as input is output (hence the LongWritable as the output key):
job.setMapperClass(PrimeNumberMap.class);
job.setReducerClass(PrimeNumberReduce.class);
The mapper should be defined as below,
public class PrimeNumberMap extends Mapper<**IntWritable**, Text, IntWritable, IntWritable> {
instead of
public class PrimeNumberMap extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
As it is mentioned in the comment before you should have the mapper and reducer defined.
job.setMapperClass(PrimeNumberMap.class);
job.setReducerClass(PrimeNumberReduce.class);
Please refer to Hadoop Definitive guide 3rd edition, Chapter 2, Page 24
I am a fresh hand in hadoop mapreduce program.
When mapping, I use IntWritable but I reduce the values in IntWritable format and convert the result to double before using DoubleWritable in context write.
It fails when running.
My method to handle the covert int in map to double in reduce is:
Mapper(LongWritable,Text,Text,DoubleWritable)
Reducer(Text,DoubleWritable,Text,DoubleWritable)
job.setOutputValueClass(DoubleWritable.Class)

Resources