Erroneous input value to Reducer in Hadoop - hadoop

I have defined a custom writable (called EquivalenceClsAggValue) which has a field of type ArrayList (called aggValues) in Hadoop. Using my test data, the size of aggValues for each output entry by the Mapper in 2. However, when I check the size of aggValues in the Reducer, it gives me different size! That is, the size accumulates (first element has has 2, second one has size 4, third has size 6, and so on). What can be the problem?
This is how I output in the Mapper:
EquivalenceClsAggValue outputValue = new EquivalenceClsAggValue();
.....
output.collect(new IntWritable(outputValue.aggValues.size()),outputValue);
And in the Reducer:
public void reduce(IntWritable key, Iterator<EquivalenceClsAggValue> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
while(values.hasNext()){
EquivalenceClsAggValue e = values.next();
output.collect(new Text(key.toString()), new IntWritable(e.aggValues.size()));
.....
and the output is:
2 2
2 4
2 6

In your readFields method, you need to clear out any previous contents of the array list - Hadoop re-uses the same object between calls.
Sorry, i missed this in your previous post:
#Override
public void readFields(DataInput arg0) throws IOException {
// add this statement to clear out previous contents
aggValues.clear();
int size = arg0.readInt();
for (int i=0;i<size;i++){
SortedMapWritable tmp = new SortedMapWritable();
tmp.readFields(arg0);
aggValues.add(tmp);
}
}

Related

reducer stuck at 70%

I am putting together a very initial programming task with hadoop and going with the classic wordcount problem.
Have put a sample file on hdfs, and tried to run wordcount on it. The mapper goes through just fine, however, the reducer is stuck at 70%, never moving forward.
I tried this with files on local file system too, and got same behaviour.
What could i be doing wrong ?
here are map and reduce functions -
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
// TODO Auto-generated method stub
String line = value.toString();
String[] lineparts = line.split(",");
for(int i=0; i<lineparts.length; ++i)
{
output.collect(new Text(lineparts[i]), new IntWritable(1));
}
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
// TODO Auto-generated method stub
int count = 0;
while(values.hasNext())
{
count=count+1;
}
output.collect(key , new IntWritable(count));
}
You never call next() on your iterator, so you're basically creating an infinite loop.
As a side note, the preferred way to implement this word count example is not to increment the count by 1, but use the value instead:
IntWritable value = values.next();
count += value.get();
This way, you can reuse your Reducer as a Combiner so that it will compute partial counts for each mapper and emit ("wordX", 7) to the reducer instead of 7 occurrences of ("wordX", 1) from a given mapper. You can read more about combiners here.

hadoop - total line of input files

I have an input file contains:
id value
1e 1
2e 1
...
2e 1
3e 1
4e 1
And I would like to find the total id of my input file. So In my main, I have declare a list so that when I read the input file, I will insert the line into the list
MainDriver.java
public static Set list = new HashSet();
and I my map
// Apply regex to find the id
...
// Insert id to the list
MainDriver.list.add(regex.group(1)); // add 1e, 2e, 3e ...
and In my reduce, I try to use the list as
public void reduce(WritableComparable key, Iterator values,
OutputCollector output, Reporter reporter) throws IOException
{
...
output.collect(key, new IntWritable(MainDriver.list.size()));
}
So I expect the value print out the file, in this case will be 4. But it actually prints out 0.
I have verify that regex.group(1) would extract valid id. So I have no clue why the size of my list is 0 in the reduce process.
The mappers and reducers run on separate JVMs (and often separate machines altogether) both from each other and from the driver program, so there is no common instance of your list Set variable that all of those methods can concurrently read and write to.
One way in MapReduce to count the number of keys is:
Emit (id, 1) from your mapper
(optionally) Sum the 1s for each mapper using a combiner to minimize network and reducer I/O
In the reducer:
In setup() initialize a class-scope numeric variable (int or long presumbly) to 0
In reduce() increment the counter, and ignore the values
In cleanup() emit the counter value now that all keys have been processed
Run the job with a single reducer, so all the keys go to the same JVM where a single count can be made
This is basically ignoring the advantage of using MapReduce in the first place.
Correct me if I'm wrong, but it appears you can map your output from your Mapper by "id", and then in your Reducer you receive something like Text key, Iterator values as the parameters.
You can then just sum up values and output output.collect(key, <total value>);
Example (apologies for using Context rather than OutputCollector, but the logic is the same):
public static class MyMapper extends Mapper<LongWritable, Text, Text, Text> {
private final Text key = new Text("id");
private final Text id = new Text();
public void map(LongWritable key, Text value,
Context context) throws IOException, InterruptedException {
id.set(regex.group(1)); // do whatever you do
context.write(id, countOne);
}
}
public static class MyReducer extends Reducer<Text, Text, Text, IntWritable> {
private final IntWritable totalCount = new IntWritable();
public void reduce(Text key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
int cnt = 0;
for (Text value : values) {
cnt ++;
}
totalCount.set(cnt);
context.write(key, totalCount);
}
}

map emitted keys change inside reducer/combiner

I need to do two separate matrix multiplications (S*A and P*A) inside my mappers and emit the results of both. I know i can do that easily with two mapreduce job, but inorder to save running time I need to do them in one job. So what I do is that after doing both multiplications I put both of outputs in context object, but with different keys so that I can distinguish them inside reducer:
LongWritable One = new LongWritable();
One.set(1);
context.write(One, partialSA);
LongWritable two = new LongWritable();
two.set(2);
context.write(two, partialPA);
In reduce, I just need to add all partialSA matrices together and add all partialPA matrices together too. The problem is that If I use combiner, the emitted keys I receive inside combiner are 0 and 1 instead of 1 and two!!! And if i dont use combiner, inside reducer I receive 0 and 1 as keys instead of 1 and 2.
Why is this happening? what is the problem?
Here is the exact cleanup function of my mapper:
public void cleanup(Context context) throws IOException, InterruptedException{
LongWritable one = new LongWritable();
one.set(1);
LongWritable two = new LongWritable();
two.set(2)
context.write(one, partialSA);
context.write(two, partialPA);
}
Here is the reducer() code:
public void reduce(LongWritable key, Iterable<MatrixWritable> values, Context context) throws IOException, InterruptedException{
System.out.println("*** In reduce() **** "+key.get());
Iterator<MatrixWritable> itr = values.iterator();
if(key.get() == 1){
while(itr.hasNext()){
SA.addMatrices(itr.next());
}
}else if(key.get() == 2){
while(itr.hasNext()){
PA.addMatrices(itr.next());
}
}
}

why Hadoop combiner output not merged by reducer

I ran a simple wordcount MapReduce example adding combiner with a small change in combiner output, The output of combiner is not merged by reducer. scenario is as follows
Test:
Map -> Combiner ->Reducer
In combiner i added two extra lines to out put a word different and count 1, reducer is not suming the "different" word count. output pasted below.
Text t = new Text("different"); // Added a my own output
context.write(t, new IntWritable(1)); // Added my own output
public class wordcountcombiner extends Reducer<Text, IntWritable, Text, IntWritable>{
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
context.write(key, new IntWritable(sum));
Text t = new Text("different"); // Added my own output
context.write(t, new IntWritable(1)); // Added my own output
}
}
Input:
I ran a simple wordcount MapReduce example adding combiner with a small change in combiner output, The output of combiner is not merged by reducer. scenario is as follows
In combiner I added two extra lines to out put a word different and count 1, reducer is not suming the "different" word count. output pasted below.
Output:
"different" 1
different 1
different 1
I 2
different 1
In 1
different 1
MapReduce 1
different 1
The 1
different 1
...
How can this happen?
fullcode:
I ran wordcount program with combiner and just for fun i tweaked it in combiner, so i faced this issue.
I have three separate classes for mapper, combiner and reducer.
Driver:
public class WordCount {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// TODO Auto-generated method stub
Job job = Job.getInstance(new Configuration());
job.setJarByClass(wordcountmapper.class);
job.setJobName("Word Count");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(wordcountmapper.class);
job.setCombinerClass(wordcountcombiner.class);
job.setReducerClass(wordcountreducer.class);
job.getConfiguration().set("fs.file.impl", "com.conga.services.hadoop.patch.HADOOP_7682.WinLocalFileSystem");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true)? 0 : 1);
}
}
Mapper:
public class wordcountmapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text word = new Text();
IntWritable one = new IntWritable(1);
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{
String line = value.toString();
StringTokenizer token = new StringTokenizer(line);
while (token.hasMoreTokens())
{
word.set(token.nextToken());
context.write(word, one);
}
}
}
Combiner:
public class wordcountcombiner extends Reducer<Text, IntWritable, Text, IntWritable>{
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
context.write(key, new IntWritable(sum));
Text t = new Text("different");
context.write(t, new IntWritable(1));
}
}
Reducer:
public class wordcountreducer extends Reducer<Text, IntWritable, Text, IntWritable>{
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
The output is normal because you're having two lines doing wrong things :
Why are you having this code
Text t = new Text("different"); // Added my own output
context.write(t, new IntWritable(1)); // Added my own output
In your reducer you're doing the sum and then you're adding to the output different 1 ....
You are writing in the final output of the job a new "1 different" in the reduce function, without doing any kind of aggregation. The reduce function is called once per key, as you can see in the method signature, it takes as arguments a key and the list of values for that key, which means that it is called once for each of the keys.
Since you are using as key a word, and in each call of reduce you are writing to the output "1 different", you will get one of those for each of the words in the input data.
hadoop requires that the reduce method in the combiner writes only the same key that it receives as input. This is required because hadoop sorts the keys only before the combiner is called, it does not re-sort them after the combiner has run. In your program, the reduce method writes the key "different" in addition to the key that it received as input. This means that the key "different" then appears in different positions in the order of keys, and these occurrences are not merged before they get passed to the reducer.
For example:
Assume the sorted list of keys output by the mapper is: "alpha", "beta", "gamma"
Your combiner is then called three times (once for "alpha", once for "beta", once for "gamma") and produces keys "alpha", "different", then keys "beta", "different", then keys "gamma", "different".
The "sorted" (but actually not sorted) list of keys after the combiner has executed is then:
"alpha", "different", "beta", "different", "gamma", "different"
This list does not get sorted again, so the different occurrences of "different" do not get merged.
The reducer is then called separately six times, and the key "different" appears 3 times in the output of the reducer.

Weird error in Hadoop reducer

The reducer in my map-reduce job is as follows:
public static class Reduce_Phase2 extends MapReduceBase implements Reducer<IntWritable, Neighbourhood, Text,Text> {
public void reduce(IntWritable key, Iterator<Neighbourhood> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
ArrayList<Neighbourhood> cachedValues = new ArrayList<Neighbourhood>();
while(values.hasNext()){
Neighbourhood n = values.next();
cachedValues.add(n);
//correct output
//output.collect(new Text(n.source), new Text(n.neighbours));
}
for(Neighbourhood node:cachedValues){
//wrong output
output.collect(new Text(key.toString()), new Text(node.source+"\t\t"+node.neighbours));
}
}
}
TheNeighbourhood class has two attributes, source and neighbours, both of type Text. This reducer receives one key which has 19 values(of type Neighbourhood) assigned. When I output the source and neighbours inside the while loop, I get the actual values of 19 different values. However, if I output them after the while loop as shown in the code, I get 19 similar values. That is, one object gets output 19 times! It is very weired that what happens. Is there any idea on that?
Here is the code of the class Neighbourhood
public class Neighbourhood extends Configured implements WritableComparable<Neighbourhood> {
Text source ;
Text neighbours ;
public Neighbourhood(){
source = new Text();
neighbours = new Text();
}
public Neighbourhood (String s, String n){
source = new Text(s);
neighbours = new Text(n);
}
#Override
public void readFields(DataInput arg0) throws IOException {
source.readFields(arg0);
neighbours.readFields(arg0);
}
#Override
public void write(DataOutput arg0) throws IOException {
source.write(arg0);
neighbours.write(arg0);
}
#Override
public int compareTo(Neighbourhood o) {
return 0;
}
}
You're being caught out by a efficiency mechanism employed by Hadoop - Object reuse.
Your calls to values.next() is returning the same object reference each time, all Hadoop is doing behind the scenes is replaced the contents of that same object with the underlying bytes (deserialized using the readFields() method).
To avoid this you'll need to create deep copies of the object returned from values.next() - Hadoop actually has a utility class to do this for you called ReflectionUtils.copy. A simple fix would be as follows:
while(values.hasNext()){
Neighbourhood n = ReflectionUtils.newInstance(Neighbourhood.class, conf);
ReflectionUtils.copy(values.next(), n, conf);
You'll need to cache a version of the job Configuration (conf in the above code), which you can obtain by overriding the configure(JobConf) method in your Reducer:
#Override
protected void configure(JobConf job) {
conf = job;
}
Be warned though - accumulating a list in this way is often the cause of memory problems in your job, especially if you have 100,000+ values for a given single key.

Resources