Explanation of mapper program for search in hadoop - hadoop

I am new to hadoop, so i am having difficulty in understanding the programs a little. So, If someone can help me in understanding this mapper program ?
package SearchTxn;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MyMap extends Mapper<LongWritable, Text, NullWritable, Text>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String Txn = value.toString();
String TxnParts[] = Txn.split(",");
Double Amt = Double.parseDouble(TxnParts[3]);
String Uid = TxnParts[2];
if(Uid.equals("4000010") && Amt>100)
{
context.write(null, value);
}
}
}

The code basically filters lines in which Uid (second column in your csv) is "4000010" and Amt (I guess for amount, third column in your csv) is greater than 100.

Along with answer from #Thomas Jungblut, below line of your program says about Mapper class overall input and output. Here nothing is retuned as a key but Text as a value.
public class MyMap extends Mapper<LongWritable, Text, NullWritable, Text>{
So are the parameters in write method.
context.write(null, value);
Its not always necessary to write key for serialization from Mapper class. As per your use case , either key or value or both can be written to context.write method.

Related

java.lang.ArrayIndexOutOfBoundsException while execution

package com.HadoopExpert;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class SumMapper extends Mapper<LongWritable, Text,Text,IntWritable>{
public void map(LongWritable key, Text value, Context con) throws IOException, InterruptedException{
String s = value.toString();
String[] words = s.split(",");
String gender = words[4];
int sal = Integer.parseInt(words[2]);
con.write(new Text(gender), new IntWritable(sal));
}
}
this is my mapper class code i want to fetch array by index m getting a eror aarray out of bound index
thanx in advance
According to your data metioned in your comment, the index of gender should be 3. Note that the index of array starts from 0 in java.
And you should always check your data before use it, such as:
if (words.length > 3) {
String gender = words[3];
......
}
And you should think about how to process the bad data (count up and then ignore it, or try to fix it, and so on).

Hashmap in each mapper should be used in a single reducer

In one of my class im using HashMap.Im calling that class inside my mapper. So now each mapper has its own HashMap. Now can i use all the HashMaps into a single reducer? Actually my HashMap contains Key as my filename and value is the Set.So each HashMap contains a filename and a Set. Now i want to use all the HashMap caontaining same filename and want to club all the values(Sets) and then write that HashMap into my Hdfs file
Yes you can do that. If your mapper is giving an output in the form of hashmap then you can use Hadoop's MapWritable as your value of mapper.
For e.g.
public class MyMapper extends Mapper<LongWritable, Text, Text, MapWritable>
you have to convert your Hashmap into MapWritable format:
MapWritable mapWritable = new MapWritable();
for (Map.Entry<String,String> entry : yourHashMap.entrySet()) {
if(null != entry.getKey() && null != entry.getValue()){
mapWritable.put(new Text(entry.getKey()),new Text(entry.getValue()));
}
}
Then provide the mapwritable to your context:
ctx.write(new Text("my_key",mapWritable);
For Reducer class you have take MapWritable as your input value
public class MyReducer extends Reducer<Text, MapWritable, Text, Text>
public void reduce(Text key, Iterable<MapWritable> values, Context ctx) throws IOException, InterruptedException
Then iterate through the map and extract the values the way you want. For e.g:
for (MapWritable entry : values) {
for (Entry<Writable, Writable> extractData: entry.entrySet()) {
//your logic for the data will go here.
}
}

not able to access the reduce method in reduce class

public static class Reduce extends Reducer<Text, Text, Text, Text>
{
private Text txtReduceOutputKey = new Text("");
private Text txtReduceOutputValue = new Text("");
public void reduce(Text key, Iterator<Text> values, Context context) throws IOException, InterruptedException
{
//some code;
}
}
It is not giving any error. I am able to access the class as I am able to initiate those variables txtReduceOutputKey and txtReduceOutputValue. but the reduce method is ignored while execution. So I am not able to run the code //some code in the above method. Also I am using below packages.
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Job;
Any idea how can I fix this?
Please make it sure that you have set the Reducer class in the driver code.
For example :-
job.setReducerClass(Base_Reducer.class);
it was because of Iterator. it should be replaced by Iterable.
Thank you.
Such issues come due to couple reasons like below:
Not overriding the reduce() method - it's better to annotate with override
As mentioned in an earlier comment, forgetting to set reducer in the driver code

Run and Reduce mehods in Reducer class

Can anybody help me explaining the execution flow of run() and reduce() method in a Reducer class. I am trying to calculate the average of word counts in my MapReduce job. My Reducer class receives "word" and "iterable of occurrences" as key-value pairs.
My objective is to calculate the average of word occurrences with respect to all the words in the document. Can run() method in reducer iterate through all the keys and count all the number of words? I can then use this sum to find the average by looping through each iterable value provided with the keys
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class AverageReducer extends Reducer<Text, IntWritable, Text,IntWritable> {
private IntWritable average = new IntWritable();
private static int count=0;
protected void run()
{
//loop through all the keys and increment count
}
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
int sum=0;
for(IntWritable val:values)
{
sum=sum+val.get();
}
average.set(sum/count);
context.write(key, average);
}
As described here you can't iterate over values twice. And i think it is bad idea to override run method, it just iterates through keys and calls reduce method for every pair (source). So you can't calculate the average of word occurrences with only one map-reduce job.

Method v Class level variables in Hadoop MapReduce

This is a question regarding the performance of writable variables and allocation within a map reduce step. Here is a reducer:
static public class MyReducer extends Reducer<Text, Text, Text, Text> {
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) {
for (Text val : values) {
context.write(key, new Text(val));
}
}
}
Or is this better performance-wise:
static public class MyReducer extends Reducer<Text, Text, Text, Text> {
private Text myText = new Text();
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) {
for (Text val : values) {
myText.set(val);
context.write(key, myText);
}
}
}
In the Hadoop Definitive Guide all the examples are in the first form but I'm not sure if that is for shorter code samples or because it's more idiomatic.
The book may use the first form because it is more concise. However, it is less efficient. For large input files, that approach will create a large number of objects. This excessive object creation would slow down your performance. Performance-wise, the second approach is preferable.
Some references that discuss this issue:
Tip 7 here,
On Hadoop object re-use, and
This JIRA.
Yeah, second approach is preferable if reducer has large data to process. The first approach, will keep creating references and cleaning it up depends on the garbage collector.

Resources