about context object in map-reduce - hadoop

Can anyone explain why we are writing arguments in angle brackets in below statement and why we are defining output key/value pairs in arguments.
public static class Map extends Mapper <LongWritable, Text, Text, IntWritable>
What is context object and why we are using in the below statement.
public void map(LongWritable key, Text value, Context context ) throws IOException, InterruptedException

To add to what #Vasu answered..
Context stores references to RecordReader and RecordWriter.
Whenever context.getCurrentKey() and context.getCurrentValue() are used to retrieve key and value pair, the request is assigned to RecordReader. And when context.write() is called, it is assigned to RecordWriter.
Here RecordReader and RecordWriter are actually abstract classes.

<> is used to indicate generics in Java.
Mapper <LongWritable, Text, Text, IntWritable> takes only <LongWritable,Text> as keys and <Text,IntWritable> as values. If you try to provide any other writable types to your mapper, this will throw an error.
Context context object is used to write output Key-Values as well as get configuration, counters, cacheFiles etc in the Mapper.

Related

Could reducer class not be launched by any chance? Can't see Sytem.out.println statements in the reducer logs

I have a driver class, mapper class and reducer class. The mapreduce job runs fine. But the desired out is not coming. I have put System.out.println statements in the reducer. I looked at the logs of mapper and reducer. System.out.println statements that I put in mapper can be seen in the logs but println statements in the reducer are not seen in the logs. Could it be possible that reducer is not at all launched?
This is the log fine from reducer.
I assume this question is based on the code in your earlier question: mapreduce composite Key sample - doesn't show the desired output
public class CompositeKeyReducer extends Reducer<Country, IntWritable, Country, IntWritable> {
public void reduce(Country key, Iterator<IntWritable> values, Context context) throws IOException, InterruptedException {
}
}
The reduce isn't running because the reduce method signature is wrong. You have:
public void reduce(Country key, Iterator<IntWritable> values, Context context)
It should be:
public void reduce(Country key, Iterable<IntWritable> values, Context context)
To make sure this doesn't happen again you should add the #Override annotation to the class. This will tell you if you've got the signature wrong.
No change in the code. It works now.
All I did was restarted my Hadoop Cloudera image and it works now. I can't believe this happended.

Hashmap in each mapper should be used in a single reducer

In one of my class im using HashMap.Im calling that class inside my mapper. So now each mapper has its own HashMap. Now can i use all the HashMaps into a single reducer? Actually my HashMap contains Key as my filename and value is the Set.So each HashMap contains a filename and a Set. Now i want to use all the HashMap caontaining same filename and want to club all the values(Sets) and then write that HashMap into my Hdfs file
Yes you can do that. If your mapper is giving an output in the form of hashmap then you can use Hadoop's MapWritable as your value of mapper.
For e.g.
public class MyMapper extends Mapper<LongWritable, Text, Text, MapWritable>
you have to convert your Hashmap into MapWritable format:
MapWritable mapWritable = new MapWritable();
for (Map.Entry<String,String> entry : yourHashMap.entrySet()) {
if(null != entry.getKey() && null != entry.getValue()){
mapWritable.put(new Text(entry.getKey()),new Text(entry.getValue()));
}
}
Then provide the mapwritable to your context:
ctx.write(new Text("my_key",mapWritable);
For Reducer class you have take MapWritable as your input value
public class MyReducer extends Reducer<Text, MapWritable, Text, Text>
public void reduce(Text key, Iterable<MapWritable> values, Context ctx) throws IOException, InterruptedException
Then iterate through the map and extract the values the way you want. For e.g:
for (MapWritable entry : values) {
for (Entry<Writable, Writable> extractData: entry.entrySet()) {
//your logic for the data will go here.
}
}

Method v Class level variables in Hadoop MapReduce

This is a question regarding the performance of writable variables and allocation within a map reduce step. Here is a reducer:
static public class MyReducer extends Reducer<Text, Text, Text, Text> {
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) {
for (Text val : values) {
context.write(key, new Text(val));
}
}
}
Or is this better performance-wise:
static public class MyReducer extends Reducer<Text, Text, Text, Text> {
private Text myText = new Text();
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) {
for (Text val : values) {
myText.set(val);
context.write(key, myText);
}
}
}
In the Hadoop Definitive Guide all the examples are in the first form but I'm not sure if that is for shorter code samples or because it's more idiomatic.
The book may use the first form because it is more concise. However, it is less efficient. For large input files, that approach will create a large number of objects. This excessive object creation would slow down your performance. Performance-wise, the second approach is preferable.
Some references that discuss this issue:
Tip 7 here,
On Hadoop object re-use, and
This JIRA.
Yeah, second approach is preferable if reducer has large data to process. The first approach, will keep creating references and cleaning it up depends on the garbage collector.

Use two Mappers on same file simultaneously in Hadoop

Assuming there is a file and two different independent mappers to be executed upon that file in parallel. To do that we require to use a copy of the file.
What I want to know is "Is it possible to use same file for the two mappers" which in turn will reduce the resources utilization and make the system time efficient.
Is there any research in this area or any existing tool in Hadoop which can help in overcoming this.
Assuming that both Mappers have the same K,V signature, you could use a delegating mapper and then call the map method of your two mappers:
public class DelegatingMapper extends Mapper<LongWritable, Text, Text, Text> {
public Mapper<LongWritable, Text, Text, Text> mapper1;
public Mapper<LongWritable, Text, Text, Text> mapper2;
protected void setup(Context context) {
mapper1 = new MyMapper1<LongWritable, Text, Text, Text>();
mapper1.setup(context);
mapper2 = new MyMapper1<LongWritable, Text, Text, Text>();
mapper2.setup(context);
}
public void map(LongWritable key, Text value, Context context) {
// your map methods will need to be public for each class
mapper1.map(key, value, context);
mapper2.map(key, value, context);
}
protected void cleanup(Context context) {
mapper1.cleanup(context);
mapper2.cleanup(context);
}
}
On a high level, there are 2 scenarios I could imagine with the question in hand.
Case 1:
If you are trying to write the SAME implementation in both Mapper classes to process the same input file with the sole aim of efficient resource utilization, this probably isn't the correct approach. Because, when a file is saved in the cluster it gets divided into blocks and replicated across data nodes.
This basically gives you the most efficient resource utilization as all the data blocks for the same input file are processed in PARALLEL.
Case 2:
If you are trying to write two DIFFERENT Mapper implementations (with their own business logic), for some particular workflow you want to execute based on your business requirements. Yes, you can pass the same input file to two different mappers using MultipleInputs class.
MultipleInputs.addInputPath(job, file1, TextInputFormat.class, Mapper1.class);
MultipleInputs.addInputPath(job, file1, TextInputFormat.class, Mapper2.class);
This could only be a workaround based on what you want to implement.
Thanks.

Why cannot Reducer.class used as a real reducer in Hadoop MapReduce?

I noticed that Mapper.class can be used as a real mapper in a phase, together with a user-defined reducer. For example,
Phase 1:
Mapper.class -> WordCountReduce.class
This will work.
However, Reducer.class cannot be used the same way. Namely something like
Phase 2:
WordReadMap.class -> Reducer.class
will not work.
Why is that?
I don't see why it wouldn't as long as the outputs are of the same class as the inputs. The default in the new API just writes out whatever you pass into it, it's implemented as
#SuppressWarnings("unchecked")
protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context
) throws IOException, InterruptedException {
for(VALUEIN value: values) {
context.write((KEYOUT) key, (VALUEOUT) value);
}
}
For the old API, it's an interface, and you can't directly instantiate an interface. If you're using that, then that's the reason it fails. Then again, the Mapper is an interface as well, and you shouldn't be able to instantiate it...

Resources