Difference between combiner and in-mapper combiner in mapreduce? - hadoop

I'm new to hadoop and mapreduce. Could someone clarify the difference between a combiner and an in-mapper combiner or are they the same thing?

You are probably already aware that a combiner is a process that runs locally on each Mapper machine to pre-aggregate data before it is shuffled across the network to the various cluster Reducers.
The in-mapper combiner takes this optimization a bit further: the aggregations do not even write to local disk: they occur in-memory in the Mapper itself.
The in-mapper combiner does this by taking advantage of the setup() and cleanup() methods of
org.apache.hadoop.mapreduce.Mapper
to create an in-memory map along the following lines:
Map<LongWritable, Text> inmemMap = null
protected void setup(Mapper.Context context) throws IOException, InterruptedException {
inmemMap = new Map<LongWritable, Text>();
}
Then during each map() invocation you add values to than in memory map (instead of calling context.write() on each value. Finally the Map/Reduce framework will automatically call:
protected void cleanup(Mapper.Context context) throws IOException, InterruptedException {
for (LongWritable key : inmemMap.keySet()) {
Text myAggregatedText = doAggregation(inmemMap.get(key))// do some aggregation on
the inmemMap.
context.write(key, myAggregatedText);
}
}
Notice that instead of calling context.write() every time, you add entries to the in-memory map. Then in the cleanup() method you call context.write() but with the condensed/pre-aggregated results from your in-memory map . Therefore your local map output spool files (that will be read by the reducers) will be much smaller.
In both cases - both in memory and external combiner - you gain the benefits of less network traffic to the reducers due to smaller map spool files. That also decreases the reducer processing.

Related

cluster.getJob is returning null in hadoop

public void setup(Context context) throws IOException, InterruptedException{
Configuration conf = context.getConfiguration();
org.apache.hadoop.mapreduce.Cluster cluster = new org.apache.hadoop.mapreduce.Cluster(conf);
Job currentJob = cluster.getJob(context.getJobID());
mapperCounter = currentJob.getCounters().findCounter(TEST).getValue();
}
I wrote the following code to get the value of a counter that I am incrementing in my mapper function. The problem is that the currentJob returned by cluster.getJob is turning out to be null. Does anyone know how I can fix this?
My question is different cause I am trying to access my counter in the reducer not after all the map reduce tasks are done. This code that I have pasted here belongs in my reducer class.
It seems that cluster.getJob(context.getJobID()); does not work in hadoop's Standalone Operation.
Try running your Program with YARN in hadoop's Single Node Cluster mode like described in the documentation: https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation

output HBase Increment in MR reducer

I have a mapreduce job that writes to HBase. I know you can output Put and Delete from the reducer using the TableMapReduceUtil.
Is it possible emit Increment to increment values in an HBase table instead out emitting Puts and Gets? If yes, how to do it and if not then why?
I'm using CDH3
public static class TheReducer extends TableReducer<Text, Text, ImmutableBytesWritable> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
///....DO SOME STUFF HERE
Increment increment = new Increment(row);
increment.addColumn(col,qual,1L);
context.write(null, increment); //<--- I want to be able to do this
}
}
Thanks
As far as I know you can't use Increment in the context - but you can always open a connection to HBase and write Increments anywhere (mapper, mapper cleanup, reducer etc.)
Do note that increments are not idempotent so the result might be problematic on partial success of the map/reduce job and/or if you have speculative execution for M/R (i.e. multiple mappers doing the same work)

How do I create a new, unique key in a Hadoop Reducer

In a Hadoop Reducer, I would like to create and emit new keys under specific conditions, and I'd like to ensure that these keys are unique.
The pseudo-code for what I want goes like:
#Override
protected void reduce(WritableComparable key, Iterable<Writable> values, Context context)
throws IOException, InterruptedException {
// do stuff:
// ...
// write original key:
context.write(key, data);
// write extra key:
if (someConditionIsMet) {
WritableComparable extraKey = createNewKey()
context.write(extraKey, moreData);
}
}
So I now have two questions:
Is it possible at all to emit more than one different key in reduce()? I know that keys won't be resorted but that is ok for me.
The extra key has to be unique across all reducers - both for application reasons and because I think it would otherwise violate the contract of the reduce stage.
What is a good way to generate a key that is unique across reducers (and possibly across jobs?)
Maybe get reducer/job IDs and incorporate that into key generation?
Yes you can output any number of keys
You can incorporate the task attempt information into your key to make it job unique (across the reducers and even handling speculative execution if you want). you can acquire this information from the reducer's Context.getTaskAttemptID() method and then pull out the reducer ID number with TaskAttemptID.getTaskID().getId()

hadoop-streaming : writing output to different files

Here is the scenario
Reducer1
/
Mapper - - Reducer2
\
ReducerN
In reducer I want to write the data on different files, lets say the reducer looks like
def reduce():
for line in sys.STDIN:
if(line == type1):
create_type_1_file(line)
if(line == type2):
create_type_2_file(line)
if(line == type3):
create_type3_file(line)
... and so on
def create_type_1_file(line):
# writes to file1
def create_type2_file(line):
# writes to file2
def create_type_3_file(line):
# write to file 3
Consider the paths to write as :
file1 = /home/user/data/file1
file2 = /home/user/data/file2
file3 = /home/user/data/file3
When I run in pseudo-distributed mode(machine with one node and hdfs daemons running), things are good since all daemons will write to the same set of files
Question:
- If I run this in cluster of 1000 machines, will they write to the same set of files even then? I am writing to local filesystem in this case, Is there a better way to perform this operation in hadoop streaming?
Typically the o/p of reduce is written to a reliable storage system like HDFS, because if one of the nodes goes down then the reduce data associated with that nodes is lost. It's not possible to run that particular reduce task again outside the context of the Hadoop framework. Also, once the job is complete, the o/p from the 1000 nodes have to be consolidated for the different input types.
Concurrent writing is not supported in HDFS. There might be a case where multiple reducers might be writing to the same file in HDFS and this might corrupt the file. When multiple reduce tasks are running on a single node, concurrency might be a problem when writing to a single local file also.
One of the solution is to have a reduce task specific file name and later combine all the files for a specific input type.
Output can be written from the Reducer to more than one location using MultipleOutputs class.You can regard file1,file2 and file3 as three folders and write 1000 Reducers' output data to these folders seperately.
Usage pattern for job submission:
Job job = new Job();
FileInputFormat.setInputPath(job, inDir);
//outDir is the root path, in this case, outDir="/home/user/data/"
FileOutputFormat.setOutputPath(job, outDir);
//You have to assign the output formatclass.Using MultipleOutputs in this way will still create zero-sized default output, eg part-00000. To prevent this use LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class); instead of job.setOutputFormatClass(TextOutputFormat.class); in your Hadoop job configuration.
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(MOMap.class);
job.setReducerClass(MOReduce.class);
...
job.waitForCompletion(true);
Usage in Reducer:
private MultipleOutputs out;
public void setup(Context context) {
out = new MultipleOutputs(context);
...
}
public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
//'/' characters in baseOutputPath will be translated into directory levels in your file system. Also, append your custom-generated path with "part" or similar, otherwise your output will be -00000, -00001 etc. No call to context.write() is necessary.
for (Text line : values) {
if(line == type1)
out.write(key, new Text(line),"file1/part");
else if(line == type2)
out.write(key, new Text(line),"file2/part");
else if(line == type3)
out.write(key, new Text(line),"file3/part");
}
}
protected void cleanup(Context context) throws IOException, InterruptedException {
out.close();
}
ref:https://hadoop.apache.org/docs/r2.6.3/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html

How to call Partitioner in Haoop v 0.21

In my application I want to create as many reducer jobs as possible based on the keys. Now my current implementation writes all the keys and values in a single (reducer) output file. So to solve this, I have used one partitioner but I cannot call the class.The partitioner should be called after the selection Map task and before the selection reduce task but it did not.The code of the partitioner is the following
public class MultiWayJoinPartitioner extends Partitioner<Text, Text> {
#Override
public int getPartition(Text key, Text value, int nbPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % nbPartitions;
return 0;
}
}
Is this code is correct to partition the files based on the keys and values and the output will be transfer to the reducer automatically??
You don't show all of your code, but there is usually a class (called the "Job" or "MR" class) that configures the mapper, reducer, partitioner, etc. and then actually submits the job to hadoop. In this class you will have a job configuration object that has many properties, one of which is the number of reducers. Set this property to whatever number your hadoop configuration can handle.
Once the job is configured with a given number of reducers, that number will be passed into your partition (which looks correct, by the way). Your partitioner will start returning the appropriate reducer/partition for the key/value pair. That's how you get as many reducers as possible.

Resources