output HBase Increment in MR reducer - hadoop

I have a mapreduce job that writes to HBase. I know you can output Put and Delete from the reducer using the TableMapReduceUtil.
Is it possible emit Increment to increment values in an HBase table instead out emitting Puts and Gets? If yes, how to do it and if not then why?
I'm using CDH3
public static class TheReducer extends TableReducer<Text, Text, ImmutableBytesWritable> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
///....DO SOME STUFF HERE
Increment increment = new Increment(row);
increment.addColumn(col,qual,1L);
context.write(null, increment); //<--- I want to be able to do this
}
}
Thanks

As far as I know you can't use Increment in the context - but you can always open a connection to HBase and write Increments anywhere (mapper, mapper cleanup, reducer etc.)
Do note that increments are not idempotent so the result might be problematic on partial success of the map/reduce job and/or if you have speculative execution for M/R (i.e. multiple mappers doing the same work)

Related

cluster.getJob is returning null in hadoop

public void setup(Context context) throws IOException, InterruptedException{
Configuration conf = context.getConfiguration();
org.apache.hadoop.mapreduce.Cluster cluster = new org.apache.hadoop.mapreduce.Cluster(conf);
Job currentJob = cluster.getJob(context.getJobID());
mapperCounter = currentJob.getCounters().findCounter(TEST).getValue();
}
I wrote the following code to get the value of a counter that I am incrementing in my mapper function. The problem is that the currentJob returned by cluster.getJob is turning out to be null. Does anyone know how I can fix this?
My question is different cause I am trying to access my counter in the reducer not after all the map reduce tasks are done. This code that I have pasted here belongs in my reducer class.
It seems that cluster.getJob(context.getJobID()); does not work in hadoop's Standalone Operation.
Try running your Program with YARN in hadoop's Single Node Cluster mode like described in the documentation: https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation

Difference between combiner and in-mapper combiner in mapreduce?

I'm new to hadoop and mapreduce. Could someone clarify the difference between a combiner and an in-mapper combiner or are they the same thing?
You are probably already aware that a combiner is a process that runs locally on each Mapper machine to pre-aggregate data before it is shuffled across the network to the various cluster Reducers.
The in-mapper combiner takes this optimization a bit further: the aggregations do not even write to local disk: they occur in-memory in the Mapper itself.
The in-mapper combiner does this by taking advantage of the setup() and cleanup() methods of
org.apache.hadoop.mapreduce.Mapper
to create an in-memory map along the following lines:
Map<LongWritable, Text> inmemMap = null
protected void setup(Mapper.Context context) throws IOException, InterruptedException {
inmemMap = new Map<LongWritable, Text>();
}
Then during each map() invocation you add values to than in memory map (instead of calling context.write() on each value. Finally the Map/Reduce framework will automatically call:
protected void cleanup(Mapper.Context context) throws IOException, InterruptedException {
for (LongWritable key : inmemMap.keySet()) {
Text myAggregatedText = doAggregation(inmemMap.get(key))// do some aggregation on
the inmemMap.
context.write(key, myAggregatedText);
}
}
Notice that instead of calling context.write() every time, you add entries to the in-memory map. Then in the cleanup() method you call context.write() but with the condensed/pre-aggregated results from your in-memory map . Therefore your local map output spool files (that will be read by the reducers) will be much smaller.
In both cases - both in memory and external combiner - you gain the benefits of less network traffic to the reducers due to smaller map spool files. That also decreases the reducer processing.

How to process all map outputs in one reducer at the same time?

I have written a MapReduce application in which the mappers produce output in the following form:
key1 value1
key2 value2
keyn valuen
What I want to do is to sum all of the values for all the keys in my reducer. Basically:
sum = value1+value2+value3
Is that possible? From what I understand currently the reducer is called separately for each key/value pair. One solution that came to my mind was to have a private sum variable maintaining the sum of the values process thus far in it. In that case, however, how do I know that all of the pairs have been processed so that the sum may be written out to the collector?
If you don't need the key then use a single, constant key. If you have to have several key values, you could set the number of reducers to 1 and use an instance variable in the reducer class to hold the sum of all the values. Initialize the variable in the setup() method and report the overall sum in the close() method.
Another approach would be to write the sum of the values for a given key by incrementing a counter with the sum in the reduce method. Let hadoop bring all the values together in a single counter value.
I am also new to Hadoop and while doing research on the same problem, I found out the Mapper and Reducer classes also have setup() and cleanup() methods along with map() and reduce().
First, set number of Reducers to 1.
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
int sum=0
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
for (IntWritable value : values)
{
sum += value.get();
}
}
#Override
protected void cleanup(Context context) throws IOException, InterruptedException {
context.write(new Text("Sum:", new IntWritable(sum));
}
}

Hadoop variable set in reducer and read in driver

How I can set a variable in a reducer, which after its execution can be read by the driver after all tasks finish their execution? Something like:
class Driver extends Configured implements Tool{
public int run(String[] args) throws Exception {
...
JobClient.runJob(conf); // reducer sets some variable
String varValue = ...; // variable value is read by driver
}
}
WORKAROUND
I came up with this "ugly" workaround. The main idea is that you create a group of counters in which you hold only one counter where its name is the value you wish to return (you ignore the actual counter value). The code look like this:
// reducer || mapper
reporter.incrCounter("Group name", "counter name -> actual value", 0);
// driver
RunningJob runningJob = JobClient.runJob(conf);
String value = runningJob.getCounters().getGroup("Group name").iterator().next().getName();
The same will work for mappers as well. Though this solves my problem, I think this type of solution is "ugly". Thus I leave the question open.
You can't amend the configuration in a map / reduce task and expect that change to be persisted to configurations in other tasks and / or the job client that submitted the job (lets say you write different values in the reducer - which one 'wins' out and is persisted back?).
You can however write files to HDFS yourself which can then be read back when your job returns - No less ugly really but there isn't a way doesn't involve another technology (Zookeeper, HBase or any other NoSQL / RDB) holding the value between your task ending and you being able to retrieve the value upon job success.

How to call Partitioner in Haoop v 0.21

In my application I want to create as many reducer jobs as possible based on the keys. Now my current implementation writes all the keys and values in a single (reducer) output file. So to solve this, I have used one partitioner but I cannot call the class.The partitioner should be called after the selection Map task and before the selection reduce task but it did not.The code of the partitioner is the following
public class MultiWayJoinPartitioner extends Partitioner<Text, Text> {
#Override
public int getPartition(Text key, Text value, int nbPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % nbPartitions;
return 0;
}
}
Is this code is correct to partition the files based on the keys and values and the output will be transfer to the reducer automatically??
You don't show all of your code, but there is usually a class (called the "Job" or "MR" class) that configures the mapper, reducer, partitioner, etc. and then actually submits the job to hadoop. In this class you will have a job configuration object that has many properties, one of which is the number of reducers. Set this property to whatever number your hadoop configuration can handle.
Once the job is configured with a given number of reducers, that number will be passed into your partition (which looks correct, by the way). Your partitioner will start returning the appropriate reducer/partition for the key/value pair. That's how you get as many reducers as possible.

Resources