hadoop word count and get the maximum occured word - hadoop

I am very new to hadoop. i have done with word-count and now I want to do a modification.
I want to get the word that has occurred most in a text file. If, normal word count program gives a output :
a 1
b 4
c 2
I want to write program that will give me output only
b 4
here my reducer function ::
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>
{
int max_sum=0;
Text max_occured_key;
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
if(sum > max_sum)
{
max_sum = sum;
max_occured_key = key;
}
context.write(max_occured_key, new IntWritable(max_sum));
//context.write(key, new IntWritable(sum));
}
}
but it is not giving the right output.
Can anyone help plz ?

You're writing out the maximum value so far at the end of each reduce function - so you'll get more than a single entry per reducer. You're also running into reference re-use problems as you're copying the reference of the key to your max_occured_key variable (rather than copying the value).
You should probably amend as follows:
Initialize the max_occured_key variable at construction time (to an empty Text)
Call max_occured_key.set(key); rather than using the equals assignment - The reference the key parameter is reused for all iterations of the reduce method, so the actual object will remain the same, just the underlying contents will be amended per iteration
Override the cleanup method and move the context.write call to that method - so that you'll only get one K,V output pair per reducer.
For example:
#Override
protected void cleanup(Context context) {
context.write(max_occured_key, new IntWritable(max_sum));
}
The cleanup method is called once all the data has been passed through your map or reduce task (and is called per task instance (so if you gave 10 reducers, this method will be called for each instance).

Related

How to get that the limit exceeded when I use limit() on a range of items from stream using Java 8 lambda?

How should I know without using another condition to compare the map.size() with limitValue, that the limit was exceeding when my stream iterated?
Here,
for limitValue = 3, it should return false.
for limitValue = 4, it should return true.
I can not use an outside int field as it must be final to be used inside lambda.
import java.util.*;
import java.util.stream.*;
public class Test {
public static void main(String[] args) throws Exception {
Map<Integer, String> map = new HashMap<>();
map.put(1, "foo");
map.put(2, "bar");
map.put(3, "baz");
int limitValue = 3;
String result = map.entrySet()
.stream()
.limit(limitValue)
.map(entry -> entry.getKey() + " - " + entry.getValue())
.collect(Collectors.joining(", "));
System.out.println(result);
}
}
I can not use an outside int field as it must be final to be used
inside lambda.
Yes, this is because, within a lambda expression, you can only reference local variables whose value doesn’t change (in java).
This is a good thing in a way as mutating a variable(s) inside a lambda is not thread safe when executing in parallel.
So, the system is helping you prevent such scenarios at compile time by allowing only final or effectively final variables to be used in lambdas.
Note, this restriction only holds for local variables.
Anyhow, my advice is not to mutate variables that are not solely contained within a given function itself as it introduces a side-effect and side-effects in behavioral parameters to stream operations are, in general, discouraged.
Keep things simple and proceed with the below approach.
boolean exceeded = limitValue > map.size();

Selecting max key in the reducer function [duplicate]

This question already has an answer here:
Finding biggest value for key
(1 answer)
Closed 7 years ago.
My understanding about the reducer is that it processes one key, value pair from the intermediate o/p file of sort and shuffle. I don't know how to access that intermediate file which has the sorted & shuffled key value pairs. Once I cannot access the intermediate file, I cannot write code in the reducer module to select the largest key. I have no clue as how to program the reducer which receives one K,V pair at a time to print only the largest key and its corresponding values to the final output file.
Suppose if this is the intermediate file from the mapper which has also undergone sort and shuffling ..
1 a
2 be to
4 this what
I would want the reducer to print only "4 this what" in the final output file. Since the reducer does not have the entire file in its memory. Its not possible to write this logic in the reducer. I am wondering if there is any API support to pick the last line from the intermediate file which would eventually have the max key (keys would be sorted by default)
OR
Do I have to OVERIDE the default sort comparator to do what I want to achieve ???
You can set a different comparator for sorting in your job:
job.setSortComparatorClass(LongWritable.DecreasingComparator.class);
This for example will sort decreasingly by a LongWritable key.
A simple solution would be to have one Reducer (so all key-value pairs go to it), and have it keep track of the greatest key.
IntWritable currentMax = new IntWritable(-1);
public void reduce(IntWritable key, Iterable<Text> values, Context context) {
if (key.compareTo(currentMax) > 0) {
currentMax.set(key.get());
// copy 'values' somewhere
}
}
public void cleanup(Context context) {
Text outputValue = //create output from saved max values;
context.emit(currentMax, outputValue);
}
An additional optimization would be to either only emit the maximum key in the same way from the Mapper, or use this Reducer implementation as your Combiner class.
Better approach thanks to Thomas Jungblut.
For your driver:
job.setSortComparatorClass(IntWritable.DecreasingComparator.class);
For your reducer:
boolean biggestKeyDone = false;
public void reduce(IntWritable key, Iterable<Text> values, Context context) {
if (!biggestKeyDone){
// write or whatever with the values of the biggest key
biggestKeyDone = true;
}
}
If you only want to write the values of the biggest key in your reducer, I suggest saving in the configuration the biggest key detected in your mapper. Like this:
Integer currentMax = null;
public void map(IntWritable key, Text value, Context){
if (currentMax == null){
currentMax = key.intValue();
}else{
currentMax = Math.max(currentMax.intValue(), key.get());
}
context.write(key, value);
}
protected void cleanup(){
if (currentMax!=null){
context.getConfiguration().set("biggestKey", currentMax.toString());
}
}
Then, in your reducer:
int biggestKey = -1;
protected void setup(Context context){
biggestKey = Integer.parseInt(context.getConfiguration().get("biggestKey"));
}
public void reduce(IntWritable key, Iterable<Text> values, Context context) {
if (biggestKey == key.get()) {
// write or whatever with the values of the biggest key
}
}
This way you avoid wasting memory and time copying values.

How to process all map outputs in one reducer at the same time?

I have written a MapReduce application in which the mappers produce output in the following form:
key1 value1
key2 value2
keyn valuen
What I want to do is to sum all of the values for all the keys in my reducer. Basically:
sum = value1+value2+value3
Is that possible? From what I understand currently the reducer is called separately for each key/value pair. One solution that came to my mind was to have a private sum variable maintaining the sum of the values process thus far in it. In that case, however, how do I know that all of the pairs have been processed so that the sum may be written out to the collector?
If you don't need the key then use a single, constant key. If you have to have several key values, you could set the number of reducers to 1 and use an instance variable in the reducer class to hold the sum of all the values. Initialize the variable in the setup() method and report the overall sum in the close() method.
Another approach would be to write the sum of the values for a given key by incrementing a counter with the sum in the reduce method. Let hadoop bring all the values together in a single counter value.
I am also new to Hadoop and while doing research on the same problem, I found out the Mapper and Reducer classes also have setup() and cleanup() methods along with map() and reduce().
First, set number of Reducers to 1.
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
int sum=0
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
for (IntWritable value : values)
{
sum += value.get();
}
}
#Override
protected void cleanup(Context context) throws IOException, InterruptedException {
context.write(new Text("Sum:", new IntWritable(sum));
}
}

output HBase Increment in MR reducer

I have a mapreduce job that writes to HBase. I know you can output Put and Delete from the reducer using the TableMapReduceUtil.
Is it possible emit Increment to increment values in an HBase table instead out emitting Puts and Gets? If yes, how to do it and if not then why?
I'm using CDH3
public static class TheReducer extends TableReducer<Text, Text, ImmutableBytesWritable> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
///....DO SOME STUFF HERE
Increment increment = new Increment(row);
increment.addColumn(col,qual,1L);
context.write(null, increment); //<--- I want to be able to do this
}
}
Thanks
As far as I know you can't use Increment in the context - but you can always open a connection to HBase and write Increments anywhere (mapper, mapper cleanup, reducer etc.)
Do note that increments are not idempotent so the result might be problematic on partial success of the map/reduce job and/or if you have speculative execution for M/R (i.e. multiple mappers doing the same work)

Map Reduce - Not able to get the rigth key

Hi I am writing map reduce code finding the maximum temperature. The problem is that I am getting the maximum temperature but without the corresponding key.
public static class TemperatureReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
Text year=new Text();
int maxTemperature=Integer.MIN_VALUE;
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
for(IntWritable valTemp:values) {
maxTemperature=Math.max(maxTemperature, valTemp.get());
}
//System.out.println("The maximunm temperature is " + maxTemperature);
context.write(year, new IntWritable(maxTemperature));
}
}
mapper imagin like
1955 52
1958 7
1985 22
1999 32
and so on.
It is overwriting the keys and printing all the data. I want only maximum temperature and its year.
I see a couple of things wrong with your code sample:
Reset the maxTemperature inside the reduce method (as the first statement), at the moment you have a bug in that it will output the maximum temperature seen for all preceding key/values
Where are you configuring the contents of year? in fact you don't need to, just call context.write(key, new IntWritable(maxTemperature); as the input key is the year
You might want to create a IntWritable instance variable and re-use it rather than creating a new IntWritable when writing out the output value (this is an efficiency thing rather than a potential cause of your problem)

Resources