This question already has an answer here:
Finding biggest value for key
(1 answer)
Closed 7 years ago.
My understanding about the reducer is that it processes one key, value pair from the intermediate o/p file of sort and shuffle. I don't know how to access that intermediate file which has the sorted & shuffled key value pairs. Once I cannot access the intermediate file, I cannot write code in the reducer module to select the largest key. I have no clue as how to program the reducer which receives one K,V pair at a time to print only the largest key and its corresponding values to the final output file.
Suppose if this is the intermediate file from the mapper which has also undergone sort and shuffling ..
1 a
2 be to
4 this what
I would want the reducer to print only "4 this what" in the final output file. Since the reducer does not have the entire file in its memory. Its not possible to write this logic in the reducer. I am wondering if there is any API support to pick the last line from the intermediate file which would eventually have the max key (keys would be sorted by default)
OR
Do I have to OVERIDE the default sort comparator to do what I want to achieve ???
You can set a different comparator for sorting in your job:
job.setSortComparatorClass(LongWritable.DecreasingComparator.class);
This for example will sort decreasingly by a LongWritable key.
A simple solution would be to have one Reducer (so all key-value pairs go to it), and have it keep track of the greatest key.
IntWritable currentMax = new IntWritable(-1);
public void reduce(IntWritable key, Iterable<Text> values, Context context) {
if (key.compareTo(currentMax) > 0) {
currentMax.set(key.get());
// copy 'values' somewhere
}
}
public void cleanup(Context context) {
Text outputValue = //create output from saved max values;
context.emit(currentMax, outputValue);
}
An additional optimization would be to either only emit the maximum key in the same way from the Mapper, or use this Reducer implementation as your Combiner class.
Better approach thanks to Thomas Jungblut.
For your driver:
job.setSortComparatorClass(IntWritable.DecreasingComparator.class);
For your reducer:
boolean biggestKeyDone = false;
public void reduce(IntWritable key, Iterable<Text> values, Context context) {
if (!biggestKeyDone){
// write or whatever with the values of the biggest key
biggestKeyDone = true;
}
}
If you only want to write the values of the biggest key in your reducer, I suggest saving in the configuration the biggest key detected in your mapper. Like this:
Integer currentMax = null;
public void map(IntWritable key, Text value, Context){
if (currentMax == null){
currentMax = key.intValue();
}else{
currentMax = Math.max(currentMax.intValue(), key.get());
}
context.write(key, value);
}
protected void cleanup(){
if (currentMax!=null){
context.getConfiguration().set("biggestKey", currentMax.toString());
}
}
Then, in your reducer:
int biggestKey = -1;
protected void setup(Context context){
biggestKey = Integer.parseInt(context.getConfiguration().get("biggestKey"));
}
public void reduce(IntWritable key, Iterable<Text> values, Context context) {
if (biggestKey == key.get()) {
// write or whatever with the values of the biggest key
}
}
This way you avoid wasting memory and time copying values.
Related
I am trying to create a method that will take a list of items with set weights and choose 1 at random. My solution was to use a Hashmap that will use Integer as a weight to randomly select 1 of the Keys from the Hashmap. The keys of the HashMap can be a mix of Object types and I want to return 1 of the selected keys.
However, I would like to avoid returning a null value on top of avoiding mutation. Yes, I know this is Java, but there are more elegant ways to write Java and hoping to solve this problem as it stands.
public <T> T getRandomValue(HashMap<?, Integer> VALUES) {
final int SIZE = VALUES.values().stream().reduce(0, (a, b) -> a + b);
final int RAND_SELECTION = ThreadLocalRandom.current().nextInt(SIZE) + 1;
int currentWeightSum = 0;
for (Map.Entry<?, Integer> entry : VALUES.entrySet()) {
if (RAND_SELECTION > currentWeightSum && RAND_SELECTION <= (currentWeightSum + entry.getValue())) {
return (T) entry.getKey();
} else {
currentWeightSum += entry.getValue();
}
}
return null;
}
Since the code after the loop should never be reached under normal circumstances, you should indeed not write something like return null at this point, but rather throw an exception, so that irregular conditions can be spotted right at this point, instead of forcing the caller to eventually debug a NullPointerException, perhaps occurring at an entirely different place.
public static <T> T getRandomValue(Map<T, Integer> values) {
if(values.isEmpty())
throw new NoSuchElementException();
final int totalSize = values.values().stream().mapToInt(Integer::intValue).sum();
if(totalSize<=0)
throw new IllegalArgumentException("sum of weights is "+totalSize);
final int threshold = ThreadLocalRandom.current().nextInt(totalSize) + 1;
int currentWeightSum = 0;
for (Map.Entry<T, Integer> entry : values.entrySet()) {
currentWeightSum += entry.getValue();
if(threshold <= currentWeightSum) {
return entry.getKey();
}
}
// if we reach this point, the map's content must have been changed in-between
throw new ConcurrentModificationException();
}
Note that the code fixes some other issues of your code. You should not promise to return an arbitrary T without knowing the actual type of the map. If the map contains objects of different type as key, i.e. is a Map<Object,Integer>, the caller can’t expect to get anything more specific than Object. Besides that, you should not insist of the parameter to be a HashMap when any Map is sufficient. Further, I changed the variable names to adhere to Java’s naming convention and simplified the loop’s body.
If you want to support empty maps as legal input, changing the return type to Optional<T> would be the best solution, returning an empty optional for empty maps and an optional containing the value otherwise (this would disallow null keys). Still, the supposed-to-be-unreachable code point after the loop should be flagged with an exception.
I have written a MapReduce application in which the mappers produce output in the following form:
key1 value1
key2 value2
keyn valuen
What I want to do is to sum all of the values for all the keys in my reducer. Basically:
sum = value1+value2+value3
Is that possible? From what I understand currently the reducer is called separately for each key/value pair. One solution that came to my mind was to have a private sum variable maintaining the sum of the values process thus far in it. In that case, however, how do I know that all of the pairs have been processed so that the sum may be written out to the collector?
If you don't need the key then use a single, constant key. If you have to have several key values, you could set the number of reducers to 1 and use an instance variable in the reducer class to hold the sum of all the values. Initialize the variable in the setup() method and report the overall sum in the close() method.
Another approach would be to write the sum of the values for a given key by incrementing a counter with the sum in the reduce method. Let hadoop bring all the values together in a single counter value.
I am also new to Hadoop and while doing research on the same problem, I found out the Mapper and Reducer classes also have setup() and cleanup() methods along with map() and reduce().
First, set number of Reducers to 1.
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
int sum=0
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
for (IntWritable value : values)
{
sum += value.get();
}
}
#Override
protected void cleanup(Context context) throws IOException, InterruptedException {
context.write(new Text("Sum:", new IntWritable(sum));
}
}
Partitioning is the process of determining which reducer instance will receive which intermediate keys and values. Each mapper must determine for all of its output (key, value) pairs which reducer will receive them. It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same
Problem: How does hadoop make it? Use a hash function? what is the default function?
The default partitioner in Hadoop is the HashPartitioner which has a method called getPartition. It takes key.hashCode() & Integer.MAX_VALUE and finds the modulus using the number of reduce tasks.
For example, if there are 10 reduce tasks, getPartition will return values 0 through 9 for all keys.
Here is the code:
public class HashPartitioner<K, V> extends Partitioner<K, V> {
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
To create a custom partitioner, you would extend Partitioner, create a method getPartition, then set your partitioner in the driver code (job.setPartitionerClass(CustomPartitioner.class);). This is particularly helpful if doing secondary sort operations, for example.
I am very new to hadoop. i have done with word-count and now I want to do a modification.
I want to get the word that has occurred most in a text file. If, normal word count program gives a output :
a 1
b 4
c 2
I want to write program that will give me output only
b 4
here my reducer function ::
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>
{
int max_sum=0;
Text max_occured_key;
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
if(sum > max_sum)
{
max_sum = sum;
max_occured_key = key;
}
context.write(max_occured_key, new IntWritable(max_sum));
//context.write(key, new IntWritable(sum));
}
}
but it is not giving the right output.
Can anyone help plz ?
You're writing out the maximum value so far at the end of each reduce function - so you'll get more than a single entry per reducer. You're also running into reference re-use problems as you're copying the reference of the key to your max_occured_key variable (rather than copying the value).
You should probably amend as follows:
Initialize the max_occured_key variable at construction time (to an empty Text)
Call max_occured_key.set(key); rather than using the equals assignment - The reference the key parameter is reused for all iterations of the reduce method, so the actual object will remain the same, just the underlying contents will be amended per iteration
Override the cleanup method and move the context.write call to that method - so that you'll only get one K,V output pair per reducer.
For example:
#Override
protected void cleanup(Context context) {
context.write(max_occured_key, new IntWritable(max_sum));
}
The cleanup method is called once all the data has been passed through your map or reduce task (and is called per task instance (so if you gave 10 reducers, this method will be called for each instance).
Hi I am writing map reduce code finding the maximum temperature. The problem is that I am getting the maximum temperature but without the corresponding key.
public static class TemperatureReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
Text year=new Text();
int maxTemperature=Integer.MIN_VALUE;
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
for(IntWritable valTemp:values) {
maxTemperature=Math.max(maxTemperature, valTemp.get());
}
//System.out.println("The maximunm temperature is " + maxTemperature);
context.write(year, new IntWritable(maxTemperature));
}
}
mapper imagin like
1955 52
1958 7
1985 22
1999 32
and so on.
It is overwriting the keys and printing all the data. I want only maximum temperature and its year.
I see a couple of things wrong with your code sample:
Reset the maxTemperature inside the reduce method (as the first statement), at the moment you have a bug in that it will output the maximum temperature seen for all preceding key/values
Where are you configuring the contents of year? in fact you don't need to, just call context.write(key, new IntWritable(maxTemperature); as the input key is the year
You might want to create a IntWritable instance variable and re-use it rather than creating a new IntWritable when writing out the output value (this is an efficiency thing rather than a potential cause of your problem)