How to find one specific key value pair as output from reducer - hadoop

I need to find the student with max marks using MR
Paul 90
Ben 20
Cook 80
Joe 85
So output of reducer should be (Paul 90)
can anyone help me with this?

A good way of doing this is to do a secondary sort in Hadoop. Your Map output key should be a combination of (Name, Marks).
You would then implement a custom comparator which can take this key & based on the Marks only compare 2 given values and sort based on higher marks.
Typically we implement a grouping comparator but in this case we would want all the keys to go into a single reducer. So we would ignore the key differences in the grouping comparator.
In the reducer just get the first value & exit.
Details of secondary sort : Secondary Sort

You can map all input tuples to the same key, with a value being the same as each input tuple, like (the-one-key, (Ben, 20)), and use a reduce function that returns only the tuple that has the maximum grade (since there is only one key).
To make sure that MR parallelism kicks in, using a combiner with the same function as the reducer (above) should do the trick. That way, the reducer will only get one tuple from each mapper and will have less work to do.
Edit: even better, you can already eliminate all but the max in the mapping function to get best performance (see Venkat's remark that combiners are not guaranteed to be used).
Example with two mappers:
Paul 90
Ben 20
Cook 80
Joe 85
Mapped to:
Mapper 1
(the-one-key, (Paul, 90))
(the-one-key, (Ben, 20))
Mapper 2
(the-one-key, (Cook, 80))
(the-one-key, (Joe, 85))
Combined to (still on the mappers' side):
Mapper 1
(the-one-key, (Paul, 90))
Mapper 2
(the-one-key, (Joe, 85))
Reduced to:
(the-one-key, (Paul, 90))
A final remark: MapReduce may be "too much" for this if you have a small data set. A simple scan in local memory would be faster if you only have a few hundreds or thousands values.

Take a look at the following code at gist:
https://gist.github.com/meshekhar/6dd773abf2af6ff631054facab885bf3
In mapper, data gets mapped to key value pair:
key: "Paul 90"
key: "Ben 20"
key: "Cook 80"
key: "Joe 85"
In reducer, iterating through all the records using while loop, each value is split into name and marks and max marks stored in temp variable.
And at the end, the max value and corresponding name pair are returned. e.g. Paul 90.
I tested it on a single node system with more than 1 million records, takes less than 10 sec.

Related

How to find the global average in a large dataset?

I am writing simple mapreduce programs to find the average,smallest number and largest number present in my data(many text files).I guess using a combiner to find the desired stuff for within the numbers processed by a single mapper first would make it more efficient.
However I am concerned about the fact that, in order to be able to find the average, smallest number or largest number we would require the data from all mappers(and hence all combiners) to go to a single reducer, so that we can find universal average, smallest number or largest number .Which in case of larger data sets would be a huge bottleneck.
I am sure there would be some way out to handle this issue in hadoop that I probably can not think of.Can someone please guide me.I have been asked this sort of questions in couple of interviews as well.
Also while running my 'Find Average' mapreduce program I am facing an issue, the only running mapper is taking too long to complete.I have increased the map task time-out as well but it still gets stuck.Whereas with the help of stdout logs I have found that my mapper and combiner are executed smoothly.Hence I am not able to figure out what is causing my mapreduce job to hang.
Averages can be calculated on a stream of data. Try holding on to the following:
Current average
Number of elements
This way you'll know how much weight to give to an incoming number as well as a batch of numbers.
Here are a few solutions:
find-running-median-from-a-stream-of-integers
average-of-a-stream-of-numbers
For average, use a single reducer, emitting the same key for all pairs and the values, for which you want to find the average, as value, like that (without a combiner, since average is not associative, i.e., the average of averages is not the global average).
Example:
values in Mapper 1: 1, 2, 3
values in Mapper 2: 5, 10
The average of the values of Mapper 1 is 2 = (1+2+3)/3.
The average of the values of Mapper 2 is 7.5 = (5+10)/2.
The average of the averages is 4.75 = (2+7.5)/2.
The global average is 4.2 = (1+2+3+5+10)/5.
For a more detailed answer, including a tricky solution with a combiner, see my slides (starting from slide 7), inspired from Donald Miner's book "MapReduce Design Patterns".
For the min/max, do the following logic:
Again, you can use a single reducer, with all the mappers emitting the same key always and the value being each of the values that you want to find the min/max.
A combiner (which is the same as the reducer) receives a list of values and emits the local min/max. Then, the single reducer, receives a list of local mins/maxs and emits the global min/max (min and max ARE associative).
In pseudocode:
map (key, value):
emit (1, value);
reduce(key, list<values>): //same are combiner
min = first_value;
for each value
if value <= min
min = value;
emit (key, min);
From Map
Output the Key as NullWrittable and value as (sum of value,count)
In Reducer
Split the value and count
Sum the value and count individually
Find the value of total sum divided by total count
Output the average from reducer.
Logic 2
Create a Writable which can hold count and sum Pass this variable from map and reduce it with single reducer

top-N b values for each a value using mapreduce

I am new to hadoop and have been struggling to write a mapreduce algorithm for finding top N values for each A value. Any help or guide to code implementation would be highly appreciated.
Input data
a,1
a,9
b,3
b,5
a,4
a,7
b,1
output
a 1,4,7,9
b 1,3,5
I believe we should write a Mapper that would read the line, split the values and allow it to be collected by reducer. And once in the reducer we have to do the sorting part.
If the number of values per key is small enough, the simple approach of just having the reducer read all values associated to a given key and output the top N is probably best.
If the number of values per key is large enough that this would be a poor choice, then a composite key is going to work better, and a custom partitioner and comparator will be needed. You'd want to partition based on the natural key (here 'a' or 'b', so that these end up at the same reducer) but with a secondary sort on the value (so that the reducer will see the largest values first).
The secondary sort trick mentioned by cohoz seems to be what you're looking for.
There's a nice guide here, which even has a similar structure to your problem (in the example, the author is seeking to walk over each integer timestamp (1,2,3) in sorted order for each class (a,b,c). You'll simply need to modify the reducer in the example to just walk over the top n items and emit them, then stop.

How to control the sort order of mapper result in mapreduce before being sent to reducer

Taking a slight variation of the word count example to explain what I am trying to do.
I have 3 mappers each producing a complete word count result on 3 large input files.
Let us say the output is:
Mapper 1 Result:
-------
cat 100
dog 50
fox 10
Mapper 2 Result:
-------
fox 200
pig 5
rat 1
Mapper 3 Result:
-------
dog 70
rat 50
fox 10
Notice that each result is a complete word count with unique key,count results for given files.
Now on the reducer side my algorithm requires that there be only one reducer,
and for reasons that are a bit too lengthy to discuss here, I want the results from each mapper to be fed into reducer in the descending order of counts but without performing any shuffle and sort step. i.e. I like the reducer to receive the results from each mapper in the following order without any grouping by key:
cat 100
dog 50
fox 10
fox 200
pig 5
rat 1
dog 70
rat 50
fox 10
i.e. just load the results of each mapper into reducer in the descending order of value(not key)
Seems like this should be a Map-only job since you don't want Shuffle and Sort to happen.
If you REALLY need to use Reduce then I suggest you need to have a composite key and do secondary sort.
The key would include a mapper id, normal key and the count value. You would do primary sort on mapper id and secondary sort on count. You would also need a grouping comparator that did not group anything (or grouped on mapper id and normal key only).
Again, looking at all the stuff you would need to do to use a Reducer just to prevent Shuffle and Sort, seems like this should be a Map-only job unless the output must be in a single file.

MapReduce - how do I calculate relative values (average, top k and so)?

I'm looking for a way to calculate "global" or "relative" values during a MapReduce process - an average, sum, top etc. Say I have a list of workers, with their IDs associated with their salaries (and a bunch of other stuff). At some stage of the processing, I'd like to know who are the workers who earn the top 10% of salaries. For that I need some "global" view of the values, which I can't figure out.
If I have all values sent into a single reducer, it has that global view, but then I loose concurrency, and it seems awkward. Is there a better way?
(The framework I'd like to use is Google's, but I'm trying to figure out the technique - no framework specific tricks please)
My first thought is to do something like this:
MAP: Use some dummy value as the key, maybe the empty string for efficiency, and create class that holds both a salary and an employee ID. In each Mapper, create an array that holds 10 elements. Fill it up with the first ten salaries you see, sorted (so location 0 is the highest salary, location 9 is the 10th highest). For every salary after that, see if it is in the top ten and if it is, insert it in the correct location and then move the lower salaries down, as appropriate.
Combiner/Reducer: merge sort the lists. I'd basically do the same thing as in the mapper by creating a ten element array and then loop over all the arrays that match the key, merging them in according to the same comparison/replace/move down sequence as in the mapper
If you run this with one reducer, it should ensure that the top 10 salaries are output.
I don't see a way to do this while using more than one reducer. If you use a combiner, then the reducer should only have to merge a ten-element array for each node that ran mappers (which should be manageable unless you're running on thousands of nodes).
[Edit: I misunderstood. Update for the Top 10% ]
For doing something that relates to the "total" there is no other way than determining the total first and then do the calculations.
So the "Top 10% salaries" could be done roughly as follows:
Determine total:
MAP: Identity
REDUCE: Aggregate the info of all records that go through and create a new "special" record with the "total". Note that you would like to scale
This can also be done by letting the MAP output 2 records (data, total) and the reducer then only touches the "total" records by aggregating.
Use total:
MAP: Prepare for SecondarySort
SHUFFLE/SORT: Sort the records so that the records with the "total" are first into the reducer.
REDUCE: Depending on your implementation the reducer may get a bunch of these total records (aggregate them) and for all subsequent records determine where they are in relation to everything else.
The biggest question on this kind of processing is: Will is scale?
Remember you are breaking the biggest "must have" for scaling out: Independent chunks of information. This makes them dependent around the "total" value.
I expect a technically different way of making the "total" value available to the second step is essential in making this work on "big data".
The "Hadoop - The definitive Guide" book by Tom White has a very good chapter on Secondary Sort.
I would do something like this
The mapper will use an UUID as part of the key, created in the setup() method of the mapper. The mapper emits as key, the UUID appended with either 0 or the salary. The mapper accumulates the count and total.
In the cleanup() method, the mapper emits UUID appended with 0 as the key and the count and total as the value. In the map() method, the mapper emits the UUID appended with salary as the key and salary as the value.
Since the keys are sorted, the first call to combiner will have the count and total as the value. The combiner could store them as class members. We could also find out what 10% of total count is and save that as well as class member (call it top). We initialize a list and save it as a class member.
Subsequent calls to combiner will contain the salary as the value, arriving in sorted order. We add the value to the list and at the same time increment a counter. When the counter reaches the value top, we don't store any more values in our list. We ignore values in rest of the combiner calls.
In the combiner cleanup(), we do the emit. The combiner will emit only the UUID as the key. The value will contain count and total followed by the top 10% of the values. So the output of the combiner will have partial results, based on the subset of the data that passed through the mapper.
The reducer will be called as many times as the number of mappers in this case, because each mapper/combiner emits only one key.
The reducer will accumulate the counts, totals and the top 10% values in the reduce() method. In the cleanup() method, the average is calulated. The top 10% is also calculated in the cleanup() method from the aggregation of top 10% arriving in each call of the reducer. This is basically a merge sort.
The reducer cleanup() method could do multiple emits, so that average is in the first row, followed by the top 10% of salaries in the subsequent rows.
Finally, to ensure that final aggregate statistics are global, you have to set the number of reducers to one.
Since there is data accumulation and sorting in the reducer, although on partial data set, there may be memory issues.

Sorting the values before they are send to the reducer

I'm thinking about building a small testing application in hadoop to get the hang of the system.
The application I have in mind will be in the realm of doing statistics.
I want to have "The 10 worst values for each key" from my reducer function (where I must assume the possibility a huge number of values for some keys).
What I have planned is that the values that go into my reducer will basically be the combination of "The actual value" and "The quality/relevance of the actual value".
Based on the relevance I "simply" want to take the 10 worst/best values and output them from the reducer.
How do I go about doing that (assuming a huge number of values for a specific key)?
Is there a way that I can sort all values BEFORE they are sent into the reducer (and simply stop reading the input when I have read the first 10) or must this be done differently?
Can someone here point me to a piece of example code I can have a look at?
Update: I found two interesting Jira issues HADOOP-485 and HADOOP-686.
Anyone has a code fragment on how to use this in the Hadoop 0.20 API?
Sounds definitively like a SecondarySortProblem. Take a look into "Hadoop: The definitive guide", if you like to. It's from O'Reilly. You can also access it online. There they describe a pretty good implementation.
I implemented it by myself too. Basically it works this way:
The partitioner will care for all the key-value-pairs with the same key going to one single reducer. Nothing special here.
But there is also the GroupingComparator, that will form groupings. One group is actually passed as an iterator to one reduce()-call. So a Partition can contain multiple groupings. But the amount of partitions should be equal the number of reducers. But the grouping also allows to do some sorting as it implements a compareTo-method.
With this method, you can control, that the 10 best/worst/highest/lowest however keys will reach the reducer first. So after you read these 10 keys, you can leave the reduce method without any further iterations.
Hope that was helpful :-)
It sounds like you want to use a Combiner, which defines what to do with the values your create on the Map side before they are sent to the Reducer, but after they are grouped by key.
The combiner is often set to just be the reducer class (so you reduce on the map side, and then again on the reduce side).
Take a look at how the wordCount example uses the combiner to pre-compute partial counts:
http://wiki.apache.org/hadoop/WordCount
Update
Here's what I have in mind for your problem; it's possible I misunderstood what you are trying to do, though.
Every mapper emits <key, {score, data}> pairs.
The combiner gets a partial set of these pairs: <key, [set of {score, data}> and does a local sort (still on the mapper nodes), and outputs <key, [sorted set of top 10 local {score, data}]> pairs.
The reducer will get <key, [set of top-10-sets]> -- all it has to do is perform the merge step of sort-merge (no sorting needed) for each of the members of the value sets, and stop merging when the first 10 values are pulled.
update 2
So, now that we know that the rank as cumilative and as a result, you can't filter the data early by using combiners, the only thing is to do what you suggested -- get a secondary sort going. You've found the right tickets; there is an example of how to do this in Hadoop 20 in src/examples/org/apache/hadoop/examples/SecondarySort.java (or, if you don't want to download the whole source tree, you can look at the example patch in https://issues.apache.org/jira/browse/HADOOP-4545 )
If I understand the question properly, you'll need to use a TotalOrderPartitioner.

Resources