MapReduce - how do I calculate relative values (average, top k and so)? - hadoop

I'm looking for a way to calculate "global" or "relative" values during a MapReduce process - an average, sum, top etc. Say I have a list of workers, with their IDs associated with their salaries (and a bunch of other stuff). At some stage of the processing, I'd like to know who are the workers who earn the top 10% of salaries. For that I need some "global" view of the values, which I can't figure out.
If I have all values sent into a single reducer, it has that global view, but then I loose concurrency, and it seems awkward. Is there a better way?
(The framework I'd like to use is Google's, but I'm trying to figure out the technique - no framework specific tricks please)

My first thought is to do something like this:
MAP: Use some dummy value as the key, maybe the empty string for efficiency, and create class that holds both a salary and an employee ID. In each Mapper, create an array that holds 10 elements. Fill it up with the first ten salaries you see, sorted (so location 0 is the highest salary, location 9 is the 10th highest). For every salary after that, see if it is in the top ten and if it is, insert it in the correct location and then move the lower salaries down, as appropriate.
Combiner/Reducer: merge sort the lists. I'd basically do the same thing as in the mapper by creating a ten element array and then loop over all the arrays that match the key, merging them in according to the same comparison/replace/move down sequence as in the mapper
If you run this with one reducer, it should ensure that the top 10 salaries are output.
I don't see a way to do this while using more than one reducer. If you use a combiner, then the reducer should only have to merge a ten-element array for each node that ran mappers (which should be manageable unless you're running on thousands of nodes).

[Edit: I misunderstood. Update for the Top 10% ]
For doing something that relates to the "total" there is no other way than determining the total first and then do the calculations.
So the "Top 10% salaries" could be done roughly as follows:
Determine total:
MAP: Identity
REDUCE: Aggregate the info of all records that go through and create a new "special" record with the "total". Note that you would like to scale
This can also be done by letting the MAP output 2 records (data, total) and the reducer then only touches the "total" records by aggregating.
Use total:
MAP: Prepare for SecondarySort
SHUFFLE/SORT: Sort the records so that the records with the "total" are first into the reducer.
REDUCE: Depending on your implementation the reducer may get a bunch of these total records (aggregate them) and for all subsequent records determine where they are in relation to everything else.
The biggest question on this kind of processing is: Will is scale?
Remember you are breaking the biggest "must have" for scaling out: Independent chunks of information. This makes them dependent around the "total" value.
I expect a technically different way of making the "total" value available to the second step is essential in making this work on "big data".
The "Hadoop - The definitive Guide" book by Tom White has a very good chapter on Secondary Sort.

I would do something like this
The mapper will use an UUID as part of the key, created in the setup() method of the mapper. The mapper emits as key, the UUID appended with either 0 or the salary. The mapper accumulates the count and total.
In the cleanup() method, the mapper emits UUID appended with 0 as the key and the count and total as the value. In the map() method, the mapper emits the UUID appended with salary as the key and salary as the value.
Since the keys are sorted, the first call to combiner will have the count and total as the value. The combiner could store them as class members. We could also find out what 10% of total count is and save that as well as class member (call it top). We initialize a list and save it as a class member.
Subsequent calls to combiner will contain the salary as the value, arriving in sorted order. We add the value to the list and at the same time increment a counter. When the counter reaches the value top, we don't store any more values in our list. We ignore values in rest of the combiner calls.
In the combiner cleanup(), we do the emit. The combiner will emit only the UUID as the key. The value will contain count and total followed by the top 10% of the values. So the output of the combiner will have partial results, based on the subset of the data that passed through the mapper.
The reducer will be called as many times as the number of mappers in this case, because each mapper/combiner emits only one key.
The reducer will accumulate the counts, totals and the top 10% values in the reduce() method. In the cleanup() method, the average is calulated. The top 10% is also calculated in the cleanup() method from the aggregation of top 10% arriving in each call of the reducer. This is basically a merge sort.
The reducer cleanup() method could do multiple emits, so that average is in the first row, followed by the top 10% of salaries in the subsequent rows.
Finally, to ensure that final aggregate statistics are global, you have to set the number of reducers to one.
Since there is data accumulation and sorting in the reducer, although on partial data set, there may be memory issues.

Related

Designing of the "mapper" and "reducer" functions' functionality for hadoop?

I am trying to design a mapper and reducer for Hadoop. I am new to Hadoop, and I'm a bit confused about how the mapper and reducer is supposed for work for my specific application.
The input to my mapper is a large directed graph's connectivity. It is a 2 column input where each row is an individual edge connectivity. The first column is the start node id and the second column is the end node id of each edge. I'm trying to output the number of neighbors for each start node id into a 2 column text file, where the first column is sorted in order of increasing start node id.
My questions are:
(1) The input is already set up such that each line is a key-value pair, where the key is the start node id, and the value is the end node id. Would the mapper simply just read in each line and write it out? That seems redundant.
(2) Does the sorting take place in between the mapper and reducer or could the sorting actually be done with the reducer itself?
If my understanding is correct, you want to count how many distinct values a key will have.
Simply emitting the input key-value pairs in the mapper, and then counting the distinct values per key (e.g., by adding them to a set and emitting the set size as the value of the reducer) in the reducer is one way of doing it, but a bit redundant, as you say.
In general, you want to reduce the network traffic, so you may want to do some more computations before the shuffling (yes, this is done by Hadoop).
Two easy ways to improve the efficiency are:
1) Use a combiner, which will output sets of values, instead of single values. This way, you will send fewer key-value pairs to the reducers, and also, some values may be skipped, since they have been already in the local value set of the same key.
2) Use map-side aggregation. Instead of emitting the input key-value pairs right away, store them locally in the mapper (in memory) in a data structure (e.g., hashmap or multimap). The key can be the map input key and the value can be a set of values seen so far for this key. Each type you meet a new value for this key, you append it to this structure. At the end of each mapper, you emit this structure (or you convert the values to an array), from the close() method (if I remember the name).
You can lookup both methods using the keywords "combiner" and "map-side aggregation".
A global sorting on the key is a bit trickier. Again, two basic options, but are not really good though:
1) you use a single reducer, but then you don't gain anything from parallelism,
2) you use a total order partitioner, which needs some extra coding.
Other than that, you may want to move to Spark for a more intuitive and efficient solution.

How does map-reduce work..Did i get it right?

I'm trying to understand how map-reduce actually work. please read what i written below and tell me if there's any missing parts or incorrect things in here.
Thank you.
The data is first splitted into what is called input splits(which is a logical kind of group which we define the size of it as our needs of record processing).
Then, there is a Mapper for every input split which takes every input split and sort it by key and value.
Then, there is the shuffling process which takes all of the data from the mappers (key-values) and merges all the same keys with its values(output it's all the keys with its list of values). The shuffling process occurs in order to give the reducer an input of a 1 key for each type of key with its summed values.
Then, the Reducer merges all the key value into one place(page maybe?) which is the final result of the MapReduce process.
We only have to make sure to define the Map(which gives output of key-value always) and Reduce(final result- get the input key-value and can be count,sum,avg,etc..) step code.
Your understanding is slightly wrong specially how mapper works.
I got a very nice pictorial image to explain in simple term
It is similar to the wordcount program, where
Each bundle of chocolates are the InputSplit, which is handled by a mapper. So we have 3 bundles.
Each chocolate is a word. One or more words (making a sentence) is a record input to single mapper. So, within one inputsplit, there may be multiple records and each record is input to single mapper.
mapper count occurrence of each of the word (chocolate) and spit the count. Note that each of the mapper is working on only one line (record). As soon as it is done, it picks next record from the inputsplit. (2nd phase in the image)
Once map phase is finished, sorting and shuffling takes place to make a bucket of same chocolates counts. (3rd phase in the image)
One reducer get one bucket with key as name of the chocolate (or the word) and a list of counts. So, there are as many reducer as many distinct words in whole input file.
The reducer iterates through the count and sum them up to produce the final count and emit it against the word.
The Below diagram shows how one single inputsplit of wordcount program works:
Similar QA - Simple explanation of MapReduce?
Also, this post explain Hadoop - HDFS & Mapreduce in very simple way https://content.pivotal.io/blog/demystifying-apache-hadoop-in-5-pictures

How to find the global average in a large dataset?

I am writing simple mapreduce programs to find the average,smallest number and largest number present in my data(many text files).I guess using a combiner to find the desired stuff for within the numbers processed by a single mapper first would make it more efficient.
However I am concerned about the fact that, in order to be able to find the average, smallest number or largest number we would require the data from all mappers(and hence all combiners) to go to a single reducer, so that we can find universal average, smallest number or largest number .Which in case of larger data sets would be a huge bottleneck.
I am sure there would be some way out to handle this issue in hadoop that I probably can not think of.Can someone please guide me.I have been asked this sort of questions in couple of interviews as well.
Also while running my 'Find Average' mapreduce program I am facing an issue, the only running mapper is taking too long to complete.I have increased the map task time-out as well but it still gets stuck.Whereas with the help of stdout logs I have found that my mapper and combiner are executed smoothly.Hence I am not able to figure out what is causing my mapreduce job to hang.
Averages can be calculated on a stream of data. Try holding on to the following:
Current average
Number of elements
This way you'll know how much weight to give to an incoming number as well as a batch of numbers.
Here are a few solutions:
find-running-median-from-a-stream-of-integers
average-of-a-stream-of-numbers
For average, use a single reducer, emitting the same key for all pairs and the values, for which you want to find the average, as value, like that (without a combiner, since average is not associative, i.e., the average of averages is not the global average).
Example:
values in Mapper 1: 1, 2, 3
values in Mapper 2: 5, 10
The average of the values of Mapper 1 is 2 = (1+2+3)/3.
The average of the values of Mapper 2 is 7.5 = (5+10)/2.
The average of the averages is 4.75 = (2+7.5)/2.
The global average is 4.2 = (1+2+3+5+10)/5.
For a more detailed answer, including a tricky solution with a combiner, see my slides (starting from slide 7), inspired from Donald Miner's book "MapReduce Design Patterns".
For the min/max, do the following logic:
Again, you can use a single reducer, with all the mappers emitting the same key always and the value being each of the values that you want to find the min/max.
A combiner (which is the same as the reducer) receives a list of values and emits the local min/max. Then, the single reducer, receives a list of local mins/maxs and emits the global min/max (min and max ARE associative).
In pseudocode:
map (key, value):
emit (1, value);
reduce(key, list<values>): //same are combiner
min = first_value;
for each value
if value <= min
min = value;
emit (key, min);
From Map
Output the Key as NullWrittable and value as (sum of value,count)
In Reducer
Split the value and count
Sum the value and count individually
Find the value of total sum divided by total count
Output the average from reducer.
Logic 2
Create a Writable which can hold count and sum Pass this variable from map and reduce it with single reducer

top-N b values for each a value using mapreduce

I am new to hadoop and have been struggling to write a mapreduce algorithm for finding top N values for each A value. Any help or guide to code implementation would be highly appreciated.
Input data
a,1
a,9
b,3
b,5
a,4
a,7
b,1
output
a 1,4,7,9
b 1,3,5
I believe we should write a Mapper that would read the line, split the values and allow it to be collected by reducer. And once in the reducer we have to do the sorting part.
If the number of values per key is small enough, the simple approach of just having the reducer read all values associated to a given key and output the top N is probably best.
If the number of values per key is large enough that this would be a poor choice, then a composite key is going to work better, and a custom partitioner and comparator will be needed. You'd want to partition based on the natural key (here 'a' or 'b', so that these end up at the same reducer) but with a secondary sort on the value (so that the reducer will see the largest values first).
The secondary sort trick mentioned by cohoz seems to be what you're looking for.
There's a nice guide here, which even has a similar structure to your problem (in the example, the author is seeking to walk over each integer timestamp (1,2,3) in sorted order for each class (a,b,c). You'll simply need to modify the reducer in the example to just walk over the top n items and emit them, then stop.

Sorting the values before they are send to the reducer

I'm thinking about building a small testing application in hadoop to get the hang of the system.
The application I have in mind will be in the realm of doing statistics.
I want to have "The 10 worst values for each key" from my reducer function (where I must assume the possibility a huge number of values for some keys).
What I have planned is that the values that go into my reducer will basically be the combination of "The actual value" and "The quality/relevance of the actual value".
Based on the relevance I "simply" want to take the 10 worst/best values and output them from the reducer.
How do I go about doing that (assuming a huge number of values for a specific key)?
Is there a way that I can sort all values BEFORE they are sent into the reducer (and simply stop reading the input when I have read the first 10) or must this be done differently?
Can someone here point me to a piece of example code I can have a look at?
Update: I found two interesting Jira issues HADOOP-485 and HADOOP-686.
Anyone has a code fragment on how to use this in the Hadoop 0.20 API?
Sounds definitively like a SecondarySortProblem. Take a look into "Hadoop: The definitive guide", if you like to. It's from O'Reilly. You can also access it online. There they describe a pretty good implementation.
I implemented it by myself too. Basically it works this way:
The partitioner will care for all the key-value-pairs with the same key going to one single reducer. Nothing special here.
But there is also the GroupingComparator, that will form groupings. One group is actually passed as an iterator to one reduce()-call. So a Partition can contain multiple groupings. But the amount of partitions should be equal the number of reducers. But the grouping also allows to do some sorting as it implements a compareTo-method.
With this method, you can control, that the 10 best/worst/highest/lowest however keys will reach the reducer first. So after you read these 10 keys, you can leave the reduce method without any further iterations.
Hope that was helpful :-)
It sounds like you want to use a Combiner, which defines what to do with the values your create on the Map side before they are sent to the Reducer, but after they are grouped by key.
The combiner is often set to just be the reducer class (so you reduce on the map side, and then again on the reduce side).
Take a look at how the wordCount example uses the combiner to pre-compute partial counts:
http://wiki.apache.org/hadoop/WordCount
Update
Here's what I have in mind for your problem; it's possible I misunderstood what you are trying to do, though.
Every mapper emits <key, {score, data}> pairs.
The combiner gets a partial set of these pairs: <key, [set of {score, data}> and does a local sort (still on the mapper nodes), and outputs <key, [sorted set of top 10 local {score, data}]> pairs.
The reducer will get <key, [set of top-10-sets]> -- all it has to do is perform the merge step of sort-merge (no sorting needed) for each of the members of the value sets, and stop merging when the first 10 values are pulled.
update 2
So, now that we know that the rank as cumilative and as a result, you can't filter the data early by using combiners, the only thing is to do what you suggested -- get a secondary sort going. You've found the right tickets; there is an example of how to do this in Hadoop 20 in src/examples/org/apache/hadoop/examples/SecondarySort.java (or, if you don't want to download the whole source tree, you can look at the example patch in https://issues.apache.org/jira/browse/HADOOP-4545 )
If I understand the question properly, you'll need to use a TotalOrderPartitioner.

Resources