How to find the global average in a large dataset? - hadoop

I am writing simple mapreduce programs to find the average,smallest number and largest number present in my data(many text files).I guess using a combiner to find the desired stuff for within the numbers processed by a single mapper first would make it more efficient.
However I am concerned about the fact that, in order to be able to find the average, smallest number or largest number we would require the data from all mappers(and hence all combiners) to go to a single reducer, so that we can find universal average, smallest number or largest number .Which in case of larger data sets would be a huge bottleneck.
I am sure there would be some way out to handle this issue in hadoop that I probably can not think of.Can someone please guide me.I have been asked this sort of questions in couple of interviews as well.
Also while running my 'Find Average' mapreduce program I am facing an issue, the only running mapper is taking too long to complete.I have increased the map task time-out as well but it still gets stuck.Whereas with the help of stdout logs I have found that my mapper and combiner are executed smoothly.Hence I am not able to figure out what is causing my mapreduce job to hang.

Averages can be calculated on a stream of data. Try holding on to the following:
Current average
Number of elements
This way you'll know how much weight to give to an incoming number as well as a batch of numbers.
Here are a few solutions:
find-running-median-from-a-stream-of-integers
average-of-a-stream-of-numbers

For average, use a single reducer, emitting the same key for all pairs and the values, for which you want to find the average, as value, like that (without a combiner, since average is not associative, i.e., the average of averages is not the global average).
Example:
values in Mapper 1: 1, 2, 3
values in Mapper 2: 5, 10
The average of the values of Mapper 1 is 2 = (1+2+3)/3.
The average of the values of Mapper 2 is 7.5 = (5+10)/2.
The average of the averages is 4.75 = (2+7.5)/2.
The global average is 4.2 = (1+2+3+5+10)/5.
For a more detailed answer, including a tricky solution with a combiner, see my slides (starting from slide 7), inspired from Donald Miner's book "MapReduce Design Patterns".
For the min/max, do the following logic:
Again, you can use a single reducer, with all the mappers emitting the same key always and the value being each of the values that you want to find the min/max.
A combiner (which is the same as the reducer) receives a list of values and emits the local min/max. Then, the single reducer, receives a list of local mins/maxs and emits the global min/max (min and max ARE associative).
In pseudocode:
map (key, value):
emit (1, value);
reduce(key, list<values>): //same are combiner
min = first_value;
for each value
if value <= min
min = value;
emit (key, min);

From Map
Output the Key as NullWrittable and value as (sum of value,count)
In Reducer
Split the value and count
Sum the value and count individually
Find the value of total sum divided by total count
Output the average from reducer.
Logic 2
Create a Writable which can hold count and sum Pass this variable from map and reduce it with single reducer

Related

How to find one specific key value pair as output from reducer

I need to find the student with max marks using MR
Paul 90
Ben 20
Cook 80
Joe 85
So output of reducer should be (Paul 90)
can anyone help me with this?
A good way of doing this is to do a secondary sort in Hadoop. Your Map output key should be a combination of (Name, Marks).
You would then implement a custom comparator which can take this key & based on the Marks only compare 2 given values and sort based on higher marks.
Typically we implement a grouping comparator but in this case we would want all the keys to go into a single reducer. So we would ignore the key differences in the grouping comparator.
In the reducer just get the first value & exit.
Details of secondary sort : Secondary Sort
You can map all input tuples to the same key, with a value being the same as each input tuple, like (the-one-key, (Ben, 20)), and use a reduce function that returns only the tuple that has the maximum grade (since there is only one key).
To make sure that MR parallelism kicks in, using a combiner with the same function as the reducer (above) should do the trick. That way, the reducer will only get one tuple from each mapper and will have less work to do.
Edit: even better, you can already eliminate all but the max in the mapping function to get best performance (see Venkat's remark that combiners are not guaranteed to be used).
Example with two mappers:
Paul 90
Ben 20
Cook 80
Joe 85
Mapped to:
Mapper 1
(the-one-key, (Paul, 90))
(the-one-key, (Ben, 20))
Mapper 2
(the-one-key, (Cook, 80))
(the-one-key, (Joe, 85))
Combined to (still on the mappers' side):
Mapper 1
(the-one-key, (Paul, 90))
Mapper 2
(the-one-key, (Joe, 85))
Reduced to:
(the-one-key, (Paul, 90))
A final remark: MapReduce may be "too much" for this if you have a small data set. A simple scan in local memory would be faster if you only have a few hundreds or thousands values.
Take a look at the following code at gist:
https://gist.github.com/meshekhar/6dd773abf2af6ff631054facab885bf3
In mapper, data gets mapped to key value pair:
key: "Paul 90"
key: "Ben 20"
key: "Cook 80"
key: "Joe 85"
In reducer, iterating through all the records using while loop, each value is split into name and marks and max marks stored in temp variable.
And at the end, the max value and corresponding name pair are returned. e.g. Paul 90.
I tested it on a single node system with more than 1 million records, takes less than 10 sec.

algorithm: is there a map-reduce way to merge a group of sets by deleting all the subsets

The problem is: Suppose we have a group of Sets: Set(1,2,3) Set(1,2,3,4) Set(4,5,6) Set(1,2,3,4,6), we need to delete all the subsets and finally get the Result: Set(4,5,6) Set(1,2,3,4,6). (Since both Set(1,2,3) and Set(1,2,3,4) are the subsets of Set(1,2,3,4,6), both are removed.)
And suppose that the elements of the set have order, which can be Int, Char, etc.
Is it possible to do it in a map-reduce way?
The reason to do it in a map-reduce way is that sometimes the group of Sets has a very large size, which makes it not possible to do it in the memory of a single machine. So we hope to do it in a map-reduce way, it may be not very efficient, but just work.
My problem is:
I don't know how to define a key for the key-value pair in the map-reduce process to group Sets properly.
I don't know when the process should be finished, that all the subsets have been removed.
EDIT:
The size of the data will keep growing larger in the future.
The input can be either a group of sets or multiple lines with each line containing a group of sets. Currently the input is val data = RDD[Set], I firstly do data.collect(), which results in an overall group of sets. But I can modify the generation of the input into a RDD[Array[Set]], which will give me multiple lines with each line containing a group of sets.
The elements in each set can be sorted by modifying other parts of the program.
I doubt this can be done by a traditional map-reduce technique which is essentially a divide-and-conquer method. This is because:
in this problem each set has to essentially be compared to all of the sets of larger cardinality whose min and max elements lie around the min and max of the smaller set.
unlike sorting and other problems amenable to map-reduce, we don't have a transitivity relation, i.e., if A is not-a-subset-of B and B is-not-subset-of C, we cannot make any statement about A w.r.t. C.
Based on the above observations this problem seems to be similar to duplicate detection and there is research on duplicate detection, for example here. Similar techniques will work well for the current problem.
Since subset-of is a transitive relation (proof), you could take advantage of that and design an iterative algorithm that eliminates subsets in each iteration.
The logic is the following:
Mapper:
eliminate local subsets and emit only the supersets. Let the key be the first element of each superset.
Reducer:
eliminate local subsets and emit only the supersets.
You could also use a combiner with the same logic.
Each time, the number of reducers should decrease, until, in the last iteration, a single reducer is used. This way, you can define from the beginning the number of iterations. E.g. by setting initially 8 reducers, and each time using half of them in the next iteration, your program will terminate after 4 iterations (8 reducers, then 4, then 2 and then 1). In general, it will terminate in logn + 1 iterations (log base 2), where n is the initial number of reducers, so n should be a power of 2 and of course less than the number of mappers. If this feels restrictive, you can think of more drastic decreases in the number of reducers (e.g. decrease by 1/4, or more).
Regarding the choice of the key, this can create balancing issues, if, for example, most of the sets start with the same element. So, perhaps you could also make use of other keys, or define a partitioner to better balance the load. This policy makes sure, though, that sets that are equal will be eliminated as early as possible.
If you have MapReduce v.2, you could implement the aforementioned logic like that (pseudocode):
Mapper:
Set<Set> superSets;
setup() {
superSets = new HashSet<>();
}
map(inputSet){
Set toReplace = null;
for (Set superSet : superSets) {
if (superSet.contains(inputSet) {
return;
}
if (inputSet.contains(superSet)) {
toReplace = superSet;
break;
}
}
if (toReplace != null) {
superSets.remove(toReplace);
}
superSets.add(inputSet);
}
close() {
for (Set superSet : superSets) {
context.write(superSet.iterator.next(), superSet);
}
}
You can use the same code in the reducer and in the combiner.
As a final note, I doubt that MapReduce is the right environment for this kind of computations. Perhaps Apache Spark, or Apache Flink offer some better alternatives.
If I understand:
your goal is to detect and remove subset of set in a large set of sets
there are too sets to be managed altogether (memory limit)
strategy is map and reduce (or some sort of)
What I take into account:
main problem is that you can not managed everything at same time
usual method map/reduce supposes to split datas, and treat each part. This is not done totally like that
(because each subset can intersect with each subset).
If I make some calculations:
suppose you have a large set : 1000 000 of 3 to 20 numbers from 1 to 100.
you should have to compare 1000 Billions couples of sets
Even with 100 000 (10 billions), it takes too much times (I stopped).
What I propose (test with 100 000 sets) :
1 define a criterion to split in more little compatible sets. Compatible sets are packages of sets, and you are sure subsets of sets are at least in one same package: then you are sure to find subsets to remove with that method. Say differently: if set A is a subset of set B, then A and B will reside in one (or several) packages like that.
I just take: every subset which contains one defined element (1, 2, 3, ...) => it gives approximately 11 500 sets with precedent assumptions.
It become reasonable to compare (120 000 comparisons).
It takes 180 seconds on my machine, and it found 900 subsets to remove.
you have to do it 100 times (then 18 000 seconds).
and of course, you can find duplicates (but not too many: some %, and the gool is to eliminate).
2 At end it is easy and fast to agglomerate. Duplicate work is light.
3 bigger filters:
with a filter with 2 elements, you reduce to 1475 sets => you get approximately 30 sets to delete, it takes 2-3 seconds
and you have to do that 10 000
Interest of this method:
the selection of sets on the criterion is linear and very simple. It is also hiearchical:
split on one element , on a second, etc.
it is stateless: you can filter millions of set. You only have to keep the good one. The more datas you have,
the more filter you have to do => solution is scalable.
if you want to treat little clouds, you can takes 3, 4 elements in common, etc.
like that, you can spread your treatment among multiple machines (as many as you have).
At the end, you have to reconciliate all your datas/deleting.
This solution doesnt keep a lot of time overall (you can do calculations), but it suits the need of splitting the work.
Hope it helps.

Why split points are out of order on Hadoop total order partitioner?

I use Hadoop total order partitioner and random sampler as input sampler.
But when I increase my slave nodes and reduce tasks to 8, I get following error:
Caused by: java.io.IOException: Split points are out of order
I don't know the reason for this error.
How can I set the number of three parameters on inputsampler.randomsampler function?
Two possible problems
You have duplicate keys
You are using a different comparator for the input sampler and the task on which you are running the total order partitioner
You can diagnose this by downloading the partition file and examining its contents. The partitions file is the value of total.order.partitioner.path if it is set or _partition.lst otherwise. If your keys are text, you can run hdfs dfs -text path_to_partition_file | less to get a look. This may also work for other key types, but I haven't tried it.
If there are duplicate lines in the partition file, you have duplicate keys, otherwise you're probably using the wrong comparator.
How to fix
Duplicate Keys
My best guess is that your keys are so unbalanced that an even division of records among partitions is generating partitions with identical split points.
To solve this you have several options:
Choose a value to use as a key that better distinguishes your inputs (probably not possible, but much better if you can)
Use fewer partitions and reducers (not as scalable or certain as the next solution, but simpler to implement, especially if you have only a few duplicates). Divide the original number of partitions by largest number of duplicate entries. (For example, if your partition key file lists: a, a, b, c, c, c, d, e as split points then you have 9 reducers (8 split points) and max duplicates of 3. So, use 3 reducers (3=floor(9/3)) and if your sampling is good, you'll probably end up with proper split points. For complete stability you'd need to be able to re-run the partition step if it has duplicate entries so you can guard against the occasional over-sampling of the unbalanced keys, but at that level of complexity, you may as well look into the next solution.
Read the partitions file, rewrite it without duplicate entries, count the number of entries (call it num_non_duplicates) and use num_non_duplicates+1 reducers. The reducers with the duplicated keys will have much more work than the other reducers and run longer. If the reduce operation is commutative and associative, you may be able to mitigate this by using combiners.
Using the wrong comparator
Make sure you have mapred.output.key.comparator.class set identically in both the call to writePartitionFile and the job using TotalOrderPartitioner
Extra stuff you don't need to read but might enjoy:
The Split points are out of order error message comes from the code:
RawComparator<K> comparator =
(RawComparator<K>) job.getOutputKeyComparator();
for (int i = 0; i < splitPoints.length - 1; ++i) {
if (comparator.compare(splitPoints[i], splitPoints[i+1]) >= 0) {
throw new IOException("Split points are out of order");
}
}
The line comparator.compare(splitPoints[i], splitPoints[i+1]) >= 0 means that a pair of split points is rejected if they are either identical or out-of-order.
1 or 2 reducers will never generate this error since there can't be more than 1 split point and the loop will never execute.
Are you sure you are generating enough keys?
From the javadoc: TotalOrderPartitioner
The input file must be sorted with the same comparator and contain
JobContextImpl.getNumReduceTasks() - 1 keys.

Hadoop. Reducing result to the single value

I started learning Hadoop, and am a bit confused by MapReduce. For tasks where result natively is a list of key-value pairs everything seems clear. But I don't understand how should I solve the tasks where result is a single value (say, sum of squared input decimals, or centre of mass for input points).
On the one hand I can put all results of mapper to the same key. But as far as I understood in this case the only reducer will manage the whole set of data (calculate sum, or mean coordinates). It doesn't look like a good solution.
Another one that I can imaging is to group mapper results. Say, mapper that processed examples 0-999 will produce key equals to 0, 1000-1999 will produce key equals to 1, and so on. As far as there still will be multiple results of reducers, it will be necessary to build chain of reducers (reducing will be repeated until only one result remains). It looks much more computational effective, but a bit complicated.
I still hope that Hadoop has the off-the-shelf tool that executes superposition of reducers to maximise the efficiency of reducing the whole data to a single value. Although I failed to find one.
What is the best practise of solving the tasks where result is a single value?
If you are able to reformulate your task in terms of commutative reduce you should look at Combiners. Any way you should take a look on it, it can significantly reduce amount data to shuffle.
From my point of view, you are tackling the problem from the wrong angle.
See that problem where you need to sum the squares of your input, let's assume you have many and large text input files consisting out of a number per line.
Then ideally you want to parallelize your sums in the mapper and then just sum up the sums in the reducer.
e.G:
map: (input "x", temporary sum "s") -> s+=(x*x)
At the end of map, you would emit that temporary sum of every mapper with a global key.
In the reduce stage, you basically get all the sums from your mappers and sum the sums up, note that this is fairly small (n-times a single integer, where n is the number of mappers) in relation to your huge input files and therefore a single reducer is really not a scalability bottleneck.
You want to cut down the communication cost between the mapper and the reducer, not proxy all your data to a single reducer and read through it there, that would not parallelize anything.
I think your analysis of the specific use cases you bring up are spot on. These use cases still fall into a rather inclusive scope of what you can do with hadoop and there are certainly other things that hadoop just wasn't designed to handle. If I had to solve the same problem, I would follow your first approach unless I knew the data was too big, then I'd follow your two-step approach.

MapReduce - how do I calculate relative values (average, top k and so)?

I'm looking for a way to calculate "global" or "relative" values during a MapReduce process - an average, sum, top etc. Say I have a list of workers, with their IDs associated with their salaries (and a bunch of other stuff). At some stage of the processing, I'd like to know who are the workers who earn the top 10% of salaries. For that I need some "global" view of the values, which I can't figure out.
If I have all values sent into a single reducer, it has that global view, but then I loose concurrency, and it seems awkward. Is there a better way?
(The framework I'd like to use is Google's, but I'm trying to figure out the technique - no framework specific tricks please)
My first thought is to do something like this:
MAP: Use some dummy value as the key, maybe the empty string for efficiency, and create class that holds both a salary and an employee ID. In each Mapper, create an array that holds 10 elements. Fill it up with the first ten salaries you see, sorted (so location 0 is the highest salary, location 9 is the 10th highest). For every salary after that, see if it is in the top ten and if it is, insert it in the correct location and then move the lower salaries down, as appropriate.
Combiner/Reducer: merge sort the lists. I'd basically do the same thing as in the mapper by creating a ten element array and then loop over all the arrays that match the key, merging them in according to the same comparison/replace/move down sequence as in the mapper
If you run this with one reducer, it should ensure that the top 10 salaries are output.
I don't see a way to do this while using more than one reducer. If you use a combiner, then the reducer should only have to merge a ten-element array for each node that ran mappers (which should be manageable unless you're running on thousands of nodes).
[Edit: I misunderstood. Update for the Top 10% ]
For doing something that relates to the "total" there is no other way than determining the total first and then do the calculations.
So the "Top 10% salaries" could be done roughly as follows:
Determine total:
MAP: Identity
REDUCE: Aggregate the info of all records that go through and create a new "special" record with the "total". Note that you would like to scale
This can also be done by letting the MAP output 2 records (data, total) and the reducer then only touches the "total" records by aggregating.
Use total:
MAP: Prepare for SecondarySort
SHUFFLE/SORT: Sort the records so that the records with the "total" are first into the reducer.
REDUCE: Depending on your implementation the reducer may get a bunch of these total records (aggregate them) and for all subsequent records determine where they are in relation to everything else.
The biggest question on this kind of processing is: Will is scale?
Remember you are breaking the biggest "must have" for scaling out: Independent chunks of information. This makes them dependent around the "total" value.
I expect a technically different way of making the "total" value available to the second step is essential in making this work on "big data".
The "Hadoop - The definitive Guide" book by Tom White has a very good chapter on Secondary Sort.
I would do something like this
The mapper will use an UUID as part of the key, created in the setup() method of the mapper. The mapper emits as key, the UUID appended with either 0 or the salary. The mapper accumulates the count and total.
In the cleanup() method, the mapper emits UUID appended with 0 as the key and the count and total as the value. In the map() method, the mapper emits the UUID appended with salary as the key and salary as the value.
Since the keys are sorted, the first call to combiner will have the count and total as the value. The combiner could store them as class members. We could also find out what 10% of total count is and save that as well as class member (call it top). We initialize a list and save it as a class member.
Subsequent calls to combiner will contain the salary as the value, arriving in sorted order. We add the value to the list and at the same time increment a counter. When the counter reaches the value top, we don't store any more values in our list. We ignore values in rest of the combiner calls.
In the combiner cleanup(), we do the emit. The combiner will emit only the UUID as the key. The value will contain count and total followed by the top 10% of the values. So the output of the combiner will have partial results, based on the subset of the data that passed through the mapper.
The reducer will be called as many times as the number of mappers in this case, because each mapper/combiner emits only one key.
The reducer will accumulate the counts, totals and the top 10% values in the reduce() method. In the cleanup() method, the average is calulated. The top 10% is also calculated in the cleanup() method from the aggregation of top 10% arriving in each call of the reducer. This is basically a merge sort.
The reducer cleanup() method could do multiple emits, so that average is in the first row, followed by the top 10% of salaries in the subsequent rows.
Finally, to ensure that final aggregate statistics are global, you have to set the number of reducers to one.
Since there is data accumulation and sorting in the reducer, although on partial data set, there may be memory issues.

Resources