Computing median in map reduce - hadoop

Can someone example the computation of median/quantiles in map reduce?
My understanding of Datafu's median is that the 'n' mappers sort the
data and send the data to "1" reducer which is responsible for sorting
all the data from n mappers and finding the median(middle value)
Is my understanding correct?,
if so, does this approach scale for
massive amounts of data as i can clearly see the one single reducer
struggling to do the final task.
Thanks

Trying to find the median (middle number) in a series is going to require that 1 reducer is passed the entire range of numbers to determine which is the 'middle' value.
Depending on the range and uniqueness of values in your input set, you could introduce a combiner to output the frequency of each value - reducing the number of map outputs sent to your single reducer. Your reducer can then consume the sort value / frequency pairs to identify the median.
Another way you could scale this (again if you know the range and rough distribution of values) is to use a custom partitioner that distributes the keys by range buckets (0-99 go to reducer 0, 100-199 to reducer 2, and so on). This will however require some secondary job to examine the reducer outputs and perform the final median calculation (knowing for example the number of keys in each reducer, you can calculate which reducer output will contain the median, and at which offset)

Do you really need the exact median and quantiles?
A lot of the time, you are better off with just getting approximate values, and working with them, in particular if you use this for e.g. data partitioning.
In fact, you can use the approximate quantiles to speed up finding the exact quantiles (actually in O(n/p) time), here is a rough outline of the strategy:
Have a mapper for each partition compute the desired quantiles, and output them to a new data set. This data set should be several order of magnitues smaller (unless you ask for too many quantiles!)
Within this data set, compute the quantiles again, similar to "median of medians". These are your initial estimates.
Repartition the data according to these quantiles (or even additional partitions obtained this way). The goal is that in the end, the true quantile is guaranteed to be in one partition, and there should be at most one of the desired quantiles in each partition
Within each of the partitions, perform a QuickSelect (in O(n)) to find the true quantile.
Each of the steps is in linear time. The most costly step is part 3, as it will require the whole data set to be redistributed, so it generates O(n) network traffic.
You can probably optimize the process by choosing "alternate" quantiles for the first iteration. Say, you want to find the global median. You can't find it in a linear process easily, but you can probably narrow it down to 1/kth of the data set, when it is split into k partitions. So instead of having each node report its median, have each node additionally report the objects at (k-1)/(2k) and (k+1)/(2k). This should allow you to narrow down the range of values where the true median must lie signficantly. So in the next step, you can each node send those objects that are within the desired range to a single master node, and choose the median within this range only.

O((n log n)/p) to sort it then O(1) to get the median.
Yes... you can get O(n/p) but you can't use the out of the box sort functionality in Hadoop. I would just sort and get the center item unless you can justify the 2-20 hours of development time to code the parallel kth largest algorithm.

In many real-world scenarios, the cardinality of values in a dataset will be relatively small. In such cases, the problem can be efficiently solved with two MapReduce jobs:
Calculate frequencies of values in your dataset (Word Count job, basically)
Identity mapper + a reducer which calculates median based on < value - frequency> pairs
Job 1. will drastically reduce the amount of data and can be executed fully in parallel. Reducer of job 2. will only have to process n (n = cardinality of your value set) items instead of all values, as with the naive approach.
Below, an example reducer of the job 2. It's is python script that could be used directly in Hadoop streaming. Assumes values in your dataset are ints, but can be easily adopted for doubles
import sys
item_to_index_range = []
total_count = 0
# Store in memory a mapping of a value to the range of indexes it has in a sorted list of all values
for line in sys.stdin:
item, count = line.strip().split("\t", 1)
new_total_count = total_count + int(count)
item_to_index_range.append((item, (total_count + 1, new_total_count + 1)))
total_count = new_total_count
# Calculate index(es) of middle items
middle_items_indexes = [(total_count / 2) + 1]
if total_count % 2 == 0:
middle_items_indexes += [total_count / 2]
# Retrieve middle item(s)
middle_items = []
for i in middle_items_indexes:
for item, index_range in item_to_index_range:
if i in range(*index_range):
middle_items.append(item)
continue
print sum(middle_items) / float(len(middle_items))
This answer builds up on top of a suggestion initially coming from the answer of Chris White. The answer suggests using a combiner as a mean to calculate frequencies of values. However, in MapReduce, combiners are not guaranteed to be always executed. This has some side effects:
reducer will first have to compute final < value - frequency > pairs and then calculate median.
In the worst case scenario, combiners will never be executed and the reducer will still have to struggle with processing all individual values

Related

Select and filter algorithm

I would like to select the top n values from a dataset, but ignore elements based on what I have already selected - i.e., given a set of points (x,y), I would like to select the top 100 values of x (which are all distinct) but not select any points such that y equals the y of any already-selected point. I would like to make sure that the highest values of x are prioritized.
Is there any existing algorithm for this, or at least similar ones? I have a huge amount of data and would like to do this as efficiently as possible. Memory is not as much of a concern.
You can do this in O(n log k) time where n is the number of values in the dataset and k are the number of top values you'd like to get.
Store the values you wish to exclude in a hash table.
Make an empty min-heap.
Iterate over all of the values and for each value:
If it is in the hash table skip it.
If the heap contains fewer than k values, add it to the heap.
If the heap contains >=k values, if the value you're looking at is greater than the smallest member of the minheap, pop that value and add the new one.
I will share my thoughts and since the author still has not specified the scope of data to be processed, I will assume that it is too large to be handled by a single machine and I will also assume that the author is familiar with Hadoop.
So I would suggest using the MapReduce as follows:
Mappers simply emit pairs (x,y)
Combiners select k pairs with largest values of x (k=100 in this case) in the meantime maintaining the unique y's in the hashset to avoid duplicates, then emit k pairs found.
There should be only one reducer in this job since it has to get all pairs from combiners to finalize the job by selecting k pairs for the last time. Reducer's implementation is identical to combiner.
The number of combiners should be selected considering memory resources needed to select top k pairs out of incoming dataset since whichever method is used (sorting, heap or anything else) it is going to be done in-memory, as well as keeping that hashset with unique y's

hash table about the load factor

I'm studying about hash table for algorithm class and I became confused with the load factor.
Why is the load factor, n/m, significant with 'n' being the number of elements and 'm' being the number of table slots?
Also, why does this load factor equal the expected length of n(j), the linked list at slot j in the hash table when all of the elements are stored in a single slot?
The crucial property of a hash table is the expected constant time it takes to look up an element.*
In order to achieve this, the implementer of the hash table has to make sure that every query to the hash table returns below some fixed amount of steps.
If you have a hash table with m buckets and you add elements indefinitely (i.e. n>>m), then also the size of the lists will grow and you can't guarantee that expected constant time for look ups, but you will rather get linear time (since the running time you need to traverse the ever increasing linked lists will outweigh the lookup for the bucket).
So, how can we achieve that the lists don't grow? Well, you have to make sure that the length of the list is bounded by some fixed constant - how we do that? Well, we have to add additional buckets.
If the hash table is well implemented, then the hash function being used to map the elements to buckets, should distribute the elements evenly across the buckets. If the hash function does this, then the length of the lists will be roughly the same.
How long is one of the lists if the elements are distributed evenly? Clearly we'll have total number of elements divided by the number of buckets, i.e. the load factor n/m (number of elements per bucket = expected/average length of each list).
Hence, to ensure constant time look up, what we have to do is keep track of the load factor (again: expected length of the lists) such that, when it goes above the fixed constant we can add additional buckets.
Of course, there are more problems which come in, such as how to redistribute the elements you already stored or how many buckets should you add.
The important message to take away, is that the load factor is needed to decide when to add additional buckets to the hash table - that's why it is not only 'important' but crucial.
Of course, if you map all the elements to the same bucket, then the average length of each list won't be worth much. All this stuff only makes sense, if you distribute evenly across the buckets.
*Note the expected - I can't emphasize this enough. Its typical to hear "hash table have constant look up time". They do not! Worst case is always O(n) and you can't make that go away.
Adding to the existing answers, let me just put in a quick derivation.
Consider a arbitrarily chosen bucket in the table. Let X_i be the indicator random variable that equals 1 if the ith element is inserted into this element and 0 otherwise.
We want to find E[X_1 + X_2 + ... + X_n].
By linearity of expectation, this equals E[X_1] + E[X_2] + ... E[X_n]
Now we need to find the value of E[X_i]. This is simply (1/m) 1 + (1 - (1/m) 0) = 1/m by the definition of expected values. So summing up the values for all i's, we get 1/m + 1/m + 1/m n times. This equals n/m. We have just found out the expected number of elements inserted into a random bucket and this is the load factor.

How to efficiently find top-k elements?

I have a big sequence file storing the tfidf values for documents. Each line represents line and the columns are the value of tfidfs for each term (the row is a sparse vector). I'd like to pick the top-k words for each document using Hadoop. The naive solution is to loop through all the columns for each row in the mapper and pick the top-k but as the file becomes bigger and bigger I don't think this is a good solution. Is there a better way to do that in Hadoop?
1. In every map calculate TopK (this is local top K for each map)
2. Spawn a signle reduce , now top K from all mappers will flow to this reducer and hence global Top K will be evaluated.
Think of the problem as
1. You have been given the results of X number of horse races.
2. You need to find Top N fastest horse.

incremental way of counting quantiles for large set of data

I need to count the quantiles for a large set of data.
Let's assume we can get the data only through some portions (i.e. one row of a large matrix). To count the Q3 quantile one need to get all the portions of the data and store it somewhere, then sort it and count the quantile:
List<double> allData = new List<double>();
// This is only an example; the portions of data are not really rows of some matrix
foreach(var row in matrix)
{
allData.AddRange(row);
}
allData.Sort();
double p = 0.75 * allData.Count;
int idQ3 = (int)Math.Ceiling(p) - 1;
double Q3 = allData[idQ3];
I would like to find a way of obtaining the quantile without storing the data in an intermediate variable. The best solution would be to count some parameters of mid-results for first row and then adjust it step by step for next rows.
Note:
These datasets are really big (ca 5000 elements in each row)
The Q3 can be estimated, it doesn't have to be an exact value.
I call the portions of data "rows", but they can have different leghts! Usually it varies not so much (+/- few hundred samples) but it varies!
This question is similar to “On-line” (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis, but I need to count quantiles.
ALso there are few articles in this topic, i.e.:
An Efficient Algorithm for the Approximate Median Selection Problem
Incremental quantile estimation for massive tracking
Before trying to implement these approaches, I wondered if there are maybe any other, quicker ways of counting the 0.25/0.75 quantiles?
I second the idea of using buckets. Don't limit yourself to 100 buckets - might as well use 1 million. The tricky part is to pick your bucket ranges so that everything doesn't end up in a single bucket. Probably the best way to estimate your bucket ranges is to take a reasonable random sample of your data, compute the 10% and 90% quantiles using the simple sort algorithm, then generate equal-sized buckets to fill that range. It isn't perfect, but if your data isn't from a super-weird distribution, it should work.
If you can't do random samples, you're in more trouble. You can pick an initial bucketing guess based on your expected data distribution, then while working through your data if any bucket (typically the first or last bucket) gets overfull, start over again with a new bucket range.
There is a more recent and much simpler algorithm for this that provides very good estimates of the extreme quantiles.
The basic idea is that smaller bins are used at the extremes in a way that both bounds the size of the data structure and guarantees higher accuracy for small or large q. The algorithm is available in several languages and many packages. The MergingDigest version requires no dynamic allocation ... once the MergingDigest is instantiated, no further heap allocation is required.
See https://github.com/tdunning/t-digest
Only retrieve the data you really need -- i.e., whatever value(s) is/are being used as the key for sorting, not everything else associated with it.
You can probably use Tony Hoare's Select algorithm to find your quantile more quickly than sorting all the data.
If your data has a Gaussian distribution, you can estimate the quantiles from the standard deviation. I assume your data isn't Gaussian distributed or you'd just be using the SD anyway.
If you can pass through your data twice, I'd do the following:
First pass, compute the max, min, SD and mean.
Second pass, divide the range [min,max] into some number of buckets (e.g. 100); do the same for (mean - 2*SD,mean + 2*SD) (with extra buckets for outliers). Then run through the data again, tossing numbers into these buckets.
Count buckets until you are at 25% and 75% of the data. If you want to get extra-fancy, you can interpolate between bucket values. (I.e. if you need 10% of a bucket to hit your 25th quantile, assume the value is 10% of the way from the low bound to the upper bound.)
This should give you a pretty good linear-time algorithm that works okay for most sets of not-entirely-perverse data.
Inspired by this answer I created a method that estimates the quantiles quite good. It is approximation close enough for my purposes.
The idea is following: the 0.75 quantile is in fact a median of all values that lies above the global median. And respectively, 0.25 quantile is a median of all values below the global median.
So if we can approximate the median, we can in similar way approximate the quantiles.
double median = 0;
double q1 = 0;
double q3 = 0;
double eta = 0.005;
foreach( var value in listOfValues) // or stream, or any other large set of data...
{
median += eta * Math.Sign(p.Int - median);
}
// Second pass. We know the median, so we can count the quantiles.
foreach(var value in listOfValues)
{
if(p.Int < median)
q1 += eta*Math.Sign(p.Int - q1);
else
q3 += eta*Math.Sign(p.Int - q3);
}
Remarks:
If distribution of your data is strange, you will need to have bigger eta in order to fit to the strange data. But the accuracy will be worse.
If the distribution is strange, but you know the total size of your collection (i.e. N) you can adjust the eta parameter in this way: at the beggining set the eta to be almost equal some large value (i.e. 0.2). As the loop passes, lower the value of eta so when you reach almost the end of the collection, the eta will be almost equal 0 (for example, in loop compute it like that: eta = 0.2 - 0.2*(i/N);
q-digest is an approximate online algorithm that lets you compute quantile: http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf
Here is an implementation:
https://github.com/airlift/airlift/blob/master/stats/src/main/java/io/airlift/stats/QuantileDigest.java

How to calculate or approximate the median of a list without storing the list

I'm trying to calculate the median of a set of values, but I don't want to store all the values as that could blow memory requirements. Is there a way of calculating or approximating the median without storing and sorting all the individual values?
Ideally I would like to write my code a bit like the following
var medianCalculator = new MedianCalculator();
foreach (var value in SourceData)
{
medianCalculator.Add(value);
}
Console.WriteLine("The median is: {0}", medianCalculator.Median);
All I need is the actual MedianCalculator code!
Update: Some people have asked if the values I'm trying to calculate the median for have known properties. The answer is yes. One value is in 0.5 increments from about -25 to -0.5. The other is also in 0.5 increments from -120 to -60. I guess this means I can use some form of histogram for each value.
Thanks
Nick
If the values are discrete and the number of distinct values isn't too high, you could just accumulate the number of times each value occurs in a histogram, then find the median from the histogram counts (just add up counts from the top and bottom of the histogram until you reach the middle). Or if they're continuous values, you could distribute them into bins - that wouldn't tell you the exact median but it would give you a range, and if you need to know more precisely you could iterate over the list again, examining only the elements in the central bin.
There is the 'remedian' statistic. It works by first setting up k arrays, each of length b. Data values are fed in to the first array and, when this is full, the median is calculated and stored in the first pos of the next array, after which the first array is re-used. When the second array is full the median of its values is stored in the first pos of the third array, etc. etc. You get the idea :)
It's simple and pretty robust. The reference is here...
http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/Remedian.pdf
Hope this helps
Michael
I use these incremental/recursive mean and median estimators, which both use constant storage:
mean += eta * (sample - mean)
median += eta * sgn(sample - median)
where eta is a small learning rate parameter (e.g. 0.001), and sgn() is the signum function which returns one of {-1, 0, 1}. (Use a constant eta if the data is non-stationary and you want to track changes over time; otherwise, for stationary sources you can use something like eta=1/n for the mean estimator, where n is the number of samples seen so far... unfortunately, this does not appear to work for the median estimator.)
This type of incremental mean estimator seems to be used all over the place, e.g. in unsupervised neural network learning rules, but the median version seems much less common, despite its benefits (robustness to outliers). It seems that the median version could be used as a replacement for the mean estimator in many applications.
Also, I modified the incremental median estimator to estimate arbitrary quantiles. In general, a quantile function tells you the value that divides the data into two fractions: p and 1-p. The following estimates this value incrementally:
quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)
The value p should be within [0,1]. This essentially shifts the sgn() function's symmetrical output {-1,0,1} to lean toward one side, partitioning the data samples into two unequally-sized bins (fractions p and 1-p of the data are less than/greater than the quantile estimate, respectively). Note that for p=0.5, this reduces to the median estimator.
I would love to see an incremental mode estimator of a similar form...
(Note: I also posted this to a similar topic here: "On-line" (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis?)
Here is a crazy approach that you might try. This is a classical problem in streaming algorithms. The rules are
You have limited memory, say O(log n) where n is the number of items you want
You can look at each item once and make a decision then and there what to do with it, if you store it, it costs memory, if you throw it away it is gone forever.
The idea for the finding a median is simple. Sample O(1 / a^2 * log(1 / p)) * log(n) elements from the list at random, you can do this via reservoir sampling (see a previous question). Now simply return the median from your sampled elements, using a classical method.
The guarantee is that the index of the item returned will be (1 +/- a) / 2 with probability at least 1-p. So there is a probability p of failing, you can choose it by sampling more elements. And it wont return the median or guarantee that the value of the item returned is anywhere close to the median, just that when you sort the list the item returned will be close to the half of the list.
This algorithm uses O(log n) additional space and runs in Linear time.
This is tricky to get right in general, especially to handle degenerate series that are already sorted, or have a bunch of values at the "start" of the list but the end of the list has values in a different range.
The basic idea of making a histogram is most promising. This lets you accumulate distribution information and answer queries (like median) from it. The median will be approximate since you obviously don't store all values. The storage space is fixed so it will work with whatever length sequence you have.
But you can't just build a histogram from say the first 100 values and use that histogram continually.. the changing data may make that histogram invalid. So you need a dynamic histogram that can change its range and bins on the fly.
Make a structure which has N bins. You'll store the X value of each slot transition (N+1 values total) as well as the population of the bin.
Stream in your data. Record the first N+1 values. If the stream ends before this, great, you have all the values loaded and you can find the exact median and return it. Else use the values to define your first histogram. Just sort the values and use those as bin definitions, each bin having a population of 1. It's OK to have dupes (0 width bins).
Now stream in new values. For each one, binary search to find the bin it belongs to.
In the common case, you just increment the population of that bin and continue.
If your sample is beyond the histogram's edges (highest or lowest), just extend the end bin's range to include it.
When your stream is done, you find the median sample value by finding the bin which has equal population on both sides of it, and linearly interpolating the remaining bin-width.
But that's not enough.. you still need to ADAPT the histogram to the data as it's being streamed in. When a bin gets over-full, you're losing information about that bin's sub distribution.
You can fix this by adapting based on some heuristic... The easiest and most robust one is if a bin reaches some certain threshold population (something like 10*v/N where v=# of values seen so far in the stream, and N is the number of bins), you SPLIT that overfull bin. Add a new value at the midpoint of the bin, give each side half of the original bin's population. But now you have too many bins, so you need to DELETE a bin. A good heuristic for that is to find the bin with the smallest product of population and width. Delete it and merge it with its left or right neighbor (whichever one of the neighbors itself has the smallest product of width and population.). Done!
Note that merging or splitting bins loses information, but that's unavoidable.. you only have fixed storage.
This algorithm is nice in that it will deal with all types of input streams and give good results. If you have the luxury of choosing sample order, a random sample is best, since that minimizes splits and merges.
The algorithm also allows you to query any percentile, not just median, since you have a complete distribution estimate.
I use this method in my own code in many places, mostly for debugging logs.. where some stats that you're recording have unknown distribution. With this algorithm you don't need to guess ahead of time.
The downside is the unequal bin widths means you have to do a binary search for each sample, so your net algorithm is O(NlogN).
David's suggestion seems like the most sensible approach for approximating the median.
A running mean for the same problem is a much easier to calculate:
Mn = Mn-1 + ((Vn - Mn-1) / n)
Where Mn is the mean of n values, Mn-1 is the previous mean, and Vn is the new value.
In other words, the new mean is the existing mean plus the difference between the new value and the mean, divided by the number of values.
In code this would look something like:
new_mean = prev_mean + ((value - prev_mean) / count)
though obviously you may want to consider language-specific stuff like floating-point rounding errors etc.
I don't think it is possible to do without having the list in memory. You can obviously approximate with
average if you know that the data is symmetrically distributed
or calculate a proper median of a small subset of data (that fits in memory) - if you know that your data has the same distribution across the sample (e.g. that the first item has the same distribution as the last one)
Find Min and Max of the list containing N items through linear search and name them as HighValue and LowValue
Let MedianIndex = (N+1)/2
1st Order Binary Search:
Repeat the following 4 steps until LowValue < HighValue.
Get MedianValue approximately = ( HighValue + LowValue ) / 2
Get NumberOfItemsWhichAreLessThanorEqualToMedianValue = K
is K = MedianIndex, then return MedianValue
is K > MedianIndex ? then HighValue = MedianValue Else LowValue = MedianValue
It will be faster without consuming memory
2nd Order Binary Search:
LowIndex=1
HighIndex=N
Repeat Following 5 Steps until (LowIndex < HighIndex)
Get Approximate DistrbutionPerUnit=(HighValue-LowValue)/(HighIndex-LowIndex)
Get Approximate MedianValue = LowValue + (MedianIndex-LowIndex) * DistributionPerUnit
Get NumberOfItemsWhichAreLessThanorEqualToMedianValue = K
is (K=MedianIndex) ? return MedianValue
is (K > MedianIndex) ? then HighIndex=K and HighValue=MedianValue Else LowIndex=K and LowValue=MedianValue
It will be faster than 1st order without consuming memory
We can also think of fitting HighValue, LowValue and MedianValue with HighIndex, LowIndex and MedianIndex to a Parabola, and can get ThirdOrder Binary Search which will be faster than 2nd order without consuming memory and so on...
Usually if the input is within a certain range, say 1 to 1 million, it's easy to create an array of counts: read the code for "quantile" and "ibucket" here: http://code.google.com/p/ea-utils/source/browse/trunk/clipper/sam-stats.cpp
This solution can be generalized as an approximation by coercing the input into an integer within some range using a function that you then reverse on the way out: IE: foo.push((int) input/1000000) and quantile(foo)*1000000.
If your input is an arbitrary double precision number, then you've got to autoscale your histogram as values come in that are out of range (see above).
Or you can use the median-triplets method described in this paper: http://web.cs.wpi.edu/~hofri/medsel.pdf
I picked up the idea of iterative quantile calculation. It is important to have a good value for starting point and eta, these may come from mean and sigma. So I programmed this:
Function QuantileIterative(Var x : Array of Double; n : Integer; p, mean, sigma : Double) : Double;
Var eta, quantile,q1, dq : Double;
i : Integer;
Begin
quantile:= mean + 1.25*sigma*(p-0.5);
q1:=quantile;
eta:=0.2*sigma/xy(1+n,0.75); // should not be too large! sets accuracy
For i:=1 to n Do
quantile := quantile + eta * (signum_smooth(x[i] - quantile,eta) + 2*p - 1);
dq:=abs(q1-quantile);
If dq>eta
then Begin
If dq<3*eta then eta:=eta/4;
For i:=1 to n Do
quantile := quantile + eta * (signum_smooth(x[i] - quantile,eta) + 2*p - 1);
end;
QuantileIterative:=quantile
end;
As the median for two elements would be the mean, I used a smoothed signum function, and xy() is x^y. Are there ideas to make it better? Of course if we have some more a-priori knowledge we can add code using min and max of the array, skew, etc. For big data you would not use an array perhaps, but for testing it is easier.
On homogeneous random ordered and for big enough list, this pseudo code can work:
# find min on the fly
if minDataPoint > dataPoint:
minDataPoint = dataPoint
# find max on the fly
if maxDataPoint < dataPoint:
maxDataPoint = dataPoint
# estimate median base on the current data
estimate_mid = (maxDataPoint + minDataPoint) / 2
#if **new** dataPoint is closer to the mid? stor it
if abs(midDataPoint - estimate_mid) > abs(dataPoint - estimate_mid):
midDataPoint = dataPoint
Inspired by #lakshmanaraj

Resources