Median of median algorithms understanding - algorithm

I've searched around the web and visited the wiki page for the Median of median algorithm. But can't seem to find an explicit statement to my question:
If one has a very very large list of integers (TBs in size) and wants to find the median of this list in a distributed manner, would breaking the list up into sub lists of varying sizes (or equal doesn't really matter), then proceed to compute the medians of those smaller sub-lists, then compute the median of those medians result in the median of the original large list?
Furthermore is this statement also correct for any of the kth statistics? I'd be interested in links to research etc in this area.

The answer to your question is no.
If you want to understand how to actually select the k-th order statistics (including the median of course) in a parallel setting (distributed setting is of course not really different), take a look at this recent paper, in which I proposed a new algorithm improving the previous state of the art algorithm for parallel selection:
Deterministic parallel selection algorithms on coarse-grained multicomputers
Here, we use two weighted 3-medians as pivots, and partition around these pivots using five-way partitioning. We also implemented and tested the algorithm using MPI. Results are very good, taking into account that this is a deterministic algorithm exploiting the worst-case O(n) selection algorithm. Using the randomized O(n) QuickSelect algorithm provides an extremely fast parallel algorithm.

If one has a very very large list of integers (TBs in size) and wants to find the median of this list in a distributed manner, would breaking the list up into sub lists of varying sizes (or equal doesn't really matter), then proceed to compute the medians of those smaller sub-lists, then compute the median of those medians result in the median of the original large list?
No. The actual median of the entire list is not necessarily a median of any of the sublists.
Median-of-medians can give you a good choice of pivot for quickselect by virtue of being nearer the actual median than a randomly selected element, but you would have to do the rest of the quickselect algorithm to locate the actual median of the larger list.

Related

Choosing a Pivot in QuickSort

I was reading about QuickSort and it appears that ideally, they used randomized algorithm for choosing a pivot with at least 25-75 split of the array.
Why can't they calculate the median value of the array and choose the most nearest value to median in every recursive call?
I think it would take the same amount of running time or maybe even better than randomized approach.
Using median of medians, a near median can be chosen, but the overhead is significant, effectively sorting groups of 5. Wiki article:
https://en.wikipedia.org/wiki/Median_of_medians
Note that median of medians can be implemented in place.
As for a random pivot, the code to calculate a random index takes a significant amount of the time during a partition step.
A simpler approach is to use the median of first, middle, last, to avoid worst case time for already sorted or reverse sorted data, and as answered by yeputons, using introsort which switches to heap sort (based on level of recursion) to avoid worst case time.
Because calculating median value takes at least linear time (in comparison to constant time required for random selection), and it's no trivial in linear time. So even though asymptotical performance becomes guaranteed, wall clock performance decreases. I believe it's more practical to guarantee performance in other ways, e.g. by using Introsort.

KD-Tree implementation

I'm trying to write my own KD-Tree implementation and eventually a kNN implementation. and I'm having a bit of difficulty understanding how the KD-Tree construct the search tree.
on wikipedia it says that it finds the median of the values and use that as the root of the tree.
When there are many dimensions however, how would u compute the median?
You don't find the median in several dimensions (in fact, there is no meaningful order for multidimensional numbers). At every level of the kd Tree, you focus on one dimension. You choose the median based on this dimension, ignoring other components.
Note that you can use many criteria other than the median, depending on what you want to do. Likewise, selecting a good scheme for deciding the dimension for each node is an art, though virtually every scheme is correct.
It is not required to find the medians: from wikipedia:
Note also that it is not required to select the median point. In that
case, the result is simply that there is no guarantee that the tree
will be balanced. A simple heuristic to avoid coding a complex
linear-time median-finding algorithm, or using an O(n log n) sort of
all n points, is to use sort to find the median of a fixed number of
randomly selected points to serve as the splitting plane. In practice,
this technique often results in nicely balanced trees.
KD-Tree from wikipedia
You can simply sort the points according to one dimension, then choose
the median as root, then recursively construct subtrees(sort with other dimension)
here is an implementation:
https://github.com/tavaresdong/cs106l/blob/master/KDTree/src/KDTree.h

Finding the median of medians of quicksort

I am working on quick-sort with median of medians algorithm. I normally use the selection-sort to get the median of the subarrays of 5 elements. However, if there are thousands of subarrays, it means that I have to find a median of thousand medians. I think I cannot use the selection-sort to find that median because it is not optimal.
Question:
Can anyone suggest me a better way to find that median?
Thanks in advance.
The median-of-medians algorithm doesn't work by finding the median of each block of size 5 and then running a sorting algorithm on them to find the median. Instead, you typically would sort each block, take the median of each, then recursively invoke the median-of-medians algorithm on these medians to get a good pivot. It's very uncommon to see the median-of-medians algorithm used in quicksort, since the constant factor in the O(n) runtime of the median-of-medians algorithm is so large that it tends to noticeably degrade performance.
There are several possible improvements you can try over this original approach. The simplest way to get a good pivot is just to pick a random element - this leads to Θ(n log n) runtime with very high probability. If you're not comfortable using randomness, you can try using the introselect algorithm, which is a modification of the median-of-medians algorithm that tries to lower the constant factor by guessing an element that might be a good pivot and cutting off the recursion early if one is found. You could also try writing introsort, which uses quicksort and switches to a different algorithm (usually heapsort) if it appears that the algorithm is degenerating.
Hope this helps!

Sorting algorithms for data of known statistical distribution?

It just occurred to me, if you know something about the distribution (in the statistical sense) of the data to sort, the performance of a sorting algorithm might benefit if you take that information into account.
So my question is, are there any sorting algorithms that take into account that kind of information? How good are they?
An example to clarify: if you know the distribution of your data to be Gaussian, you could estimate mean and average on the fly as you process the data. This would give you an estimate of the final position of each number, which you could use to place them close to their final position.
I'm pretty surprised the answer isn't a wiki link to a thourough page discussing this issue. Isn't this a very common case (the Gaussian case, for example)?
I'm adding a bounty to this question, because I'm looking for definite answers with sources, not speculation. Something like "in the case of gaussian distributed data, XYZ algorithm is the fastest on average, as was proved by Smith et al. [1]". However any additional information is welcome.
If the data you are sorting has a known distribution, I would use a Bucket Sort algorithm. You could add some extra logic to it so that you calculated the size and/or positions of the various buckets based upon properties of the distribution (ex: for Gaussian, you might have a bucket every (sigma/k) away from the mean, where sigma is the standard deviation of the distribution).
By having a known distribution and modifying the standard Bucket Sort algorithm in this way, you would probably get the Histogram Sort algorithm or something close to it. Of course, your algorithm would be computationally faster than the the Histogram Sort algorithm because there would probably not be a need to do the first pass (described in the link) since you already know the distribution.
Edit: given your new criteria of your question, (though my previous answer concerning Histogram Sort links to the respectable NIST and contains performance information), here is a peer review journal article from the International Conference on Parallel Processing:
Adaptive Data Partition for Sorting Using Probability Distribution
The authors claim this algorithm has better performance (up to 30% better) than the popular Quick-Sort Algorithm.
Sounds like you might want to read Self-Improving Algorithms: they achieve an eventual optimal expected running time for arbitrary input distributions.
We give such self-improving algorithms
for two problems: (i) sorting a
sequence of numbers and (ii) computing
the Delaunay triangulation of a planar
point set. Both algorithms achieve
optimal expected limiting complexity.
The algorithms begin with a training
phase during which they collect
information about the input
distribution, followed by a stationary
regime in which the algorithms settle
to their optimized incarnations.
If you already know your input distribution is approximately Gaussian, then perhaps another approach would be more efficient in terms of space complexity, but in terms of expected running time this is a rather wonderful result.
Knowing the data source distribution, one can build a good hash function. Knowing the distribution well, the hash function may prove to be a perfect hash function, or close to perfect for many input vectors.
Such function would divide an input of size n into n bins, such that the smallest item would map into the 1st bin, and the largest item would map to the last bin. When the hash is perfect- we would achieve sort just be inserting all the items into the bins.
Inserting all the items into a hash table, then extracting them by order will be O(n) when the hash is perfect (assuming the hash function calculation cost is O(1), and the underline hash data structure operations are O(1)).
I would use an array of fibonacci heaps to implement the hash-table.
For input vector for which the hash function won't be perfect (but still close to perfect), it would still be much better than O(nlogn). When it is perfect - it would be O(n). I'm not sure how to calculate the average complexity, but if forced to, I would bet on O(nloglogn).
Computer sorting algorithms can be classified into
two categories, comparison-based sorting and
non-comparison-based sorting. For comparison-based
sorting, the sorting time in its best-case performance is
Ω (nlogn), while in its worst-case performance the
sorting time can rise up to O(n2 ). In recent years,
some improved algorithms have been proposed to
speed up comparison-based sorting, such as advanced
quick sort according to data distribution characteristics
. However, the average sorting time for these
algorithms is just Ω (nlog2n), and only in the best-case
can it reach O(n).
Different from comparison-based sorting,
non-comparison-based sorting such as count sorting,
bucket sorting and radix sorting depends mainly on key
and address calculation. When the values of keys are
finite ranging from 1 to m, the computational
complexity of non-comparison-based sorting is
O(m+n). Particularly, when m=O(n), the sorting time
can reach O(n). However, when m=n2, n3, …., the
upper bound of linear sorting time can not be obtained.
Among non-comparison-based sorting, bucket sorting
distributes a group of records with similar keys into the
appropriate “bucket”, then another sorting algorithm is
applied to the records in each bucket. With bucket
sorting, the partition of records into m buckets is less
time consuming, while only a few records will be
contained in each bucket so that “cleanup sorting”
algorithm can be applied very fast. Therefore,
bucket sorting has the potential to asymptotically save
sorting time compared with Ω (nlogn) algorithms.
Obviously, how to uniformly distribute all records into
buckets plays a critical role in bucket sorting. Hence what you need is a method to construct a hash function
according to data distribution, which is used to
uniformly distribute n records into n buckets based on
the key of each record. Hence, the sorting time of the
proposed bucket sorting algorithm will reach O(n)
under any circumstance.
check this paper: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5170434&tag=1
Bucket sort would give you a linear time sorting algorithm, as long as you can compute the CDF of each point in O(1) time.
The algorithm, which you can also look up elsewhere, is as follows:
a = array(0, n - 1, []) // create an empty list for each bucket
for x in input:
a[floor(n * cdf(x))].append(x) // O(1) time for each x
input.clear()
for i in {0,...,n - 1}:
// this sorting step costs O(|a[i]|^2) time for each bucket
// but most buckets are small and the cost is O(1) per bucket in expectation
insertion_sort(a[i])
input.concatenate(a[i])
The running time is O(n) in expectation because in expectation there are O(n) pairs (x, y) such that x and y fall in the same bucket, and the running time of insertion sort is precisely O(n + # pairs in the same bucket). The analysis is similar to that of FKS static perfect hashing.
EDIT: If you don't know the distribution, but you know what family it's from, you can just estimate the distribution in O(n), in the Gaussian case by computing the mean and variance, and then use the same algorithm (incidentally, computing the cdf in this case is nontrivial).
You could use that information in quicksort to select the pivot value. I think it would improve the probability of the algorithm staying away of the O(N**2) worst case complexity.
I think cycle sort falls into this category. You use it when you know the exact position that you want each element to end up at.
Cyclesort has some nice properties - for certain restricted types of data it can do a stable, in-place sort in linear time, while guaranteeing that each element will be moved at most once.
Use inverse of cdf for different pdfs other than uniform distributions.
a = array(0, n - 1, []) // create an empty list for each bucket
for x in input:
a[floor(n * inverse_cdf(x))].append(x) // O(1) time for each x
input.clear()
for i in {0,...,n - 1}:
// this sorting step costs O(|a[i]|^2) time for each bucket
// but most buckets are small and the cost is O(1) per bucket in expectation
insertion_sort(a[i])
input.concatenate(a[i])

Parallel computation of the median of a large array

I got asked this question once and still haven't been able to figure it out:
You have an array of N integers, where N is large, say, a billion. You want to calculate the median value of this array. Assume you have m+1 machines (m workers, one master) to distribute the job to. How would you go about doing this?
Since the median is a nonlinear operator, you can't just find the median in each machine and then take the median of those values.
Depending on the Parallel Computation Model, algorithms could vary. (Note: the pdf linked to in previous sentence just contains some of the many possible ones).
Finding the median is a special case of finding the ith element. This problem is called 'selection problem', so you need to search the web for parallel selection.
Here is one paper (unfortunately, not free) which might be useful: Parallel Selection Algorithms With Analysis on Clusters.
And google's first link for the query "Parallel Selection" gives: http://www.umiacs.umd.edu/research/EXPAR/papers/3494/node18.html which actually uses the median of medians for the general problem and not just median finding.
You could do a highly parallelizable sort (like merge sort) and get the median from the result.
Would sorting the array be overkill? If not, then divide up the array and then merge the results together is my suggestion.

Resources