Parallel computation of the median of a large array - parallel-processing

I got asked this question once and still haven't been able to figure it out:
You have an array of N integers, where N is large, say, a billion. You want to calculate the median value of this array. Assume you have m+1 machines (m workers, one master) to distribute the job to. How would you go about doing this?
Since the median is a nonlinear operator, you can't just find the median in each machine and then take the median of those values.

Depending on the Parallel Computation Model, algorithms could vary. (Note: the pdf linked to in previous sentence just contains some of the many possible ones).
Finding the median is a special case of finding the ith element. This problem is called 'selection problem', so you need to search the web for parallel selection.
Here is one paper (unfortunately, not free) which might be useful: Parallel Selection Algorithms With Analysis on Clusters.
And google's first link for the query "Parallel Selection" gives: http://www.umiacs.umd.edu/research/EXPAR/papers/3494/node18.html which actually uses the median of medians for the general problem and not just median finding.

You could do a highly parallelizable sort (like merge sort) and get the median from the result.

Would sorting the array be overkill? If not, then divide up the array and then merge the results together is my suggestion.

Related

how to tell if two sequences differ by a permutation?

Given any two sequences of n real numbers, say (a1,a2,...,an) and (b1,b2,...,bn), how to tell if one sequence (which can also be viewed as a vector) is a permutation of the other?
I plan to develop an algorithm and run it on Matlab to do this job. I can only think of an algorithm that costs n! times: just try all the permutations in n.
Is there a faster algorithm?
First of all, why n! ? if for every ai you search a match in bi you will get O(n^2).
Anyway it is more efficient to use sort with O(nlogn) complexity.
A=[3,1,2,7];
B=[2,3,1,7];
isPermutated=isequal(sort(A),sort(B))
Just sort both sequences and compare sorted results.
In some situations you might find useful create sets/map/dictionary (with counters if multiple elements are possible) from both sequences and check every element presence in another set.

Sorting Algorithm that minimizes the maximum number of comparisons in which individual items are involved

I'm interested in finding a comparison sorting algorithm that minimizes the number of times each single element is compared with others during the run of the algorithm.
For a randomly sorted list, I'm interested in two distributions: the number of comparisons that are needed to sort a list (this is the traditional criterion) and the number of comparisons in which each single element of the list is involved.
Among the algorithms that have a good performance in terms of the number of comparisons, say achieving O(n log(n)) on average, I would like to find out the one for which, on average, the number of times a single element is compared with others is minimized.
I supposed that the theoretical minimum is O(log(n)) which is obtained by dividing the above figure on the total number of comparisons by n.
I'm also interested in the case where data are likely to be already ordered to some extent.
Is perhaps a simulation the best way to go about finding an answer?
(My previous question has been put on hold - This is now a very clear question, if you can't understand it then please explain why)
Yes you definitely should do simulations.
There you will implicitely set the size and pre-ordering constraints in a way that may allow more specific statements than the general question you rose.
There can, however, not be a clear answer to such question in general.
Big-O deals with asymptotic behaviour while your question
seem to target smaller problem sizes. So Big-O could hint on the best candidates for sufficiently large input sets to a sort run. (But, e.g. if you are interested in size<=5 the results may be completely different!)
For getting proper estimate on comparison operations you would need
to analyze each individual algorithm.
At the end, the result (for a given algorithm) will necesarily be specific to the dataset being sorted.
Also, on avarage is not well defined in your context. I'd assume you intend to refer to the number of comparisions on the participating objects for a given sort and not avarage over a (sufficiently large) set of sort runs.
Even within a single algorithm the distribution of comparisions an individual object is taking place in may show a large standard deviation in one case and be (nearly) equally distributed in another case.
As complexity of a sorting algorithm is determined by the total number of comparisons (and position changes thereof), I do not assume there will be much from therotical analysis contributing to an answer.
Maybe you can add some background on what would make an answer to your question "interesting" in a practical sense?

Median of median algorithms understanding

I've searched around the web and visited the wiki page for the Median of median algorithm. But can't seem to find an explicit statement to my question:
If one has a very very large list of integers (TBs in size) and wants to find the median of this list in a distributed manner, would breaking the list up into sub lists of varying sizes (or equal doesn't really matter), then proceed to compute the medians of those smaller sub-lists, then compute the median of those medians result in the median of the original large list?
Furthermore is this statement also correct for any of the kth statistics? I'd be interested in links to research etc in this area.
The answer to your question is no.
If you want to understand how to actually select the k-th order statistics (including the median of course) in a parallel setting (distributed setting is of course not really different), take a look at this recent paper, in which I proposed a new algorithm improving the previous state of the art algorithm for parallel selection:
Deterministic parallel selection algorithms on coarse-grained multicomputers
Here, we use two weighted 3-medians as pivots, and partition around these pivots using five-way partitioning. We also implemented and tested the algorithm using MPI. Results are very good, taking into account that this is a deterministic algorithm exploiting the worst-case O(n) selection algorithm. Using the randomized O(n) QuickSelect algorithm provides an extremely fast parallel algorithm.
If one has a very very large list of integers (TBs in size) and wants to find the median of this list in a distributed manner, would breaking the list up into sub lists of varying sizes (or equal doesn't really matter), then proceed to compute the medians of those smaller sub-lists, then compute the median of those medians result in the median of the original large list?
No. The actual median of the entire list is not necessarily a median of any of the sublists.
Median-of-medians can give you a good choice of pivot for quickselect by virtue of being nearer the actual median than a randomly selected element, but you would have to do the rest of the quickselect algorithm to locate the actual median of the larger list.

Sorting algorithms for data of known statistical distribution?

It just occurred to me, if you know something about the distribution (in the statistical sense) of the data to sort, the performance of a sorting algorithm might benefit if you take that information into account.
So my question is, are there any sorting algorithms that take into account that kind of information? How good are they?
An example to clarify: if you know the distribution of your data to be Gaussian, you could estimate mean and average on the fly as you process the data. This would give you an estimate of the final position of each number, which you could use to place them close to their final position.
I'm pretty surprised the answer isn't a wiki link to a thourough page discussing this issue. Isn't this a very common case (the Gaussian case, for example)?
I'm adding a bounty to this question, because I'm looking for definite answers with sources, not speculation. Something like "in the case of gaussian distributed data, XYZ algorithm is the fastest on average, as was proved by Smith et al. [1]". However any additional information is welcome.
If the data you are sorting has a known distribution, I would use a Bucket Sort algorithm. You could add some extra logic to it so that you calculated the size and/or positions of the various buckets based upon properties of the distribution (ex: for Gaussian, you might have a bucket every (sigma/k) away from the mean, where sigma is the standard deviation of the distribution).
By having a known distribution and modifying the standard Bucket Sort algorithm in this way, you would probably get the Histogram Sort algorithm or something close to it. Of course, your algorithm would be computationally faster than the the Histogram Sort algorithm because there would probably not be a need to do the first pass (described in the link) since you already know the distribution.
Edit: given your new criteria of your question, (though my previous answer concerning Histogram Sort links to the respectable NIST and contains performance information), here is a peer review journal article from the International Conference on Parallel Processing:
Adaptive Data Partition for Sorting Using Probability Distribution
The authors claim this algorithm has better performance (up to 30% better) than the popular Quick-Sort Algorithm.
Sounds like you might want to read Self-Improving Algorithms: they achieve an eventual optimal expected running time for arbitrary input distributions.
We give such self-improving algorithms
for two problems: (i) sorting a
sequence of numbers and (ii) computing
the Delaunay triangulation of a planar
point set. Both algorithms achieve
optimal expected limiting complexity.
The algorithms begin with a training
phase during which they collect
information about the input
distribution, followed by a stationary
regime in which the algorithms settle
to their optimized incarnations.
If you already know your input distribution is approximately Gaussian, then perhaps another approach would be more efficient in terms of space complexity, but in terms of expected running time this is a rather wonderful result.
Knowing the data source distribution, one can build a good hash function. Knowing the distribution well, the hash function may prove to be a perfect hash function, or close to perfect for many input vectors.
Such function would divide an input of size n into n bins, such that the smallest item would map into the 1st bin, and the largest item would map to the last bin. When the hash is perfect- we would achieve sort just be inserting all the items into the bins.
Inserting all the items into a hash table, then extracting them by order will be O(n) when the hash is perfect (assuming the hash function calculation cost is O(1), and the underline hash data structure operations are O(1)).
I would use an array of fibonacci heaps to implement the hash-table.
For input vector for which the hash function won't be perfect (but still close to perfect), it would still be much better than O(nlogn). When it is perfect - it would be O(n). I'm not sure how to calculate the average complexity, but if forced to, I would bet on O(nloglogn).
Computer sorting algorithms can be classified into
two categories, comparison-based sorting and
non-comparison-based sorting. For comparison-based
sorting, the sorting time in its best-case performance is
Ω (nlogn), while in its worst-case performance the
sorting time can rise up to O(n2 ). In recent years,
some improved algorithms have been proposed to
speed up comparison-based sorting, such as advanced
quick sort according to data distribution characteristics
. However, the average sorting time for these
algorithms is just Ω (nlog2n), and only in the best-case
can it reach O(n).
Different from comparison-based sorting,
non-comparison-based sorting such as count sorting,
bucket sorting and radix sorting depends mainly on key
and address calculation. When the values of keys are
finite ranging from 1 to m, the computational
complexity of non-comparison-based sorting is
O(m+n). Particularly, when m=O(n), the sorting time
can reach O(n). However, when m=n2, n3, …., the
upper bound of linear sorting time can not be obtained.
Among non-comparison-based sorting, bucket sorting
distributes a group of records with similar keys into the
appropriate “bucket”, then another sorting algorithm is
applied to the records in each bucket. With bucket
sorting, the partition of records into m buckets is less
time consuming, while only a few records will be
contained in each bucket so that “cleanup sorting”
algorithm can be applied very fast. Therefore,
bucket sorting has the potential to asymptotically save
sorting time compared with Ω (nlogn) algorithms.
Obviously, how to uniformly distribute all records into
buckets plays a critical role in bucket sorting. Hence what you need is a method to construct a hash function
according to data distribution, which is used to
uniformly distribute n records into n buckets based on
the key of each record. Hence, the sorting time of the
proposed bucket sorting algorithm will reach O(n)
under any circumstance.
check this paper: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5170434&tag=1
Bucket sort would give you a linear time sorting algorithm, as long as you can compute the CDF of each point in O(1) time.
The algorithm, which you can also look up elsewhere, is as follows:
a = array(0, n - 1, []) // create an empty list for each bucket
for x in input:
a[floor(n * cdf(x))].append(x) // O(1) time for each x
input.clear()
for i in {0,...,n - 1}:
// this sorting step costs O(|a[i]|^2) time for each bucket
// but most buckets are small and the cost is O(1) per bucket in expectation
insertion_sort(a[i])
input.concatenate(a[i])
The running time is O(n) in expectation because in expectation there are O(n) pairs (x, y) such that x and y fall in the same bucket, and the running time of insertion sort is precisely O(n + # pairs in the same bucket). The analysis is similar to that of FKS static perfect hashing.
EDIT: If you don't know the distribution, but you know what family it's from, you can just estimate the distribution in O(n), in the Gaussian case by computing the mean and variance, and then use the same algorithm (incidentally, computing the cdf in this case is nontrivial).
You could use that information in quicksort to select the pivot value. I think it would improve the probability of the algorithm staying away of the O(N**2) worst case complexity.
I think cycle sort falls into this category. You use it when you know the exact position that you want each element to end up at.
Cyclesort has some nice properties - for certain restricted types of data it can do a stable, in-place sort in linear time, while guaranteeing that each element will be moved at most once.
Use inverse of cdf for different pdfs other than uniform distributions.
a = array(0, n - 1, []) // create an empty list for each bucket
for x in input:
a[floor(n * inverse_cdf(x))].append(x) // O(1) time for each x
input.clear()
for i in {0,...,n - 1}:
// this sorting step costs O(|a[i]|^2) time for each bucket
// but most buckets are small and the cost is O(1) per bucket in expectation
insertion_sort(a[i])
input.concatenate(a[i])

Is it possible to calculate median of a list of numbers better than O(n log n)?

I know that it is possible to calculate the mean of a list of numbers in O(n). But what about the median? Is there any better algorithm than sort (O(n log n)) and lookup middle element (or mean of two middle elements if an even number of items in list)?
Yes. You can do it (deterministically) in O(n).
What you're talking about is a selection algorithm, where k = n/2. There is a method based on the same partitioning function used in quicksort which works. It is called, not surprisingly, quickselect. While it can, like quicksort, have a O(n2) worst case, this can be brought down to linear time using the proper pivot selection.
Partially irrelevant, but: a quick tip on how to quickly find answers to common basic questions like this on the web.
We're talking about medians? So Gg to the page about medians in wikipedia
Search page for algorithm:
Efficient computation of the sample median
Even though sorting n items takes in general O(n log n) operations, by using a "divide and conquer" algorithm the median of n items can be computed with only O(n) operations (in fact, you can always find the k-th element of a list of values with this method; this is called the selection problem).
Follow the link to the selection problem for the description of algorithm. Read intro:
... There are worst-case linear time selection algorithms. ...
And if you're interested read about the actual ingenious algorithm.
If the numbers are discrete (e.g. integers) and there is a manageable number of distinct values, you can use a "bucket sort" which is O(N), then iterate over the buckets to figure out which bucket holds the median. The complete calculation is O(N) in time and O(B) in space.
Just for fun (and who knows, it may be faster) there's another randomized median algorithm, explained technically in Mitzenmacher's and Upfall's book. Basically, you choose a polynomially-smaller subset of the list, and (with some fancy bookwork) such that it probably contains the real median, and then use it to find the real median. The book is on google books, and here's a link. Note: I was able to read the pages of the algorthm, so assuming that google books reveals the same pages to everyone, you can read them too.
It is a randomized algorithm s.t. if it finds the answer, it is 100% certain that it is the correct answer (this is called Las Vegas style). The randomness arises from the runtime --- occasionally (with probability 1/(sqrt(n)), I think) it FAILS to find the median, and must be re-run.
Asymptotically, it is exactly linear when you take into the chance of failure --- that is to say, it is a wee bit less than linear, exactly such that when you take into account the number of times you may need to re-run it, it becomes linear.
Note: I'm not saying this is better or worse --- I certainly haven't done a real-life runtime comparison between these algorithms! I'm simply presenting an additional algorithm that has linear runtime, but works in a significantly different way.
This link has popped up recently on calculating median: http://matpalm.com/median/question.html .
In general I think you can't go beyond O(n log n) time, but I don't have any proof on that :). No matter how much you make it parallel, aggregating the results into a single value takes at least log n levels of execution.
Try the randomized algorithm, the sampling size (e.g. 2000) is independent from the data size n, still be able to get sufficiently high (99%) accuracy. If you need higher accuracy, just increase sampling size. Using Chernoff bound can proof the probability under a certain sampling size. I've write some JavaScript Code to implement the algorithm, feel free to take it. http://www.sfu.ca/~wpa10

Resources