Reducing the Average Number of Comparisons in Selection - algorithm

The problem here is to reduce the average number of comparisons need in a selection sort.
I am reading an article on this and here is text snippet:
More generally, a sample S' of s elements is chosen from the n
elements. Let "delta" be some number, which we will choose later so
as to minimize the average number of comparisons used by the
procedure. We find the (v1 = (k * s)/(n - delta))th and (v2 = (k* * s)/(n + delta)
)th smallest elements in S'. Almost certainly, the kth smallest
element in S will fall between v1 and v2, so we are left with a
selection problem on (2 * delta) elements. With low probability, the
kth smallest element does not fall in this range, and we have
considerable work to do. However, with a good choice of s and delta,
we can ensure, by the laws of probability, that the second case does
not adversely affect the total work.
I do not follow the above text. Can anyone please explain to me with examples. How did the author reduce to 2 * delta elements? And how does he know that there is a low probablity that element does not fall into this category.
Thanks!

The basis for the idea is that the normal selection algorithm has linear runtime complexity, but in practical terms is slow. We need to sort all the elements in groups of five, and recursively do even more work. O(n) but with too large a constant. The idea then, is to reduce the number of comparisons in the selection algorithm (not a selection sort necessarily). Intuitively it is the same as in basic statistics; if I take a sample subspace of large enough proportion, it is likely that the distribution of data in the subspace adequately reflects the data in the whole space.
So if I'm looking for the kth number in a set of size one million, I could instead take say 10 000 (already one hundredth the size), which is still large enough to be a good representation of the global distribution, and look for the k/100th number. That's simple scaling. So if the space was 10 and I was looking for the 3rd, that's like looking for the 30th in 100, or the 300th in 1000, etc. Essentially k/S = k'/S' (where we're looking for the kth number in S, and we translate that to the k'th number in S' our subspace) and therefore k' = k*S'/S which should look familiar, since in the text you quoted S' is denoted by s, and S by n, and that's the same fraction quoted.
Now in order to take statistical fluctuations into account, we don't assume that the subspace will be a perfect representation of the data's distribution, so we allow for some fluctuation, namely, delta. We say let's find the k'th-delta and k'th+delta elements in S', and then we can say with great certainty (i.e. high mathematical probability) that the kth value from S is in the interval (k'th-delta, k'th+delta).
To wrap it all up we perform these two selections on S', then partition S accordingly, and now do [normal] selection on the much smaller interval in the partition. This ends up being almost optimal for the elements outside the interval, because we don't do selection on those, only partition them. So the selection process is faster, because we have reduced the problem size from S to S'.

Related

Partitioning an ordered list of weights into N sub-lists of approximately equal weight

Suppose I have an ordered list of weights, having length M. I want to divide this list into N ordered non-empty sublists, where the sum of the weights in each sublist are as close to each other as possible. Finally, the length of the list will always be greater than or equal to the number of partitions.
For example:
A reader of epoch fantasy wants to read the entire Wheel of Time series in N = 90 days. She wants to read approximately the same amount of words each day, but she doesn't want to break a single chapter across two days. Obviously, she also doesn't want to read it out of order either. The series has a total of M chapters, and she has a list of the word counts in each.
What algorithm could she use to calculate the optimum reading schedule?
In this example, the weights probably won't vary much, but the algorithm I'm seeking should be general enough to handle weights that vary widely.
As for what I consider optimum, I would say that given the choice between having two or three partitions vary in weight a small amount from the average would be better than having one partition vary a lot. Or in other words, She would rather have several days where she reads a few hundred more or fewer words than the average, if it means she can avoid having to read a thousand words more or fewer than the average, even once. My thinking is to use something like this to compute the score of any given solution:
let W_1, W_2, W_3 ... w_N be the weights of each partition (calculated by simply summing the weights of its elements).
let x be the total weight of the list, divided by its length M.
Then the score would be the sum, where I goes from 1 to N of (X - w_i)^2
So, I think I know a way to score each solution. The question is, what's the best way to minimize the score, other than brute force?
Any help or pointers in the right direction would be much appreciated!
As hinted by the first entry under "Related" on the right column of this page, you are probably looking for a "minimum raggedness word wrap" algorithm.

How to detect partitions (clusters) of sparse data in linear time, and (hopefully) sublinear space?

Let's say I have m integers from n disjoint integer intervals, which are in some sense "far" apart.
n is not known beforehand, but it is known to be small (see assumptions below).
For example, for n = 3, I might have been given randomly distributed integers from the intervals 105-2400, 58030-571290, 1000000-1000100.
Finding the minimum (105) and maximum (1000100) is clearly trivial.
But is there any way to efficiently (in O(m) time and hopefully o(m) space) find the intervals' boundary points, so that I can quickly partition the data for separate processing?
If there is no efficient way to do this exactly, is there an efficient way to approximate the partitions, to within a small constant factor (like 2)?
(For example, 4000 would be an acceptable approximation of the upper bound of the smaller interval, and 30000 would be an acceptable approximation of the lower bound of the middle interval.)
Assumptions:
Everything is nonnegative
n is very small (say, < 10)
The max value is comparatively large (say, on the order of 226)
The intervals are dense (i.e. there exists an integer in the array for most values inside that interval)
Two clusters are far apart if their closest boundaries are at least a constant factor c apart.
(Edit: It makes sense for c to be relative to the cluster size rather than relative to the bound. So a cluster of 1 element at 1000000 should not be approximated as originating from the interval 500000-2000000.)
The integers are not sorted, and this is crucial. In fact sorting them in O(m) time is impossible without radix sort, but radix sort could have O(max value) complexity, and there is no guarantee the max value is anywhere close to m.
Again, speed is the most important factor here; inaccuracy is tolerated as long as it's within a reasonable factor.
I say move to a logarithmic scale of factor c to search for the intervals, since you know them being at least c factor apart. Then, make an array of counters, each counter counts numbers in intervals (0.5X .. 0.5X+0.5) logarithmic scale, where X is the index of selected counter.
Say, c is 2, and you know the maximum upper bound of 226, so you create 52 counters, then calculate floor(2*log<sub>2</sub>i) where i is the current integer, and increment that counter. After you parse all of m integers, walk through that array, and each sequence of zeroes in there will mean that the corresponding logarithmic interval is empty.
So the output of this will be the sequence of occupied intervals, logarithmically aligned to halves of power of c, aka 128, 181, 256, 363, 512 etc. This satisfies your requirements on precision for intervals' boundaries.
Update: You can also store lowest and highest number out of those that hit the interval. Once you do, the intervals' boundaries are calculated as follows:
Find the first nonzero counter from current position in the counters array. Take its lowest number as lower bound.
Progress through the counters array until you find a zero in the counter or hit the array's end. Take the last nonzero counter's highest number. This will be your upper bound for the current interval.
Proceed until full traversal of the array.
Return the set of intervals found. The boundaries will be strict.
An example: (abstract language code)
counters=[];
lowest=[];
highest=[];
for (i=0;i<m;i++) {
x=getNextInteger();
n=Math.floor(2.0*logByBase(c,x));
counters[n]++;
if (counters[n]==1) {
lowest[n]=x;
highest[n]=x;
} else {
if (lowest[n]>x) lowest[n]=x;
if (highest[n]<x) highest[n]=x;
}
}
zeroflag=true; /// are we in mode of finding a zero or a nonzero
intervals=[];
currentLow=0;
currentHigh=0;
for (i=0;i<counters.length;i++) {
if (zeroflag) {
// we search for nonzero
if (counters[i]>0) {
currentLow=lowest[i]; // there's a value
zeroflag=false;
} // else skip
} else {
if (counters[i]==0) {
currentHigh=highest[i-1]; // previous was nonzero, get its highest
intervals.push([currentLow,currentHigh]); // store interval
zeroflag=true;
}
}
if (!zeroflag) { // unfinished interval
currentHigh=highest[counters.length-1];
intervals.push([currentLow,currentHigh]); // store last interval
}
You may want to look at aproximate median finding.
These methods can often be generalized to finding arbitrary quantiles with a reasonable precision; and quantiles are good for distributing your work load.
Here's my take, using two passes over the data set.
sample 10000 objects from your data set.
Solve your problem for the sample objects.
Re-scan your data set, assigning each object to the nearest interval from your sample, and tracking the minimum and maximum of each interval.
If your gaps are prominent enough, they should still be visible in the sample. The second pass is only to refine the interval boundaries.
Split the total range into buckets. The bucket boundaries X_i should be spaced in an appropriate way, like e.g. linearly X_i=16*i. Other options would be quadratic spacing like X_i=4*i*i or logarithmic X_i=2^(i/16), here the total number of buckets would be smaller but finding the right bucket for a given number would be more effort. Each bucket is empty or non-empty, so one bit would be sufficient.
You iterate over the set of numbers, and for each number you mark its bucket as non-empty. Then the gaps between your intervals are represented by series of empty buckets. So now you find all sufficiently long series of empty buckets, and you have the interval gaps. The accuracy of the interval boundary will be determined by the bucket size, so assuming a bucket size of 16 you interval border is off by at most 15. If the max number is 226 and buckets are size 16 and you use one bit for each bucket you need 219 byte, or 512kB of memory.
Actually I think I was wrong when I posted the question -- the answer does seem to be radix sort.
The number of buckets is arbitrary, it doesn't have to correlate with the sizes of the intervals.
It might even be 2, if I go bit-by-bit.
Thus radix sort could help me sort the data in O(m log(max value)) ≈ O(m) time (since log(max value) is essentially a constant factor of 26 according to the assumptions), at which point the problem becomes trivial.

Find medians in multiple sub ranges of a unordered list

E.g. given a unordered list of N elements, find the medians for sub ranges 0..100, 25..200, 400..1000, 10..500, ...
I don't see any better way than going through each sub range and run the standard median finding algorithms.
A simple example: [5 3 6 2 4]
The median for 0..3 is 5 . (Not 4, since we are asking the median of the first three elements of the original list)
INTEGER ELEMENTS:
If the type of your elements are integers, then the best way is to have a bucket for each number lies in any of your sub-ranges, where each bucket is used for counting the number its associated integer found in your input elements (for example, bucket[100] stores how many 100s are there in your input sequence). Basically you can achieve it in the following steps:
create buckets for each number lies in any of your sub-ranges.
iterate through all elements, for each number n, if we have bucket[n], then bucket[n]++.
compute the medians based on the aggregated values stored in your buckets.
Put it in another way, suppose you have a sub-range [0, 10], and you would like to compute the median. The bucket approach basically computes how many 0s are there in your inputs, and how many 1s are there in your inputs and so on. Suppose there are n numbers lies in range [0, 10], then the median is the n/2th largest element, which can be identified by finding the i such that bucket[0] + bucket[1] ... + bucket[i] greater than or equal to n/2 but bucket[0] + ... + bucket[i - 1] is less than n/2.
The nice thing about this is that even your input elements are stored in multiple machines (i.e., the distributed case), each machine can maintain its own buckets and only the aggregated values are required to pass through the intranet.
You can also use hierarchical-buckets, which involves multiple passes. In each pass, bucket[i] counts the number of elements in your input lies in a specific range (for example, [i * 2^K, (i+1) * 2^K]), and then narrow down the problem space by identifying which bucket will the medium lies after each step, then decrease K by 1 in the next step, and repeat until you can correctly identify the medium.
FLOATING-POINT ELEMENTS
The entire elements can fit into memory:
If your entire elements can fit into memory, first sorting the N element and then finding the medians for each sub ranges is the best option. The linear time heap solution also works well in this case if the number of your sub-ranges is less than logN.
The entire elements cannot fit into memory but stored in a single machine:
Generally, an external sort typically requires three disk-scans. Therefore, if the number of your sub-ranges is greater than or equal to 3, then first sorting the N elements and then finding the medians for each sub ranges by only loading necessary elements from the disk is the best choice. Otherwise, simply performing a scan for each sub-ranges and pick up those elements in the sub-range is better.
The entire elements are stored in multiple machines:
Since finding median is a holistic operator, meaning you cannot derive the final median of the entire input based on the medians of several parts of input, it is a hard problem that one cannot describe its solution in few sentences, but there are researches (see this as an example) have been focused on this problem.
I think that as the number of sub ranges increases you will very quickly find that it is quicker to sort and then retrieve the element numbers you want.
In practice, because there will be highly optimized sort routines you can call.
In theory, and perhaps in practice too, because since you are dealing with integers you need not pay n log n for a sort - see http://en.wikipedia.org/wiki/Integer_sorting.
If your data are in fact floating point and not NaNs then a little bit twiddling will in fact allow you to use integer sort on them - from - http://en.wikipedia.org/wiki/IEEE_754-1985#Comparing_floating-point_numbers - The binary representation has the special property that, excluding NaNs, any two numbers can be compared like sign and magnitude integers (although with modern computer processors this is no longer directly applicable): if the sign bit is different, the negative number precedes the positive number (except that negative zero and positive zero should be considered equal), otherwise, relative order is the same as lexicographical order but inverted for two negative numbers; endianness issues apply.
So you could check for NaNs and other funnies, pretend the floating point numbers are sign + magnitude integers, subtract when negative to correct the ordering for negative numbers, and then treat as normal 2s complement signed integers, sort, and then reverse the process.
My idea:
Sort the list into an array (using any appropriate sorting algorithm)
For each range, find the indices of the start and end of the range using binary search
Find the median by simply adding their indices and dividing by 2 (i.e. median of range [x,y] is arr[(x+y)/2])
Preprocessing time: O(n log n) for a generic sorting algorithm (like quick-sort) or the running time of the chosen sorting routine
Time per query: O(log n)
Dynamic list:
The above assumes that the list is static. If elements can freely be added or removed between queries, a modified Binary Search Tree could work, with each node keeping a count of the number of descendants it has. This will allow the same running time as above with a dynamic list.
The answer is ultimately going to be "in depends". There are a variety of approaches, any one of which will probably be suitable under most of the cases you may encounter. The problem is that each is going to perform differently for different inputs. Where one may perform better for one class of inputs, another will perform better for a different class of inputs.
As an example, the approach of sorting and then performing a binary search on the extremes of your ranges and then directly computing the median will be useful when the number of ranges you have to test is greater than log(N). On the other hand, if the number of ranges is smaller than log(N) it may be better to move elements of a given range to the beginning of the array and use a linear time selection algorithm to find the median.
All of this boils down to profiling to avoid premature optimization. If the approach you implement turns out to not be a bottleneck for your system's performance, figuring out how to improve it isn't going to be a useful exercise relative to streamlining those portions of your program which are bottlenecks.

Algorithm to generate k element subsets in order of their sum

If I have an unsorted large set of n integers (say 2^20 of them) and would like to generate subsets with k elements each (where k is small, say 5) in increasing order of their sums, what is the most efficient way to do so?
Why I need to generate these subsets in this fashion is that I would like to find the k-element subset with the smallest sum satisfying a certain condition, and I thus would apply the condition on each of the k-element subsets generated.
Also, what would be the complexity of the algorithm?
There is a similar question here: Algorithm to get every possible subset of a list, in order of their product, without building and sorting the entire list (i.e Generators) about generating subsets in order of their product, but it wouldn't fit my needs due to the extremely large size of the set n
I intend to implement the algorithm in Mathematica, but could do it in C++ or Python too.
If your desired property of the small subsets (call it P) is fairly common, a probabilistic approach may work well:
Sort the n integers (for millions of integers i.e. 10s to 100s of MB of ram, this should not be a problem), and sum the k-1 smallest. Call this total offset.
Generate a random k-subset (say, by sampling k random numbers, mod n) and check it for P-ness.
On a match, note the sum-total of the subset. Subtract offset from this to find an upper bound on the largest element of any k-subset of equivalent sum-total.
Restrict your set of n integers to those less than or equal to this bound.
Repeat (goto 2) until no matches are found within some fixed number of iterations.
Note the initial sort is O(n log n). The binary search implicit in step 4 is O(log n).
Obviously, if P is so rare that random pot-shots are unlikely to get a match, this does you no good.
Even if only 1 in 1000 of the k-sized sets meets your condition, That's still far too many combinations to test. I believe runtime scales with nCk (n choose k), where n is the size of your unsorted list. The answer by Andrew Mao has a link to this value. 10^28/1000 is still 10^25. Even at 1000 tests per second, that's still 10^22 seconds. =10^14 years.
If you are allowed to, I think you need to eliminate duplicate numbers from your large set. Each duplicate you remove will drastically reduce the number of evaluations you need to perform. Sort the list, then kill the dupes.
Also, are you looking for the single best answer here? Who will verify the answer, and how long would that take? I suggest implementing a Genetic Algorithm and running a bunch of instances overnight (for as long as you have the time). This will yield a very good answer, in much less time than the duration of the universe.
Do you mean 20 integers, or 2^20? If it's really 2^20, then you may need to go through a significant amount of (2^20 choose 5) subsets before you find one that satisfies your condition. On a modern 100k MIPS CPU, assuming just 1 instruction can compute a set and evaluate that condition, going through that entire set would still take 3 quadrillion years. So if you even need to go through a fraction of that, it's not going to finish in your lifetime.
Even if the number of integers is smaller, this seems to be a rather brute force way to solve this problem. I conjecture that you may be able to express your condition as a constraint in a mixed integer program, in which case solving the following could be a much faster way to obtain the solution than brute force enumeration. Assuming your integers are w_i, i from 1 to N:
min sum(i) w_i*x_i
x_i binary
sum over x_i = k
subject to (some constraints on w_i*x_i)
If it turns out that the linear programming relaxation of your MIP is tight, then you would be in luck and have a very efficient way to solve the problem, even for 2^20 integers (Example: max-flow/min-cut problem.) Also, you can use the approach of column generation to find a solution since you may have a very large number of values that cannot be solved for at the same time.
If you post a bit more about the constraint you are interested in, I or someone else may be able to propose a more concrete solution for you that doesn't involve brute force enumeration.
Here's an approximate way to do what you're saying.
First, sort the list. Then, consider some length-5 index vector v, corresponding to the positions in the sorted list, where the maximum index is some number m, and some other index vector v', with some max index m' > m. The smallest sum for all such vectors v' is always greater than the smallest sum for all vectors v.
So, here's how you can loop through the elements with approximately increasing sum:
sort arr
for i = 1 to N
for v = 5-element subsets of (1, ..., i)
set = arr{v}
if condition(set) is satisfied
break_loop = true
compute sum(set), keep set if it is the best so far
break if break_loop
Basically, this means that you no longer need to check for 5-element combinations of (1, ..., n+1) if you find a satisfying assignment in (1, ..., n), since any satisfying assignment with max index n+1 will have a greater sum, and you can stop after that set. However, there is no easy way to loop through the 5-combinations of (1, ..., n) while guaranteeing that the sum is always increasing, but at least you can stop checking after you find a satisfying set at some n.
This looks to be a perfect candidate for map-reduce (http://en.wikipedia.org/wiki/MapReduce). If you know of any way of partitioning them smartly so that passing candidates are equally present in each node then you can probably get a great throughput.
Complete sort may not really be needed as the map stage can take care of it. Each node can then verify the condition against the k-tuples and output results into a file that can be aggregated / reduced later.
If you know of the probability of occurrence and don't need all of the results try looking at probabilistic algorithms to converge to an answer.

How to calculate or approximate the median of a list without storing the list

I'm trying to calculate the median of a set of values, but I don't want to store all the values as that could blow memory requirements. Is there a way of calculating or approximating the median without storing and sorting all the individual values?
Ideally I would like to write my code a bit like the following
var medianCalculator = new MedianCalculator();
foreach (var value in SourceData)
{
medianCalculator.Add(value);
}
Console.WriteLine("The median is: {0}", medianCalculator.Median);
All I need is the actual MedianCalculator code!
Update: Some people have asked if the values I'm trying to calculate the median for have known properties. The answer is yes. One value is in 0.5 increments from about -25 to -0.5. The other is also in 0.5 increments from -120 to -60. I guess this means I can use some form of histogram for each value.
Thanks
Nick
If the values are discrete and the number of distinct values isn't too high, you could just accumulate the number of times each value occurs in a histogram, then find the median from the histogram counts (just add up counts from the top and bottom of the histogram until you reach the middle). Or if they're continuous values, you could distribute them into bins - that wouldn't tell you the exact median but it would give you a range, and if you need to know more precisely you could iterate over the list again, examining only the elements in the central bin.
There is the 'remedian' statistic. It works by first setting up k arrays, each of length b. Data values are fed in to the first array and, when this is full, the median is calculated and stored in the first pos of the next array, after which the first array is re-used. When the second array is full the median of its values is stored in the first pos of the third array, etc. etc. You get the idea :)
It's simple and pretty robust. The reference is here...
http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/Remedian.pdf
Hope this helps
Michael
I use these incremental/recursive mean and median estimators, which both use constant storage:
mean += eta * (sample - mean)
median += eta * sgn(sample - median)
where eta is a small learning rate parameter (e.g. 0.001), and sgn() is the signum function which returns one of {-1, 0, 1}. (Use a constant eta if the data is non-stationary and you want to track changes over time; otherwise, for stationary sources you can use something like eta=1/n for the mean estimator, where n is the number of samples seen so far... unfortunately, this does not appear to work for the median estimator.)
This type of incremental mean estimator seems to be used all over the place, e.g. in unsupervised neural network learning rules, but the median version seems much less common, despite its benefits (robustness to outliers). It seems that the median version could be used as a replacement for the mean estimator in many applications.
Also, I modified the incremental median estimator to estimate arbitrary quantiles. In general, a quantile function tells you the value that divides the data into two fractions: p and 1-p. The following estimates this value incrementally:
quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)
The value p should be within [0,1]. This essentially shifts the sgn() function's symmetrical output {-1,0,1} to lean toward one side, partitioning the data samples into two unequally-sized bins (fractions p and 1-p of the data are less than/greater than the quantile estimate, respectively). Note that for p=0.5, this reduces to the median estimator.
I would love to see an incremental mode estimator of a similar form...
(Note: I also posted this to a similar topic here: "On-line" (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis?)
Here is a crazy approach that you might try. This is a classical problem in streaming algorithms. The rules are
You have limited memory, say O(log n) where n is the number of items you want
You can look at each item once and make a decision then and there what to do with it, if you store it, it costs memory, if you throw it away it is gone forever.
The idea for the finding a median is simple. Sample O(1 / a^2 * log(1 / p)) * log(n) elements from the list at random, you can do this via reservoir sampling (see a previous question). Now simply return the median from your sampled elements, using a classical method.
The guarantee is that the index of the item returned will be (1 +/- a) / 2 with probability at least 1-p. So there is a probability p of failing, you can choose it by sampling more elements. And it wont return the median or guarantee that the value of the item returned is anywhere close to the median, just that when you sort the list the item returned will be close to the half of the list.
This algorithm uses O(log n) additional space and runs in Linear time.
This is tricky to get right in general, especially to handle degenerate series that are already sorted, or have a bunch of values at the "start" of the list but the end of the list has values in a different range.
The basic idea of making a histogram is most promising. This lets you accumulate distribution information and answer queries (like median) from it. The median will be approximate since you obviously don't store all values. The storage space is fixed so it will work with whatever length sequence you have.
But you can't just build a histogram from say the first 100 values and use that histogram continually.. the changing data may make that histogram invalid. So you need a dynamic histogram that can change its range and bins on the fly.
Make a structure which has N bins. You'll store the X value of each slot transition (N+1 values total) as well as the population of the bin.
Stream in your data. Record the first N+1 values. If the stream ends before this, great, you have all the values loaded and you can find the exact median and return it. Else use the values to define your first histogram. Just sort the values and use those as bin definitions, each bin having a population of 1. It's OK to have dupes (0 width bins).
Now stream in new values. For each one, binary search to find the bin it belongs to.
In the common case, you just increment the population of that bin and continue.
If your sample is beyond the histogram's edges (highest or lowest), just extend the end bin's range to include it.
When your stream is done, you find the median sample value by finding the bin which has equal population on both sides of it, and linearly interpolating the remaining bin-width.
But that's not enough.. you still need to ADAPT the histogram to the data as it's being streamed in. When a bin gets over-full, you're losing information about that bin's sub distribution.
You can fix this by adapting based on some heuristic... The easiest and most robust one is if a bin reaches some certain threshold population (something like 10*v/N where v=# of values seen so far in the stream, and N is the number of bins), you SPLIT that overfull bin. Add a new value at the midpoint of the bin, give each side half of the original bin's population. But now you have too many bins, so you need to DELETE a bin. A good heuristic for that is to find the bin with the smallest product of population and width. Delete it and merge it with its left or right neighbor (whichever one of the neighbors itself has the smallest product of width and population.). Done!
Note that merging or splitting bins loses information, but that's unavoidable.. you only have fixed storage.
This algorithm is nice in that it will deal with all types of input streams and give good results. If you have the luxury of choosing sample order, a random sample is best, since that minimizes splits and merges.
The algorithm also allows you to query any percentile, not just median, since you have a complete distribution estimate.
I use this method in my own code in many places, mostly for debugging logs.. where some stats that you're recording have unknown distribution. With this algorithm you don't need to guess ahead of time.
The downside is the unequal bin widths means you have to do a binary search for each sample, so your net algorithm is O(NlogN).
David's suggestion seems like the most sensible approach for approximating the median.
A running mean for the same problem is a much easier to calculate:
Mn = Mn-1 + ((Vn - Mn-1) / n)
Where Mn is the mean of n values, Mn-1 is the previous mean, and Vn is the new value.
In other words, the new mean is the existing mean plus the difference between the new value and the mean, divided by the number of values.
In code this would look something like:
new_mean = prev_mean + ((value - prev_mean) / count)
though obviously you may want to consider language-specific stuff like floating-point rounding errors etc.
I don't think it is possible to do without having the list in memory. You can obviously approximate with
average if you know that the data is symmetrically distributed
or calculate a proper median of a small subset of data (that fits in memory) - if you know that your data has the same distribution across the sample (e.g. that the first item has the same distribution as the last one)
Find Min and Max of the list containing N items through linear search and name them as HighValue and LowValue
Let MedianIndex = (N+1)/2
1st Order Binary Search:
Repeat the following 4 steps until LowValue < HighValue.
Get MedianValue approximately = ( HighValue + LowValue ) / 2
Get NumberOfItemsWhichAreLessThanorEqualToMedianValue = K
is K = MedianIndex, then return MedianValue
is K > MedianIndex ? then HighValue = MedianValue Else LowValue = MedianValue
It will be faster without consuming memory
2nd Order Binary Search:
LowIndex=1
HighIndex=N
Repeat Following 5 Steps until (LowIndex < HighIndex)
Get Approximate DistrbutionPerUnit=(HighValue-LowValue)/(HighIndex-LowIndex)
Get Approximate MedianValue = LowValue + (MedianIndex-LowIndex) * DistributionPerUnit
Get NumberOfItemsWhichAreLessThanorEqualToMedianValue = K
is (K=MedianIndex) ? return MedianValue
is (K > MedianIndex) ? then HighIndex=K and HighValue=MedianValue Else LowIndex=K and LowValue=MedianValue
It will be faster than 1st order without consuming memory
We can also think of fitting HighValue, LowValue and MedianValue with HighIndex, LowIndex and MedianIndex to a Parabola, and can get ThirdOrder Binary Search which will be faster than 2nd order without consuming memory and so on...
Usually if the input is within a certain range, say 1 to 1 million, it's easy to create an array of counts: read the code for "quantile" and "ibucket" here: http://code.google.com/p/ea-utils/source/browse/trunk/clipper/sam-stats.cpp
This solution can be generalized as an approximation by coercing the input into an integer within some range using a function that you then reverse on the way out: IE: foo.push((int) input/1000000) and quantile(foo)*1000000.
If your input is an arbitrary double precision number, then you've got to autoscale your histogram as values come in that are out of range (see above).
Or you can use the median-triplets method described in this paper: http://web.cs.wpi.edu/~hofri/medsel.pdf
I picked up the idea of iterative quantile calculation. It is important to have a good value for starting point and eta, these may come from mean and sigma. So I programmed this:
Function QuantileIterative(Var x : Array of Double; n : Integer; p, mean, sigma : Double) : Double;
Var eta, quantile,q1, dq : Double;
i : Integer;
Begin
quantile:= mean + 1.25*sigma*(p-0.5);
q1:=quantile;
eta:=0.2*sigma/xy(1+n,0.75); // should not be too large! sets accuracy
For i:=1 to n Do
quantile := quantile + eta * (signum_smooth(x[i] - quantile,eta) + 2*p - 1);
dq:=abs(q1-quantile);
If dq>eta
then Begin
If dq<3*eta then eta:=eta/4;
For i:=1 to n Do
quantile := quantile + eta * (signum_smooth(x[i] - quantile,eta) + 2*p - 1);
end;
QuantileIterative:=quantile
end;
As the median for two elements would be the mean, I used a smoothed signum function, and xy() is x^y. Are there ideas to make it better? Of course if we have some more a-priori knowledge we can add code using min and max of the array, skew, etc. For big data you would not use an array perhaps, but for testing it is easier.
On homogeneous random ordered and for big enough list, this pseudo code can work:
# find min on the fly
if minDataPoint > dataPoint:
minDataPoint = dataPoint
# find max on the fly
if maxDataPoint < dataPoint:
maxDataPoint = dataPoint
# estimate median base on the current data
estimate_mid = (maxDataPoint + minDataPoint) / 2
#if **new** dataPoint is closer to the mid? stor it
if abs(midDataPoint - estimate_mid) > abs(dataPoint - estimate_mid):
midDataPoint = dataPoint
Inspired by #lakshmanaraj

Resources