Incremental median computation with max memory efficiency - algorithm

I have a process that generates values and that I observe. When the process terminates, I want to compute the median of those values.
If I had to compute the mean, I could just store the sum and the number of generated values and thus have O(1) memory requirement. How about the median? Is there a way to save on the obvious O(n) coming from storing all the values?
Edit: Interested in 2 cases: 1) the stream length is known, 2) it's not.

You are going to need to store at least ceil(n/2) points, because any one of the first n/2 points could be the median. It is probably simplest to just store the points and find the median. If saving ceil(n/2) points is of value, then read in the first n/2 points into a sorted list (a binary tree is probably best), then as new points are added throw out the low or high points and keep track of the number of points on either end thrown out.
Edit:
If the stream length is unknown, then obviously, as Stephen observed in the comments, then we have no choice but to remember everything. If duplicate items are likely, we could possibly save a bit of memory using Dolphins idea of storing values and counts.

I had the same problem and got a way that has not been posted here. Hopefully my answer can help someone in the future.
If you know your value range and don't care much about median value precision, you can incrementally create a histogram of quantized values using constant memory. Then it is easy to find median or any position of values, with your quantization error.
For example, suppose your data stream is image pixel values and you know these values are integers all falling within 0~255. To create the image histogram incrementally, just create 256 counters (bins) starting from zeros and count one on the bin corresponding to the pixel value while scanning through the input. Once the histogram is created, find the first cumulative count that is larger than half of the data size to get median.
For data that are real numbers, you can still compute histogram with each bin having quantized values (e.g. bins of 10's, 1's, or 0.1's etc.), depending on your expected data value range and precision you want.
If you don't know the value range of entire data sample, you can still estimate the possible value range of median and compute histogram within this range. This drops outliers by nature but is exactly what we want when computing median.

You can
Use statistics, if that's acceptable - for example, you could use sampling.
Use knowledge about your number stream
using a counting sort like approach: k distinct values means storing O(k) memory)
or toss out known outliers and keep a (high,low) counter.
If you know you have no duplicates, you could use a bitmap... but that's just a smaller constant for O(n).

If you have discrete values and lots of repetition you could store the values and counts, which would save a bit of space.
Possibly at stages through the computation you could discard the top 'n' and bottom 'n' values, as long as you are sure that the median is not in that top or bottom range.
e.g. Let's say you are expecting 100,000 values. Every time your stored number gets to (say) 12,000 you could discard the highest 1000 and lowest 1000, dropping storage back to 10,000.
If the distribution of values is fairly consistent, this would work well. However if there is a possibility that you will receive a large number of very high or very low values near the end, that might distort your computation. Basically if you discard a "high" value that is less than the (eventual) median or a "low" value that is equal or greater than the (eventual) median then your calculation is off.
Update
Bit of an example
Let's say that the data set is the numbers 1,2,3,4,5,6,7,8,9.
By inspection the median is 5.
Let's say that the first 5 numbers you get are 1,3,5,7,9.
To save space we discard the highest and lowest, leaving 3,5,7
Now get two more, 2,6 so our storage is 2,3,5,6,7
Discard the highest and lowest, leaving 3,5,6
Get the last two 4,8 and we have 3,4,5,6,8
Median is still 5 and the world is a good place.
However, lets say that the first five numbers we get are 1,2,3,4,5
Discard top and bottom leaving 2,3,4
Get two more 6,7 and we have 2,3,4,6,7
Discard top and bottom leaving 3,4,6
Get last two 8,9 and we have 3,4,6,8,9
With a median of 6 which is incorrect.
If our numbers are well distributed, we can keep trimming the extremities. If they might be bunched in lots of large or lots of small numbers, then discarding is risky.

Related

How to make a uniform random distribution but where result is revealed in steps?

For example, let's say there is a array of items each equally likely to be chosen, and the output of this random function will tell which item to be chosen, but I want the function to be split into multiple steps so that along each step the list of potential items is narrowed in giving better insight on the result probabilities.
Here's a step by step example of how it might work:
Step 1: Every item is 1/1000 chance.
Step 2: Random subset of half the original set is removed, so each remaining item is 1/500 now.
Step 3: Repeat step 2 until narrowed down to a single item.
The requirements I'd like for the algorithm is < O(n) time complexity and at each step the distribution is still uniformly random.
Initially I though to have an algorithm which:
Start with variables min and max describing the current range of values left.
Shrink the range by generating random float number between [-1, 1] which is applied to the range to shrink it proportionally. If random number is negative then lower the max, otherwise raise the min. So 50% of the time it is shifting the min up, and shifting the max down, and the range is shrinking by a factor between [0,1].
Repeat 2. until range converges on a single number.
But I noticed this doesn't have a uniform distribution, and instead it is more common for the chosen result to be closer to starting min and max values. So to fix this I think one could add a preliminary step where the starting range is offset by another random value. But this would only fix in making the starting distribution uniformly random, and it still doesn't fit my requirement of making it uniformly random at every step.
The naive solution is to generate random numbers and remove those from the list until at each step, but that is a O(n) solution so I hope there is something better.
You just have to apply Bayes' Theorem.
If you randomly remove a portion p of the remaining possibilities, the remaining items have their probabilities multiplied by 1/(1-p). So in your step 2, the probabilities change by an amount corresponding to how much the range changed. And not by a fixed factor.
This problem has some very simple answers so maybe that is why people seemed confused.
One solution is to generate a random number between [0,n] where n is the number of items in the current set, and instead of just removing it, you remove a range of items around that point.
Solution two is a bit more complicated but has the property of preserving set order + location such that the resulting set is just a spliced section of the original set, wheras the first solution's resulting set could be made of up multiple sections of the original set. The method here is as described initially in my post, but you also apply the random offset during each turn, not just once at the beginning.

Is it better to reduce the space complexity or the time complexity for a given program?

Grid Illumination: Given an NxN grid with an array of lamp coordinates. Each lamp provides illumination to every square on their x axis, every square on their y axis, and every square that lies in their diagonal (think of a Queen in chess). Given an array of query coordinates, determine whether that point is illuminated or not. The catch is when checking a query all lamps adjacent to, or on, that query get turned off. The ranges for the variables/arrays were about: 10^3 < N < 10^9, 10^3 < lamps < 10^9, 10^3 < queries < 10^9
It seems like I can get one but not both. I tried to get this down to logarithmic time but I can't seem to find a solution. I can reduce the space complexity but it's not that fast, exponential in fact. Where should I focus on instead, speed or space? Also, if you have any input as to how you would solve this problem please do comment.
Is it better for a car to go fast or go a long way on a little fuel? It depends on circumstances.
Here's a proposal.
First, note you can number all the diagonals that the inputs like on by using the first point as the "origin" for both nw-se and ne-sw. The diagonals through this point are both numbered zero. The nw-se diagonals increase per-pixel in e.g the northeast direction, and decreasing (negative) to the southwest. Similarly ne-sw are numbered increasing in the e.g. the northwest direction and decreasing (negative) to the southeast.
Given the origin, it's easy to write constant time functions that go from (x,y) coordinates to the respective diagonal numbers.
Now each set of lamp coordinates is naturally associated with 4 numbers: (x, y, nw-se diag #, sw-ne dag #). You don't need to store these explicitly. Rather you want 4 maps xMap, yMap, nwSeMap, and swNeMap such that, for example, xMap[x] produces the list of all lamp coordinates with x-coordinate x, nwSeMap[nwSeDiagonalNumber(x, y)] produces the list of all lamps on that diagonal and similarly for the other maps.
Given a query point, look up it's corresponding 4 lists. From these it's easy to deal with adjacent squares. If any list is longer than 3, removing adjacent squares can't make it empty, so the query point is lit. If it's only 3 or fewer, it's a constant time operation to see if they're adjacent.
This solution requires the input points to be represented in 4 lists. Since they need to be represented in one list, you can argue that this algorithm requires only a constant factor of space with respect to the input. (I.e. the same sort of cost as mergesort.)
Run time is expected constant per query point for 4 hash table lookups.
Without much trouble, this algorithm can be split so it can be map-reduced if the number of lampposts is huge.
But it may be sufficient and easiest to run it on one big machine. With a billion lamposts and careful data structure choices, it wouldn't be hard to implement with 24 bytes per lampost in an unboxed structures language like C. So a ~32Gb RAM machine ought to work just fine. Building the maps with multiple threads requires some synchronization, but that's done only once. The queries can be read-only: no synchronization required. A nice 10 core machine ought to do a billion queries in well less than a minute.
There is very easy Answer which works
Create Grid of NxN
Now for each Lamp increment the count of all the cells which suppose to be illuminated by the Lamp.
For each query check if cell on that query has value > 0;
For each adjacent cell find out all illuminated cells and reduce the count by 1
This worked fine but failed for size limit when trying for 10000 X 10000 grid

How to detect partitions (clusters) of sparse data in linear time, and (hopefully) sublinear space?

Let's say I have m integers from n disjoint integer intervals, which are in some sense "far" apart.
n is not known beforehand, but it is known to be small (see assumptions below).
For example, for n = 3, I might have been given randomly distributed integers from the intervals 105-2400, 58030-571290, 1000000-1000100.
Finding the minimum (105) and maximum (1000100) is clearly trivial.
But is there any way to efficiently (in O(m) time and hopefully o(m) space) find the intervals' boundary points, so that I can quickly partition the data for separate processing?
If there is no efficient way to do this exactly, is there an efficient way to approximate the partitions, to within a small constant factor (like 2)?
(For example, 4000 would be an acceptable approximation of the upper bound of the smaller interval, and 30000 would be an acceptable approximation of the lower bound of the middle interval.)
Assumptions:
Everything is nonnegative
n is very small (say, < 10)
The max value is comparatively large (say, on the order of 226)
The intervals are dense (i.e. there exists an integer in the array for most values inside that interval)
Two clusters are far apart if their closest boundaries are at least a constant factor c apart.
(Edit: It makes sense for c to be relative to the cluster size rather than relative to the bound. So a cluster of 1 element at 1000000 should not be approximated as originating from the interval 500000-2000000.)
The integers are not sorted, and this is crucial. In fact sorting them in O(m) time is impossible without radix sort, but radix sort could have O(max value) complexity, and there is no guarantee the max value is anywhere close to m.
Again, speed is the most important factor here; inaccuracy is tolerated as long as it's within a reasonable factor.
I say move to a logarithmic scale of factor c to search for the intervals, since you know them being at least c factor apart. Then, make an array of counters, each counter counts numbers in intervals (0.5X .. 0.5X+0.5) logarithmic scale, where X is the index of selected counter.
Say, c is 2, and you know the maximum upper bound of 226, so you create 52 counters, then calculate floor(2*log<sub>2</sub>i) where i is the current integer, and increment that counter. After you parse all of m integers, walk through that array, and each sequence of zeroes in there will mean that the corresponding logarithmic interval is empty.
So the output of this will be the sequence of occupied intervals, logarithmically aligned to halves of power of c, aka 128, 181, 256, 363, 512 etc. This satisfies your requirements on precision for intervals' boundaries.
Update: You can also store lowest and highest number out of those that hit the interval. Once you do, the intervals' boundaries are calculated as follows:
Find the first nonzero counter from current position in the counters array. Take its lowest number as lower bound.
Progress through the counters array until you find a zero in the counter or hit the array's end. Take the last nonzero counter's highest number. This will be your upper bound for the current interval.
Proceed until full traversal of the array.
Return the set of intervals found. The boundaries will be strict.
An example: (abstract language code)
counters=[];
lowest=[];
highest=[];
for (i=0;i<m;i++) {
x=getNextInteger();
n=Math.floor(2.0*logByBase(c,x));
counters[n]++;
if (counters[n]==1) {
lowest[n]=x;
highest[n]=x;
} else {
if (lowest[n]>x) lowest[n]=x;
if (highest[n]<x) highest[n]=x;
}
}
zeroflag=true; /// are we in mode of finding a zero or a nonzero
intervals=[];
currentLow=0;
currentHigh=0;
for (i=0;i<counters.length;i++) {
if (zeroflag) {
// we search for nonzero
if (counters[i]>0) {
currentLow=lowest[i]; // there's a value
zeroflag=false;
} // else skip
} else {
if (counters[i]==0) {
currentHigh=highest[i-1]; // previous was nonzero, get its highest
intervals.push([currentLow,currentHigh]); // store interval
zeroflag=true;
}
}
if (!zeroflag) { // unfinished interval
currentHigh=highest[counters.length-1];
intervals.push([currentLow,currentHigh]); // store last interval
}
You may want to look at aproximate median finding.
These methods can often be generalized to finding arbitrary quantiles with a reasonable precision; and quantiles are good for distributing your work load.
Here's my take, using two passes over the data set.
sample 10000 objects from your data set.
Solve your problem for the sample objects.
Re-scan your data set, assigning each object to the nearest interval from your sample, and tracking the minimum and maximum of each interval.
If your gaps are prominent enough, they should still be visible in the sample. The second pass is only to refine the interval boundaries.
Split the total range into buckets. The bucket boundaries X_i should be spaced in an appropriate way, like e.g. linearly X_i=16*i. Other options would be quadratic spacing like X_i=4*i*i or logarithmic X_i=2^(i/16), here the total number of buckets would be smaller but finding the right bucket for a given number would be more effort. Each bucket is empty or non-empty, so one bit would be sufficient.
You iterate over the set of numbers, and for each number you mark its bucket as non-empty. Then the gaps between your intervals are represented by series of empty buckets. So now you find all sufficiently long series of empty buckets, and you have the interval gaps. The accuracy of the interval boundary will be determined by the bucket size, so assuming a bucket size of 16 you interval border is off by at most 15. If the max number is 226 and buckets are size 16 and you use one bit for each bucket you need 219 byte, or 512kB of memory.
Actually I think I was wrong when I posted the question -- the answer does seem to be radix sort.
The number of buckets is arbitrary, it doesn't have to correlate with the sizes of the intervals.
It might even be 2, if I go bit-by-bit.
Thus radix sort could help me sort the data in O(m log(max value)) ≈ O(m) time (since log(max value) is essentially a constant factor of 26 according to the assumptions), at which point the problem becomes trivial.

Generate random sequence of integers differing by 1 bit without repeats

I need to generate a (pseudo) random sequence of N bit integers, where successive integers differ from the previous by only 1 bit, and the sequence never repeats. I know a Gray code will generate non-repeating sequences with only 1 bit difference, and an LFSR will generate non-repeating random-like sequences, but I'm not sure how to combine these ideas to produce what I want.
Practically, N will be very large, say 1000. I want to randomly sample this large space of 2^1000 integers, but I need to generate something like a random walk because the application in mind can only hop from one number to the next by flipping one bit.
Use any random number generator algorithm to generate an integer between 1 and N (or 0 to N-1 depending on the language). Use the result to determine the index of the bit to flip.
In order to satisfy randomness you will need to store previously generated numbers (thanks ShreevatsaR). Additionally, you may run into a scenario where no non-repeating answers are possible so this will require a backtracking algorithm as well.
This makes me think of fractals - following a boundary in a julia set or something along those lines.
If N is 1000, use a 2^500 x 2^500 fractal bitmap (obviously don't generate it in advance - you can derive each pixel on demand, and most won't be needed). Each pixel move is one pixel up, down, left or right following the boundary line between pixels, like a simple bitmap tracing algorithm. So long as you start at the edge of the bitmap, you should return to the edge of the bitmap sooner or later - following a specific "colour" boundary should always give a closed curve with no self-crossings, if you look at the unbounded version of that fractal.
The x and y axes of the bitmap will need "Gray coded" co-ordinates, of course - a bit like oversized Karnaugh maps. Each step in the tracing (one pixel up, down, left or right) equates to a single-bit change in one bitmap co-ordinate, and therefore in one bit of the resulting values in the random walk.
EDIT
I just realised there's a problem. The more wrinkly the boundary, the more likely you are in the tracing to hit a point where you have a choice of directions, such as...
* | .
---+---
. | *
Whichever direction you enter this point, you have a choice of three ways out. Choose the wrong one of the other two and you may return back to this point, therefore this is a possible self-crossing point and possible repeat. You can eliminate the continue-in-the-same-direction choice - whichever way you turn should keep the same boundary colours to the left and right of your boundary path as you trace - but this still leaves a choice of two directions.
I think the problem can be eliminated by making having at least three colours in the fractal, and by always keeping the same colour to one particular side (relative to the trace direction) of the boundary. There may be an "as long as the fractal isn't too wrinkly" proviso, though.
The last resort fix is to keep a record of points where this choice was available. If you return to the same point, backtrack and take the other alternative.
While an algorithm like this:
seed()
i = random(0, n)
repeat:
i ^= >> (i % bitlen)
yield i
…would return a random sequence of integers differing each by 1 bit, it would require a huge array for backtracing to ensure uniqueness of numbers.
Further more your running time would increase exponentially(?) with increasing density of your backtrace, as the chance to hit a new and non-repeating number decreases with every number in the sequence.
To reduce time and space one could try to incorporate one of these:
Bloom Filter
Use a Bloom Filter to drastically reduce the space (and time) needed for uniqueness-backtracing.
As Bloom Filters come with the drawback of producing false positives from time to time a certain rate of falsely detected repeats (sic!) (which thus are skipped) in your sequence would occur.
While the use of a Bloom Filter would reduce the space and time your running time would still increase exponentially(?)…
Hilbert Curve
A Hilbert Curve represents a non-repeating (kind of pseudo-random) walk on a quadratic plane (or in a cube) with each step being of length 1.
Using a Hilbert Curve (on an appropriate distribution of values) one might be able to get rid of the need for a backtrace entirely.
To enable your sequence to get a seed you'd generate n (n being the dimension of your plane/cube/hypercube) random numbers between 0 and s (s being the length of your plane's/cube's/hypercube's sides).
Not only would a Hilbert Curve remove the need for a backtrace, it would also make the sequencer run in O(1) per number (in contrast to the use of a backtrace, which would make your running time increase exponentially(?) over time…)
To seed your sequence you'd wrap-shift your n-dimensional distribution by random displacements in each of its n dimension.
Ps: You might get better answers here: CSTheory # StackExchange (or not, see comments)

Finding the Nth largest value in a group of numbers as they are generated

I'm writing a program than needs to find the Nth largest value in a group of numbers. These numbers are generated by the program, but I don't have enough memory to store N numbers. Is there a better upper bound than N that can be acheived for storage? The upper bound for the size of the group of numbers (and for N) is approximately 100,000,000.
Note: The numbers are decimals and the list can include duplicates.
[Edit]: My memory limit is 16 MB.
This is a multipass algorithm (therefore, you must be able to generate the same list multiple times, or store the list off to secondary storage).
First pass:
Find the highest value and the lowest value. That's your initial range.
Passes after the first:
Divide the range up into 10 equally spaced bins. We don't need to store any numbers in the bins. We're just going to count membership in the bins. So we just have an array of integers (or bigints--whatever can accurately hold our counts) Note that 10 is an arbitrary choice for the number of bins. Your sample size and distribution will determine the best choice.
Spin through each number in the data, incrementing the count of whichever bin holds the number you see.
Figure out which bin has your answer, and add how many numbers are above that bin to a count of numbers above the winning bin.
The winning bin's top and bottom range are your new range.
Loop through these steps again until you have enough memory to hold the numbers in the current bin.
Last pass:
You should know how many numbers are above the current bin by now.
You have enough storage to grab all the numbers within your range of the current bin, so you can spin through and grab the actual numbers. Just sort them and grab the correct number.
Example: if the range you see is 0.0 through 1000.0, your bins' ranges will be:
(- 0.0 - 100.0]
(100.0 - 200.0]
(200.0 - 300.0]
...
(900.0 - 1000.0)
If you find through the counts that your number is in the (100.0 - 2000.0] bin, your next set of bins will be:
(100.0 - 110.0]
(110.0 - 120.0]
etc.
Another multipass idea:
Simply do a binary search. Choose the midpoint of the range as the first guess. Your passes just need to do an above/below count to determine the next estimate (which can be weighted by the count, or a simple average for code simplicity).
Are you able to regenerate the same group of numbers from start? If you are, you could make multiple passes over the output: start by finding the largest value, restart the generator, find the largest number smaller than that, restart the generator, and repeat this until you have your result.
It's going to be a real performance killer, because you have a lot of numbers and a lot of passes will be required - but memory-wise, you will only need to store 2 elements (the current maximum and a "limit", the number you found during the last pass) and a pass counter.
You could speed it up by using your priority queue to find the M largest elements (choosing some M that you are able to fit in memory), allowing you to reduce the number of passes required to N/M.
If you need to find, say, the 10th largest element in a list of 15 numbers, you could save time by working the other way around. Since it is the 10th largest element, that means there are 15-10=5 elements smaller than this element - so you could look for the 6th smallest element instead.
This is similar to another question -- C Program to search n-th smallest element in array without sorting? -- where you may get some answers.
The logic will work for Nth largest/smallest search similarly.
Note: I am not saying this is a duplicate of that.
Since you have a lot (nearly 1 billion?) numbers, here is another way for space optimization.
Lets assume your numbers fit in 32-bit values, so about 1 billion would require sometime close to 32GB space. Now, if you can afford about 128MB of working memory, we can do this in one pass.
Imagine a 1 billion bit-vector stored as an array of 32-bit words
Let it be initialized to all zeros
Start running through your numbers and keep setting the correct bit position for the value of the number
When you are done with one pass, start counting from the start of this bit vector for the Nth set-bit
That bit's position gives you the value for your Nth largest number
You have actually sorted all the numbers in the process (however, count of duplicates is not tracked)
If I understood well, the upper bound memory usage for your program is O(N) (possibly N+1). You can maintain a list of the generated values that are greater than the current X (the Nth largest value so far) ordered by lowest first. As soon as a new greater value is generated, you can replace the current X by the first element of the list and insert the just generated value to its corresponding position in the list.
sort -n | uniq -c and the Nth should be the Nth row

Resources