I want to sample K items from a stream of N items that I see one at a time. I don't know how big N is until the last item turns up, and I want the space consumption to depend on K rather than N.
So far I've described a reservoir sampling problem. The major ask though is that I'd like the samples to be 'evenly spaced', or at least more evenly spaced than reservoir sampling manages. This is vague; one formalization would be that the sample indices are a low-discrepancy sequence, but I'm not particularly tied to that.
I'd also like the process to be random and every possible sample to have a non-zero probability of appearing, but I'm not particularly tied to this either.
My intuition is that this is a feasible problem, and the algorithm I imagine preferentially drops samples from the 'highest density' part of the reservoir in order to make space for samples from the incoming stream. It also seems like a common enough problem that someone should have written a paper on it, but Googling combinations of 'evenly spaced', 'reservoir', 'quasirandom', 'sampling' haven't gotten me anywhere.
edit #1: An example might help.
Example
Suppose K=3, and I get items 0, 1, 2, 3, 4, 5, ....
After 3 items, the sample would be [0, 1, 2], with spaces of {1}
After 6 items, I'd like to most frequently get [0, 2, 4] with its spaces of {2}, but commonly getting samples like [0, 3, 5] or [0, 2, 4] with spaces of {2, 3} would be good too.
After 9 items, I'd like to most frequently get [0, 4, 8] with its spaces of {4}, but commonly getting samples like [0, 4, 7] with spaces of {4, 3} would be good too.
edit #2: I've learnt a lesson here about providing lots of context when requesting answers. David and Matt's answers are promising, but in case anyone sees this and has a perfect solution, here's some more information:
Context
I have hundreds of low-res videos streaming through a GPU. Each stream is up to 10,000 frames long, and - depending on application - I want to sample 10 to 1000 frames from each. Once a stream is finished and I've got a sample, it's used to train a machine learning algorithm, then thrown away. Another stream is started in its place. The GPU's memory is 10 gigabytes, and a 'good' set of reservoirs occupies a few gigabytes in the current application and plausibly close to the entire memory in future applications.
If space isn't at a premium, I'd oversample using the uniform random reservoir algorithm by some constant factor (e.g., if you need k items, sample 10k) and remember the index that each sampled item appeared at. At the end, use dynamic programming to choose k indexes to maximize (e.g.) the sum of the logs of the gaps between consecutive chosen indexes.
Here's an algorithm that doesn't require much extra memory. Hopefully it meets your quality requirements.
The high-level idea is to divide the input into k segments and choose one element uniformly at random from each segment. Given the memory constraint, we can't make the segments as even as we would like, but they'll be within a factor of two.
The simple version of this algorithm (that uses 2k reservoir slots and may return a sample of any size between k and 2k) starts by reading the first k elements, then proceeds in rounds. In round r (counting from zero), we read k 2r elements, using the standard reservoir algorithm to choose one random sample from each segment of 2r. At the end of each round, we append these samples to the existing reservoir and do the following compression step. For each pair of consecutive elements, choose one uniformly at random to retain and discard the other.
The complicated version of this algorithm uses k slots and returns a sample of size k by interleaving the round sampling step with compression. Rather than write a formal description, I'll demonstrate it, since I think that will be easier to understand.
Let k = 8. We pick up after 32 elements have been read. I use the notation [a-b] to mean a random element whose index is between a and b inclusive. The reservoir looks like this:
[0-3] [4-7] [8-11] [12-15] [16-19] [20-23] [24-27] [28-31]
Before we process the next element (32), we have to make room. This means merging [0-3] and [4-7] into [0-7].
[0-7] [32] [8-11] [12-15] [16-19] [20-23] [24-27] [28-31]
We merge the next few elements into [32].
[0-7] [32-39] [8-11] [12-15] [16-19] [20-23] [24-27] [28-31]
Element 40 requires another merge, this time of [16-19] and [20-23]. In general, we do merges in a low-discrepancy order.
[0-7] [32-39] [8-11] [12-15] [16-23] [40] [24-27] [28-31]
Keep going.
[0-7] [32-39] [8-11] [12-15] [16-23] [40-47] [24-27] [28-31]
At the end of the round, the reservoir looks like this.
[0-7] [32-39] [8-15] [48-55] [16-23] [40-47] [24-31] [56-63]
We use standard techniques from FFT to undo the butterfly permutation of the new samples and move them to the end.
[0-7] [8-15] [16-23] [24-31] [32-39] [40-47] [48-55] [56-63]
Then we start the next round.
Perhaps the simplest way to do reservoir sampling is to associate a random score with each sample, and then use a heap to remember the k samples with the highest scores.
This corresponds to applying a threshold operation to white noise, where the threshold value is chosen to admit the correct number of samples. Every sample has the same chance of being included in the output set, exactly as if k samples were selected uniformly.
If you sample blue noise instead of white noise to produce your scores, however, then applying a threshold operation will produce a low-discrepancy sequence and the samples in your output set will be more evenly spaced. This effect occurs because, while white noise samples are all independent, blue noise samples are temporally anti-correlated.
This technique is used to create pleasing halftone patterns (google Blue Noise Mask).
Theoretically, it works for any final sampling ratio, but realistically its limited by numeric precision. I think it has a good chance of working OK for your range of 1-100, but I'd be more comfortable with 1-20.
There are many ways to generate blue noise, but probably your best choices are to apply a high-pass filter to white noise or to construct an approximation directly from 1D Perlin noise.
Related
This should be a quite simple problem, but I don't have proper algorithmic training and find myself stuck trying to solve this.
I need to calculate the possible combinations to reach a number by adding a limited set of smaller numbers together.
Imagine that we are playing with LEGO and I have a brick that is 12 units long and I need to list the possible substitutions I can make with shorter bricks. For this example we may say that the available bricks are 2, 4, 6 and 12 units long.
What might be a good approach to building an algorithm that can calculate the substitions? There are no bounds on how many bricks I can use at a time, so it could be 6x2 as well as 1x12, the important thing is I need to list all of the options.
So the inputs are the target length (in this case 12) and available bricks (an array of numbers (arbitrary length), in this case [2, 4, 6, 12]).
My approach was to start with the low number and add it up until I reach the target, then take the next lowest and so on. But that way I miss out on the combinations of multiple numbers and when I try to factor that in, it gets really messy.
I suggest a recursive approach: given a function f(target,permissibles) to list all representations of target as a combination of permissibles, you can do this:
def f(target,permissibles):
for x in permissibles:
collect f(target - x, permissibles)
if you do not want to differentiate between 12 = 4+4+2+2 and 12=2+4+2+4, you need to sort permissibles in the descending order and do
def f(target,permissibles):
for x in permissibles:
collect f(target - x, permissibles.remove(larger than x))
I want to pick the top "range" of cards based upon a percentage. I have all my possible 2 card hands organized in an array in order of the strength of the hand, like so:
AA, KK, AKsuited, QQ, AKoff-suit ...
I had been picking the top 10% of hands by multiplying the length of the card array by the percentage which would give me the index of the last card in the array. Then I would just make a copy of the sub-array:
Arrays.copyOfRange(cardArray, 0, 16);
However, I realize now that this is incorrect because there are more possible combinations of, say, Ace King off-suit - 12 combinations (i.e. an ace of one suit and a king of another suit) than there are combinations of, say, a pair of aces - 6 combinations.
When I pick the top 10% of hands therefore I want it to be based on the top 10% of hands in proportion to the total number of 2 cards combinations - 52 choose 2 = 1326.
I thought I could have an array of integers where each index held the combined total of all the combinations up to that point (each index would correspond to a hand from the original array). So the first few indices of the array would be:
6, 12, 16, 22
because there are 6 combinations of AA, 6 combinations of KK, 4 combinations of AKsuited, 6 combinations of QQ.
Then I could do a binary search which runs in BigOh(log n) time. In other words I could multiply the total number of combinations (1326) by the percentage, search for the first index lower than or equal to this number, and that would be the index of the original array that I need.
I wonder if there a way that I could do this in constant time instead?
As Groo suggested, if precomputation and memory overhead permits, it would be more efficient to create 6 copies of AA, 6 copies of KK, etc and store them into a sorted array. Then you could run your original algorithm on this properly weighted list.
This is best if the number of queries is large.
Otherwise, I don't think you can achieve constant time for each query. This is because the queries depend on the entire frequency distribution. You can't look only at a constant number of elements to and determine if it's the correct percentile.
had a similar discussion here Algorithm for picking thumbed-up items As a comment to my answer (basically what you want to do with your list of cards), someone suggested a particular data structure, http://en.wikipedia.org/wiki/Fenwick_tree
Also, make sure your data structure will be able to provide efficient access to, say, the range between top 5% and 15% (not a coding-related tip though ;).
I need to generate a (pseudo) random sequence of N bit integers, where successive integers differ from the previous by only 1 bit, and the sequence never repeats. I know a Gray code will generate non-repeating sequences with only 1 bit difference, and an LFSR will generate non-repeating random-like sequences, but I'm not sure how to combine these ideas to produce what I want.
Practically, N will be very large, say 1000. I want to randomly sample this large space of 2^1000 integers, but I need to generate something like a random walk because the application in mind can only hop from one number to the next by flipping one bit.
Use any random number generator algorithm to generate an integer between 1 and N (or 0 to N-1 depending on the language). Use the result to determine the index of the bit to flip.
In order to satisfy randomness you will need to store previously generated numbers (thanks ShreevatsaR). Additionally, you may run into a scenario where no non-repeating answers are possible so this will require a backtracking algorithm as well.
This makes me think of fractals - following a boundary in a julia set or something along those lines.
If N is 1000, use a 2^500 x 2^500 fractal bitmap (obviously don't generate it in advance - you can derive each pixel on demand, and most won't be needed). Each pixel move is one pixel up, down, left or right following the boundary line between pixels, like a simple bitmap tracing algorithm. So long as you start at the edge of the bitmap, you should return to the edge of the bitmap sooner or later - following a specific "colour" boundary should always give a closed curve with no self-crossings, if you look at the unbounded version of that fractal.
The x and y axes of the bitmap will need "Gray coded" co-ordinates, of course - a bit like oversized Karnaugh maps. Each step in the tracing (one pixel up, down, left or right) equates to a single-bit change in one bitmap co-ordinate, and therefore in one bit of the resulting values in the random walk.
EDIT
I just realised there's a problem. The more wrinkly the boundary, the more likely you are in the tracing to hit a point where you have a choice of directions, such as...
* | .
---+---
. | *
Whichever direction you enter this point, you have a choice of three ways out. Choose the wrong one of the other two and you may return back to this point, therefore this is a possible self-crossing point and possible repeat. You can eliminate the continue-in-the-same-direction choice - whichever way you turn should keep the same boundary colours to the left and right of your boundary path as you trace - but this still leaves a choice of two directions.
I think the problem can be eliminated by making having at least three colours in the fractal, and by always keeping the same colour to one particular side (relative to the trace direction) of the boundary. There may be an "as long as the fractal isn't too wrinkly" proviso, though.
The last resort fix is to keep a record of points where this choice was available. If you return to the same point, backtrack and take the other alternative.
While an algorithm like this:
seed()
i = random(0, n)
repeat:
i ^= >> (i % bitlen)
yield i
…would return a random sequence of integers differing each by 1 bit, it would require a huge array for backtracing to ensure uniqueness of numbers.
Further more your running time would increase exponentially(?) with increasing density of your backtrace, as the chance to hit a new and non-repeating number decreases with every number in the sequence.
To reduce time and space one could try to incorporate one of these:
Bloom Filter
Use a Bloom Filter to drastically reduce the space (and time) needed for uniqueness-backtracing.
As Bloom Filters come with the drawback of producing false positives from time to time a certain rate of falsely detected repeats (sic!) (which thus are skipped) in your sequence would occur.
While the use of a Bloom Filter would reduce the space and time your running time would still increase exponentially(?)…
Hilbert Curve
A Hilbert Curve represents a non-repeating (kind of pseudo-random) walk on a quadratic plane (or in a cube) with each step being of length 1.
Using a Hilbert Curve (on an appropriate distribution of values) one might be able to get rid of the need for a backtrace entirely.
To enable your sequence to get a seed you'd generate n (n being the dimension of your plane/cube/hypercube) random numbers between 0 and s (s being the length of your plane's/cube's/hypercube's sides).
Not only would a Hilbert Curve remove the need for a backtrace, it would also make the sequencer run in O(1) per number (in contrast to the use of a backtrace, which would make your running time increase exponentially(?) over time…)
To seed your sequence you'd wrap-shift your n-dimensional distribution by random displacements in each of its n dimension.
Ps: You might get better answers here: CSTheory # StackExchange (or not, see comments)
I have a list of size n which contains n consecutive members of an arithmetic progression which are not in order. I changed less than half of the elements in this list with some random integer. From this new list, how can I find the difference of the initial arithmetic progression?
I thought a lot about it but except brute force, I was not able to come up with any other thing :(
Thanks for thinking on this one :)
It's not possible to solve this in general and be 100% sure that your answer is correct. Let's say that the initial list is the following arithmetic progression (not in order):
1 3 2 4
Change less than half the elements at random... let's say for example that we changed 2 to 5:
1 3 5 4
If we can first find out which numbers we need to change to obtain a valid shuffled arithmetic sequence then we can easily solve the problem stated in the question. However we can see that there are multiple possible answers depending in which we number we choose to change:
6, 3, 5, 4 (difference is 1)
1, 3, 2, 4 (difference is 1)
1, 3, 5, 7 (difference is 2)
There is no way to know which of these possible sequence is the original sequence, so you cannot be sure what the original difference was.
Since there is no deterministic solution for the problem (as stated by #Mark Byers), you can try a probabilistic approach.
It's difficult to obtain the original progression, but its rate can be obtained easily by comparing the differences between elements. The difference of original ones will be multiples of rate.
Consider you take 2 elements from the list (probability that both of them belongs to the original sequence is 1/4), and compute the difference. This difference, with probability of 1/4, will be a multiple of the rate. Decompose it to prime factors and count them (for example, 12 = 2^^2 * 3 will add 2 to 2's counter and will increment 3's counter).
After many such iterations (it looks like a good problem for probabilistic methods, like Monte Carlo), you could analize the counters.
If a prime factor belongs to the rate, its counter will be at least num_iteartions/4 ( or num_iterations/2 if it appears twice).
The main problem is that small factors will have large probability on random input (for example, the difference between two random numbers will have 50% probability to be divisible by 2). So you'll have to compensate it: since 3/4 of your differences were random, you'll have to consider that (3/8)*num_iterations of 2's counter must be ignored. Since this also applies to all powers of two, the simpliest way is to pregenerate "white noise mask" by taking the differences only between random numbers.
EDIT: let's take this approach further. Consider that you create this "white noise mask" (let's call it spectrum) for random numbers, and consider that it's base-1 spectrum, since their smallest "largest common factor" is 1. By computing it for a differences of the arithmetic sequence, you'll obtain a base-R spectrum, where R is the rate, and it will equivalent to a shifted version of base-1 spectrum. So you have to find the value of R such that
your_spectrum ~= spectrum(1)*3/4 + spectrum(R)*1/4
You could also check for largest number R such that at least half of the elements will be equal modulo R.
I have a process that generates values and that I observe. When the process terminates, I want to compute the median of those values.
If I had to compute the mean, I could just store the sum and the number of generated values and thus have O(1) memory requirement. How about the median? Is there a way to save on the obvious O(n) coming from storing all the values?
Edit: Interested in 2 cases: 1) the stream length is known, 2) it's not.
You are going to need to store at least ceil(n/2) points, because any one of the first n/2 points could be the median. It is probably simplest to just store the points and find the median. If saving ceil(n/2) points is of value, then read in the first n/2 points into a sorted list (a binary tree is probably best), then as new points are added throw out the low or high points and keep track of the number of points on either end thrown out.
Edit:
If the stream length is unknown, then obviously, as Stephen observed in the comments, then we have no choice but to remember everything. If duplicate items are likely, we could possibly save a bit of memory using Dolphins idea of storing values and counts.
I had the same problem and got a way that has not been posted here. Hopefully my answer can help someone in the future.
If you know your value range and don't care much about median value precision, you can incrementally create a histogram of quantized values using constant memory. Then it is easy to find median or any position of values, with your quantization error.
For example, suppose your data stream is image pixel values and you know these values are integers all falling within 0~255. To create the image histogram incrementally, just create 256 counters (bins) starting from zeros and count one on the bin corresponding to the pixel value while scanning through the input. Once the histogram is created, find the first cumulative count that is larger than half of the data size to get median.
For data that are real numbers, you can still compute histogram with each bin having quantized values (e.g. bins of 10's, 1's, or 0.1's etc.), depending on your expected data value range and precision you want.
If you don't know the value range of entire data sample, you can still estimate the possible value range of median and compute histogram within this range. This drops outliers by nature but is exactly what we want when computing median.
You can
Use statistics, if that's acceptable - for example, you could use sampling.
Use knowledge about your number stream
using a counting sort like approach: k distinct values means storing O(k) memory)
or toss out known outliers and keep a (high,low) counter.
If you know you have no duplicates, you could use a bitmap... but that's just a smaller constant for O(n).
If you have discrete values and lots of repetition you could store the values and counts, which would save a bit of space.
Possibly at stages through the computation you could discard the top 'n' and bottom 'n' values, as long as you are sure that the median is not in that top or bottom range.
e.g. Let's say you are expecting 100,000 values. Every time your stored number gets to (say) 12,000 you could discard the highest 1000 and lowest 1000, dropping storage back to 10,000.
If the distribution of values is fairly consistent, this would work well. However if there is a possibility that you will receive a large number of very high or very low values near the end, that might distort your computation. Basically if you discard a "high" value that is less than the (eventual) median or a "low" value that is equal or greater than the (eventual) median then your calculation is off.
Update
Bit of an example
Let's say that the data set is the numbers 1,2,3,4,5,6,7,8,9.
By inspection the median is 5.
Let's say that the first 5 numbers you get are 1,3,5,7,9.
To save space we discard the highest and lowest, leaving 3,5,7
Now get two more, 2,6 so our storage is 2,3,5,6,7
Discard the highest and lowest, leaving 3,5,6
Get the last two 4,8 and we have 3,4,5,6,8
Median is still 5 and the world is a good place.
However, lets say that the first five numbers we get are 1,2,3,4,5
Discard top and bottom leaving 2,3,4
Get two more 6,7 and we have 2,3,4,6,7
Discard top and bottom leaving 3,4,6
Get last two 8,9 and we have 3,4,6,8,9
With a median of 6 which is incorrect.
If our numbers are well distributed, we can keep trimming the extremities. If they might be bunched in lots of large or lots of small numbers, then discarding is risky.