Something like a reversed random number generator - algorithm

I really don't know what the name of this problem is, but it's something like lossy compression, and I have a bad English, but I will try to describe it as much as I can.
Suppose I have list of unsorted unique numbers from unknown source, the length is usually between 255 to 512 with a range from 0 to 512.
I wonder if there is some kind of an algorithm that reads the data and return something like a seed number that I can use to generate a list somehow close to the original but with some degree of error.
For example
original list
{5, 13, 25, 33, 3, 10}
regenerated list
{4, 10, 30, 30, 5, 5} or {8, 20, 20, 35, 5, 9} //and so on
Does this problem have a name, and is there an algorithm that can do what I just described?
Is it the same as Monte Carlo method because from what I understand it isn't.
Is it possible to use some of the techniques used in lossy compression to get this kind of approximation ?
What I tried to do to solve this problem is to use a simple 16 bit RNG and brute-force all the possible values comparing them to the original list and pick the one with the minimum difference, but I think this way is rather dumb and inefficient.

This is indeed lossy compression.
You don't tell us the range of the values in the list. From the samples you give we can extrapolate that they count at least 6 bits each (0 to 63). In total, you have from 0 to 3072 bits to compress.
If these sequences have no special property and appear to be random, I doubt there is any way to achieve significant compression. Think that the probability of an arbitrary sequence to be matched from a 32 bits seed is 2^32.2^(-3072)=7.10^(-916), i.e. less than infinitesimal. If you allow 10% error on every value, the probability of a match is 2^32.0.1^512=4.10^(-503).
A trivial way to compress with 12.5% accuracy is to get rid of the three LSB of each value, leading to 50% savings (1536 bits), but I doubt this is what you are looking for.
Would be useful to measure the entropy of the sequences http://en.wikipedia.org/wiki/Entropy_(information_theory) and/or possible correlations between the values. This can be done by plotting all (V, Vi+1) pairs, or (Vi, Vi+1, Vi+2) triples and looking for patterns.

Related

Evenly-spaced samples from a stream of unknown length

I want to sample K items from a stream of N items that I see one at a time. I don't know how big N is until the last item turns up, and I want the space consumption to depend on K rather than N.
So far I've described a reservoir sampling problem. The major ask though is that I'd like the samples to be 'evenly spaced', or at least more evenly spaced than reservoir sampling manages. This is vague; one formalization would be that the sample indices are a low-discrepancy sequence, but I'm not particularly tied to that.
I'd also like the process to be random and every possible sample to have a non-zero probability of appearing, but I'm not particularly tied to this either.
My intuition is that this is a feasible problem, and the algorithm I imagine preferentially drops samples from the 'highest density' part of the reservoir in order to make space for samples from the incoming stream. It also seems like a common enough problem that someone should have written a paper on it, but Googling combinations of 'evenly spaced', 'reservoir', 'quasirandom', 'sampling' haven't gotten me anywhere.
edit #1: An example might help.
Example
Suppose K=3, and I get items 0, 1, 2, 3, 4, 5, ....
After 3 items, the sample would be [0, 1, 2], with spaces of {1}
After 6 items, I'd like to most frequently get [0, 2, 4] with its spaces of {2}, but commonly getting samples like [0, 3, 5] or [0, 2, 4] with spaces of {2, 3} would be good too.
After 9 items, I'd like to most frequently get [0, 4, 8] with its spaces of {4}, but commonly getting samples like [0, 4, 7] with spaces of {4, 3} would be good too.
edit #2: I've learnt a lesson here about providing lots of context when requesting answers. David and Matt's answers are promising, but in case anyone sees this and has a perfect solution, here's some more information:
Context
I have hundreds of low-res videos streaming through a GPU. Each stream is up to 10,000 frames long, and - depending on application - I want to sample 10 to 1000 frames from each. Once a stream is finished and I've got a sample, it's used to train a machine learning algorithm, then thrown away. Another stream is started in its place. The GPU's memory is 10 gigabytes, and a 'good' set of reservoirs occupies a few gigabytes in the current application and plausibly close to the entire memory in future applications.
If space isn't at a premium, I'd oversample using the uniform random reservoir algorithm by some constant factor (e.g., if you need k items, sample 10k) and remember the index that each sampled item appeared at. At the end, use dynamic programming to choose k indexes to maximize (e.g.) the sum of the logs of the gaps between consecutive chosen indexes.
Here's an algorithm that doesn't require much extra memory. Hopefully it meets your quality requirements.
The high-level idea is to divide the input into k segments and choose one element uniformly at random from each segment. Given the memory constraint, we can't make the segments as even as we would like, but they'll be within a factor of two.
The simple version of this algorithm (that uses 2k reservoir slots and may return a sample of any size between k and 2k) starts by reading the first k elements, then proceeds in rounds. In round r (counting from zero), we read k 2r elements, using the standard reservoir algorithm to choose one random sample from each segment of 2r. At the end of each round, we append these samples to the existing reservoir and do the following compression step. For each pair of consecutive elements, choose one uniformly at random to retain and discard the other.
The complicated version of this algorithm uses k slots and returns a sample of size k by interleaving the round sampling step with compression. Rather than write a formal description, I'll demonstrate it, since I think that will be easier to understand.
Let k = 8. We pick up after 32 elements have been read. I use the notation [a-b] to mean a random element whose index is between a and b inclusive. The reservoir looks like this:
[0-3] [4-7] [8-11] [12-15] [16-19] [20-23] [24-27] [28-31]
Before we process the next element (32), we have to make room. This means merging [0-3] and [4-7] into [0-7].
[0-7] [32] [8-11] [12-15] [16-19] [20-23] [24-27] [28-31]
We merge the next few elements into [32].
[0-7] [32-39] [8-11] [12-15] [16-19] [20-23] [24-27] [28-31]
Element 40 requires another merge, this time of [16-19] and [20-23]. In general, we do merges in a low-discrepancy order.
[0-7] [32-39] [8-11] [12-15] [16-23] [40] [24-27] [28-31]
Keep going.
[0-7] [32-39] [8-11] [12-15] [16-23] [40-47] [24-27] [28-31]
At the end of the round, the reservoir looks like this.
[0-7] [32-39] [8-15] [48-55] [16-23] [40-47] [24-31] [56-63]
We use standard techniques from FFT to undo the butterfly permutation of the new samples and move them to the end.
[0-7] [8-15] [16-23] [24-31] [32-39] [40-47] [48-55] [56-63]
Then we start the next round.
Perhaps the simplest way to do reservoir sampling is to associate a random score with each sample, and then use a heap to remember the k samples with the highest scores.
This corresponds to applying a threshold operation to white noise, where the threshold value is chosen to admit the correct number of samples. Every sample has the same chance of being included in the output set, exactly as if k samples were selected uniformly.
If you sample blue noise instead of white noise to produce your scores, however, then applying a threshold operation will produce a low-discrepancy sequence and the samples in your output set will be more evenly spaced. This effect occurs because, while white noise samples are all independent, blue noise samples are temporally anti-correlated.
This technique is used to create pleasing halftone patterns (google Blue Noise Mask).
Theoretically, it works for any final sampling ratio, but realistically its limited by numeric precision. I think it has a good chance of working OK for your range of 1-100, but I'd be more comfortable with 1-20.
There are many ways to generate blue noise, but probably your best choices are to apply a high-pass filter to white noise or to construct an approximation directly from 1D Perlin noise.

How to cluster values based on their frequency of occurrence?

I am working on a clustering algorithm where I need to cluster values based on their frequency in the data. This would indicate which values are not important and would be treated as the part of a larger cluster than individual entity.
I am new to data science and would like to know the best algorithm/approach to achieve this.
For example, I have the following data set. The first column are the property values and second column denotes their frequency of occurrence.
Value = [1, 1.5, 2, 3, 4, 6, 8, 16, 32, 128]
Frequency = [207, 19, 169, 92, 36, 7, 12, 5, 2, 2]
Here, Frequency[i] corresponds to Value[i]
The frequency can be thought of as the importance of a value. The other thing which denotes the importance of a value is the distance between the elements in the array. For example, 1.5 is not that significant compared to 32 or 128, since it has elements much closer such as 1 and 2.
When approaching to cluster these values, I need to look at distances between values and also the frequency of their occurrence. A possible output for the above problem would be
Clust_value = [(1, 1.5), 2, 3, 4, (6, 8), 16, (32, 128)]
This is not the best cluster but one possible answer. I need to know the best algorithm to approach this problem.
Firstly, I tried to solve this problem without taking into account the spread of elements in the values array, but that gave wrong answers in some situations. We have tried using mean and median for clustering values again with no successful outcome.
We have tried comparing frequencies of the neighbors and then clubbing the values into one cluster. We also tried to find the minimum distance between the elements of the values array and then putting them into one cluster if their difference was greater than a threshold value, but this failed to cluster values if they had low frequencies. I also looked for clustering algorithms on-line but did not get any useful resource relevant to the problem defined above.
Is there any better way to approach the problem?
You need to come up with some mathematical quality criterion of what makes one solution better than another. Unless you have thousands of numbers, you can afford a rather 'brute force' method: begin with the first number, add the next as long as your quality increases, otherwise begin a new cluster. Because your data are sorted this will be fairly efficient and find a rather good solution (you can try additional splits to further improve quality).
So it all boils down to you needing to specify quality.
Do not assume that existing criteria (e.g. variance in k-means) work for you. At most, you may be able to find a data transformation such that your requirements turn into variance, but that also will be specific to your problem.

Determin Depth of Huffman Tree using Input Character pattern (or Frequency)?

I'd like to as a variation on this question regarding Huffman tree building. Is there anyway to calculate the depth of a Huffman tree from the input (or frequency), without drawing tree.
if there is no quick way, How the answer of that question was found? Specific Example is : 10-Input Symbol with Frequency 1 to 10 is 5.
If you are looking for an equation to take the frequencies and give you the depth, then no, no such equation exists. The proof is that there exist sets of frequencies on which you will have arbitrary choices to make in applying the Huffman algorithm that result in different depth trees! So there isn't even a unique answer to "What is the depth of the Huffman tree?" for some sets of frequencies.
A simple example is the set of frequencies 1, 1, 2, and 2, which can give a depth of 2 or 3 depending on which minimum frequencies are paired when applying the Huffman algorithm.
The only way to get the answer is to apply the Huffman algorithm. You can take some shortcuts to get just the depth, since you won't be using the tree at the end. But you will be effectively building the tree no matter what.
You might be able to approximate the depth, or at least put bounds on it, with an entropy equation. In some special cases the bounds may be restrictive enough to give you the exact depth. E.g. if all of the frequencies are equal, then you can calculate the depth to be the ceiling of the log base 2 of the number of symbols.
A cool example that shows that a simple entropy bound won't be strong enough to get the exact answer is when you use the Fibonacci sequence for the frequencies. This assures that the depth of the tree is the number of symbols minus one. So the frequencies 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, and 610 will result in a depth of 14 bits even though the entropy of the lowest frequency symbol is 10.64 bits.

Exhaustive random number generator

In one of my project I encountered a need to generate a set of numbers in a given range that will be:
Exhaustive, which means that it will cover the most of the given
range without any repetition.
It will guarantee determinism (every time the sequence will be the
same). This can be probably achieved with a fixed seed.
It will be random (I am not very versed into Random Number Theory, but I guess there is a bunch of rules that describes randomness. From perspective something like 0,1,2..N is not random).
Ranges I am talking about can be ranges of integers, or of real numbers.
For example, if I used standard C# random generator to generate 10 numbers in range [0, 9] I will get this:
0 0 1 2 0 1 5 6 2 6
As you can see, a big part of given range still remains 'unexplored' and there are many repetitions.
Of course, input space can be very large, so remembering previously chosen values is not an option.
What would be the right way to tackle this problem?
Thanks.
After the comments:
Ok i agree that the random is not the right word, but I hope that you understood what I am trying to achieve. I want to explore given range that can be big so in memory list is not an option. If a range is (0, 10) and i want three numbers i want to guarantee that those numbers will be different and that they will 'describe the range' (i.e. They wont all be in a lower half etc).
Determinism part means that i would like to use something like standard rng with a fixed seed, so I can fully control the sequence.
I hope i made things a bit clearer.
Thanks.
Here's three options with different tradeoffs:
Generate a list of numbers ahead of time, and shuffle them using the fisher-yates shuffle. Select from the list as needed. O(n) total memory, and O(1) time per element. Randomness is as good as the PRNG you used to do the shuffle. The simplest of the three alternatives, too.
Use a Linear Feedback Shift Register, which will generate every value in its sequence exactly once before repeating. O(log n) total memory, and O(1) time per element. It's easy to determine future values based on the present value, however, and LFSRs are most easily constructed for power of 2 periods (but you can pick the next biggest power of 2, and skip any out of range values).
Use a secure permutation based on a block cipher. Usable for any power of 2 period, and with a little extra trickery, any arbitrary period. O(log n) total space and O(1) time per element, randomness is as good as the block cipher. The most complex of the three to implement.
If you just need something, what about something like this?
maxint = 16
step = 7
sequence = 7, 14, 5, 12, 3, 10, 1, 8, 15, 6, 13, 4, 11, 2, 9, 0
If you pick step right, it will generate the entire interval before repeating. You can play around with different values of step to get something that "looks" good. The "seed" here is where you start in the sequence.
Is this random? Of course not. Will it look random according to a statistical test of randomness? It might depend on the step, but likely this will not look very statistically random at all. However, it certainly picks the numbers in the range, not in their original order, and without any memory of the numbers picked so far.
In fact, you could make this look even better by making a list of factors - like [1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15, 16] - and using shuffled versions of those to compute step * factor (mod maxint). Let's say we shuffled the example factors lists like [3, 2, 4, 5, 1], [6, 8, 9, 10, 7], [13, 16, 12, 11, 14, 15]. then we'd get the sequence
5, 14, 12, 3, 7, 10, 8, 15, 6, 1, 11, 0, 4, 13, 2, 9
The size of the factors list is completely tunable, so you can store as much memory as you like. Bigger factor lists, more randomness. No repeats regardless of factor list size. When you exhaust a factor list, generating a new one is as easy as counting and shuffling.
It is my impression that what you are looking for is a randomly-ordered list of numbers, not a random list of numbers. You should be able to get this with the following pseudocode. Better math-ies may be able to tell me if this is in fact not random:
list = [ 1 .. 100 ]
for item,index in list:
location = random_integer_below(list.length - index)
list.switch(index,location+index)
Basically, go through the list and pick a random item from the rest of the list to use in the position you are at. This should randomly arrange the items in your list. If you need to reproduce the same random order each time, consider saving the array, or ensuring somehow that random_integer_below always returns numbers in the same order given some seed.
Generate an array that contains the range, in order. So the array contains [0, 1, 2, 3, 4, 5, ... N]. Then use a Fisher-Yates Shuffle to scramble the array. You can then iterate over the array to get your random numbers.
If you need repeatability, seed your random number generator with the same value at the start of the shuffle.
Do not use a random number generator to select numbers in a range. What will eventually happen is that you have one number left to fill, and your random number generator will cycle repeatedly until it selects that number. Depending on the random number generator, there is no guarantee that will ever happen.
What you should do is generate a list of numbers on the desired range, then use a random number generator to shuffle the list. The shuffle is known as the Fisher-Yates shuffle, or sometimes called the Knuth shuffle. Here's pseudocode to shuffle an array x of n elements with indices from 0 to n-1:
for i from n-1 to 1
j = random integer such that 0 ≤ j ≤ i
swap x[i] and x[j]

Sorting structures in order of least change

This came out being incomprehensible. I will rephrase
Is there an algorithm or approach that will allow sorting an array in such a way that it minimizes the differences between successive elements?
struct element
{
uint32 positions[8];
}
These records are order-insensitive.
The output file format is defined to be:
byte present; // each bit indicating whether position[i] is present
uint32 position0;
-- (only bits set in Present are actually written in the file).
uint32 positionN; // N is the bitcount of "present"
byte nextpresent;
All records are guaranteed to be unique, so a 'present' byte of 0 represents EOF.
The file is parsed by updating a "current" structure with the present fields, and the result is added to the list.
Eg: { 1, 2, 3}, { 2, 3, 2}, { 4, 2, 3}
Would be: 111b 1 2 3 001b 4 111b 2 3 2
Saving 2 numbers off the unsorted approach.
My goal is to to minimize the output file size.
Your problem
I think this question should really be tagged with 'compression'.
As I understand it, you have unordered records which consist of eight 4-byte integers: 32 bytes in total. You want to store these records with a minimum file size, and have decided to use some form of delta encoding based on a Hamming distance. You're asking how to best sort your data for the compression scheme you've constructed.
Your assumptions
From what you've told us, I don't see any real reason for you to split up your 32 bytes in the way you've described (apart from the fact that word boundaries are convenient)! If you get the same data back, do you really care if it's encoded as eight lots of 4 bytes, or sixteen lots of 2 bytes, or as one huge 32-byte integer?
Furthermore, unless there's something about the problem domain which makes your method the favourite, your best bet is probably to use a tried-and-tested compression scheme. You should be able to find code that's already written, and you'll get good performance on typical data.
Your question
Back to your original question, if you really do want to take this route. It's easy to imagine picking a starting record (I don't think it will make much difference which, but it probably makes sense to pick the 'smallest' or 'largest'), and computing the Hamming distance to all other records. You could then pick the one with the minimum distance to store next, and repeat. Obviously this is O(n^2) in the number of records. Unfortunately, this paper (which I haven't read or understood in detail) makes it look like computing the minimum Hamming distance from one string to a set of others is intrinsically hard, and doesn't have very good approximations.
You could obviously get better complexity by sorting your records based on Hamming weight (which comes down to the population count of that 32-byte integer), which is O(n log(n)) in the number of records. Then use some difference coding on the result. But I don't think this will make a terribly good compression scheme: the integers from 0 to 7 might end up as something like:
000, 100, 010, 001, 101, 011, 110, 111
0, 4, 2, 1, 5, 3, 6, 7
Which brings us back to the question I asked before: are you sure your compression scheme is better than something more standard for your particular data?
You're looking at a pair of subproblems, defining the difference between structures, then the sort.
I'm not terribly clear on your description of the structure, nor on the precedence of differences, but I'll assume you can work that out and compute a difference score between two instances. For files, there are known algorithms for discussing these things, like the one used in diff.
For your ordering, you're looking at a classic travelling salesman problem. If you're sorting a few of these things, its easy. If you are sorting a lot of them, you'll have to settle for a 'good enough' sort, unless you're ready to apply domain knowledge and many little tricks from TSP to the effort.

Resources