Sorting structures in order of least change - algorithm

This came out being incomprehensible. I will rephrase
Is there an algorithm or approach that will allow sorting an array in such a way that it minimizes the differences between successive elements?
struct element
{
uint32 positions[8];
}
These records are order-insensitive.
The output file format is defined to be:
byte present; // each bit indicating whether position[i] is present
uint32 position0;
-- (only bits set in Present are actually written in the file).
uint32 positionN; // N is the bitcount of "present"
byte nextpresent;
All records are guaranteed to be unique, so a 'present' byte of 0 represents EOF.
The file is parsed by updating a "current" structure with the present fields, and the result is added to the list.
Eg: { 1, 2, 3}, { 2, 3, 2}, { 4, 2, 3}
Would be: 111b 1 2 3 001b 4 111b 2 3 2
Saving 2 numbers off the unsorted approach.
My goal is to to minimize the output file size.

Your problem
I think this question should really be tagged with 'compression'.
As I understand it, you have unordered records which consist of eight 4-byte integers: 32 bytes in total. You want to store these records with a minimum file size, and have decided to use some form of delta encoding based on a Hamming distance. You're asking how to best sort your data for the compression scheme you've constructed.
Your assumptions
From what you've told us, I don't see any real reason for you to split up your 32 bytes in the way you've described (apart from the fact that word boundaries are convenient)! If you get the same data back, do you really care if it's encoded as eight lots of 4 bytes, or sixteen lots of 2 bytes, or as one huge 32-byte integer?
Furthermore, unless there's something about the problem domain which makes your method the favourite, your best bet is probably to use a tried-and-tested compression scheme. You should be able to find code that's already written, and you'll get good performance on typical data.
Your question
Back to your original question, if you really do want to take this route. It's easy to imagine picking a starting record (I don't think it will make much difference which, but it probably makes sense to pick the 'smallest' or 'largest'), and computing the Hamming distance to all other records. You could then pick the one with the minimum distance to store next, and repeat. Obviously this is O(n^2) in the number of records. Unfortunately, this paper (which I haven't read or understood in detail) makes it look like computing the minimum Hamming distance from one string to a set of others is intrinsically hard, and doesn't have very good approximations.
You could obviously get better complexity by sorting your records based on Hamming weight (which comes down to the population count of that 32-byte integer), which is O(n log(n)) in the number of records. Then use some difference coding on the result. But I don't think this will make a terribly good compression scheme: the integers from 0 to 7 might end up as something like:
000, 100, 010, 001, 101, 011, 110, 111
0, 4, 2, 1, 5, 3, 6, 7
Which brings us back to the question I asked before: are you sure your compression scheme is better than something more standard for your particular data?

You're looking at a pair of subproblems, defining the difference between structures, then the sort.
I'm not terribly clear on your description of the structure, nor on the precedence of differences, but I'll assume you can work that out and compute a difference score between two instances. For files, there are known algorithms for discussing these things, like the one used in diff.
For your ordering, you're looking at a classic travelling salesman problem. If you're sorting a few of these things, its easy. If you are sorting a lot of them, you'll have to settle for a 'good enough' sort, unless you're ready to apply domain knowledge and many little tricks from TSP to the effort.

Related

Evenly-spaced samples from a stream of unknown length

I want to sample K items from a stream of N items that I see one at a time. I don't know how big N is until the last item turns up, and I want the space consumption to depend on K rather than N.
So far I've described a reservoir sampling problem. The major ask though is that I'd like the samples to be 'evenly spaced', or at least more evenly spaced than reservoir sampling manages. This is vague; one formalization would be that the sample indices are a low-discrepancy sequence, but I'm not particularly tied to that.
I'd also like the process to be random and every possible sample to have a non-zero probability of appearing, but I'm not particularly tied to this either.
My intuition is that this is a feasible problem, and the algorithm I imagine preferentially drops samples from the 'highest density' part of the reservoir in order to make space for samples from the incoming stream. It also seems like a common enough problem that someone should have written a paper on it, but Googling combinations of 'evenly spaced', 'reservoir', 'quasirandom', 'sampling' haven't gotten me anywhere.
edit #1: An example might help.
Example
Suppose K=3, and I get items 0, 1, 2, 3, 4, 5, ....
After 3 items, the sample would be [0, 1, 2], with spaces of {1}
After 6 items, I'd like to most frequently get [0, 2, 4] with its spaces of {2}, but commonly getting samples like [0, 3, 5] or [0, 2, 4] with spaces of {2, 3} would be good too.
After 9 items, I'd like to most frequently get [0, 4, 8] with its spaces of {4}, but commonly getting samples like [0, 4, 7] with spaces of {4, 3} would be good too.
edit #2: I've learnt a lesson here about providing lots of context when requesting answers. David and Matt's answers are promising, but in case anyone sees this and has a perfect solution, here's some more information:
Context
I have hundreds of low-res videos streaming through a GPU. Each stream is up to 10,000 frames long, and - depending on application - I want to sample 10 to 1000 frames from each. Once a stream is finished and I've got a sample, it's used to train a machine learning algorithm, then thrown away. Another stream is started in its place. The GPU's memory is 10 gigabytes, and a 'good' set of reservoirs occupies a few gigabytes in the current application and plausibly close to the entire memory in future applications.
If space isn't at a premium, I'd oversample using the uniform random reservoir algorithm by some constant factor (e.g., if you need k items, sample 10k) and remember the index that each sampled item appeared at. At the end, use dynamic programming to choose k indexes to maximize (e.g.) the sum of the logs of the gaps between consecutive chosen indexes.
Here's an algorithm that doesn't require much extra memory. Hopefully it meets your quality requirements.
The high-level idea is to divide the input into k segments and choose one element uniformly at random from each segment. Given the memory constraint, we can't make the segments as even as we would like, but they'll be within a factor of two.
The simple version of this algorithm (that uses 2k reservoir slots and may return a sample of any size between k and 2k) starts by reading the first k elements, then proceeds in rounds. In round r (counting from zero), we read k 2r elements, using the standard reservoir algorithm to choose one random sample from each segment of 2r. At the end of each round, we append these samples to the existing reservoir and do the following compression step. For each pair of consecutive elements, choose one uniformly at random to retain and discard the other.
The complicated version of this algorithm uses k slots and returns a sample of size k by interleaving the round sampling step with compression. Rather than write a formal description, I'll demonstrate it, since I think that will be easier to understand.
Let k = 8. We pick up after 32 elements have been read. I use the notation [a-b] to mean a random element whose index is between a and b inclusive. The reservoir looks like this:
[0-3] [4-7] [8-11] [12-15] [16-19] [20-23] [24-27] [28-31]
Before we process the next element (32), we have to make room. This means merging [0-3] and [4-7] into [0-7].
[0-7] [32] [8-11] [12-15] [16-19] [20-23] [24-27] [28-31]
We merge the next few elements into [32].
[0-7] [32-39] [8-11] [12-15] [16-19] [20-23] [24-27] [28-31]
Element 40 requires another merge, this time of [16-19] and [20-23]. In general, we do merges in a low-discrepancy order.
[0-7] [32-39] [8-11] [12-15] [16-23] [40] [24-27] [28-31]
Keep going.
[0-7] [32-39] [8-11] [12-15] [16-23] [40-47] [24-27] [28-31]
At the end of the round, the reservoir looks like this.
[0-7] [32-39] [8-15] [48-55] [16-23] [40-47] [24-31] [56-63]
We use standard techniques from FFT to undo the butterfly permutation of the new samples and move them to the end.
[0-7] [8-15] [16-23] [24-31] [32-39] [40-47] [48-55] [56-63]
Then we start the next round.
Perhaps the simplest way to do reservoir sampling is to associate a random score with each sample, and then use a heap to remember the k samples with the highest scores.
This corresponds to applying a threshold operation to white noise, where the threshold value is chosen to admit the correct number of samples. Every sample has the same chance of being included in the output set, exactly as if k samples were selected uniformly.
If you sample blue noise instead of white noise to produce your scores, however, then applying a threshold operation will produce a low-discrepancy sequence and the samples in your output set will be more evenly spaced. This effect occurs because, while white noise samples are all independent, blue noise samples are temporally anti-correlated.
This technique is used to create pleasing halftone patterns (google Blue Noise Mask).
Theoretically, it works for any final sampling ratio, but realistically its limited by numeric precision. I think it has a good chance of working OK for your range of 1-100, but I'd be more comfortable with 1-20.
There are many ways to generate blue noise, but probably your best choices are to apply a high-pass filter to white noise or to construct an approximation directly from 1D Perlin noise.

Algorithm for seeing if many different arrays are subsets of another one?

Let's say I have an array of ~20-100 integers, for example [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (actually numbers more like [106511349 , 173316561, ...], all nonnegative 64-bit integers under 2^63, but for demonstration purposes let's use these).
And many (~50,000) smaller arrays of usually 1-20 terms to match or not match:
1=[2, 3, 8, 20]
2=[2, 3, NOT 8]
3=[2, 8, NOT 16]
4=[2, 8, NOT 16] (there will be duplicates with different list IDs)
I need to find which of these are subsets of the array being tested. A matching list must have all of the positive matches, and none of the negative ones. So for this small example, I would need to get back something like [3, 4]. List 1 fails to match because it requires 20, and list 2 fails to match because it has NOT 8. The NOT can easily be represented by using the high bit/making the number negative in those cases.
I need to do this quickly up to 10,000 times per second . The small arrays are "fixed" (they change infrequently, like once every few seconds), while the large array is done per data item to be scanned (so 10,000 different large arrays per second).
This has become a bit of a bottleneck, so I'm looking into ways to optimize it.
I'm not sure the best data structures or ways to represent this. One solution would be to turn it around and see what small lists we even need to consider:
2=[1, 2, 3, 4]
3=[1, 2]
8=[1, 2, 3, 4]
16=[3, 4]
20=[1]
Then we'd build up a list of lists to check, and do the full subset matching on these. However, certain terms (often the more frequent ones) are going to end up in many of the lists, so there's not much of an actual win here.
I was wondering if anyone is aware of a better algorithm for solving this sort of problem?
you could try to make a tree with the smaller arrays since they change less frequently, such that each subtree tries to halve the number of small arrays left.
For example, do frequency analysis on numbers in the smaller arrays. Find which number is found in closest to half of the smaller arrays. Make that the first check in the tree. In your example that would be '3' since it occurs in half the small arrays. Now that's the head node in the tree. Now put all the small lists that contain 3 to the left subtree and all the other lists to the right subtree. Now repeat this process recursively on each subtree. Then when a large array comes in, reverse index it, and then traverse the subtree to get the lists.
You did not state which of your arrays are sorted - if any.
Since your data is not that big, I would use a hash-map to store the entries of the source set (the one with ~20-100 integers). That would basically let you test if a integer is present in O(1).
Then, given that 50,000(arrays) * 20(terms each) * 8(bytes per term) = 8 megabytes + (hash map overhead), does not seem large either for most systems, I would use another hash-map to store tested arrays. This way you don't have to re-test duplicates.
I realize this may be less satisfying from a CS point of view, but if you're doing a huge number of tiny tasks that don't affect each other, you might want to consider parallelizing them (multithreading). 10,000 tasks per second, comparing a different array in each task, should fit the bill; you don't give any details about what else you're doing (e.g., where all these arrays are coming from), but it's conceivable that multithreading could improve your throughput by a large factor.
First, do what you were suggesting; make a hashmap from input integer to the IDs of the filter arrays it exists in. That lets you say "input #27 is in these 400 filters", and toss those 400 into a sorted set. You've then gotta do an intersection of the sorted sets for each one.
Optional: make a second hashmap from each input integer to it's frequency in the set of filters. When an input comes in, sort it using the second hashmap. Then take the least common input integer and start with it, so you have less overall work to do on each step. Also compute the frequencies for the "not" cases, so you basically get the most bang for your buck on each step.
Finally: this could be pretty easily made into a parallel programming problem; if it's not fast enough on one machine, it seems you could put more machines on it pretty easily, if whatever it's returning is useful enough.

Something like a reversed random number generator

I really don't know what the name of this problem is, but it's something like lossy compression, and I have a bad English, but I will try to describe it as much as I can.
Suppose I have list of unsorted unique numbers from unknown source, the length is usually between 255 to 512 with a range from 0 to 512.
I wonder if there is some kind of an algorithm that reads the data and return something like a seed number that I can use to generate a list somehow close to the original but with some degree of error.
For example
original list
{5, 13, 25, 33, 3, 10}
regenerated list
{4, 10, 30, 30, 5, 5} or {8, 20, 20, 35, 5, 9} //and so on
Does this problem have a name, and is there an algorithm that can do what I just described?
Is it the same as Monte Carlo method because from what I understand it isn't.
Is it possible to use some of the techniques used in lossy compression to get this kind of approximation ?
What I tried to do to solve this problem is to use a simple 16 bit RNG and brute-force all the possible values comparing them to the original list and pick the one with the minimum difference, but I think this way is rather dumb and inefficient.
This is indeed lossy compression.
You don't tell us the range of the values in the list. From the samples you give we can extrapolate that they count at least 6 bits each (0 to 63). In total, you have from 0 to 3072 bits to compress.
If these sequences have no special property and appear to be random, I doubt there is any way to achieve significant compression. Think that the probability of an arbitrary sequence to be matched from a 32 bits seed is 2^32.2^(-3072)=7.10^(-916), i.e. less than infinitesimal. If you allow 10% error on every value, the probability of a match is 2^32.0.1^512=4.10^(-503).
A trivial way to compress with 12.5% accuracy is to get rid of the three LSB of each value, leading to 50% savings (1536 bits), but I doubt this is what you are looking for.
Would be useful to measure the entropy of the sequences http://en.wikipedia.org/wiki/Entropy_(information_theory) and/or possible correlations between the values. This can be done by plotting all (V, Vi+1) pairs, or (Vi, Vi+1, Vi+2) triples and looking for patterns.

Subset calculation of list of integers

I'm currently implementing an algorithm where one particular step requires me to calculate subsets in the following way.
Imagine I have sets (possibly millions of them) of integers. Where each set could potentially contain around a 1000 elements:
Set1: [1, 3, 7]
Set2: [1, 5, 8, 10]
Set3: [1, 3, 11, 14, 15]
...,
Set1000000: [1, 7, 10, 19]
Imagine a particular input set:
InputSet: [1, 7]
I now want to quickly calculate to which this InputSet is a subset. In this particular case, it should return Set1 and Set1000000.
Now, brute-forcing it takes too much time. I could also parallelise via Map/Reduce, but I'm looking for a more intelligent solution. Also, to a certain extend, it should be memory-efficient. I already optimised the calculation by making use of BloomFilters to quickly eliminate sets to which the input set could never be a subset.
Any smart technique I'm missing out on?
Thanks!
Well - it seems that the bottle neck is the number of sets, so instead of finding a set by iterating all of them, you could enhance performance by mapping from elements to all sets containing them, and return the sets containing all the elements you searched for.
This is very similar to what is done in AND query when searching the inverted index in the field of information retrieval.
In your example, you will have:
1 -> [set1, set2, set3, ..., set1000000]
3 -> [set1, set3]
5 -> [set2]
7 -> [set1, set7]
8 -> [set2]
...
EDIT:
In inverted index in IR, to save space we sometimes use d-gaps - meaning we store the offset between documents and not the actual number. For example, [2,5,10] will become [2,3,5]. Doing so and using delta encoding to represent the numbers tends to help a lot when it comes to space.
(Of course there is also a downside: you need to read the entire list in order to find if a specific set/document is in it, and cannot use binary search, but it sometimes worths it, especially if it is the difference between fitting the index into RAM or not).
How about storing a list of the sets which contain each number?
1 -- 1, 2, 3, 1000000
3 -- 1, 3
5 -- 2
etc.
Extending amit's solution, instead of storing the actual numbers, you could just store intervals and their associated sets.
For example using a interval size of 5:
(1-5): [1,2,3,1000000]
(6-10): [2,1000000]
(11-15): [3]
(16-20): [1000000]
In the case of (1,7) you should consider intervals (1-5) and (5-10) (which can be determined simply by knowing the size of the interval). Intersecting those ranges gives you [2,1000000]. Binary search of the sets shows that indeed, (1,7) exists in both sets.
Though you'll want to check the min and max values for each set to get a better idea of what the interval size should be. For example, 5 is probably a bad choice if the min and max values go from 1 to a million.
You should probably keep it so that a binary search can be used to check for values, so the subset range should be something like (min + max)/N, where 2N is the max number of values that will need to be binary searched in each set. For example, "does set 3 contain any values from 5 to 10?" this is done by finding the closest values to 5 (3) and 10 (11), in this case, no it does not. You would have to go through each set and do binary searches for the interval values that could be within the set. This means ensuring that you don't go searching for 100 when the set only goes up to 10.
You could also just store the range (min and max). However, the issue is that I suspect your numbers are going be be clustered, thus not providing much use. Although as mentioned, it'll probably be useful for determining how to set up the intervals.
It'll still be troublesome to pick what range to use, too large and it'll take a long time to build the data structure (1000 * million * log(N)). Too small, and you'll start to run into space issues. The ideal size of the range is probably such that it ensures that the number of set's related to each range is approximately equal, while also ensuring that the total number of ranges isn't too high.
Edit:
One benefit is that you don't actually need to store all intervals, just the ones you need. Although, if you have too many unused intervals, it might be wise to increase the interval and split the current intervals to ensure that the search is fast. This is especially true if processioning time isn't a major issue.
Start searching from biggest number (7) of input set and
eliminate other subsets (Set1 and Set1000000 will returned).
Search other input elements (1) in remaining sets.

Tinyurl-style unique code: potential algorithm to prevent collisions

I have a system that requires a unique 6-digit code to represent an object, and I'm trying to think of a good algorithm for generating them. Here are the pre-reqs:
I'm using a base-20 system (no caps, numbers, vowels, or l to prevent confusion and naughty words)
The base-20 allows 64 million combinations
I'll be inserting potentially 5-10 thousand entries at once, so in theory I'd use bulk inserts, which means using a unique key probably won't be efficient or pretty (especially if there starts being lots of collisions)
It's not out of the question to fill up 10% of the combinations so there's a high potential for lots of collisions
I want to make sure the codes are non-consecutive
I had an idea that sounded like it would work, but I'm not good enough at math to figure out how to implement it: if I start at 0 and increment by N, then convert to base-20, it seems like there should be some value for N that lets me count each value from 0-63,999,999 before repeating any.
For example, going from 0 through 9 using N=3 (so 10 mod 3): 0, 3, 6, 9, 2, 5, 8, 1, 4, 7.
Is there some magic math method for figuring out values of N for some larger number that is able to count through the whole range without repeating? Ideally, the number I choose would sort of jump around the set such that it wasn't obvious that there was a pattern, but I'm not sure how possible that is.
Alternatively, a hashing algorithm that guaranteed uniqueness for values 0-64 million would work, but I'm way too dumb to know if that's possible.
All you need is a number that shares no factors with your key space. Easiest value is to use a prime number. You can google for large primes, or use http://primes.utm.edu/lists/small/10000.txt
Any prime number which is not a factor of the length of the sequence should be able to span the sequence without repeating. For 64000000, that means you shouldn't use 2 or 5. Of course, if you don't want them to be generated consecutively, generating them 2 or 5 apart is probably also not very good. I personally like the number 73973!
There is another method to get a similar result (jumping over the entire set of the values without repeating, nonconsequtively), without using the primes - by using maximum length sequences, which you can generate using specially constructed shift registers.
My math is a bit rusty, but I think you just need to ensure that the GCF of N and 64 million is 1. I'd go with a prime number (that doesn't divide evenly into 64 million) just in case though.
#Nick Lewis:
Well, only if the prime number doesn't divide 64 million. So, for the questioner's purposes, numbers like 2 or 5 would probably not be advisable.
Don't reinvent the wheel:
http://en.wikipedia.org/wiki/Universally_Unique_Identifier

Resources