Generator for iterating over a random non-repeated sequence - algorithm

I have p items (let's assume p=5, items={0,1,2,3,4}). I need to be able to iterate over them in a random order, but without repeating them (unless all were visited) while maintaining only as small seed-like metadata as possible between the iterations. The generator is otherwise stateless. It would be used like this:
Initialization (metadata is long in this example, but it could be anything "small"):
long metadata = randomLong()
Usage:
(metadata, result) = generator.generate(metadata)
return(result)
If it works properly, it should continuously return something like 3, 1, 0, 4, 2, 3, 1, 0, 4, 2, 3...
Is that possible?
I know I could easily pre-generate the sequence, then metadata would contain whole this sequence and an index, but that's not viable for me, as the sequence will have thousands of items and the metadata must be slim.
I also found this, which resembles what I am trying to achieve, but it's either too brief or too math-y for me.
Added: I am aware of the fact, that for p=1000, there are 1000! ways of ordering the sequence, which would definitely not fit into a long, but both "having metadata something bigger than long" and "generator may be unable to generate some sequences" is OK for me.

I would, as a base, use Fisher-Yates algorithm.
It is able to construct a random permutation of a given ordered list of elements in O(n).
Then the trick could be to construct an iterator that shuffles an internal list of elements and iterate through it, and when this internal iteration ends, shuffles again and iterate on the result...
Something like:
function next() -> element {
internal data:
i an integer;
d an array of elements;
code:
if i equals to d.length { shuffle(d); i <-- 0; }
return d[i++];
}

Related

Sum of Function defined on Subsets

I want to know if their are any fast approaches to solve the following problem. I have a list of codes somewhere in the thousands (A0, A1, A2, ...). There is a positive value attached to about a million distinct combinations (A0-A1, A2-A10, A1-A2-A10, ...). Let the values be denoted f(A0-A1). Note that not all the combinations have the value attached.
For each listed combination, I want to calculate the sum of values of the values attached to each set that contains the given combination. For instance, for A2-A10,
calculate
g(A2-A10) = f(A2-A10) + f(A1-A2-A10) + ...
I would like to do this with minimal time complexity. A simpler related problem is to find all combinations where g(C) is greater than a threshold value.
Key the existing combinations with a bit map, where bit n denotes whether An is in that particular coding. Store the values keyed by the bit map for each in your favorite hash-map structure. Thus, f(A0, A1, A10, A12) would be combo_val[11000000001010000...]
To sum all of the desired combinations, build a bit map of your root. For instance, with the combination above, we'd have root = 1100000000101000 (cutting off at 16 total elements for the sake of illustration.
Now simply loop through the keys of the hashmap, using root as a mask. Sum the desired values:
total = 0
for key in combo_val.keys()
if root && key == root
total += combo_val[key]
Does that get you moving?
I thought waaay too long before coming up with the following approach.
Index the million combinations. So you know which you want. In your example:
0: A0-A1
1: A2-A10
2: A1-A2-A10
For each code, create an ordered list of combinations that contain that code. Call that code_combs. In your example:
A0: [0]
A1: [0, 2]
A2: [1, 2]
A10: [1, 2]
Now we have a combination of codes, like A2-A10. We create two arrays, one of codes, the other of indices. Set indices at 0. So:
codes = ['A2', 'A10']
indices = [0, 0]
And now do the following:
while not done:
let max_comb = max(code_combs[codes[i]][indices[i]] over i in range(len(codes))
Advance each index until we are at the max_comb or greater
(if we reach the end of any list, we are done)
If all are at the same max_comb, we add its value.
Advance all indexes by 1.
(if we reach the end of any list, we are done)
Basically this is a k-way intersection of ordered lists. Now here is the trick. If we advance naively, this will be slightly faster because we only have to look at combinations that contain a code. However we can use a clever advance strategy like this:
Advance by 1, 2, 4, 8, etc until we reach or pass the point we want.
Do a binary search between the last two values until we find the point we want
(Be warned, implementing binary search is not always so easy to get right.)
And now we are crossing fingers. But if any one of our codes has few combinations that it is in, and there aren't too many codes in our combination, we can compute our intersection quite quickly.

Compare rotated lists, containing duplicates [duplicate]

This question already has answers here:
How to check whether two lists are circularly identical in Python
(18 answers)
Closed 7 years ago.
I'm looking for an efficient way to compare lists of numbers to see if they match at any rotation (comparing 2 circular lists).
When the lists don't have duplicates, picking smallest/largest value and rotating both lists before comparisons works.
But when there may be many duplicate large values, this isn't so simple.
For example, lists [9, 2, 0, 0, 9] and [0, 0, 9, 9, 2] are matches,where [9, 0, 2, 0, 9] won't (since the order is different).
Heres an example of an in-efficient function which works.
def min_list_rotation(ls):
return min((ls[i:] + ls[:i] for i in range(len(ls))))
# example use
ls_a = [9, 2, 0, 0, 9]
ls_b = [0, 0, 9, 9, 2]
print(min_list_rotation(ls_a) == min_list_rotation(ls_b))
This can be improved on for efficiency...
check sorted lists match before running exhaustive tests.
only test rotations that start with the minimum value(skipping matching values after that)effectively finding the minimum value with the furthest & smallest number after it (continually - in the case there are multiple matching next-biggest values).
compare rotations without creating the new lists each time..
However its still not a very efficient method since it relies on checking many possibilities.
Is there a more efficient way to perform this comparison?
Related question:
Compare rotated lists in python
If you are looking for duplicates in a large number of lists, you could rotate each list to its lexicographically minimal string representation, then sort the list of lists or use a hash table to find duplicates. This canonicalisation step means that you don't need to compare every list with every other list. There are clever O(n) algorithms for finding the minimal rotation described at https://en.wikipedia.org/wiki/Lexicographically_minimal_string_rotation.
You almost have it.
You can do some kind of "normalization" or "canonicalisation" of a list independently of the others, then you only need to compare item by item (or if you want, put them in a map, in a set to eliminate duplicates, ..."
1 take the minimum item, which is not preceded by itself (in a circular way)
In you example 92009, you should take the first 0 (not the second one)
2 If you have always the same item (say 00000), you just keep that: 00000
3 If you have the same item several times, take the next item, which is minimal, and keep going until you find one unique path with minimums.
Example: 90148301562 => you have 0148.. and 0156.. => you take 0148
4 If you can not separate the different paths (= if you have equality at infinite), you have a repeating pattern: then, no matters: you take any of them.
Example: 014376501437650143765 : you have the same pattern 0143765...
It is like AAA, where A = 0143765
5 When you have your list in this form, it is easy to compare two of them.
How to do that efficiently:
Iterate on your list to get the minimums Mx (not preceded by itself). If you find several, keep all of them.
Then, iterate from each minimum Mx, take the next item, and keep the minimums. If you do an entire cycle, you have a repeating pattern.
Except the case of repeating pattern, this must be the minimal way.
Hope it helps.
I would do this in expected O(N) time using a polynomial hash function to compute the hash of list A, and every cyclic shift of list B. Where a shift of list B has the same hash as list A, I'd compare the actual elements to see if they are equal.
The reason this is fast is that with polynomial hash functions (which are extremely common!), you can calculate the hash of each cyclic shift from the previous one in constant time, so you can calculate hashes for all of the cyclic shifts in O(N) time.
It works like this:
Let's say B has N elements, then the the hash of B using prime P is:
Hb=0;
for (i=0; i<N ; i++)
{
Hb = Hb*P + B[i];
}
This is an optimized way to evaluate a polynomial in P, and is equivalent to:
Hb=0;
for (i=0; i<N ; i++)
{
Hb += B[i] * P^(N-1-i); //^ is exponentiation, not XOR
}
Notice how every B[i] is multiplied by P^(N-1-i). If we shift B to the left by 1, then every every B[i] will be multiplied by an extra P, except the first one. Since multiplication distributes over addition, we can multiply all the components at once just by multiplying the whole hash, and then fix up the factor for the first element.
The hash of the left shift of B is just
Hb1 = Hb*P + B[0]*(1-(P^N))
The second left shift:
Hb2 = Hb1*P + B[1]*(1-(P^N))
and so on...

Data Structure / Hash Function to link Sets of Ints to Value

Given n integer id's, I wish to link all possible sets of up to k id's to a constant value. What I'm looking for is a way to translate sets (e.g. {1, 5}, {1, 3, 5} and {1, 2, 3, 4, 5, 6, 7}) to unique values.
Guarantees:
n < 100 and k < 10 (again: set sizes will range in [1, k]).
The order of id's doesn't matter: {1, 5} == {5, 1}.
All combinations are possible, but some may be excluded.
All sets and values are constant and made only once. No deletes or inserts, no value updates.
Once generated, the only operations taking place will be look-ups.
Look-ups will be frequent and one-directional (given set, look up value).
There is no need to sort (or otherwise organize) the values.
Additionally, it would be nice (but not obligatory) if "neighboring" sets (drop one id, add one id, swap one id, etc) are easy to reach, as well as "all sets that include at least this set".
Any ideas?
Enumerate using the product of primes.
a -> 2
b -> 3
c -> 5
d -> 7
et cetera
Now hash(ab) := 6, and hash (abc) := 30
And a nice side effect is that, if "ab" is a subset of "abc", then:
hash(abc) % hash(ab) == 0
and
hash(abc) / hash(ab) == hash(c)
The bad news: You might run into overflow, the 100th prime will probably be around 1000, and 64 bits cannot accomodate 1000**10. This will not affect the functioning as a hash function; only the subset thingy will fail to work. the same method applied to anagrams
The other option is Zobrist-hashing. It is equivalent to the the primes method, but instead of primes you use a fixed set of (random) numbers, and instead of multiplying you use XOR.
For a fixed small (it needs << ~70 bits) set like yours, it might be possible to tune the zobrist tables to totally avoid collisions (yielding a perfect hash).
And the final (and simplest) way is to use a (100bit) bitmap, and treat that as a hashvalue (maybe after modulo table size)
And a totally unrelated method is to just build a decision tree on the bits of the bitmap. (the tree would have a maximal depth of k) a related kD tree on bit values
May be not the best solution, but you can do the following:
Sort the set from Lowest to highest with a simple IntegerComparator
Add each item of the set to a String
so if you have {2,5,9,4} first Step->{2,4,5,9}; second->"2459"
This way you will get a unique String from a unique set. If you really need to map them to an integer value, you can hash the string after that.
A second way I can think of is to store them in a java Set and simply map it against a HashMap with set as keys
Calculate a 'diff' from each set {1, 6, 87, 89} = {1,5,81,2,0,0,...}
{1,2,3,4} = { 1,1,1,1,0,0,0,0... };
Then binary encode each number with a variable length encoding and concatenate the bits.
It's hard to compare the sets (except for the first few equal bits), but because there can't be many large intervals in a set, all possible values just might fit into 64 bits. (slack of 16 bits at least...)

Good hash function for permutations?

I have got numbers in a specific range (usually from 0 to about 1000). An algorithm selects some numbers from this range (about 3 to 10 numbers). This selection is done quite often, and I need to check if a permutation of the chosen numbers has already been selected.
e.g one step selects [1, 10, 3, 18] and another one [10, 18, 3, 1] then the second selection can be discarded because it is a permutation.
I need to do this check very fast. Right now I put all arrays in a hashmap, and use a custom hash function: just sums up all the elements, so 1+10+3+18=32, and also 10+18+3+1=32. For equals I use a bitset to quickly check if elements are in both sets (I do not need sorting when using the bitset, but it only works when the range of numbers is known and not too big).
This works ok, but can generate lots of collisions, so the equals() method is called quite often. I was wondering if there is a faster way to check for permutations?
Are there any good hash functions for permutations?
UPDATE
I have done a little benchmark: generate all combinations of numbers in the range 0 to 6, and array length 1 to 9. There are 3003 possible permutations, and a good hash should generated close to this many different hashes (I use 32 bit numbers for the hash):
41 different hashes for just adding (so there are lots of collisions)
8 different hashes for XOR'ing values together
286 different hashes for multiplying
3003 different hashes for (R + 2e) and multiplying as abc has suggested (using 1779033703 for R)
So abc's hash can be calculated very fast and is a lot better than all the rest. Thanks!
PS: I do not want to sort the values when I do not have to, because this would get too slow.
One potential candidate might be this.
Fix a odd integer R.
For each element e you want to hash compute the factor (R + 2*e).
Then compute the product of all these factors.
Finally divide the product by 2 to get the hash.
The factor 2 in (R + 2e) guarantees that all factors are odd, hence avoiding
that the product will ever become 0. The division by 2 at the end is because
the product will always be odd, hence the division just removes a constant bit.
E.g. I choose R = 1779033703. This is an arbitrary choice, doing some experiments should show if a given R is good or bad. Assume your values are [1, 10, 3, 18].
The product (computed using 32-bit ints) is
(R + 2) * (R + 20) * (R + 6) * (R + 36) = 3376724311
Hence the hash would be
3376724311/2 = 1688362155.
Summing the elements is already one of the simplest things you could do. But I don't think it's a particularly good hash function w.r.t. pseudo randomness.
If you sort your arrays before storing them or computing hashes, every good hash function will do.
If it's about speed: Have you measured where the bottleneck is? If your hash function is giving you a lot of collisions and you have to spend most of the time comparing the arrays bit-by-bit the hash function is obviously not good at what it's supposed to do. Sorting + Better Hash might be the solution.
If I understand your question correctly you want to test equality between sets where the items are not ordered. This is precisely what a Bloom filter will do for you. At the expense of a small number of false positives (in which case you'll need to make a call to a brute-force set comparison) you'll be able to compare such sets by checking whether their Bloom filter hash is equal.
The algebraic reason why this holds is that the OR operation is commutative. This holds for other semirings, too.
depending if you have a lot of collisions (so the same hash but not a permutation), you might presort the arrays while hashing them. In that case you can do a more aggressive kind of hashing where you don't only add up the numbers but add some bitmagick to it as well to get quite different hashes.
This is only beneficial if you get loads of unwanted collisions because the hash you are doing now is too poor. If you hardly get any collisions, the method you are using seems fine
I would suggest this:
1. Check if the lengths of permutations are the same (if not - they are not equal)
Sort only 1 array. Instead of sorting another array iterate through the elements of the 1st array and search for the presence of each of them in the 2nd array (compare only while the elements in the 2nd array are smaller - do not iterate through the whole array).
note: if you can have the same numbers in your permutaions (e.g. [1,2,2,10]) then you will need to remove elements from the 2nd array when it matches a member from the 1st one.
pseudo-code:
if length(arr1) <> length(arr2) return false;
sort(arr2);
for i=1 to length(arr1) {
elem=arr1[i];
j=1;
while (j<=length(arr2) and elem<arr2[j]) j=j+1;
if elem <> arr2[j] return false;
}
return true;
the idea is that instead of sorting another array we can just try to match all of its elements in the sorted array.
You can probably reduce the collisions a lot by using the product as well as the sum of the terms.
1*10*3*18=540 and 10*18*3*1=540
so the sum-product hash would be [32,540]
you still need to do something about collisions when they do happen though
I like using string's default hash code (Java, C# not sure about other languages), it generates pretty unique hash codes.
so if you first sort the array, and then generates a unique string using some delimiter.
so you can do the following (Java):
int[] arr = selectRandomNumbers();
Arrays.sort(arr);
int hash = (arr[0] + "," + arr[1] + "," + arr[2] + "," + arr[3]).hashCode();
if performance is an issue, you can change the suggested inefficient string concatenation to use StringBuilder or String.format
String.format("{0},{1},{2},{3}", arr[0],arr[1],arr[2],arr[3]);
String hash code of course doesn't guarantee that two distinct strings have different hash, but considering this suggested formatting, collisions should be extremely rare

sorting algorithm where pairwise-comparison can return more information than -1, 0, +1

Most sort algorithms rely on a pairwise-comparison the determines whether A < B, A = B or A > B.
I'm looking for algorithms (and for bonus points, code in Python) that take advantage of a pairwise-comparison function that can distinguish a lot less from a little less or a lot more from a little more. So perhaps instead of returning {-1, 0, 1} the comparison function returns {-2, -1, 0, 1, 2} or {-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5} or even a real number on the interval (-1, 1).
For some applications (such as near sorting or approximate sorting) this would enable a reasonable sort to be determined with less comparisons.
The extra information can indeed be used to minimize the total number of comparisons. Calls to the super_comparison function can be used to make deductions equivalent to a great number of calls to a regular comparsion function. For example, a much-less-than b and c little-less-than b implies a < c < b.
The deductions cans be organized into bins or partitions which can each be sorted separately. Effectively, this is equivalent to QuickSort with n-way partition. Here's an implementation in Python:
from collections import defaultdict
from random import choice
def quicksort(seq, compare):
'Stable in-place sort using a 3-or-more-way comparison function'
# Make an n-way partition on a random pivot value
segments = defaultdict(list)
pivot = choice(seq)
for x in seq:
ranking = 0 if x is pivot else compare(x, pivot)
segments[ranking].append(x)
seq.clear()
# Recursively sort each segment and store it in the sequence
for ranking, segment in sorted(segments.items()):
if ranking and len(segment) > 1:
quicksort(segment, compare)
seq += segment
if __name__ == '__main__':
from random import randrange
from math import log10
def super_compare(a, b):
'Compare with extra logarithmic near/far information'
c = -1 if a < b else 1 if a > b else 0
return c * (int(log10(max(abs(a - b), 1.0))) + 1)
n = 10000
data = [randrange(4*n) for i in range(n)]
goal = sorted(data)
quicksort(data, super_compare)
print(data == goal)
By instrumenting this code with the trace module, it is possible to measure the performance gain. In the above code, a regular three-way compare uses 133,000 comparisons while a super comparison function reduces the number of calls to 85,000.
The code also makes it easy to experiment with a variety comparison functions. This will show that naïve n-way comparison functions do very little to help the sort. For example, if the comparison function returns +/-2 for differences greater than four and +/-1 for differences four or less, there is only a modest 5% reduction in the number of comparisons. The root cause is that the course grained partitions used in the beginning only have a handful of "near matches" and everything else falls in "far matches".
An improvement to the super comparison is to covers logarithmic ranges (i.e. +/-1 if within ten, +/-2 if within a hundred, +/- if within a thousand.
An ideal comparison function would be adaptive. For any given sequence size, the comparison function should strive to subdivide the sequence into partitions of roughly equal size. Information theory tells us that this will maximize the number of bits of information per comparison.
The adaptive approach makes good intuitive sense as well. People should first be partitioned into love vs like before making more refined distinctions such as love-a-lot vs love-a-little. Further partitioning passes should each make finer and finer distinctions.
You can use a modified quick sort. Let me explain on an example when you comparison function returns [-2, -1, 0, 1, 2]. Say, you have an array A to sort.
Create 5 empty arrays - Aminus2, Aminus1, A0, Aplus1, Aplus2.
Pick an arbitrary element of A, X.
For each element of the array, compare it with X.
Depending on the result, place the element in one of the Aminus2, Aminus1, A0, Aplus1, Aplus2 arrays.
Apply the same sort recursively to Aminus2, Aminus1, Aplus1, Aplus2 (note: you don't need to sort A0, as all he elements there are equal X).
Concatenate the arrays to get the final result: A = Aminus2 + Aminus1 + A0 + Aplus1 + Aplus2.
It seems like using raindog's modified quicksort would let you stream out results sooner and perhaps page into them faster.
Maybe those features are already available from a carefully-controlled qsort operation? I haven't thought much about it.
This also sounds kind of like radix sort except instead of looking at each digit (or other kind of bucket rule), you're making up buckets from the rich comparisons. I have a hard time thinking of a case where rich comparisons are available but digits (or something like them) aren't.
I can't think of any situation in which this would be really useful. Even if I could, I suspect the added CPU cycles needed to sort fuzzy values would be more than those "extra comparisons" you allude to. But I'll still offer a suggestion.
Consider this possibility (all strings use the 27 characters a-z and _):
11111111112
12345678901234567890
1/ now_is_the_time
2/ now_is_never
3/ now_we_have_to_go
4/ aaa
5/ ___
Obviously strings 1 and 2 are more similar that 1 and 3 and much more similar than 1 and 4.
One approach is to scale the difference value for each identical character position and use the first different character to set the last position.
Putting aside signs for the moment, comparing string 1 with 2, the differ in position 8 by 'n' - 't'. That's a difference of 6. In order to turn that into a single digit 1-9, we use the formula:
digit = ceiling(9 * abs(diff) / 27)
since the maximum difference is 26. The minimum difference of 1 becomes the digit 1. The maximum difference of 26 becomes the digit 9. Our difference of 6 becomes 3.
And because the difference is in position 8, out comparison function will return 3x10-8 (actually it will return the negative of that since string 1 comes after string 2.
Using a similar process for strings 1 and 4, the comparison function returns -5x10-1. The highest possible return (strings 4 and 5) has a difference in position 1 of '-' - 'a' (26) which generates the digit 9 and hence gives us 9x10-1.
Take these suggestions and use them as you see fit. I'd be interested in knowing how your fuzzy comparison code ends up working out.
Considering you are looking to order a number of items based on human comparison you might want to approach this problem like a sports tournament. You might allow each human vote to increase the score of the winner by 3 and decrease the looser by 3, +2 and -2, +1 and -1 or just 0 0 for a draw.
Then you just do a regular sort based on the scores.
Another alternative would be a single or double elimination tournament structure.
You can use two comparisons, to achieve this. Multiply the more important comparison by 2, and add them together.
Here is a example of what I mean in Perl.
It compares two array references by the first element, then by the second element.
use strict;
use warnings;
use 5.010;
my #array = (
[a => 2],
[b => 1],
[a => 1],
[c => 0]
);
say "$_->[0] => $_->[1]" for sort {
($a->[0] cmp $b->[0]) * 2 +
($a->[1] <=> $b->[1]);
} #array;
a => 1
a => 2
b => 1
c => 0
You could extend this to any number of comparisons very easily.
Perhaps there's a good reason to do this but I don't think it beats the alternatives for any given situation and certainly isn't good for general cases. The reason? Unless you know something about the domain of the input data and about the distribution of values you can't really improve over, say, quicksort. And if you do know those things, there are often ways that would be much more effective.
Anti-example: suppose your comparison returns a value of "huge difference" for numbers differing by more than 1000, and that the input is {0, 10000, 20000, 30000, ...}
Anti-example: same as above but with input {0, 10000, 10001, 10002, 20000, 20001, ...}
But, you say, I know my inputs don't look like that! Well, in that case tell us what your inputs really look like, in detail. Then someone might be able to really help.
For instance, once I needed to sort historical data. The data was kept sorted. When new data were added it was appended, then the list was run again. I did not have the information of where the new data was appended. I designed a hybrid sort for this situation that handily beat qsort and others by picking a sort that was quick on already sorted data and tweaking it to be fast (essentially switching to qsort) when it encountered unsorted data.
The only way you're going to improve over the general purpose sorts is to know your data. And if you want answers you're going to have to communicate that here very well.

Resources