Using a set of integers to generate unique key - algorithm

Now I have some sets of integers, say:
set1 = {int1, int2, int3};
set2 = {int2, int3, int1};
set3 = {int1, int4, int2};
The order or the numbers is not taken into consideration, so set1 and set2 are the same, while set3 are not with the other two.
Now I want to generate a unique key for these sets to distinguish them, in that way, set1 and set2 should generate the same key.
I think this for a while, thoughts as sum up the integers came to my mind but can be easily proved wrong. Sort the set and do
key = n1 + n2*2^16 + n3*2^32
may be a possible way but I wonder if this can be solved more elegantly.
The key can be either integer or string.
So any one has some idea about solving this as fast as possible? Or any reading material is welcome.
More info:
The numbers are in fact colors so each integer is less than 0xffffff

If these were small integers (all within the range(0,63) for example) then you could represent each set as a bitstring (1 for any integer that's present in the set; 0 for any that's absent). For sparse sets of large integers this would be horrendously expensive in terms of storage/memory).
One other method that comes to mind would be to sort the set and form the key as the concatenation of each number's digital representation (separated by some delimiter). So the set {2,1,3} -> "1/2/3" (using "/" as the delimiter) and {30,1,2,4} => "1/2/4/30"
I suppose you could also use a hybrid approach. All elements < 63 are encoded into a hex string and all others are encoded into a string as described. Then your final resulting key is formed by: HEXxA/B/c ... (with the "x" separating the small int hex string from the larger ints in the set).

If numbers of your set is not so large, I think hashing each set into one string can be one of proper solution.
Then they are lager ones, you can make it small ones by mod function or whatever. And by this, they can be dealed with in the same way.
Hope this will help your solution if there is no better idea.

I think a key of practical size can only be a hash value - there will always be a few pairs of inputs that hash to the same key, but you can make this unlikely.
I think the idea of sorting and then applying a standard hash function is good, but I don't like your hash multipliers. If arithmetic is mod 2^32, then multiplying by 2^32 is multiplying by zero. If it is mod 2^64, then multiplying by 2^32 will lose the top 32 bits of the input.
I would use a hash function like that described in Why chose 31 to do the multiplication in the hashcode() implementation ?, where you keep a running total, multiplying the hash value by some odd number before you add then next item into it. Multiplying by an odd number mod 2^n will at least not lose information immediately. I would suggest 131, but Java has a tradition of using 31.

Related

Repeated DNA sequence

The problem is to find out all the sequences of length k in a given DNA sequence which occur more than once. I found a approach of using a rolling hash function, where for each sequence of length k, hash is computed and is stored in a map. To check if the current sequence is a repetition, we compute it's hash and check if the hash already exist in the hash map. If yes, then we include this sequence in our result, otherwise add it to the hash map.
Rolling hash here means, when moving on to the next sequence by sliding the window by one, we use the hash of previous sequence in a way that we remove the contribution of the first character of previous sequence and add the contribution of the newly added char i.e. the last character of the new sequence.
Input: AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT
and k=10
Answer: {AAAAACCCCC, CCCCCAAAAA}
This algorithm looks perfect, but I can't go about making a perfect hash function so that collisions are avoided. It would be a great help if somebody can explain how to make a perfect hash under any circumstance and most importantly in this case.
This is actually a research problem.
Let's come to terms with some facts
Input = N, Input length = |N|
You have to move a size k, here k=10, sliding window over the input. Therefore you must live with O(|N|) or more.
Your rolling hash is a form of locality sensitive deterministic hashing, the downside of deterministic hashing is the benefit of hashing is greatly diminished as the more often you encounter similar strings the harder it will be to hash
The longer your input the less effective hashing will be
Given these facts "rolling hashes" will soon fail. You cannot design a rolling hash that will even work for 1/10th of a chromosome.
SO what alternatives do you have?
Bloom Filters. They are much more robust than simple hashing. The downside is sometimes they have a false positives. But this can be mitigated by using several filters.
Cuckoo Hashes similar to bloom filters, but use less memory and have locality sensitive "hashing" and worst case constant lookup time
Just stick every suffix in a suffix trie. Once this is done, just output every string at depth 10 that also has atleast 2 children with one of the children being a leaf.
Improve on the suffix trie with a suffix tree. Lookup is not as straightforward but memory consumption is less.
My favorite the FM-Index. In my opinion the cleanest solution uses the Burrows Wheeler Transform. This technique is also used in industryu tools like Bowtie and BWA
Heads-up: This is not a general solution, but a good trick that you can use when k is not large.
The trick is to encrypt the sequence into an integer by bit manipulation.
If your input k is relatively small, let's say around 10. Then you can encrypt your DNA sequence in an int via bit manipulation. Since for each character in the sequence, there are only 4 possibilities, A, C, G, T. You can simply make your own mapping which uses 2 bits to represent a letter.
For example: 00 -> A, 01 -> C, 10 -> G, 11 -> T.
In this way, if k is 10, you won't need a string with 10 characters as hash key. Instead, you can only use 20 bits in an integer to represent the previous key string.
Then when you do your rolling hash, you left shift the integer that stores your previous sequence for 2 bits, then use any bit operations like |= to set the last two bits with your new character. And remember to clear the 2 left most bits that you just shifted, meaning you are removing them from your sliding window.
By doing this, a string could be stored in an integer, and using that integer as hash key might be nicer and cheaper in terms of the complexity of the hash function computation. If your input length k is slightly longer than 16, you may be able to use a long value. Otherwise, you might be able to use a bitset or a bitarray. But to hash them becomes another issue.
Therefore, I'd say this solution is a nice attempt for this problem when the sequence length is relatively small, i.e. can be stored in a single integer or long integer.
You can build the suffix array and the LCP array. Iterate through the LCP array, every time you see a value greater or equal to k, report the string referred to by that position (using the suffix array to determine where the substring comes from).
After you report a substring because the LCP was greater or equal to k, ignore all following values until reaching one that is less than k (this avoids reporting repeated values).
The construction of both, the suffix array and the LCP, can be done in linear time. So overall the solution is linear with respect to the size of the input plus output.
What you could do is use Chinese Remainder Theorem and pick several large prime moduli. If you recall, CRT means that a system of congruences with coprime moduli has a unique solution mod the product of all your moduli. So if you have three moduli 10^6+3, 10^6+33, and 10^6+37, then in effect you have a modulus of size 10^18 more or less. With a sufficiently large modulus, you can more or less disregard the idea of a collision happening at all---as my instructor so beautifully put it, it's more likely that your computer will spontaneously catch fire than a collision to happen, since you can drive that collision probability to be as arbitrarily small as you like.

How good is hash function that is linear combination of values?

I was reading text about hashing , I found out that naive hash code of char string can be implemented as polynomial hash function
h(S0,S1,S2,...SN-1) = S0*A^N-1 + S1*A^N-2 + S2*A^N-3 ..... SN-1*A^0. Where Si is character at index i and A is some integer.
But cannot we straightaway sum as
h(S0,S1,S2,...SN-1) = S0*(N)+S1*(N-1)+S2*(N-2) ...... SN-1*1.
I see this function also as good since two values 2*S0+S1 != 2*S1+S0 (which are reverse) are not hashed to same values. But nowhere i find this type of hash function
Suppose we work with strings of 30 characters. That's not long, but it's not so short that problems with the hash should arise purely because the strings are too short.
The sum of the weights is 465 (1+2+...+30), with printable ASCII characters that makes the maximum hash 58590, attained by "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~". There are a lot more possible printable ASCII strings of 30 characters than that (9530 ≈ 2E59), but they all hash into the range of 0 to 58590. Naturally you cannot actually have that many strings at the same time, but you could have a lot more than 58590, and that would guarantee collisions just based on counting (it is very likely to happen much sooner of course).
The maximum hash grows only slowly, you'd need strings of 34 million characters before the entire range of a 32bit integer is used.
The other way, multiplying by powers of A (this can be evaluated with Horner's scheme so no powers needs to be calculated explicitly, it still only costs an addition and a multiplication per character, though the naive way is not the fastest way to compute that hash), does not have this problem. The powers of A quickly get big (and start wrapping, which is fine as long as A is odd), so strings with 30 characters stand a good chance to cover the entire range of whatever integer type you're using.
The problem with a linear hash function is that it's much easier to generate collisions.
Consider a string with 3 chars: S0, S1, S2.
The proposed hash code would be 3 * S0 + 2 * S1 + S2.
Every time we decrease char S2 by two (e.g. e --> c), and increase char S1 by one (e.g. m --> n), we obtain the same hash code.
Even only the fact that it is possible to describe an operation preserving hash so easily would be an alarm (because some algorithm might process the string exactly in that manner). As a more extreme case consider just summing the characters. In this situation all the anagrams of the original string would generate the same hash code (thus this hash would be useless in an application processing anagrams).

How many hash functions are required in a minhash algorithm

I am keen to try and implement minhashing to find near duplicate content. http://blog.cluster-text.com/tag/minhash/ has a nice write up, but there the question of just how many hashing algorithms you need to run across the shingles in a document to get reasonable results.
The blog post above mentioned something like 200 hashing algorithms. http://blogs.msdn.com/b/spt/archive/2008/06/10/set-similarity-and-min-hash.aspx lists 100 as a default.
Obviously there is an increase in the accuracy as the number of hashes increases, but how many hash functions is reasonable?
To quote from the blog
It is tough to get the error bar on our similarity estimate much
smaller than [7%] because of the way error bars on statistically
sampled values scale — to cut the error bar in half we would need four
times as many samples.
Does this mean that mean that decreasing the number of hashes to something like 12 (200 / 4 / 4) would result in an error rate of 28% (7 * 2 * 2)?
One way to generate 200 hash values is to generate one hash value using a good hash algorithm and generate 199 values cheaply by XORing the good hash value with 199 sets of random-looking bits having the same length as the good hash value (i.e. if your good hash is 32 bits, build a list of 199 32-bit pseudo random integers and XOR each good hash with each of the 199 random integers).
Do not simply rotate bits to generate hash values cheaply if you are using unsigned integers (signed integers are fine) -- that will often pick the same shingle over and over. Rotating the bits down by one is the same as dividing by 2 and copying the old low bit into the new high bit location. Roughly 50% of the good hash values will have a 1 in the low bit, so they will have huge hash values with no prayer of being the minimum hash when that low bit rotates into the high bit location. The other 50% of the good hash values will simply equal their original values divided by 2 when you shift by one bit. Dividing by 2 does not change which value is smallest. So, if the shingle that gave the minimum hash with the good hash function happens to have a 0 in the low bit (50% chance of that) it will again give the minimum hash value when you shift by one bit. As an extreme example, if the shingle with the smallest hash value from the good hash function happens to have a hash value of 0, it will always have the minimum hash value no matter how much you rotate the bits. This problem does not occur with signed integers because minimum hash values have extreme negative values, so they tend to have a 1 at the highest bit followed by zeros (100...). So, only hash values with a 1 in the lowest bit will have a chance at being the new lowest hash value after rotating down by one bit. If the shingle with minimum hash value has a 1 in the lowest bit, after rotating down one bit it will look like 1100..., so it will almost certainly be beat out by a different shingle that has a value like 10... after the rotation, and the problem of the same shingle being picked twice in a row with 50% probability is avoided.
Pretty much.. but 28% would be the "error estimate", meaning reported measurements would frequently be inaccurate by +/- 28%.
That means that a reported measurement of 78% could easily come from only 50% similarity..
Or that 50% similarity could easily be reported as 22%. Doesn't sound accurate enough for business expectations, to me.
Mathematically, if you're reporting two digits the second should be meaningful.
Why do you want to reduce the number of hash functions to 12? What "200 hash functions" really means is, calculate a decent-quality hashcode for each shingle/string once -- then apply 200 cheap & fast transformations, to emphasise certain factors/ bring certain bits to the front.
I recommend combining bitwise rotations (or shuffling) and an XOR operation. Each hash function can combined rotation by some number of bits, then XORing by a randomly generated integer.
This both "spreads" the selectivity of the min() function around the bits, and as to what value min() ends up selecting for.
The rationale for rotation, is that "min(Int)" will, 255 times out of 256, select only within the 8 most-significant bits. Only if all top bits are the same, do lower bits have any effect in the comparison.. so spreading can be useful to avoid undue emphasis on just one or two characters in the shingle.
The rationale for XOR is that, on it's own, bitwise rotation (ROTR) can 50% of the time (when 0 bits are shifted in from the left) converge towards zero, and that would cause "separate" hash functions to display an undesirable tendency to coincide towards zero together -- thus an excessive tendency for them to end up selecting the same shingle, not independent shingles.
There's a very interesting "bitwise" quirk of signed integers, where the MSB is negative but all following bits are positive, that renders the tendency of rotations to converge much less visible for signed integers -- where it would be obvious for unsigned. XOR must still be used in these circumstances, anyway.
Java has 32-bit hashcodes builtin. And if you use Google Guava libraries, there are 64-bit hashcodes available.
Thanks to #BillDimm for his input & persistence in pointing out that XOR was necessary.
What you want can be be easily obtained from universal hashing. Popular textbooks like Corman et al as very readable information in section 11.3.3 pp 265-268. In short, you can generate family of hash functions using following simple equation:
h(x,a,b) = ((ax+b) mod p) mod m
x is key you want to hash
a is any odd number you can choose between 1 to p-1 inclusive.
b is any number you can choose between 0 to p-1 inclusive.
p is a prime number that is greater than max possible value of x
m is a max possible value you want for hash code + 1
By selecting different values of a and b you can generate many hash codes that are independent of each other.
An optimized version of this formula can be implemented as follows in C/C++/C#/Java:
(unsigned) (a*x+b) >> (w-M)
Here,
- w is size of machine word (typically 32)
- M is size of hash code you want in bits
- a is any odd integer that fits in to machine word
- b is any integer less than 2^(w-M)
Above works for hashing a number. To hash a string, get the hash code that you can get using built-in functions like GetHashCode and then use that value in above formula.
For example, let's say you need 200 16-bit hash code for string s, then following code can be written as implementation:
public int[] GetHashCodes(string s, int count, int seed = 0)
{
var hashCodes = new int[count];
var machineWordSize = sizeof(int);
var hashCodeSize = machineWordSize / 2;
var hashCodeSizeDiff = machineWordSize - hashCodeSize;
var hstart = s.GetHashCode();
var bmax = 1 << hashCodeSizeDiff;
var rnd = new Random(seed);
for(var i=0; i < count; i++)
{
hashCodes[i] = ((hstart * (i*2 + 1)) + rnd.Next(0, bmax)) >> hashCodeSizeDiff;
}
}
Notes:
I'm using hash code word size as half of machine word size which in most cases would be 16-bit. This is not ideal and has far more chance of collision. This can be used by upgrading all arithmetic to 64-bit.
Normally you want to select a and b both randomly within above said ranges.
Just use 1 hash function! (and save the 1/(f ε^2) smallest values.)
Check out this article for the state of the art practical and theoretical bounds. It has this nice graph (below), explaining why you probably want to use just one 2-independent hash function and save the k smallest values.
When estimating set sizes the paper shows that you can get a relative error of approximately ε = 1/sqrt(f k) where f is the jaccard similarity and k is the number of values kept. So if you want error ε, you need k=1/(fε^2) or if your sets have similarity around 1/3 and you want a 10% relative error, you should keep the 300 smallest values.
It seems like another way to get N number of good hashed values would be to salt the same hash with N different salt values.
In practice, if applying the salt second, it seems you could hash the data, then "clone" the internal state of your hasher, add the first salt and get your first value. You'd reset this clone to the clean cloned state, add the second salt, and get your second value. Rinse and repeat for all N items.
Likely not as cheap as XOR against N values, but seems like there's possibility for better quality results, at a minimal extra cost, especially if the data being hashed is much larger than the salt value.

Returning i-th combination of a bit array

Given a bit array of fixed length and the number of 0s and 1s it contains, how can I arrange all possible combinations such that returning the i-th combinations takes the least possible time?
It is not important the order in which they are returned.
Here is an example:
array length = 6
number of 0s = 4
number of 1s = 2
possible combinations (6! / 4! / 2!)
000011 000101 000110 001001 001010
001100 010001 010010 010100 011000
100001 100010 100100 101000 110000
problem
1st combination = 000011
5th combination = 001010
9th combination = 010100
With a different arrangement such as
100001 100010 100100 101000 110000
001100 010001 010010 010100 011000
000011 000101 000110 001001 001010
it shall return
1st combination = 100001
5th combination = 110000
9th combination = 010100
Currently I am using a O(n) algorithm which tests for each bit whether it is a 1 or 0. The problem is I need to handle lots of very long arrays (in the order of 10000 bits), and so it is still very slow (and caching is out of the question). I would like to know if you think a faster algorithm may exist.
Thank you
I'm not sure I understand the problem, but if you only want the i-th combination without generating the others, here is a possible algorithm:
There are C(M,N)=M!/(N!(M-N)!) combinations of N bits set to 1 having at most highest bit at position M.
You want the i-th: you iteratively increment M until C(M,N)>=i
while( C(M,N) < i ) M = M + 1
That will tell you the highest bit that is set.
Of course, you compute the combination iteratively with
C(M+1,N) = C(M,N)*(M+1)/(M+1-N)
Once found, you have a problem of finding (i-C(M-1,N))th combination of N-1 bits, so you can apply a recursion in N...
Here is a possible variant with D=C(M+1,N)-C(M,N), and I=I-1 to make it start at zero
SOL=0
I=I-1
while(N>0)
M=N
C=1
D=1
while(i>=D)
i=i-D
M=M+1
D=N*C/(M-N)
C=C+D
SOL=SOL+(1<<(M-1))
N=N-1
RETURN SOL
This will require large integer arithmetic if you have that many bits...
If the ordering doesn't matter (it just needs to remain consistent), I think the fastest thing to do would be to have combination(i) return anything you want that has the desired density the first time combination() is called with argument i. Then store that value in a member variable (say, a hashmap that has the value i as key and the combination you returned as its value). The second time combination(i) is called, you just look up i in the hashmap, figure out what you returned before and return it again.
Of course, when you're returning the combination for argument(i), you'll need to make sure it's not something you have returned before for some other argument.
If the number you will ever be asked to return is significantly smaller than the total number of combinations, an easy implementation for the first call to combination(i) would be to make a value of the right length with all 0s, randomly set num_ones of the bits to 1, and then make sure it's not one you've already returned for a different value of i.
Your problem appears to be constrained by the binomial coefficient. In the example you give, the problem can be translated as follows:
there are 6 items that can be chosen 2 at a time. By using the binomial coefficient, the total number of unique combinations can be calculated as N! / (K! (N - K)!, which for the case of K = 2 simplifies to N(N-1)/2. Plugging 6 in for N, we get 15, which is the same number of combinations that you calculated with 6! / 4! / 2! - which appears to be another way to calculate the binomial coefficient that I have never seen before. I have tried other combinations as well and both formulas generate the same number of combinations. So, it looks like your problem can be translated to a binomial coefficient problem.
Given this, it looks like you might be able to take advantage of a class that I wrote to handle common functions for working with the binomial coefficient:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
It should not be hard to convert this class to the language of your choice.
There may be some limitations since you are using a very large N that could end up creating larger numbers than the program can handle. This is especially true if K can be large as well. Right now, the class is limited to the size of an int. But, it should not be hard to update it to use longs.

Generating ids for a set of integers

Background:
I'm working with permutations of the sequence of integers {0, 1, 2 ... , n}.
I have a local search algorithm that transforms a permutation in some systematic way into another permutation. The point of the algorithm is to produce a permutation that minimises a cost function. I'd like to work with a wide range of problems, from n=5 to n=400.
The problem:
To reduce search effort I need to be able to check if I've processed a particular permutation of integers before. I'm using a hash table for this and I need to be able to generate an id for each permutation which I can use as a key into the table. However, I can't think of any nice hash function that maps a set of integers into a key such that collisions do not occur too frequently.
Stuff I've tried:
I started out by generating a sequence of n prime numbers and multiplying the ith number in my permutation with the ith prime then summing the results. The resulting key however produces collisions even for n=5.
I also thought to concatenate the values of all numbers together and take the integer value of the resulting string as a key but the id quickly becomes too big even for small values of n. Ideally, I'd like to be able to store each key as an integer.
Does stackoverflow have any suggestions for me?
Zobrist hashing might work for you. You need to create an NxN matrix of random integers, each cell representing that element i is in the jth position in the current permutation.
For a given permutation you pick the N cell values, and xor them one by one to get the permutation's key (note that key uniqueness is not guaranteed).
The point in this algorithm is, that if you swap to elements in your permutations, you can easily generate the new key from the current permutation by simply xor-ing out the old and xor-ing in the new positions.
Judging by your question, and the comments you've left, I'd say your problem is not possible to solve.
Let me explain.
You say that you need a unique hash from your combination, so let's make that rule #1:
1: Need a unique number to represent a combination of an arbitrary number of digits/numbers
Ok, then in a comment you've said that since you're using quite a few numbers, storing them as a string or whatnot as a key to the hashtable is not feasible, due to memory constraints. So let's rewrite that into another rule:
2: Cannot use the actual data that were used to produce the hash as they are no longer in memory
Basically, you're trying to take a large number, and store that into a much smaller number range, and still have uniqueness.
Sorry, but you can't do that.
Typical hashing algorithms produce relatively unique hash values, so unless you're willing to accept collisions, in the sense that a new combination might be flagged as "already seen" even though it hasn't, then you're out of luck.
If you were to try a bit-field, where each combination has a bit, which is 0 if it hasn't been seen, you still need large amounts of memory.
For the permutation in n=20 that you left in a comment, you have 20! (2,432,902,008,176,640,000) combinations, which if you tried to simply store each combination as a 1-bit in a bit-field, would require 276,589TB of storage.
You're going to have to limit your scope of what you're trying to do.
As others have suggested, you can use hashing to generate an integer that will be unique with high probability. However, if you need the integer to always be unique, you should rank the permutations, i.e. assign an order to them. For example, a common order of permutations for set {1,2,3} is the lexicographical order:
1,2,3
1,3,2
2,1,3
2,3,1
3,1,2
3,2,1
In this case, the id of a permutation is its index in the lexicographical order. There are other methods of ranking permutations, of course.
Making ids a range of continuous integers makes it possible to implement the storage of processed permutations as a bit field or a boolean array.
How fast does it need to be?
You could always gather the integers as a string, then take the hash of that, and then just grab the first 4 bytes.
For a hash you could use any function really, like MD5 or SHA-256.
You could MD5 hash a comma separated string containg your ints.
In C# it would look something like this (Disclaimer: I have no compiler on the machine I'm using today):
using System;
using System.Security.Cryptography;
using System.Text;
public class SomeClass {
static Guid GetHash(int[] numbers) {
string csv = string.Join(',', numbers);
return new Guid(new MD5CryptoServiceProvider().ComputeHash(Encoding.ASCII.GetBytes(csv.Trim())));
}
}
Edit: What was I thinking? As stated by others, you don't need a hash. The CSV should be sufficient as a string Id (unless your numbers array is big).
Convert each number to String, concatenate Strings (via StringBuffer) and take contents of StringBuffer as a key.
Not relates directly to the question, but as an alternative solution you may use Trie tree as a look up structure. Trie trees are very good for strings operations, its implementation relatively easy and it should be more faster (max of n(k) where k is length of a key) than hashset for a big amount of long strings. And you aren't limited in key size( such in a regular hashset in must int, not bigger). Key in your case will be a string of all numbers separated by some char.
Prime powers would work: if p_i is the ith prime and a_i is the ith element of your tuple, then
p_0**a_0 * p_1**a_1 * ... * p_n**a_n
should be unique by the Fundamental Theorem of Arithmetic. Those numbers will get pretty big, though :-)
(e.g. for n=5, (1,2,3,4,5) will map to 870,037,764,750 which is already more than 32 bits)
Similar to Bojan's post it seems like the best way to go is to have a deterministic order to the permutations. If you process them in that order then there is no need to do a lookup to see if you have already done any particular permutation.
get two permutations of same series of numbers {1,.., n}, construct a mapping tupple, (id, permutation1[id], permutation2[id]), or (id, f1(id), f2(id)); you will get an unique map by {f3(id)| for tuple (id, f1(id), f2(id)) , from id, we get a f2(id), and find a id' from tuple (id',f1(id'),f2(id')) where f1(id') == f2(id)}

Resources