Generating ids for a set of integers

Generating ids for a set of integers - algorithm

Background:
I'm working with permutations of the sequence of integers {0, 1, 2 ... , n}.
I have a local search algorithm that transforms a permutation in some systematic way into another permutation. The point of the algorithm is to produce a permutation that minimises a cost function. I'd like to work with a wide range of problems, from n=5 to n=400.
The problem:
To reduce search effort I need to be able to check if I've processed a particular permutation of integers before. I'm using a hash table for this and I need to be able to generate an id for each permutation which I can use as a key into the table. However, I can't think of any nice hash function that maps a set of integers into a key such that collisions do not occur too frequently.
Stuff I've tried:
I started out by generating a sequence of n prime numbers and multiplying the ith number in my permutation with the ith prime then summing the results. The resulting key however produces collisions even for n=5.
I also thought to concatenate the values of all numbers together and take the integer value of the resulting string as a key but the id quickly becomes too big even for small values of n. Ideally, I'd like to be able to store each key as an integer.
Does stackoverflow have any suggestions for me?

Zobrist hashing might work for you. You need to create an NxN matrix of random integers, each cell representing that element i is in the jth position in the current permutation.
For a given permutation you pick the N cell values, and xor them one by one to get the permutation's key (note that key uniqueness is not guaranteed).
The point in this algorithm is, that if you swap to elements in your permutations, you can easily generate the new key from the current permutation by simply xor-ing out the old and xor-ing in the new positions.

Judging by your question, and the comments you've left, I'd say your problem is not possible to solve.
Let me explain.
You say that you need a unique hash from your combination, so let's make that rule #1:
1: Need a unique number to represent a combination of an arbitrary number of digits/numbers
Ok, then in a comment you've said that since you're using quite a few numbers, storing them as a string or whatnot as a key to the hashtable is not feasible, due to memory constraints. So let's rewrite that into another rule:
2: Cannot use the actual data that were used to produce the hash as they are no longer in memory
Basically, you're trying to take a large number, and store that into a much smaller number range, and still have uniqueness.
Sorry, but you can't do that.
Typical hashing algorithms produce relatively unique hash values, so unless you're willing to accept collisions, in the sense that a new combination might be flagged as "already seen" even though it hasn't, then you're out of luck.
If you were to try a bit-field, where each combination has a bit, which is 0 if it hasn't been seen, you still need large amounts of memory.
For the permutation in n=20 that you left in a comment, you have 20! (2,432,902,008,176,640,000) combinations, which if you tried to simply store each combination as a 1-bit in a bit-field, would require 276,589TB of storage.
You're going to have to limit your scope of what you're trying to do.

As others have suggested, you can use hashing to generate an integer that will be unique with high probability. However, if you need the integer to always be unique, you should rank the permutations, i.e. assign an order to them. For example, a common order of permutations for set {1,2,3} is the lexicographical order:
1,2,3
1,3,2
2,1,3
2,3,1
3,1,2
3,2,1
In this case, the id of a permutation is its index in the lexicographical order. There are other methods of ranking permutations, of course.
Making ids a range of continuous integers makes it possible to implement the storage of processed permutations as a bit field or a boolean array.

How fast does it need to be?
You could always gather the integers as a string, then take the hash of that, and then just grab the first 4 bytes.
For a hash you could use any function really, like MD5 or SHA-256.

You could MD5 hash a comma separated string containg your ints.
In C# it would look something like this (Disclaimer: I have no compiler on the machine I'm using today):
using System;
using System.Security.Cryptography;
using System.Text;
public class SomeClass {
static Guid GetHash(int[] numbers) {
string csv = string.Join(',', numbers);
return new Guid(new MD5CryptoServiceProvider().ComputeHash(Encoding.ASCII.GetBytes(csv.Trim())));
}
}
Edit: What was I thinking? As stated by others, you don't need a hash. The CSV should be sufficient as a string Id (unless your numbers array is big).

Convert each number to String, concatenate Strings (via StringBuffer) and take contents of StringBuffer as a key.

Not relates directly to the question, but as an alternative solution you may use Trie tree as a look up structure. Trie trees are very good for strings operations, its implementation relatively easy and it should be more faster (max of n(k) where k is length of a key) than hashset for a big amount of long strings. And you aren't limited in key size( such in a regular hashset in must int, not bigger). Key in your case will be a string of all numbers separated by some char.

Prime powers would work: if p_i is the ith prime and a_i is the ith element of your tuple, then
p_0**a_0 * p_1**a_1 * ... * p_n**a_n
should be unique by the Fundamental Theorem of Arithmetic. Those numbers will get pretty big, though :-)
(e.g. for n=5, (1,2,3,4,5) will map to 870,037,764,750 which is already more than 32 bits)

Similar to Bojan's post it seems like the best way to go is to have a deterministic order to the permutations. If you process them in that order then there is no need to do a lookup to see if you have already done any particular permutation.

get two permutations of same series of numbers {1,.., n}, construct a mapping tupple, (id, permutation1[id], permutation2[id]), or (id, f1(id), f2(id)); you will get an unique map by {f3(id)| for tuple (id, f1(id), f2(id)) , from id, we get a f2(id), and find a id' from tuple (id',f1(id'),f2(id')) where f1(id') == f2(id)}

Related

Repeated DNA sequence

The problem is to find out all the sequences of length k in a given DNA sequence which occur more than once. I found a approach of using a rolling hash function, where for each sequence of length k, hash is computed and is stored in a map. To check if the current sequence is a repetition, we compute it's hash and check if the hash already exist in the hash map. If yes, then we include this sequence in our result, otherwise add it to the hash map.
Rolling hash here means, when moving on to the next sequence by sliding the window by one, we use the hash of previous sequence in a way that we remove the contribution of the first character of previous sequence and add the contribution of the newly added char i.e. the last character of the new sequence.
Input: AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT
and k=10
Answer: {AAAAACCCCC, CCCCCAAAAA}
This algorithm looks perfect, but I can't go about making a perfect hash function so that collisions are avoided. It would be a great help if somebody can explain how to make a perfect hash under any circumstance and most importantly in this case.

This is actually a research problem.
Let's come to terms with some facts
Input = N, Input length = |N|
You have to move a size k, here k=10, sliding window over the input. Therefore you must live with O(|N|) or more.
Your rolling hash is a form of locality sensitive deterministic hashing, the downside of deterministic hashing is the benefit of hashing is greatly diminished as the more often you encounter similar strings the harder it will be to hash
The longer your input the less effective hashing will be
Given these facts "rolling hashes" will soon fail. You cannot design a rolling hash that will even work for 1/10th of a chromosome.
SO what alternatives do you have?
Bloom Filters. They are much more robust than simple hashing. The downside is sometimes they have a false positives. But this can be mitigated by using several filters.
Cuckoo Hashes similar to bloom filters, but use less memory and have locality sensitive "hashing" and worst case constant lookup time
Just stick every suffix in a suffix trie. Once this is done, just output every string at depth 10 that also has atleast 2 children with one of the children being a leaf.
Improve on the suffix trie with a suffix tree. Lookup is not as straightforward but memory consumption is less.
My favorite the FM-Index. In my opinion the cleanest solution uses the Burrows Wheeler Transform. This technique is also used in industryu tools like Bowtie and BWA

Heads-up: This is not a general solution, but a good trick that you can use when k is not large.
The trick is to encrypt the sequence into an integer by bit manipulation.
If your input k is relatively small, let's say around 10. Then you can encrypt your DNA sequence in an int via bit manipulation. Since for each character in the sequence, there are only 4 possibilities, A, C, G, T. You can simply make your own mapping which uses 2 bits to represent a letter.
For example: 00 -> A, 01 -> C, 10 -> G, 11 -> T.
In this way, if k is 10, you won't need a string with 10 characters as hash key. Instead, you can only use 20 bits in an integer to represent the previous key string.
Then when you do your rolling hash, you left shift the integer that stores your previous sequence for 2 bits, then use any bit operations like |= to set the last two bits with your new character. And remember to clear the 2 left most bits that you just shifted, meaning you are removing them from your sliding window.
By doing this, a string could be stored in an integer, and using that integer as hash key might be nicer and cheaper in terms of the complexity of the hash function computation. If your input length k is slightly longer than 16, you may be able to use a long value. Otherwise, you might be able to use a bitset or a bitarray. But to hash them becomes another issue.
Therefore, I'd say this solution is a nice attempt for this problem when the sequence length is relatively small, i.e. can be stored in a single integer or long integer.

You can build the suffix array and the LCP array. Iterate through the LCP array, every time you see a value greater or equal to k, report the string referred to by that position (using the suffix array to determine where the substring comes from).
After you report a substring because the LCP was greater or equal to k, ignore all following values until reaching one that is less than k (this avoids reporting repeated values).
The construction of both, the suffix array and the LCP, can be done in linear time. So overall the solution is linear with respect to the size of the input plus output.

What you could do is use Chinese Remainder Theorem and pick several large prime moduli. If you recall, CRT means that a system of congruences with coprime moduli has a unique solution mod the product of all your moduli. So if you have three moduli 10^6+3, 10^6+33, and 10^6+37, then in effect you have a modulus of size 10^18 more or less. With a sufficiently large modulus, you can more or less disregard the idea of a collision happening at all---as my instructor so beautifully put it, it's more likely that your computer will spontaneously catch fire than a collision to happen, since you can drive that collision probability to be as arbitrarily small as you like.

Find first unique number in an unsorted array

I came across this question while going through previous interview questions. Any direction to approach this ?
Find first unique number in an unsorted array of 32 bit numbers
without using hash tables or array of counters.

Seeing that the input array is unsorted, you can solve the problem by sorting it. This is a bit silly - why give an answer to the question in the question itself? - but the technicalities of the sorting are a little interesting, so maybe this answer isn't trivial after all.
When looking at the array after sorting, you will find several numbers that are not equal to their predecessor and successor; from these, you want to choose the first one in the original array.
To do that efficiently, in your temporary array which is being sorted, for each number, store also the index of that number in the original array. So, at the end, choose the number which is not equal to its predecessor and successor, and which has the lowest index in the original array.

When you have to "do X without using Y", you can sometimes use Z, which has the same effect as Y, and argue that you were not using Y. Or you can disguise Y well enough so no one would recognize using it at first sight.
With that in mind, consider storing repetition counters for all the numbers in a trie. To choose the first number from the set of all unique numbers, store also the indices together with repetition counters.
I can claim that a trie is not an array of repetition counters, because you don't have to allocate and initialize 232 memory cells for the array. This is more like a glorified hashtable, but looks different enough.

What hash function produces the maximum number of collisions when hashing n keys?

My question has to do with collisions. What is the maximum number of collisions that may result from hashing n keys? I believe you would be able to find this by taking n-1. But I am unsure if this is correct. I'm specifically trying to figure out a hash function that would produce that many collisions. I'm just having a hard time understand the concept of the question. Any help on the subject would be appreciated!

The maximum number of collisions is equal to the number of items you hash.
Example:
hash function: h(x) = 3
All items will be hashed to key 3.
Notice that number of keys, n in your case, doesn't affect the answer, since no matter how many keys you have, your items are always going to be hashed in key 3, with the h(x) I provided above.
Visualization:
Usually, hashing looks like this:
but if I want to have the maximum number of collisions, then, by using the h(x) provided above I will get all my items (the names in the picture) all hashed to the vary same key, i.e. key 3.
So in that case the maximum number of collisions is the number of names, 5.

Maximum expected number of collision in uniform hashing is $$O(\frac{log(n)}{loglog(n)})$$

Best algorithm to find N unique random numbers in VERY large array

I have an array with, for example, 1000000000000 of elements (integers). What is the best approach to pick, for example, only 3 random and unique elements from this array? Elements must be unique in whole array, not in list of N (3 in my example) elements.
I read about Reservoir sampling, but it provides only method to pick random numbers, which can be non-unique.

If the odds of hitting a non-unique value are low, your best bet will be to select 3 random numbers from the array, then check each against the entire array to ensure it is unique - if not, choose another random sample to replace it and repeat the test.
If the odds of hitting a non-unique value are high, this increases the number of times you'll need to scan the array looking for uniqueness and makes the simple solution non-optimal. In that case you'll want to split the task of ensuring unique numbers from the task of making a random selection.
Sorting the array is the easiest way to find duplicates. Most sorting algorithms are O(n log n), but since your keys are integers Radix sort can potentially be faster.
Another possibility is to use a hash table to find duplicates, but that will require significant space. You can use a smaller hash table or Bloom filter to identify potential duplicates, then use another method to go through that smaller list.

counts = [0] * (MAXINT-MININT+1)
for value in Elements:
counts[value] += 1
uniques = [c for c in counts where c==1]
result = random.pick_3_from(uniques)

I assume that you have a reasonable idea what fraction of the array values are likely to be unique. So you would know, for instance, that if you picked 1000 random array values, the odds are good that one is unique.
Step 1. Pick 3 random hash algorithms. They can all be the same algorithm, except that you add different integers to each as a first step.
Step 2. Scan the array. Hash each integer all three ways, and for each hash algorithm, keep track of the X lowest hash codes you get (you can use a priority queue for this), and keep a hash table of how many times each of those integers occurs.
Step 3. For each hash algorithm, look for a unique element in that bucket. If it is already picked in another bucket, find another. (Should be a rare boundary case.)
That is your set of three random unique elements. Every unique triple should have even odds of being picked.
(Note: For many purposes it would be fine to just use one hash algorithm and find 3 things from its list...)
This algorithm will succeed with high likelihood in one pass through the array. What is better yet is that the intermediate data structure that it uses is fairly small and is amenable to merging. Therefore this can be parallelized across machines for a very large data set.

Using a set of integers to generate unique key

Now I have some sets of integers, say:
set1 = {int1, int2, int3};
set2 = {int2, int3, int1};
set3 = {int1, int4, int2};
The order or the numbers is not taken into consideration, so set1 and set2 are the same, while set3 are not with the other two.
Now I want to generate a unique key for these sets to distinguish them, in that way, set1 and set2 should generate the same key.
I think this for a while, thoughts as sum up the integers came to my mind but can be easily proved wrong. Sort the set and do
key = n1 + n2*2^16 + n3*2^32
may be a possible way but I wonder if this can be solved more elegantly.
The key can be either integer or string.
So any one has some idea about solving this as fast as possible? Or any reading material is welcome.
More info:
The numbers are in fact colors so each integer is less than 0xffffff

If these were small integers (all within the range(0,63) for example) then you could represent each set as a bitstring (1 for any integer that's present in the set; 0 for any that's absent). For sparse sets of large integers this would be horrendously expensive in terms of storage/memory).
One other method that comes to mind would be to sort the set and form the key as the concatenation of each number's digital representation (separated by some delimiter). So the set {2,1,3} -> "1/2/3" (using "/" as the delimiter) and {30,1,2,4} => "1/2/4/30"
I suppose you could also use a hybrid approach. All elements < 63 are encoded into a hex string and all others are encoded into a string as described. Then your final resulting key is formed by: HEXxA/B/c ... (with the "x" separating the small int hex string from the larger ints in the set).

If numbers of your set is not so large, I think hashing each set into one string can be one of proper solution.
Then they are lager ones, you can make it small ones by mod function or whatever. And by this, they can be dealed with in the same way.
Hope this will help your solution if there is no better idea.

I think a key of practical size can only be a hash value - there will always be a few pairs of inputs that hash to the same key, but you can make this unlikely.
I think the idea of sorting and then applying a standard hash function is good, but I don't like your hash multipliers. If arithmetic is mod 2^32, then multiplying by 2^32 is multiplying by zero. If it is mod 2^64, then multiplying by 2^32 will lose the top 32 bits of the input.
I would use a hash function like that described in Why chose 31 to do the multiplication in the hashcode() implementation ?, where you keep a running total, multiplying the hash value by some odd number before you add then next item into it. Multiplying by an odd number mod 2^n will at least not lose information immediately. I would suggest 131, but Java has a tradition of using 31.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio