Resolving second case collisions using Double Hashing

Resolving second case collisions using Double Hashing - data-structures

When provided with a second collision case, how is this resolved?
I.E:
Let's say we have an array of numbers:
[22, 1, 13, 11, 24, -1, -1, -1, -1, -1, -1]
Where -1 indicates empty in the array....
if we were to attempt to insert 33 using
h1(key) = key % 11
h2(key) = 7 - (key % 7)
Passing in 33 would give 2, where the array location 2 is already occupied (with 13). How do we handle this collision case? Do we pass the returned mod value into h2 again? Do we replace the value # that array value? (I suspect the latter is not the case.)
Edit: Added parenthesis to h2

With double hashing, you compute the position as:
pos = (h1 + i * h2) mod table_size
The trick here is to increment i until an empty position in the hash table is found. Therefore the computation is not only done once, but multiple times until a slot has been found. See the Wikipedia article for details.
There are other forms of open addressing similar to double hashing that are very efficient too, for example cuckoo hashing and robin-hood hashing.

Related

Hashing with division remainder method

I don't understand this exercise.
Hash the keys: (13,17,39,27,1,20,4,40,25,9,2,37) into a hash table of size 13 using the division-remainder method.
a) find a suitable value for m.
b) handle collisions using linked lists andvisualize theresult in a table like this
0→
1→
2→
3→
4→
5→
6→
...
c) c) Handle collision with linear probing using the sequence s(j) = j and illustrate the development in a table by starting a new column for every insert (don’t forget to copy the cells already filled to the right) and by using downwards arrows to show the probing steps in case of collisions.
my attempt:
a) if the table size is 13, m also have to be 13 because of remaining classes
b) for example 0→ 39 -> 13 ....
c) I have no idea
It would be really great if someone could help me solve it. :)

Let me give a brief overview of all topics which will be used here.
Hash-map is a data structure that uses a hash function to map identifying values, known as keys, to their associated values. It contains “key-value” pairs and allows retrieving value by key.
Like in array you can get any element using index, similarly you can get any value using a key in hash-map.
Basically something like this happens, you are given a key which is string here, then it is hashed and we put the value at that index in array.
In our example image, if you want what is value for "Billy", we again hash "Billy" we get 03. Now we just check the value at index 3 and that's the stored value for "Billy" (key)
In your case you have to hash integers not strings.
Now how to hash keys?
There can be several methods like you may sum ascii values of characters of string, or anything what you can think of.
let's say you have this array [100, 1, 3, 56, 80]
and you have to store it in bucket of size 13.
We directly can't use those array values as an index because we will need index 1 and index 100, it will make bucket have 100 size.
But if you take remainder of each array number with 13 then the remainder is always guaranteed to be from 0 to 13, thus you can use a 13 size bucket if you has keys using division method
[100, 1, 3, 56, 80] remainder with 13 -> [9, 1, 3, 4, 5]
Thus you store 100's value at index 9, and so on.
Collision:
But what if in array we have a value 5 and 80, both after will give remainder 5. What to store at index 5 now?
In our example image,
Now let's say "SACHU" this also gives 03 after hashing now two keys gave same index so this is called collision which can be resolved using two methods
linkedlist like storage (store both values at same index using linkedlist, like this)
linear probing: in simple words 03 index is already occupied we try to find next empty index, like using the most simplest probing our in image example will be, 06 is empty so we store "SACHU" value at 06 not 03.
(now this is a little hard so I highly suggest you to read hashing and collisions on internet)
Now, there is one method where we h(x) denotes the hash of an integer x.
if number is x, first hash will be, h1 = h(x)
If h1 index is not empty we again hash same index, h2 = h(h1)
An so on, I am not sure, but I guess this is what is meant by s[j] = j method.
THESE ARE THE METHODS WHICH YOU HAVE TO USE IN YOUR PROBLEM.
I prefer you to give it a try first.
You can read more about it online and and comment if still you were not able to solve it.

Simple hashcode in hashmap misconception?

I am implementing my own specialized hashmap which has generic value types, but keys are always of type long. Here and there, I am seeing people suggesting that I should multiply key by a prime and then get modulo by number of buckets:
int bucket = (key * prime) % numOfBuckets;
and I don't understand why? It seems to me that it has exactly the same distribution as simple:
int bucket = key % numOfBuckets;
For example, if numOfBuckets is 8, with second "algorithm" we get buckets like {0, 1, 2, 3, 4, 5, 6, 7} repeating for key = 0 to infinity. In first algorithm for same keys we get buckets {0, 3, 6, 1, 4, 7, 2, 5} (or similar) also repeating. Basically we have the same problem like when using identity hash.
Basically, in both cases we get collisions for keys:
key = x + k*numOfBuckets (for k = 1 to infinity; and x = key % numOfBuckets)
because when we get modulo by numOfBuckets we always get x. So, what's the deal with first algorithm, can someone enlighten me?

If numOfBuckets is a power of two and the prime is odd (which seems to be the intended use case), then we have gcd(numOfBuckets, prime) == 1. That in turn means there is a number inverse such that inverse * numOfBuckets = 1 (mod numOfBuckets), so the multiplication is a bijective operation that just shuffles the buckets around a bit. That is of course useless, so your conclusions are correct.
Or perhaps more intuitively: in a multiplication information only flows from the lowest bit to the highest, never in reverse. So any of the bits that the bucket index would not rely on without the multiplication, are still discarded with the multiplication.
Some other techniques do help, for example Java's HashMap uses this:
/**
* Applies a supplemental hash function to a given hashCode, which
* defends against poor quality hash functions. This is critical
* because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
An other thing that works is multiplying by some large constant and then using the upper bits of the result (which contain a mixture of the bits below them, so all bits of the key can be used that way).

How to generate a pseudo-random involution?

For generating a pseudo-random permutation, the Knuth shuffles can be used. An involution is a self-inverse permutation and I guess, I could adapt the shuffles by forbidding touching an element multiple times. However, I'm not sure whether I could do it efficiently and whether it generates every involution equiprobably.
I'm afraid, an example is needed: On a set {0,1,2}, there are 6 permutation, out of which 4 are involutions. I'm looking for an algorithm generating one of them at random with the same probability.
A correct but very inefficient algorithm would be: Use Knuth shuffle, retry if it's no involution.

Let's here use a(n) as the number of involutions on a set of size n (as OEIS does). For a given set of size n and a given element in that set, the total number of involutions on that set is a(n). That element must either be unchanged by the involution or be swapped with another element. The number of involutions that leave our element fixed is a(n-1), since those are involutions on the other elements. Therefore a uniform distribution on the involutions must have a probability of a(n-1)/a(n) of keeping that element fixed. If it is to be fixed, just leave that element alone. Otherwise, choose another element that has not yet been examined by our algorithm to swap with our element. We have just decided what happens with one or two elements in the set: keep going and decide what happens with one or two elements at a time.
To do this, we need a list of the counts of involutions for each i <= n, but that is easily done with the recursion formula
a(i) = a(i-1) + (i-1) * a(i-2)
(Note that this formula from OEIS also comes from my algorithm: the first term counts the involutions keeping the first element where it is, and the second term is for the elements that are swapped with it.) If you are working with involutions, this is probably important enough to break out into another function, precompute some smaller values, and cache the function's results for greater speed, as in this code:
# Counts of involutions (self-inverse permutations) for each size
_invo_cnts = [1, 1, 2, 4, 10, 26, 76, 232, 764, 2620, 9496, 35696, 140152]
def invo_count(n):
"""Return the number of involutions of size n and cache the result."""
for i in range(len(_invo_cnts), n+1):
_invo_cnts.append(_invo_cnts[i-1] + (i-1) * _invo_cnts[i-2])
return _invo_cnts[n]
We also need a way to keep track of the elements that have not yet been decided, so we can efficiently choose one of those elements with uniform probability and/or mark an element as decided. We can keep them in a shrinking list, with a marker to the current end of the list. When we decide an element, we move the current element at the end of the list to replace the decided element then reduce the list. With that efficiency, the complexity of this algorithm is O(n), with one random number calculation for each element except perhaps the last. No better order complexity is possible.
Here is code in Python 3.5.2. The code is somewhat complicated by the indirection involved through the list of undecided elements.
from random import randrange
def randinvolution(n):
"""Return a random (uniform) involution of size n."""
# Set up main variables:
# -- the result so far as a list
involution = list(range(n))
# -- the list of indices of unseen (not yet decided) elements.
# unseen[0:cntunseen] are unseen/undecided elements, in any order.
unseen = list(range(n))
cntunseen = n
# Make an involution, progressing one or two elements at a time
while cntunseen > 1: # if only one element remains, it must be fixed
# Decide whether current element (index cntunseen-1) is fixed
if randrange(invo_count(cntunseen)) < invo_count(cntunseen - 1):
# Leave the current element as fixed and mark it as seen
cntunseen -= 1
else:
# In involution, swap current element with another not yet seen
idxother = randrange(cntunseen - 1)
other = unseen[idxother]
current = unseen[cntunseen - 1]
involution[current], involution[other] = (
involution[other], involution[current])
# Mark both elements as seen by removing from start of unseen[]
unseen[idxother] = unseen[cntunseen - 2]
cntunseen -= 2
return involution
I did several tests. Here is the code I used to check for validity and uniform distribution:
def isinvolution(p):
"""Flag if a permutation is an involution."""
return all(p[p[i]] == i for i in range(len(p)))
# test the validity and uniformness of randinvolution()
n = 4
cnt = 10 ** 6
distr = {}
for j in range(cnt):
inv = tuple(randinvolution(n))
assert isinvolution(inv)
distr[inv] = distr.get(inv, 0) + 1
print('In {} attempts, there were {} random involutions produced,'
' with the distribution...'.format(cnt, len(distr)))
for x in sorted(distr):
print(x, str(distr[x]).rjust(2 + len(str(cnt))))
And the results were
In 1000000 attempts, there were 10 random involutions produced, with the distribution...
(0, 1, 2, 3) 99874
(0, 1, 3, 2) 100239
(0, 2, 1, 3) 100118
(0, 3, 2, 1) 99192
(1, 0, 2, 3) 99919
(1, 0, 3, 2) 100304
(2, 1, 0, 3) 100098
(2, 3, 0, 1) 100211
(3, 1, 2, 0) 100091
(3, 2, 1, 0) 99954
That looks pretty uniform to me, as do other results I checked.

An involution is a one-to-one mapping that is its own inverse. Any cipher is a one-to-one mapping; it has to be in order for a cyphertext to be unambiguously decrypyed.
For an involution you need a cipher that is its own inverse. Such ciphers exist, ROT13 is an example. See Reciprocal Cipher for some others.
For your question I would suggest an XOR cipher. Pick a random key at least as long as the longest piece of data in your initial data set. If you are using 32 bit numbers, then use a 32 bit key. To permute, XOR the key with each piece of data in turn. The reverse permutation (equivalent to decrypting) is exactly the same XOR operation and will get back to the original data.
This will solve the mathematical problem, but it is most definitely not cryptographically secure. Repeatedly using the same key will allow an attacker to discover the key. I assume that there is no security requirement over and above the need for a random-seeming involution with an even distribution.
ETA: This is a demo, in Java, of what I am talking about in my second comment. Being Java, I use indexes 0..12 for your 13 element set.
public static void Demo() {
final int key = 0b1001;
System.out.println("key = " + key);
System.out.println();
for (int i = 0; i < 13; ++i) {
System.out.print(i + " -> ");
int ctext = i ^ key;
while (ctext >= 13) {
System.out.print(ctext + " -> ");
ctext = ctext ^ key;
}
System.out.println(ctext);
}
} // end Demo()
The output from the demo is:
key = 9
0 -> 9
1 -> 8
2 -> 11
3 -> 10
4 -> 13 -> 4
5 -> 12
6 -> 15 -> 6
7 -> 14 -> 7
8 -> 1
9 -> 0
10 -> 3
11 -> 2
12 -> 5
Where a transformed key would fall off the end of the array it is transformed again until it falls within the array. I am not sure if a while construction will fall within the strict mathematical definition of a function.

Unique pair of two integers [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Mapping two integers to one, in a unique and deterministic way
I'm trying to create unique identificator for pair of two integers (Ruby) :
f(i1,i2) = f(i2, i1) = some_unique_value
So, i1+i2, i1*i2, i1^i2 -not unique as well as (i1>i2) ? "i1" + "i2" : "i2" + "i1".
I think following solution will be ok:
(i1>i2) ? "i1" + "_" + "i2" : "i2" + "_" + "i1"
but:
I have to save result in DB and index it. So I prefer it to be an integer and as small as it possible.
Is Zlib.crc32(f(i1,i2)) can guaranty uniqueness?
Thanks.
UPD:
Actually, I'm not sure the result MUST be integer. Maybe I can convert it to decimal:
(i1>i2) ? i1.i2 : i2.i1
?

What you're looking for is called a Pairing function.
The following illustration from the German wikipedia page clearly shows how it works:
Implemented in Ruby:
def cantor_pairing(n, m)
(n + m) * (n + m + 1) / 2 + m
end
(0..5).map do |n|
(0..5).map do |m|
cantor_pairing(n, m)
end
end
=> [[ 0, 2, 5, 9, 14, 20],
[ 1, 4, 8, 13, 19, 26],
[ 3, 7, 12, 18, 25, 33],
[ 6, 11, 17, 24, 32, 41],
[10, 16, 23, 31, 40, 50],
[15, 22, 30, 39, 49, 60]]
Note that you will need to store the result of this pairing in a datatype with as many bits as both your input numbers put together. (If both input numbers are 32-bit, you will need a 64-bit datatype to be able to store all possible combinations, obviously.)

No, Zlib.crc32(f(i1,i2)) is not unique for all integer values of i1 and i2.
If i1 and i2 are also 32bit numbers then there are many more combinations of them than can be stored in a 32bit number, which is returned by CRC32.

CRC32 is not unique, and wouldn't be good to use as a key. Assuming you know the maximum value of your integers i1 and i2:
unique_id = (max_i2+1)*i1 + i2
If your integers can be negative, or will never be below a certain positive integer, you'll need the max and min values:
(max_i2-min_i2+1) * (i1-min_i1) + (i2-min_i2)
This will give you the absolute smallest number possible to identify both integers.

Well, no 4-byte hash will be unique when its input is an arbitrary binary string of more than 4 bytes. Your strings are from a highly restricted symbol set, so collisions will be fewer, but "no, not unique".
There are two ways to use a smaller integer than the possible range of values for both of your integers:
Have a system that works despite occasional collisions
Check for collisions and use some sort of rehash
The obvious way to solve your problem with a 1:1 mapping requires that you know the maximum value of one of the integers. Just multiply one by the maximum value and add the other, or determine a power of two ceiling, shift one value accordingly, then OR in the other. Either way, every bit is reserved for one or the other of the integers. This may or may not meet your "as small as possible" requirement.
Your ###_### string is unique per pair; if you could just store that as a string you win.

Here's a better, more space efficient solution:. My answer on it here

Compute rank of a combination?

I want to pre-compute some values for each combination in a set of combinations. For example, when choosing 3 numbers from 0 to 12, I'll compute some value for each one:
>>> for n in choose(range(13), 3):
print n, foo(n)
(0, 1, 2) 78
(0, 1, 3) 4
(0, 1, 4) 64
(0, 1, 5) 33
(0, 1, 6) 20
(0, 1, 7) 64
(0, 1, 8) 13
(0, 1, 9) 24
(0, 1, 10) 85
(0, 1, 11) 13
etc...
I want to store these values in an array so that given the combination, I can compute its and get the value. For example:
>>> a = [78, 4, 64, 33]
>>> a[magic((0,1,2))]
78
What would magic be?
Initially I thought to just store it as a 3-d matrix of size 13 x 13 x 13, so I can easily index it that way. While this is fine for 13 choose 3, this would have way too much overhead for something like 13 choose 7.
I don't want to use a dict because eventually this code will be in C, and an array would be much more efficient anyway.
UPDATE: I also have a similar problem, but using combinations with repetitions, so any answers on how to get the rank of those would be much appreciated =).
UPDATE: To make it clear, I'm trying to conserve space. Each of these combinations actually indexes into something take up a lot of space, let's say 2 kilobytes. If I were to use a 13x13x13 array, that would be 4 megabytes, of which I only need 572 kilobytes using (13 choose 3) spots.

Here is a conceptual answer and a code based on how lex ordering works. (So I guess my answer is like that of "moron", except that I think that he has too few details and his links have too many.) I wrote a function unchoose(n,S) for you that works assuming that S is an ordered list subset of range(n). The idea: Either S contains 0 or it does not. If it does, remove 0 and compute the index for the remaining subset. If it does not, then it comes after the binomial(n-1,k-1) subsets that do contain 0.
def binomial(n,k):
if n < 0 or k < 0 or k > n: return 0
b = 1
for i in xrange(k): b = b*(n-i)/(i+1)
return b
def unchoose(n,S):
k = len(S)
if k == 0 or k == n: return 0
j = S[0]
if k == 1: return j
S = [x-1 for x in S]
if not j: return unchoose(n-1,S[1:])
return binomial(n-1,k-1)+unchoose(n-1,S)
def choose(X,k):
n = len(X)
if k < 0 or k > n: return []
if not k: return [[]]
if k == n: return [X]
return [X[:1] + S for S in choose(X[1:],k-1)] + choose(X[1:],k)
(n,k) = (13,3)
for S in choose(range(n),k): print unchoose(n,S),S
Now, it is also true that you can cache or hash values of both functions, binomial and unchoose. And what's nice about this is that you can compromise between precomputing everything and precomputing nothing. For instance you can precompute only for len(S) <= 3.
You can also optimize unchoose so that it adds the binomial coefficients with a loop if S[0] > 0, instead of decrementing and using tail recursion.

You can try using the lexicographic index of the combination. Maybe this page will help: http://saliu.com/bbs/messages/348.html
This MSDN page has more details: Generating the mth Lexicographical Element of a Mathematical Combination.
NOTE: The MSDN page has been retired. If you download the documentation at the above link, you will find the article on page 10201 of the pdf that is downloaded.
To be a bit more specific:
When treated as a tuple, you can order the combinations lexicographically.
So (0,1,2) < (0,1,3) < (0,1,4) etc.
Say you had the number 0 to n-1 and chose k out of those.
Now if the first element is zero, you know that it is one among the first n-1 choose k-1.
If the first element is 1, then it is one among the next n-2 choose k-1.
This way you can recursively compute the exact position of the given combination in the lexicographic ordering and use that to map it to your number.
This works in reverse too and the MSDN page explains how to do that.

Use a hash table to store the results. A decent hash function could be something like:
h(x) = (x1*p^(k - 1) + x2*p^(k - 2) + ... + xk*p^0) % pp
Where x1 ... xk are the numbers in your combination (for example (0, 1, 2) has x1 = 0, x2 = 1, x3 = 2) and p and pp are primes.
So you would store Hash[h(0, 1, 2)] = 78 and then you would retrieve it the same way.
Note: the hash table is just an array of size pp, not a dict.

I would suggest a specialised hash table. The hash for a combination should be the exclusive-or of the hashes for the values. Hashes for values are basically random bit-patterns.
You could code the table to cope with collisions, but it should be fairly easy to derive a minimal perfect hash scheme - one where no two three-item combinations give the same hash value, and where the hash-size and table-size are kept to a minimum.
This is basically Zobrist hashing - think of a "move" as adding or removing one item of the combination.
EDIT
The reason to use a hash table is that the lookup performance O(n) where n is the number of items in the combination (assuming no collisions). Calculating lexicographical indexes into the combinations is significantly slower, IIRC.
The downside is obviously the up-front work done to generate the table.

For now, I've reached a compromise: I have a 13x13x13 array which just maps to the index of the combination, taking up 13x13x13x2 bytes = 4 kilobytes (using short ints), plus the normal-sized (13 choose 3) * 2 kilobytes = 572 kilobytes, for a total of 576 kilobytes. Much better than 4 megabytes, and also faster than a rank calculation!
I did this partly cause I couldn't seem to get Moron's answer to work. Also this is more extensible - I have a case where I need combinations with repetitions, and I haven't found a way to compute the rank of those, yet.

What you want are called combinadics. Here's my implementation of this concept, in Python:
def nthresh(k, idx):
"""Finds the largest value m such that C(m, k) <= idx."""
mk = k
while ncombs(mk, k) <= idx:
mk += 1
return mk - 1
def idx_to_set(k, idx):
ret = []
for i in range(k, 0, -1):
element = nthresh(i, idx)
ret.append(element)
idx -= ncombs(element, i)
return ret
def set_to_idx(input):
ret = 0
for k, ck in enumerate(sorted(input)):
ret += ncombs(ck, k + 1)
return ret

I have written a class to handle common functions for working with the binomial coefficient, which is the type of problem that your problem falls under. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration and it does not use very much memory. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
It should not be hard to convert this class to C++.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio