Hashing with division remainder method - algorithm

I don't understand this exercise.
Hash the keys: (13,17,39,27,1,20,4,40,25,9,2,37) into a hash table of size 13 using the division-remainder method.
a) find a suitable value for m.
b) handle collisions using linked lists andvisualize theresult in a table like this
0→
1→
2→
3→
4→
5→
6→
...
c) c) Handle collision with linear probing using the sequence s(j) = j and illustrate the development in a table by starting a new column for every insert (don’t forget to copy the cells already filled to the right) and by using downwards arrows to show the probing steps in case of collisions.
my attempt:
a) if the table size is 13, m also have to be 13 because of remaining classes
b) for example 0→ 39 -> 13 ....
c) I have no idea
It would be really great if someone could help me solve it. :)

Let me give a brief overview of all topics which will be used here.
Hash-map is a data structure that uses a hash function to map identifying values, known as keys, to their associated values. It contains “key-value” pairs and allows retrieving value by key.
Like in array you can get any element using index, similarly you can get any value using a key in hash-map.
Basically something like this happens, you are given a key which is string here, then it is hashed and we put the value at that index in array.
In our example image, if you want what is value for "Billy", we again hash "Billy" we get 03. Now we just check the value at index 3 and that's the stored value for "Billy" (key)
In your case you have to hash integers not strings.
Now how to hash keys?
There can be several methods like you may sum ascii values of characters of string, or anything what you can think of.
let's say you have this array [100, 1, 3, 56, 80]
and you have to store it in bucket of size 13.
We directly can't use those array values as an index because we will need index 1 and index 100, it will make bucket have 100 size.
But if you take remainder of each array number with 13 then the remainder is always guaranteed to be from 0 to 13, thus you can use a 13 size bucket if you has keys using division method
[100, 1, 3, 56, 80] remainder with 13 -> [9, 1, 3, 4, 5]
Thus you store 100's value at index 9, and so on.
Collision:
But what if in array we have a value 5 and 80, both after will give remainder 5. What to store at index 5 now?
In our example image,
Now let's say "SACHU" this also gives 03 after hashing now two keys gave same index so this is called collision which can be resolved using two methods
linkedlist like storage (store both values at same index using linkedlist, like this)
linear probing: in simple words 03 index is already occupied we try to find next empty index, like using the most simplest probing our in image example will be, 06 is empty so we store "SACHU" value at 06 not 03.
(now this is a little hard so I highly suggest you to read hashing and collisions on internet)
Now, there is one method where we h(x) denotes the hash of an integer x.
if number is x, first hash will be, h1 = h(x)
If h1 index is not empty we again hash same index, h2 = h(h1)
An so on, I am not sure, but I guess this is what is meant by s[j] = j method.
THESE ARE THE METHODS WHICH YOU HAVE TO USE IN YOUR PROBLEM.
I prefer you to give it a try first.
You can read more about it online and and comment if still you were not able to solve it.

Related

Sum of Function defined on Subsets

I want to know if their are any fast approaches to solve the following problem. I have a list of codes somewhere in the thousands (A0, A1, A2, ...). There is a positive value attached to about a million distinct combinations (A0-A1, A2-A10, A1-A2-A10, ...). Let the values be denoted f(A0-A1). Note that not all the combinations have the value attached.
For each listed combination, I want to calculate the sum of values of the values attached to each set that contains the given combination. For instance, for A2-A10,
calculate
g(A2-A10) = f(A2-A10) + f(A1-A2-A10) + ...
I would like to do this with minimal time complexity. A simpler related problem is to find all combinations where g(C) is greater than a threshold value.
Key the existing combinations with a bit map, where bit n denotes whether An is in that particular coding. Store the values keyed by the bit map for each in your favorite hash-map structure. Thus, f(A0, A1, A10, A12) would be combo_val[11000000001010000...]
To sum all of the desired combinations, build a bit map of your root. For instance, with the combination above, we'd have root = 1100000000101000 (cutting off at 16 total elements for the sake of illustration.
Now simply loop through the keys of the hashmap, using root as a mask. Sum the desired values:
total = 0
for key in combo_val.keys()
if root && key == root
total += combo_val[key]
Does that get you moving?
I thought waaay too long before coming up with the following approach.
Index the million combinations. So you know which you want. In your example:
0: A0-A1
1: A2-A10
2: A1-A2-A10
For each code, create an ordered list of combinations that contain that code. Call that code_combs. In your example:
A0: [0]
A1: [0, 2]
A2: [1, 2]
A10: [1, 2]
Now we have a combination of codes, like A2-A10. We create two arrays, one of codes, the other of indices. Set indices at 0. So:
codes = ['A2', 'A10']
indices = [0, 0]
And now do the following:
while not done:
let max_comb = max(code_combs[codes[i]][indices[i]] over i in range(len(codes))
Advance each index until we are at the max_comb or greater
(if we reach the end of any list, we are done)
If all are at the same max_comb, we add its value.
Advance all indexes by 1.
(if we reach the end of any list, we are done)
Basically this is a k-way intersection of ordered lists. Now here is the trick. If we advance naively, this will be slightly faster because we only have to look at combinations that contain a code. However we can use a clever advance strategy like this:
Advance by 1, 2, 4, 8, etc until we reach or pass the point we want.
Do a binary search between the last two values until we find the point we want
(Be warned, implementing binary search is not always so easy to get right.)
And now we are crossing fingers. But if any one of our codes has few combinations that it is in, and there aren't too many codes in our combination, we can compute our intersection quite quickly.

Copying the hash table to new rehashed table

I have a question about rehashing. Let us say, we have a hash table of size 7, and our hash function is (key%tableSize). We insert 24 to the table, and 24 will be at index 3 since 24%7=3. Then, let us say we added more elements, and now we want to rehash. The table size will be twice the size of the initial table, i.e. new table size will be 14. Then, while copying the elements to the new hash table, for example, while copying the element 24, will it still be in the index 3, or will it be at the index 24%14=10. I mean, do we use the new table size while copying the elements, or the elements stay in their initial indexes?
Thanks
It's depend on your hashing function. In your case you should use key%size_of_table else slots after 7 will never be mapped by hashing function. These slots will occupied only when you chose linear probing in order to tackle the collision.(Where we look for next empty slot). Chosing new size will help to reduce the collisions at early stage, else it would be the case table haven't reached the Load Factor still you are facing lot of collisions.
Important thing about the hash tables is that the order of the elements is not guaranteed, it depends on the hash function.
For your example: if you copy the data into the new hash using 7 for hash size your indexes: 7, 8, 9, 10, 11, 12 and 13 of the new array will be unused because you've used bigger array and your hash function cant give you result bigger than 6. These unused indexes are a bad thing because simply you don't need them, so it's better to use key % 14 instead.
Interesting thing is that the internal hash table state depends not only by the hash function but it also can depend on the order in which the elements have been inserted. For example, imagine there's a hash table (implemented with array and linked lists) X with size 4 and you insert the elements 2,3,6,10 in that order:
x
{
[0] -> []
[1] -> []
[2] -> [2,6,10]
[3] -> [3]
}
For hash function again is used key % size.
Now if we insert the keys in different order - 10, 6, 3, 2 we get:
x
{
[0] -> []
[1] -> []
[2] -> [10,6,2]
[3] -> [3]
}
I've written all these lines above just to show you that two copies of a hash can look different internally because on many factors. I think that was the consideration of your question.

Data Structure / Hash Function to link Sets of Ints to Value

Given n integer id's, I wish to link all possible sets of up to k id's to a constant value. What I'm looking for is a way to translate sets (e.g. {1, 5}, {1, 3, 5} and {1, 2, 3, 4, 5, 6, 7}) to unique values.
Guarantees:
n < 100 and k < 10 (again: set sizes will range in [1, k]).
The order of id's doesn't matter: {1, 5} == {5, 1}.
All combinations are possible, but some may be excluded.
All sets and values are constant and made only once. No deletes or inserts, no value updates.
Once generated, the only operations taking place will be look-ups.
Look-ups will be frequent and one-directional (given set, look up value).
There is no need to sort (or otherwise organize) the values.
Additionally, it would be nice (but not obligatory) if "neighboring" sets (drop one id, add one id, swap one id, etc) are easy to reach, as well as "all sets that include at least this set".
Any ideas?
Enumerate using the product of primes.
a -> 2
b -> 3
c -> 5
d -> 7
et cetera
Now hash(ab) := 6, and hash (abc) := 30
And a nice side effect is that, if "ab" is a subset of "abc", then:
hash(abc) % hash(ab) == 0
and
hash(abc) / hash(ab) == hash(c)
The bad news: You might run into overflow, the 100th prime will probably be around 1000, and 64 bits cannot accomodate 1000**10. This will not affect the functioning as a hash function; only the subset thingy will fail to work. the same method applied to anagrams
The other option is Zobrist-hashing. It is equivalent to the the primes method, but instead of primes you use a fixed set of (random) numbers, and instead of multiplying you use XOR.
For a fixed small (it needs << ~70 bits) set like yours, it might be possible to tune the zobrist tables to totally avoid collisions (yielding a perfect hash).
And the final (and simplest) way is to use a (100bit) bitmap, and treat that as a hashvalue (maybe after modulo table size)
And a totally unrelated method is to just build a decision tree on the bits of the bitmap. (the tree would have a maximal depth of k) a related kD tree on bit values
May be not the best solution, but you can do the following:
Sort the set from Lowest to highest with a simple IntegerComparator
Add each item of the set to a String
so if you have {2,5,9,4} first Step->{2,4,5,9}; second->"2459"
This way you will get a unique String from a unique set. If you really need to map them to an integer value, you can hash the string after that.
A second way I can think of is to store them in a java Set and simply map it against a HashMap with set as keys
Calculate a 'diff' from each set {1, 6, 87, 89} = {1,5,81,2,0,0,...}
{1,2,3,4} = { 1,1,1,1,0,0,0,0... };
Then binary encode each number with a variable length encoding and concatenate the bits.
It's hard to compare the sets (except for the first few equal bits), but because there can't be many large intervals in a set, all possible values just might fit into 64 bits. (slack of 16 bits at least...)

Create Ancestor Matrix from given Binary Tree

The question is, given a Ancestor Matrix, as a bitmap of 1s and 0s, to construct the corresponding Binary Tree. Can anyone give me an idea on how to do it? I found a solution at Stackoverflow, but the line a[root->data][temp[i]]=1 seems wrong, there is no binding that the nodes will contain data 1 to n. It may contain, say 2000, in which case, there will be no a[2000][some_column], since there are only 7 nodes, hence 7 rows and columns in the matrix.
Two ways:
Normalize your node values such that they are all from 1 to n. If you have nodes 1, 2, 5000 for example, make them 1, 2, 3. You can do this by sorting or hashing your labels and keeping something like normalized[i] = normalized value of node i. normalized can be a map / hash table if you have very large labels or even text labels.
You might be able to use a sparse matrix for this, implementable with a hash table or a set: keep a hash table of hash tables. H[x] stores another hash table that stores your y values. So if in a naive matrix solution you had a[2000][5000] = 1, you would use H.get(2000) => returns a hash table H' of values stored on the 2000th row => H'.get(5000) => returns the value you want.

Good hash function for permutations?

I have got numbers in a specific range (usually from 0 to about 1000). An algorithm selects some numbers from this range (about 3 to 10 numbers). This selection is done quite often, and I need to check if a permutation of the chosen numbers has already been selected.
e.g one step selects [1, 10, 3, 18] and another one [10, 18, 3, 1] then the second selection can be discarded because it is a permutation.
I need to do this check very fast. Right now I put all arrays in a hashmap, and use a custom hash function: just sums up all the elements, so 1+10+3+18=32, and also 10+18+3+1=32. For equals I use a bitset to quickly check if elements are in both sets (I do not need sorting when using the bitset, but it only works when the range of numbers is known and not too big).
This works ok, but can generate lots of collisions, so the equals() method is called quite often. I was wondering if there is a faster way to check for permutations?
Are there any good hash functions for permutations?
UPDATE
I have done a little benchmark: generate all combinations of numbers in the range 0 to 6, and array length 1 to 9. There are 3003 possible permutations, and a good hash should generated close to this many different hashes (I use 32 bit numbers for the hash):
41 different hashes for just adding (so there are lots of collisions)
8 different hashes for XOR'ing values together
286 different hashes for multiplying
3003 different hashes for (R + 2e) and multiplying as abc has suggested (using 1779033703 for R)
So abc's hash can be calculated very fast and is a lot better than all the rest. Thanks!
PS: I do not want to sort the values when I do not have to, because this would get too slow.
One potential candidate might be this.
Fix a odd integer R.
For each element e you want to hash compute the factor (R + 2*e).
Then compute the product of all these factors.
Finally divide the product by 2 to get the hash.
The factor 2 in (R + 2e) guarantees that all factors are odd, hence avoiding
that the product will ever become 0. The division by 2 at the end is because
the product will always be odd, hence the division just removes a constant bit.
E.g. I choose R = 1779033703. This is an arbitrary choice, doing some experiments should show if a given R is good or bad. Assume your values are [1, 10, 3, 18].
The product (computed using 32-bit ints) is
(R + 2) * (R + 20) * (R + 6) * (R + 36) = 3376724311
Hence the hash would be
3376724311/2 = 1688362155.
Summing the elements is already one of the simplest things you could do. But I don't think it's a particularly good hash function w.r.t. pseudo randomness.
If you sort your arrays before storing them or computing hashes, every good hash function will do.
If it's about speed: Have you measured where the bottleneck is? If your hash function is giving you a lot of collisions and you have to spend most of the time comparing the arrays bit-by-bit the hash function is obviously not good at what it's supposed to do. Sorting + Better Hash might be the solution.
If I understand your question correctly you want to test equality between sets where the items are not ordered. This is precisely what a Bloom filter will do for you. At the expense of a small number of false positives (in which case you'll need to make a call to a brute-force set comparison) you'll be able to compare such sets by checking whether their Bloom filter hash is equal.
The algebraic reason why this holds is that the OR operation is commutative. This holds for other semirings, too.
depending if you have a lot of collisions (so the same hash but not a permutation), you might presort the arrays while hashing them. In that case you can do a more aggressive kind of hashing where you don't only add up the numbers but add some bitmagick to it as well to get quite different hashes.
This is only beneficial if you get loads of unwanted collisions because the hash you are doing now is too poor. If you hardly get any collisions, the method you are using seems fine
I would suggest this:
1. Check if the lengths of permutations are the same (if not - they are not equal)
Sort only 1 array. Instead of sorting another array iterate through the elements of the 1st array and search for the presence of each of them in the 2nd array (compare only while the elements in the 2nd array are smaller - do not iterate through the whole array).
note: if you can have the same numbers in your permutaions (e.g. [1,2,2,10]) then you will need to remove elements from the 2nd array when it matches a member from the 1st one.
pseudo-code:
if length(arr1) <> length(arr2) return false;
sort(arr2);
for i=1 to length(arr1) {
elem=arr1[i];
j=1;
while (j<=length(arr2) and elem<arr2[j]) j=j+1;
if elem <> arr2[j] return false;
}
return true;
the idea is that instead of sorting another array we can just try to match all of its elements in the sorted array.
You can probably reduce the collisions a lot by using the product as well as the sum of the terms.
1*10*3*18=540 and 10*18*3*1=540
so the sum-product hash would be [32,540]
you still need to do something about collisions when they do happen though
I like using string's default hash code (Java, C# not sure about other languages), it generates pretty unique hash codes.
so if you first sort the array, and then generates a unique string using some delimiter.
so you can do the following (Java):
int[] arr = selectRandomNumbers();
Arrays.sort(arr);
int hash = (arr[0] + "," + arr[1] + "," + arr[2] + "," + arr[3]).hashCode();
if performance is an issue, you can change the suggested inefficient string concatenation to use StringBuilder or String.format
String.format("{0},{1},{2},{3}", arr[0],arr[1],arr[2],arr[3]);
String hash code of course doesn't guarantee that two distinct strings have different hash, but considering this suggested formatting, collisions should be extremely rare

Resources