Copying the hash table to new rehashed table - data-structures

I have a question about rehashing. Let us say, we have a hash table of size 7, and our hash function is (key%tableSize). We insert 24 to the table, and 24 will be at index 3 since 24%7=3. Then, let us say we added more elements, and now we want to rehash. The table size will be twice the size of the initial table, i.e. new table size will be 14. Then, while copying the elements to the new hash table, for example, while copying the element 24, will it still be in the index 3, or will it be at the index 24%14=10. I mean, do we use the new table size while copying the elements, or the elements stay in their initial indexes?
Thanks

It's depend on your hashing function. In your case you should use key%size_of_table else slots after 7 will never be mapped by hashing function. These slots will occupied only when you chose linear probing in order to tackle the collision.(Where we look for next empty slot). Chosing new size will help to reduce the collisions at early stage, else it would be the case table haven't reached the Load Factor still you are facing lot of collisions.

Important thing about the hash tables is that the order of the elements is not guaranteed, it depends on the hash function.
For your example: if you copy the data into the new hash using 7 for hash size your indexes: 7, 8, 9, 10, 11, 12 and 13 of the new array will be unused because you've used bigger array and your hash function cant give you result bigger than 6. These unused indexes are a bad thing because simply you don't need them, so it's better to use key % 14 instead.
Interesting thing is that the internal hash table state depends not only by the hash function but it also can depend on the order in which the elements have been inserted. For example, imagine there's a hash table (implemented with array and linked lists) X with size 4 and you insert the elements 2,3,6,10 in that order:
x
{
[0] -> []
[1] -> []
[2] -> [2,6,10]
[3] -> [3]
}
For hash function again is used key % size.
Now if we insert the keys in different order - 10, 6, 3, 2 we get:
x
{
[0] -> []
[1] -> []
[2] -> [10,6,2]
[3] -> [3]
}
I've written all these lines above just to show you that two copies of a hash can look different internally because on many factors. I think that was the consideration of your question.

Related

Hashing with division remainder method

I don't understand this exercise.
Hash the keys: (13,17,39,27,1,20,4,40,25,9,2,37) into a hash table of size 13 using the division-remainder method.
a) find a suitable value for m.
b) handle collisions using linked lists andvisualize theresult in a table like this
0→
1→
2→
3→
4→
5→
6→
...
c) c) Handle collision with linear probing using the sequence s(j) = j and illustrate the development in a table by starting a new column for every insert (don’t forget to copy the cells already filled to the right) and by using downwards arrows to show the probing steps in case of collisions.
my attempt:
a) if the table size is 13, m also have to be 13 because of remaining classes
b) for example 0→ 39 -> 13 ....
c) I have no idea
It would be really great if someone could help me solve it. :)
Let me give a brief overview of all topics which will be used here.
Hash-map is a data structure that uses a hash function to map identifying values, known as keys, to their associated values. It contains “key-value” pairs and allows retrieving value by key.
Like in array you can get any element using index, similarly you can get any value using a key in hash-map.
Basically something like this happens, you are given a key which is string here, then it is hashed and we put the value at that index in array.
In our example image, if you want what is value for "Billy", we again hash "Billy" we get 03. Now we just check the value at index 3 and that's the stored value for "Billy" (key)
In your case you have to hash integers not strings.
Now how to hash keys?
There can be several methods like you may sum ascii values of characters of string, or anything what you can think of.
let's say you have this array [100, 1, 3, 56, 80]
and you have to store it in bucket of size 13.
We directly can't use those array values as an index because we will need index 1 and index 100, it will make bucket have 100 size.
But if you take remainder of each array number with 13 then the remainder is always guaranteed to be from 0 to 13, thus you can use a 13 size bucket if you has keys using division method
[100, 1, 3, 56, 80] remainder with 13 -> [9, 1, 3, 4, 5]
Thus you store 100's value at index 9, and so on.
Collision:
But what if in array we have a value 5 and 80, both after will give remainder 5. What to store at index 5 now?
In our example image,
Now let's say "SACHU" this also gives 03 after hashing now two keys gave same index so this is called collision which can be resolved using two methods
linkedlist like storage (store both values at same index using linkedlist, like this)
linear probing: in simple words 03 index is already occupied we try to find next empty index, like using the most simplest probing our in image example will be, 06 is empty so we store "SACHU" value at 06 not 03.
(now this is a little hard so I highly suggest you to read hashing and collisions on internet)
Now, there is one method where we h(x) denotes the hash of an integer x.
if number is x, first hash will be, h1 = h(x)
If h1 index is not empty we again hash same index, h2 = h(h1)
An so on, I am not sure, but I guess this is what is meant by s[j] = j method.
THESE ARE THE METHODS WHICH YOU HAVE TO USE IN YOUR PROBLEM.
I prefer you to give it a try first.
You can read more about it online and and comment if still you were not able to solve it.

Preallocate or change size of vector

I have a situation where I have a process which needs to "burn-in". This means that I
Start with p values, p relatively small
For n>p, generate nth value using most recently generated p values (e.g. p+1 generated from values 1 to p, p+2 generated from values 2, p+1, etc.)
Repeat until n=N, where N large
Now, only the most recently generated p values will be useful to me, so there are two ways for me to implement this. I can either
Start with a vector of p initial values. At each iteration, mutate the vector, removing the first element, and replacing the last element with the most recently generated value or,
Preallocate a large array of length N, where first p elements are initial values. At iteration n, mutate nth value with most recently generated value
There are pros and cons to both approaches.
Pros of the first, are that we only store most relevant values. Cons of the first are that we are changing the length of the vector at each iteration.
Pros of the second are that we preallocate all the memory we need. Cons of the second is that we store much more than we need.
What is the best way to proceed? Does it depend on what aspect of performance I most need to care about? What will be the quickest?
Cheers in advance.
edit: approximately, p is usually in the order of low tens, N can be several thousand
The first solution has another huge cons: removing the first item of an array takes O(n) time since elements should be moved in memory. This certainly cause the algorithm to runs in quadratic time which is not reasonable. Shifting the items as proposed by #ForceBru should also cause this quadratic run time (since many items are moved just to add one value every time).
The second solution should be pretty fast compared to the first but, indeed, it can use a lot of memory so it should be sub-optimal (it takes time to write values in the RAM).
A faster solution is to use a data structure called a deque. Such data structure enable you to remove the first item in constant time and append a new value at the end also in constant time. That being said, it also introduces some overhead to be able to do that. Julia provide such data structure (more especially queues).
Since the number of in-flight items appears to be bounded in your algorithm, you can implement a rolling buffer. Fortunately, Julia also implement this: see CircularBuffer. This solution should be quite simple and fast (since the operations you want to do are done in O(1) time on it).
It is probably simplest to use CircularArrays.jl for your use case:
julia> using CircularArrays
julia> c = CircularArray([1,2,3,4])
4-element CircularVector(::Vector{Int64}):
1
2
3
4
julia> for i in 5:10
c[i] = i
#show c
end
c = [5, 2, 3, 4]
c = [5, 6, 3, 4]
c = [5, 6, 7, 4]
c = [5, 6, 7, 8]
c = [9, 6, 7, 8]
c = [9, 10, 7, 8]
In this way - as you can see - you can can continue using an increasing index and array will wrap around internally as needed (discarding old values that are not needed any more).
In this way you always store last p values in the array without having to copy anything or re-allocate memory in each step.
...only the most recently generated p values will be useful to me...
Start with a vector of p initial values. At each iteration, mutate the vector, removing the first element, and replacing the last element with the most recently generated value.
Cons of the first are that we are changing the length of the vector at each iteration.
There's no need to change the length of the vector. Simply shift its elements to the left (overwriting the first element) and write the new data to the_vector[end]:
the_vector = [1,2,3,4,5,6]
function shift_and_add!(vec::AbstractVector, value)
vec[1:end-1] .= #view vec[2:end] # shift
vec[end] = value # replace the last value
vec
end
#assert shift_and_add!(the_vector, 80) == [2,3,4,5,6,80]
# `the_vector` will be mutated
#assert the_vector == [2,3,4,5,6,80]

C++ map indices to sorted indices

A standard problem in many languages is to sort an array and sort the indices as well. So for instance, if a = {4,1,3,2} the sorted array is b = {1,2,3,4} and the original indices moved would be {1,3,2,0}. This is easy to do by sorting a vector of pairs for instance.
What I want instead is an array c so that c[i] is the new position of element a[i] in the array b. So, in my example, c = {3,0,2,1} because 4 moves to position 3, 1 moved to position 0 and so on.
One way is to look up each element a[i] in b (perhaps using binary search to reduce lookup time) and then add the corresponding index in c. Is there a more efficient way?
Can you assume that you have the array of originally indices moved? It's the only array above that you didn't assign to a variable. If so, one efficient way of solving this problem is to back calculate it from that array of original indices moved.
You have that as {1,3,2,0}. All you need to do it march through it and put each values index at the value indicated.
So index 0 has a 1. That means at index 1 of the new array there should be a zero. Index 1 is a 3, so at index 3 of the new array put a 1. You would get your goal of {3,0,2,1}

Data Structure / Hash Function to link Sets of Ints to Value

Given n integer id's, I wish to link all possible sets of up to k id's to a constant value. What I'm looking for is a way to translate sets (e.g. {1, 5}, {1, 3, 5} and {1, 2, 3, 4, 5, 6, 7}) to unique values.
Guarantees:
n < 100 and k < 10 (again: set sizes will range in [1, k]).
The order of id's doesn't matter: {1, 5} == {5, 1}.
All combinations are possible, but some may be excluded.
All sets and values are constant and made only once. No deletes or inserts, no value updates.
Once generated, the only operations taking place will be look-ups.
Look-ups will be frequent and one-directional (given set, look up value).
There is no need to sort (or otherwise organize) the values.
Additionally, it would be nice (but not obligatory) if "neighboring" sets (drop one id, add one id, swap one id, etc) are easy to reach, as well as "all sets that include at least this set".
Any ideas?
Enumerate using the product of primes.
a -> 2
b -> 3
c -> 5
d -> 7
et cetera
Now hash(ab) := 6, and hash (abc) := 30
And a nice side effect is that, if "ab" is a subset of "abc", then:
hash(abc) % hash(ab) == 0
and
hash(abc) / hash(ab) == hash(c)
The bad news: You might run into overflow, the 100th prime will probably be around 1000, and 64 bits cannot accomodate 1000**10. This will not affect the functioning as a hash function; only the subset thingy will fail to work. the same method applied to anagrams
The other option is Zobrist-hashing. It is equivalent to the the primes method, but instead of primes you use a fixed set of (random) numbers, and instead of multiplying you use XOR.
For a fixed small (it needs << ~70 bits) set like yours, it might be possible to tune the zobrist tables to totally avoid collisions (yielding a perfect hash).
And the final (and simplest) way is to use a (100bit) bitmap, and treat that as a hashvalue (maybe after modulo table size)
And a totally unrelated method is to just build a decision tree on the bits of the bitmap. (the tree would have a maximal depth of k) a related kD tree on bit values
May be not the best solution, but you can do the following:
Sort the set from Lowest to highest with a simple IntegerComparator
Add each item of the set to a String
so if you have {2,5,9,4} first Step->{2,4,5,9}; second->"2459"
This way you will get a unique String from a unique set. If you really need to map them to an integer value, you can hash the string after that.
A second way I can think of is to store them in a java Set and simply map it against a HashMap with set as keys
Calculate a 'diff' from each set {1, 6, 87, 89} = {1,5,81,2,0,0,...}
{1,2,3,4} = { 1,1,1,1,0,0,0,0... };
Then binary encode each number with a variable length encoding and concatenate the bits.
It's hard to compare the sets (except for the first few equal bits), but because there can't be many large intervals in a set, all possible values just might fit into 64 bits. (slack of 16 bits at least...)

Create Ancestor Matrix from given Binary Tree

The question is, given a Ancestor Matrix, as a bitmap of 1s and 0s, to construct the corresponding Binary Tree. Can anyone give me an idea on how to do it? I found a solution at Stackoverflow, but the line a[root->data][temp[i]]=1 seems wrong, there is no binding that the nodes will contain data 1 to n. It may contain, say 2000, in which case, there will be no a[2000][some_column], since there are only 7 nodes, hence 7 rows and columns in the matrix.
Two ways:
Normalize your node values such that they are all from 1 to n. If you have nodes 1, 2, 5000 for example, make them 1, 2, 3. You can do this by sorting or hashing your labels and keeping something like normalized[i] = normalized value of node i. normalized can be a map / hash table if you have very large labels or even text labels.
You might be able to use a sparse matrix for this, implementable with a hash table or a set: keep a hash table of hash tables. H[x] stores another hash table that stores your y values. So if in a naive matrix solution you had a[2000][5000] = 1, you would use H.get(2000) => returns a hash table H' of values stored on the 2000th row => H'.get(5000) => returns the value you want.

Resources