Hash Function in Separate Chaining Vs. Open Addressing - data-structures

I'm reading Weiss's Data Structures book, and I'm confused with the difference between hash function in Separate Chaining Vs. hash function in Open Addressing.
In separate chaining, the hash function is defined as:
hash(x) = x mod tableSize
whereas in open addressing:
h_i(x) = (hash(x) + f(i)) mod tableSize
where i is the number of trials and f(i) is the function such as f(i) = i for Linear Probing, f(i) = i^2 for Quadratic Probing, etc.
I have 2 questions:
1) In Separate Chaining, does it make sense to have a hash function:
hash(x) = x mod 10
when the table size equals, let's say, 11?
2) In Open Addressing, do we always have to mod the key(+gap) by tableSize twice?

1) Not really. It will be correct, but not efficient. If you mod by less than the table size, there will be at least one bucket unused at the top of your table. If there is a specific reason to choose that value to mod by (there might be, if you're looking for certain properties) then you could just trim the table to that size and avoid the waste.
2) That isn't really necessary (((a mod c) + b) mod c is redundant) and that isn't the only definition in the first place. Slightly more generally you have h_i(x) = f(x, i) mod tableSize, some obvious choices for f include
f(x, i) = x + i (linear probing)
f(x, i) = x + a * i + b * i * i for some constants a and b != 0 (quadratic probing)
f(x, i) = h1(x) + i * h2(x) for some suitable hash functions h1 and h2 (double hashing)
That last one is especially susceptible to overflow, which could mess up some properties, so you might want to perform some calculations modulo the table size (especially if that's a prime number, because then you have a nice field to work in).
Also, you're always going to use f(x, i) mod tablesize before you need f(x, i + 1), so you might as well calculate f incrementally, where at every step you mod by the tablesize because you have to do it anyway.
But we're certainly not limited to those forms of f or indeed to this scheme of open addressing where we search for an open spot. Cuckoo hashing (and variants) has two candidate places to insert an item, and will kick out an item and move it to its alt-location (possibly also displacing an item) if both places are full (with some care taken to avoid infinite loops). That way a lookup only has two places to look at, instead of potentially the entire table. It has many variants.

Related

Generate random non repeating pairs of numbers within 2 ranges

I want to create random pairs of numbers within 2 ranges.
So for example if I want 3 random pairs of numbers where 10 < n1 < 20 and 30 < n2 < 50 then an acceptable output would be this: [[11,35],[15,30],[15,42]] but not [[11,35],[11,35],[12,39]]
I would like an efficient (both computationally and memory wise) algorithm to do this. The language doesn't really matter because I can adapt it later (although Python would be preferred).
So far the best idea I have had is to create a dictionary with all the possible numbers in n1 and as values a list of the numbers which have been used in n2. Then I can just pick a random n1 and find a number which hasn't been used in n1[n2] set.
This isn't very efficient space wise though and I'm hoping for something better. It also seems to be computationally inefficient to find a number not in n1[n2] many times.
I could also do the opposite and have the dictionary populated with all the numbers not used and just pop a random number off the list. But this would use much more space.
Is there any efficient way to do this? Is this a common problem?
Edit: It would be good if this could easily be expanded to more dimensions (so sets of N numbers). But this isn't really needed yet.
An integer pair (x, y) in [min_x, min_x + s) X [min_y, min_y + t) can be mapped to an integer m within the 1D space [min_x * t, (min_x + s) * t) by calculating m = x * t + y - min_y. The inverse mapping from m to (x, y) can be achieved by (m // t, min_y + m % t) in Python.
Therefore the problem is transformed to choosing multiple values from [min_x * t, (min_x + s) * t) without replacement (i.e. no duplicates in the returned sequence). This can be done by simply calling the random.sample function in Python. According to the doc, the underlying implementation is space efficient for sequence inputs. So the entire problem can be done in Python as shown in the following:
from random import sample
# max_x and max_y are exclusive while min_x and min_y are inclusive
t = max_y - min_y
sampled_pairs = [(m//t, min_y + m%t) for m in sample(range(min_x * t, max_x * t), k=3)]

Implementing the square root method through successive approximation

Determining the square root through successive approximation is implemented using the following algorithm:
Begin by guessing that the square root is x / 2. Call that guess g.
The actual square root must lie between g and x/g. At each step in the successive approximation, generate a new guess by averaging g and x/g.
Repeat step 2 until the values of g and x/g are as close together as the precision of the hardware allows. In Java, the best way to check for this condition is to test whether the average is equal to either of the values used to generate it.
What really confuses me is the last statement of step 3. I interpreted it as follows:
private double sqrt(double x) {
double g = x / 2;
while(true) {
double average = (g + x/g) / 2;
if(average == g || average == x/g) break;
g = average;
}
return g;
}
This seems to just cause an infinite loop. I am following the algorithm exactly, if the average equals either g or x/g (the two values used to generate it) then we have our answer ?
Why would anyone ever use that approach, when they could simply use the formulas for (2n^2) = 4n^2 and (n + 1)^2 = n^2 + 2n + 1, to populate each bit in the mantissa, and divide the exponent by two, multiplying the mantissa by two iff the the mod of the exponent with two equals 1?
To check if g and x/g are as close as the HW allow, look at the relative difference and compare
it with the epsilon for your floating point format. If it is within a small integer multiple of epsilon, you are OK.
Relative difference of x and y, see https://en.wikipedia.org/wiki/Relative_change_and_difference
The epsilon for 32-bit IEEE floats is about 1.0e-7, as in one of the other answers here, but that answer used the absolute rather than the relative difference.
In practice, that means something like:
Math.abs(g-x/g)/Math.max(Math.abs(g),Math.abs(x/g)) < 3.0e-7
Never compare floating point values for equality. The result is not reliable.
Use a epsilon like so:
if(Math.abs(average-g) < 1e-7 || Math.abs(average-x/g) < 1e-7)
You can change the epsilon value to be whatever you need. Probably best is something related to the original x.

Behaviour of a own haskell function: does sometimes stop to produce (easy to produce) results

I wrote a haskell function to produce prime factorizations for numbers until a certain threshould – made of some prime factors. A minimal working code can be found here:
http://lpaste.net/117263
The problem: It works very good for "threshould <= 10^9" on my computer. But beginning with "threshould = 10^10" the method don't produce any results on my computer – I never see (even not) the first list element on my screen. The name of the critical function is "exponentSets". For every prime in the list 'factors', it computes the possible exponents (with respect to already chosen exponents for other primes). Further commends are in the code. If 10^10 works good on your machine, try it with an higher exponent (10^11 ...).
My question: what is responsible for that? How can I improve the quality of the function "exponentSets"? (I'm still not very experienced in Haskell so someone more experienced might have an Idea)
Even though you are using 64-bit integers, you still do not have enough capacity to store a temporary integer which is created in intLog:
intLog base num =
let searchExtend lower#(e, n) =
let upper#(e', n') = (2 * e, n^2) -- this line is what causes the problems
-- some code
in (some if) searchExtend (1, base)
rawLists is defined like this:
rawLists = recCall 1 threshould
Which in turn sets remaining_threshould in recCall to
threshould `quot` 1 -- same as threshould
Now intLog gets called by recCall like this:
intLog p remaining_threshould
which is the same as
intLog p threshould
Now comes the interesing part: Since num p is smaller than your base threshold, you call searchExtend (1, base), which then in turn does this:
searchExtend (e, n) =
let (e', n') = (2 * e, n ^ 2)
Since n is remaining_threshould, which is the same as threshould, you essentially square 2^32 + 1 and store this in an Int, which overflows and causes rawLists to give bogus results.
(2 ^ 32 + 1) ^ 2 :: Int is 8589934593
(2 ^ 32 + 1) ^ 2 :: Integer is 18446744082299486209

Hashing sets of integers

I'm looking for a hash function over sets H(.) and a relation R(.,.) such that if A is included in B then R(H(A), H(B)). Of course, R(.,.) must be easy to verify (constant time), and H(A) should be computed in linear time.
One example of H and R is:
H(A) = OR over 1 << (h(x) % k), for x in A, k a fixed integer and h(x) a hash function over integers.
R(H(A), H(B)) = ((H(A) & H(B)) == H(A))
Are there any other good examples? (good is hard to define but intuitively if R(H(A), H(B)) then whp A is included in B).
After thinking about this, I ended up with the example you gave. I.e. each element in B sets a bit in the hash, and A is only contained in B if each bit which is set in H(A) is also set in H(B).
Maybe a Bloom filter is applicable in your case. It seems to use the same bit trick, but with multiple hash functions.

Lists Hash function

I'm trying to make a hash function so I can tell if too lists with same sizes contain the same elements.
For exemple this is what I want:
f((1 2 3))=f((1 3 2))=f((2 1 3))=f((2 3 1))=f((3 1 2))=f((3 2 1)).
Any ideea how can I approch this problem ? I've tried doing the sum of squares of all elements but it turned out that there are collisions,for exemple f((2 2 5))=33=f((1 4 4)) which is wrong as the lists are not the same.
I'm looking for a simple approach if there is any.
Sort the list and then:
list.each do |current_element|
hash = (37 * hash + current_element) % MAX_HASH_VALUE
end
You're probably out of luck if you really want no collisions. There are N choose k sets of size k with elements in 1..N (and worse, if you allow repeats). So imagine you have N=256, k=8, then N choose k is ~4 x 10^14. You'd need a very large integer to distinctly hash all of these sets.
Possibly you have N, k such that you could still make this work. Good luck.
If you allow occasional collisions, you have lots of options. From simple things like your suggestion (add squares of elements) and computing xor the elements, to complicated things like sort them, print them to a string, and compute MD5 on them. But since collisions are still possible, you have to verify any hash match by comparing the original lists (if you keep them sorted, this is easy).
So you are looking something provides these properties,
1. If h(x1) == y1, then there is an inverse function h_inverse(y1) == x1
2. Because the inverse function exists, there cannot be a value x2 such that x1 != x2, and h(x2) == y1.
Knuth's Multiplicative Method
In Knuth's "The Art of Computer Programming", section 6.4, a multiplicative hashing scheme is introduced as a way to write hash function. The key is multiplied by the golden ratio of 2^32 (2654435761) to produce a hash result.
hash(i)=i*2654435761 mod 2^32
Since 2654435761 and 2^32 has no common factors in common, the multiplication produces a complete mapping of the key to hash result with no overlap. This method works pretty well if the keys have small values. Bad hash results are produced if the keys vary in the upper bits. As is true in all multiplications, variations of upper digits do not influence the lower digits of the multiplication result.
Robert Jenkins' 96 bit Mix Function
Robert Jenkins has developed a hash function based on a sequence of subtraction, exclusive-or, and bit shift.
All the sources in this article are written as Java methods, where the operator '>>>' represents the concept of unsigned right shift. If the source were to be translated to C, then the Java 'int' data type should be replaced with C 'uint32_t' data type, and the Java 'long' data type should be replaced with C 'uint64_t' data type.
The following source is the mixing part of the hash function.
int mix(int a, int b, int c)
{
a=a-b; a=a-c; a=a^(c >>> 13);
b=b-c; b=b-a; b=b^(a << 8);
c=c-a; c=c-b; c=c^(b >>> 13);
a=a-b; a=a-c; a=a^(c >>> 12);
b=b-c; b=b-a; b=b^(a << 16);
c=c-a; c=c-b; c=c^(b >>> 5);
a=a-b; a=a-c; a=a^(c >>> 3);
b=b-c; b=b-a; b=b^(a << 10);
c=c-a; c=c-b; c=c^(b >>> 15);
return c;
}
You can read details from here
If all the elements are numbers and they have a maximum, this is not too complicated, you sort those elements and then you put them together one after the other in the base of your maximum+1.
Hard to describe in words...
For example, if your maximum is 9 (that makes it easy to understand), you'd have :
f(2 3 9 8) = f(3 8 9 2) = 2389
If you maximum was 99, you'd have :
f(16 2 76 8) = (0)2081676
In your example with 2,2 and 5, if you know you would never get anything higher than 5, you could "compose" the result in base 6, so that would be :
f(2 2 5) = 2*6^2 + 2*6 + 5 = 89
f(1 4 4) = 1*6^2 + 4*6 + 4 = 64
Combining hash values is hard, I've found this way (no explanation, though perhaps someone would recognize it) within Boost:
template <class T>
void hash_combine(size_t& seed, T const& v)
{
seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
It should be fast since there is only shifting, additions and xor taking place (apart from the actual hashing).
However the requirement than the order of the list does not influence the end-result would mean that you first have to sort it which is an O(N log N) operation, so it may not fit.
Also, since it's impossible without more stringent boundaries to provide a collision free hash function, you'll still have to actually compare the sorted lists if ever the hash are equals...
I'm trying to make a hash function so I can tell if two lists with same sizes contain the same elements.
[...] but it turned out that there are collisions
These two sentences suggest you are using the wrong tool for the job. The point of a hash (unless it is a 'perfect hash', which doesn't seem appropriate to this problem) is not to guarantee equality, or to provide a unique output for every given input. In the general usual case, it cannot, because there are more potential inputs than potential outputs.
Whatever hash function you choose, your hashing system is always going to have to deal with the possibility of collisions. And while different hashes imply inequality, it does not follow that equal hashes imply equality.
As regards your actual problem: a start might be to sort the list in ascending order, then use the sorted values as if they were the prime powers in the prime decomposition of an integer. Reconstruct this integer (modulo the maximum hash value) and there is a hash value.
For example:
2 1 3
sorted becomes
1 2 3
Treating this as prime powers gives
2^1.3^2.5^3
which construct
2.9.125 = 2250
giving 2250 as your hash value, which will be the same hash value as for any other ordering of 1 2 3, and also different from the hash value for any other sequence of three numbers that do not overflow the maximum hash value when computed.
A naïve approach to solving your essential problem (comparing lists in an order-insensitive manner) is to convert all lists being compared to a set (set in Python or HashSet in Java). This is more effective than making a hash function since a perfect hash seems essential to your problem. For almost any other approach collisions are inevitable depending on input.

Resources