Hash table: why size should be prime? [duplicate]

Hash table: why size should be prime? [duplicate] - data-structures

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Why should hash functions use a prime number modulus?
Why is it necessary for a hash table's (the data structure) size to be a prime?
From what I understand, it assures a more even distribution but is there any other reason?

The only reason is to avoid clustering of values into a small number of buckets (yes, distribution). A more even distributed hashtable will perform more consistently.
from http://srinvis.blogspot.com/2006/07/hash-table-lengths-and-prime-numbers.html
If suppose your hashCode function results in the following hashCodes among others {x , 2x, 3x, 4x, 5x, 6x...}, then all these are going to be clustered in just m number of buckets, where m = table_length/GreatestCommonFactor(table_length, x). (It is trivial to verify/derive this). Now you can do one of the following to avoid clustering
Make sure that you don't generate too many hashCodes that are multiples of another hashCode like in {x, 2x, 3x, 4x, 5x, 6x...}.But this may be kind of difficult if your hashTable is supposed to have millions of entries.
Or simply make m equal to the table_length by making GreatestCommonFactor(table_length, x) equal to 1, i.e by making table_length coprime with x. And if x can be just about any number then make sure that table_length is a prime number.
Update: (from original answer author)
This answer is correct for a common implementation of a hash table, including the Java implementation of the original Hashtable as well as the current implementation of .NET's Dictionary.
Both the answer and the supposition that capacity should be prime are not accurate for Java's HashMap though. The implementation of a HashMap is very different and utilizes a table with a size of base 2 to store buckets and uses n-1 & hash to calculate which bucket to use as opposed to the more traditional hash % n formula.
Java's HashMap will force the actual used capacity to be the next largest base 2 number above the requested capacity.
Compare Hashtable:
int index = (hash & 0x7FFFFFFF) % tab.length
https://github.com/openjdk/jdk/blob/jdk8-b120/jdk/src/share/classes/java/util/Hashtable.java#L364
To HashMap:
first = tab[(n - 1) & hash]
https://github.com/openjdk/jdk/blob/jdk8-b120/jdk/src/share/classes/java/util/HashMap.java#L569

Related

Hash and reduce to bucket algorithm

The problem
We have a set of symbol sequences, which should be mapped to a pre-defined number of bucket-indexes.
Prerequisites
The symbol sequences are restricted in length (64 characters/bytes), and the hash algorithm used is the Delphi implementation of the Bob Jenkins hash for a 32bit hashvalue.
To further distribute the these hashvalues over a certain number of buckets we use the formula:
bucket_number := (hashvalue mod (num_buckets - 2)) + 2);
(We don't want {0,1} to be in the result set)
The question
A colleague had some doubts, that we need to choose a prime number for num_buckets to achieve an optimal1 distribution in mapping the symbol sequences to the bucket_numbers.
The majority of the team believe that's more an unproven assumption, though our team mate just claimed that's mathematically intrinsic (without more in depth explanation).
I can imagine, that certain symbol sequence patterns we use (that's just a very limited subset of what's actually allowed) may prefer certain hashvalues, but generally I don't believe that's really significant for a large number of symbol sequences.
The hash algo should already distribute the hashvalues optimally, and I doubt that a prime number mod divisor would really make a significant difference (couldn't measure that empirically either), especially since Bob Jenkins hash calculus doesn't involve any prime numbers as well, as far I can see.
[TL;DR]
Does a prime number mod divisor matter for this case, or not?
1)
optimal simply means a stable average value of number-of-sequences per bucket, which doesn't change (much) with the total number of sequences

Your colleague is simply wrong.
If a hash works well, all hash values should be equally likely, with a relationship that is not obvious from the input data.
When you take the hash mod some value, you are then mapping equally likely hash inputs to a reduced number of output buckets. The result is now not evenly distributed to the extent that outputs can be produced by different numbers of inputs. As long as the number of buckets is small relative to the range of hash values, this discrepancy is small. It is on the order of # of buckets / # of hash values. Since the number of buckets is typically under 10^6 and the number of hash values is more than 10^19, this is very small indeed. But if the number of buckets divides the range of hash values, there is no discrepancy.
Primality doesn't enter into it except from the point that you get the best distribution when the number of buckets divides the range of the hash function. Since the range of the hash function is usually a power of 2, a prime number of buckets is unlikely to do anything for you.

Efficiently search for pairs of numbers in various rows

Imagine you have N distinct people and that you have a record of where these people are, exactly M of these records to be exact.
For example
1,50,299
1,2,3,4,5,50,287
1,50,299
So you can see that 'person 1' is at the same place with 'person 50' three times. Here M = 3 obviously since there's only 3 lines. My question is given M of these lines, and a threshold value (i.e person A and B have been at the same place more than threshold times), what do you suggest the most efficient way of returning these co-occurrences?
So far I've built an N by N table, and looped through each row, incrementing table(N,M) every time N co occurs with M in a row. Obviously this is an awful approach and takes 0(n^2) to O(n^3) depending on how you implent. Any tips would be appreciated!

There is no need to create the table. Just create a hash/dictionary/whatever your language calls it. Then in pseudocode:
answer = []
for S in sets:
for (i, j) in pairs from S:
count[(i,j)]++
if threshold == count[(i,j)]:
answer.append((i,j))
If you have M sets of size of size K the running time will be O(M*K^2).
If you want you can actually keep the list of intersecting sets in a data structure parallel to count without changing the big-O.
Furthermore the same algorithm can be readily implemented in a distributed way using a map-reduce. For the count you just have to emit a key of (i, j) and a value of 1. In the reduce you count them. Actually generating the list of sets is similar.

The known concept for your case is Market Basket analysis. In this context, there are different algorithms. For example Apriori algorithm can be using for your case in a specific case for sets of size 2.
Moreover, in these cases to finding association rules with specific supports and conditions (which for your case is the threshold value) using from LSH and min-hash too.

you could use probability to speed it up, e.g. only check each pair with 1/50 probability. That will give you a 50x speed up. Then double check any pairs that make it close enough to 1/50th of M.
To double check any pairs, you can either go through the whole list again, or you could double check more efficiently if you do some clever kind of reverse indexing as you go. e.g. encode each persons row indices into 64 bit integers, you could use binary search / merge sort type techniques to see which 64 bit integers to compare, and use bit operations to compare 64 bit integers for matches. Other things to look up could be reverse indexing, binary indexed range trees / fenwick trees.

Does hashtable size depends upon length of key?

What i know,
Hashtable size depends on load factor.
It must be largest prime number, and use that prime number as the
modulo value in hash function.
Prime number must not be too close to power of 2 and power of 10.
Doubt I am having,
Does size of hashtable depends on length of key?
Following paragraph from the book Introduction to Algorithms by Cormen.
Does n=2000 mean length of string or number of element which will be store in hash table?
Good values for m are primes not too close to exact powers of 2. For
example, suppose we wish to allocate a hash table, with collisions
resolved by chaining, to hold roughly n = 2000 character strings,
where a character has 8 bits. We don't mind examining an average of 3
elements in an unsuccessful search, so we allocate a hash table of
size m = 701. The number 701 is chosen because it is a prime near =
2000/3 but not near any power of 2. Treating each key k as an integer,
our hash function would be
h(k) = k mod 701 .
Can somebody explain it>

Here's a general overview of the tradeoff with hash tables.
Suppose you have a hash table with m buckets with chains storing a total of n objects.
If you store only references to objects, the total memory consumed is O (m + n).
Now, suppose that, for an average object, its size is s, it takes O (s) time to compute its hash once, and O (s) to compare two such objects.
Consider an operation checking whether an object is present in the hash table.
The bucket will have n / m elements on average, so the operation will take O (s n / m) time.
So, the tradeoff is this: when you increase the number of buckets m, you increase memory consumption but decrease average time for a single operation.
For the original question - Does size of hashtable depends on length of key? - No, it should not, at least not directly.
The paragraph you cite only mentions the strings as an example of an object to store in a hash table.
One mentioned property is that they are 8-bit character strings.
The other is that "We don't mind examining an average of 3 elements in an unsuccessful search".
And that wraps the properties of the stored object into the form: how many elements on average do we want to place in a single bucket?
The length of strings themselves is not mentioned anywhere.

(2) and (3) are false. It is common for a hash table with 2^n buckets (ref) as long as you use the right hash function. On (1), the memory a hash table takes equals the number of buckets times the length of key. Note that for string keys, we usually keep pointers to strings, not the strings themselves, so the length of key is the length of a pointer, which is 8 bytes on 64-bit machines.

Algorithmic-wise, No!
The length of the key is irrelevant here.
Moreover, the key itself is not important, what's important is the number of different keys you predict you'll have.
Implementation-wise, Yes! Since you must save the key itself in your hashtable, it reflects on its size.
For your second question, 'n' means the number of different keys to hold.

Returning i-th combination of a bit array

Given a bit array of fixed length and the number of 0s and 1s it contains, how can I arrange all possible combinations such that returning the i-th combinations takes the least possible time?
It is not important the order in which they are returned.
Here is an example:
array length = 6
number of 0s = 4
number of 1s = 2
possible combinations (6! / 4! / 2!)
000011 000101 000110 001001 001010
001100 010001 010010 010100 011000
100001 100010 100100 101000 110000
problem
1st combination = 000011
5th combination = 001010
9th combination = 010100
With a different arrangement such as
100001 100010 100100 101000 110000
001100 010001 010010 010100 011000
000011 000101 000110 001001 001010
it shall return
1st combination = 100001
5th combination = 110000
9th combination = 010100
Currently I am using a O(n) algorithm which tests for each bit whether it is a 1 or 0. The problem is I need to handle lots of very long arrays (in the order of 10000 bits), and so it is still very slow (and caching is out of the question). I would like to know if you think a faster algorithm may exist.
Thank you

I'm not sure I understand the problem, but if you only want the i-th combination without generating the others, here is a possible algorithm:
There are C(M,N)=M!/(N!(M-N)!) combinations of N bits set to 1 having at most highest bit at position M.
You want the i-th: you iteratively increment M until C(M,N)>=i
while( C(M,N) < i ) M = M + 1
That will tell you the highest bit that is set.
Of course, you compute the combination iteratively with
C(M+1,N) = C(M,N)*(M+1)/(M+1-N)
Once found, you have a problem of finding (i-C(M-1,N))th combination of N-1 bits, so you can apply a recursion in N...
Here is a possible variant with D=C(M+1,N)-C(M,N), and I=I-1 to make it start at zero
SOL=0
I=I-1
while(N>0)
M=N
C=1
D=1
while(i>=D)
i=i-D
M=M+1
D=N*C/(M-N)
C=C+D
SOL=SOL+(1<<(M-1))
N=N-1
RETURN SOL
This will require large integer arithmetic if you have that many bits...

If the ordering doesn't matter (it just needs to remain consistent), I think the fastest thing to do would be to have combination(i) return anything you want that has the desired density the first time combination() is called with argument i. Then store that value in a member variable (say, a hashmap that has the value i as key and the combination you returned as its value). The second time combination(i) is called, you just look up i in the hashmap, figure out what you returned before and return it again.
Of course, when you're returning the combination for argument(i), you'll need to make sure it's not something you have returned before for some other argument.
If the number you will ever be asked to return is significantly smaller than the total number of combinations, an easy implementation for the first call to combination(i) would be to make a value of the right length with all 0s, randomly set num_ones of the bits to 1, and then make sure it's not one you've already returned for a different value of i.

Your problem appears to be constrained by the binomial coefficient. In the example you give, the problem can be translated as follows:
there are 6 items that can be chosen 2 at a time. By using the binomial coefficient, the total number of unique combinations can be calculated as N! / (K! (N - K)!, which for the case of K = 2 simplifies to N(N-1)/2. Plugging 6 in for N, we get 15, which is the same number of combinations that you calculated with 6! / 4! / 2! - which appears to be another way to calculate the binomial coefficient that I have never seen before. I have tried other combinations as well and both formulas generate the same number of combinations. So, it looks like your problem can be translated to a binomial coefficient problem.
Given this, it looks like you might be able to take advantage of a class that I wrote to handle common functions for working with the binomial coefficient:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
It should not be hard to convert this class to the language of your choice.
There may be some limitations since you are using a very large N that could end up creating larger numbers than the program can handle. This is especially true if K can be large as well. Right now, the class is limited to the size of an int. But, it should not be hard to update it to use longs.

What's a good algorithm for searching arrays N and M, in order to find elements in N that also exist in M?

I have two arrays, N and M. they are both arbitrarily sized, though N is usually smaller than M. I want to find out what elements in N also exist in M, in the fastest way possible.
To give you an example of one possible instance of the program, N is an array 12 units in size, and M is an array 1,000 units in size. I want to find which elements in N also exist in M. (There may not be any matches.) The more parallel the solution, the better.
I used to use a hash map for this, but it's not quite as efficient as I'd like it to be.
Typing this out, I just thought of running a binary search of M on sizeof(N) independent threads. (Using CUDA) I'll see how this works, though other suggestions are welcome.

1000 is a very small number. Also, keep in mind that parallelizing a search will only give you speedup as the number of cores you have increases. If you have more threads than cores, your application will start to slow down again due to context switching and aggregating information.
A simple solution for your problem is to use a hash join. Build a hash table from M, then look up the elements of N in it (or vice versa; since both your arrays are small it doesn't matter much).
Edit: in response to your comment, my answer doesn't change too much. You can still speed up linearly only until your number of threads equals your number of processors, and not past that.
If you want to implement a parallel hash join, this would not be difficult. Start by building X-1 hash tables, where X is the number of threads/processors you have. Use a second hash function which returns a value modulo X-1 to determine which hash table each element should be in.
When performing the search, your main thread can apply the auxiliary hash function to each element to determine which thread to hand it off to for searching.

Just sort N. Then for each element of M, do a binary search for it over sorted N. Finding the M items in N is trivially parallel even if you do a linear search over an unsorted N of size 12.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio