Use of universal hashing - data-structures

I'm trying to understand the usefullness of universal hashing over normal hashing, other than the function is randomly produced everytime, reading Cormen's book.
From what i understand in universal hashing we choose the function to be
H(x)=[(ax+b)mod p]mod m
with p being a prime number larger than all the keys, m the size of the data table, and a,b random numbers.
So for example if i want to read the ID of 80 people, and each ID has a value between [0,200], then m would be 80 and p would be 211(next prime number). Right?
I could use the function lets say
H(x)=[(100x+50)mod 211]mod 80
But why would this help? There is a high chance that i'm going to end up having a lot of empty slots of the table, taking space without reason. Wouldn't it be more usefull to lower the number m in order to get a smaller table so space isn't used wtihout reason?
Any help appreciated

I think the best way to answer your question is to abstract away from the particulars of the formula that you're using to compute hash codes and to think more about, generally, what the impact is of changing the size of a hash table.
The parameter m that you're considering tuning adjusts how many slots are in your hash table. Let's imagine that you're planning on dropping n items into your hash table. The ratio n / m is called the load factor of the hash table and is typically denoted by the letter α.
If you have a table with a high load factor (large α, small m), then you'll have less wasted space in the table. However, you'll also increase the cost of doing a lookup, since with lots of objects distributed into a small space you're likely to get a bunch of collisions that will take time to resolve.
On the other hand, if you have a table with a low load factor (small α, large m), then you decrease the likelihood of collisions and therefore will improve the cost of performing lookups. However, if α gets too small - say, you have 1,000 slots per element actually stored - then you'll have a lot of wasted space.
Part of the engineering aspect of crafting a good hash table is figuring out how to draw the balance between these two options. The best way to see what works and what doesn't is to pull out a profiler and measure how changes to α change your runtime.

Related

Are there optimal sizes for an Hashtbl in OCaml?

Say I need to store 20 keys/values, would it be more efficient to use a power of 2, e.g. 32? I read a paper where the authors used a size of 251 (for an unknown number of keys/values), is this just a random number, or is there some reasoning behind it?
I’m talking about the n in Hashtbl.create n.
It's not entirely clear what you're asking. Since you ask about Hashtbl by name, I assume you're talking about the standard hash table module. This module always allocates tables in power-of-2 sizes. So you don't have to worry about it.
There are two basic "extra good" sizes for hash tables. Powers of two are good because they make it easy to find your hash bucket. The last step of a hashing procedure is take the hash value modulo the size of your table. If the table size is a power of two, this modulo operation can be done very quickly with a masking operation. I'm not sure this matters in today's world, unless your hash function itself is very fast to compute.
The second good value is a prime number. A prime number is good because it tends to spread values throughout the table. If you have hash values that happen to be predominantly a multiple of some number, this will cause dense clusters in the hash table unless the hash table size is relatively prime to the predominant number. A large-ish prime number is relatively prime to virtually everything, so it prevents clustering. So, 251 is good because it's a prime number.

Data structure to store 2D points clustered near the origin?

I need to use a spatial 2d map for my application. The map usually contains small amount of values in (-200, -200) - (200, 200) rectangle, most of them around (0, 0).
I thought of using hash map but then I need a hash function. I thought of x * 200 + y but then adding (0, 0) and (1, 0) will require 800 bytes for the hash table only, and memory is a problem in my application.
The map is immutable after initial setup so insertion time isn't a matter, but there is a lot of access (about 600 a second) and the target CPU isn't really fast.
What are the general memory/access time trade-offs between hash map and ordinary map(I believe RB-Tree in stl) in small areas? what is a good hash function for small areas?
I think that there are a few things that I need to explain in a bit more detail to answer your question.
For starters, there is a strong distinction between a hash function as its typically used in a program and the number of buckets used in a hash table. In most implementations of a hash function, the hash function is some mapping from objects to integers. The hash table is then free to pick any number of buckets it wants, then maps back from the integers to those buckets. Commonly, this is done by taking the hash code and then modding it by the number of buckets. This means that if you want to store points in a hash table, you don't need to worry about how large the values that your hash function produces are. For example, if the hash table has only three buckets and you produce objects with hash codes 0 and 1,000,000,000, then the first object would hash to the zeroth bucket and the second object would hash to the 1,000,000,000 % 3 = 1st bucket. You wouldn't need 1,000,000,000 buckets. Consequently, you shouldn't worry about picking a hash function like x * 200 + y, since unless your hash table is implemented very oddly you don't need to worry about space usage.
If you are creating a hash table in a way where you will be inserting only once and then spending a lot of time doing accesses, you may want to look into perfect hash functions and perfect hash tables. These are data structures that work by trying to find a hash function for the set of points that you're storing such that no collisions ever occur. They take (expected) O(n) time to create, and can do lookups in worst-case O(1) time. Barring the overhead from computing the hash function, this is the fastest way to look up points in space.
If you were just to dump everything in a tree-based map like most implementations of std::map, though, you should be perfectly fine. With at most 400x400 = 160,000 points, the time required to look up a point would be about lg 160,000 ≈ 18 lookups. This is unlikely to be a bottleneck in any application, though if you really need all the performance you can get the aforementioned perfect hash table is likely to be the best option.
However, both of these solutions only work if the queries you are interested in are of the form "does point p exist in the set or not?" If you want to do more complex geometric queries like nearest-neighbor lookups or finding all the points in a bounding box, you may want to look into more complex data structures like the k-d tree, which supports extremely fast (O(log n)) lookups and fast nearest-neighbor and range searches.
Hope this helps!
Slightly confused by your terminology.
The "map" objects in the standard library are implementations of associative arrays (either via hash tables or binary search trees).
If you're doing 2D spatial processing and are looking to implement a search structure, there are many dedicated data objects - i.e. quadtrees and k-d trees.
Edit: For a few ideas on implementations, perhaps check: https://stackoverflow.com/questions/1402014/kdtree-implementation-c.
Honestly - the data structures aren't that complex - I've always rolled my own.

Why are hash table expansions usually done by doubling the size?

I've done a little research on hash tables, and I keep running across the rule of thumb that when there are a certain number of entries (either max or via a load factor like 75%) the hash table should be expanded.
Almost always, the recommendation is to double (or double plus 1, i.e., 2n+1) the size of the hash table. However, I haven't been able to find a good reason for this.
Why double the size, rather than, say, increasing it 25%, or increasing it to the size of the next prime number, or next k prime numbers (e.g., three)?
I already know that it's often a good idea to choose an initial hash table size which is a prime number, at least if your hash function uses modulus such as universal hashing. And I know that's why it's usually recommended to do 2n+1 instead of 2n (e.g., http://www.concentric.net/~Ttwang/tech/hashsize.htm)
However as I said, I haven't seen any real explanation for why doubling or doubling-plus-one is actually a good choice rather than some other method of choosing a size for the new hash table.
(And yes I've read the Wikipedia article on hash tables :) http://en.wikipedia.org/wiki/Hash_table
Hash-tables could not claim "amortized constant time insertion" if, for instance, the resizing was by a constant increment. In that case the cost of resizing (which grows with the size of the hash-table) would make the cost of one insertion linear in the total number of elements to insert. Because resizing becomes more and more expensive with the size of the table, it has to happen "less and less often" to keep the amortized cost of insertion constant.
Most implementations allow the average bucket occupation to grow to until a bound fixed in advance before resizing (anywhere between 0.5 and 3, which are all acceptable values). With this convention, just after resizing the average bucket occupation becomes half that bound. Resizing by doubling keeps the average bucket occupation in a band of width *2.
Sub-note: because of statistical clustering, you have to take an average bucket occupation as low as 0.5 if you want many buckets to have at most one elements (maximum speed for finding ignoring the complex effects of cache size), or as high as 3 if you want a minimum number of empty buckets (that correspond to wasted space).
I had read a very interesting discussion on growth strategy on this very site... just cannot find it again.
While 2 is commonly used, it's been demonstrated that it was not the best value. One often cited problem is that it does not cope well with allocators schemes (which often allocate power of twos blocks) since it would always require a reallocation while a smaller number might in fact be reallocated in the same block (simulating in-place growth) and thus being faster.
Thus, for example, the VC++ Standard Library uses a growth factor of 1.5 (ideally should be the golden number if a first-fit memory allocation strategy is being used) after an extensive discussion on the mailing list. The reasoning is explained here:
I'd be interested if any other vector implementations uses a growth factor other than 2, and I'd also like to know whether VC7 uses 1.5 or 2 (since I don't have that compiler here).
There is a technical reason to prefer 1.5 to 2 -- more specifically, to prefer values less than 1+sqrt(5)/2.
Suppose you are using a first-fit memory allocator, and you're progressively appending to a vector. Then each time you reallocate, you allocate new memory, copy the elements, then free the old memory. That leaves a gap, and it would be nice to be able to use that memory eventually. If the vector grows too rapidly, it will always be too big for the available memory.
It turns out that if the growth factor is >= 1+sqrt(5)/2, the new memory will always be too big for the hole that has been left sofar; if it is < 1+sqrt(5)/2, the new memory will eventually fit. So 1.5 is small enough to allow the memory to be recycled.
Surely, if the growth factor is >= 2 the new memory will always be too big for the hole that has been left so far; if it is < 2, the new memory will eventually fit. Presumably the reason for (1+sqrt(5))/2 is...
Initial allocation is s.
The first resize is k*s.
The second resize is k*k*s, which will fit the hole iff k*k*s <= k*s+s, i.e. iff k <= (1+sqrt(5))/2
...the hole can be recycled asap.
It could, by storing its previous size, grow fibonaccily.
Of course, it should be tailored to the memory allocation strategy.
One reason for doubling size that is specific to hash containers is that if the container capacity is always a power of two, then instead of using a general purpose modulo for converting a hash to an offset, the same result can be achieved with bit shifting. Modulo is a slow operation for the same reasons that integer division is slow. (Whether integer division is "slow" in the context of whatever else is going in a program is of course case dependent but it's certainly slower than other basic integer arithmetic.)
Doubling the memory when expanding any type of collection is an oftenly used strategy to prevent memory fragmentation and not having to reallocate too often. As you point out there might be reasons to have a prime number of elements. When knowing your application and your data, you might also be able to predict the growth of the number of elements and thus choose another (larger or smaller) growth factor than doubling.
The general implementations found in libraries are exactly that: General implementations. They have to focus on being a reasonable choice in a variety of different situations. When knowing the context, it is almost always possible to write a more specialized and more efficient implementation.
If you don't know how many objects you will end up using (lets say N),
by doubling the space you'll do log2N reallocations at most.
I assume that if you choose a proper initial "n", you increase the odds
that 2*n + 1 will produce prime numbers in subsequent reallocations.
The same reasoning applies for doubling the size as for vector/ArrayList implementations, see this answer.

When should I do rehashing of entire hash table?

How do I decide when should I do rehashing of entire hash table?
This depends a great deal on how you're resolving collisions. If you user linear probing, performance usually starts to drop pretty badly with a load factor much higher than 60% or so. If you use double hashing, a load factor of 80-85% is usually pretty reasonable. If you use collision chaining, performance usually remains reasonable with load factors up to around 150% or or more.
I've sometimes even created a hash table with balanced trees for collision resolution. In this case, you can almost forget about re-hashing -- the performance doesn't start to deteriorate noticeably until the number of items exceeds the table size by at least a couple orders of magnitude.
Generally, you have a hash table containing N elements distributed in an array of M slots.
There is a percent value (called "growthFactor") defined by the user when instantiating the hash table that is used in this way:
if (growthRatio < (N/M))
Rehash();
the rehash means your array of M slots should be resized to contain more elements (a prime number greater than the current size (or 2x greater) is ideal) and that your elements must be distributed in the new larger array.
Such value should set to something between 0.6 and 0.8.
A rule of thumb is to resize the table once it's 3/4 full.

Efficiently estimating the number of unique elements in a large list

This problem is a little similar to that solved by reservoir sampling, but not the same. I think its also a rather interesting problem.
I have a large dataset (typically hundreds of millions of elements), and I want to estimate the number of unique elements in this dataset. There may be anywhere from a few, to millions of unique elements in a typical dataset.
Of course the obvious solution is to maintain a running hashset of the elements you encounter, and count them at the end, this would yield an exact result, but would require me to carry a potentially large amount of state with me as I scan through the dataset (ie. all unique elements encountered so far).
Unfortunately in my situation this would require more RAM than is available to me (nothing that the dataset may be far larger than available RAM).
I'm wondering if there would be a statistical approach to this that would allow me to do a single pass through the dataset and come up with an estimated unique element count at the end, while maintaining a relatively small amount of state while I scan the dataset.
The input to the algorithm would be the dataset (an Iterator in Java parlance), and it would return an estimated unique object count (probably a floating point number). It is assumed that these objects can be hashed (ie. you can put them in a HashSet if you want to). Typically they will be strings, or numbers.
You could use a Bloom Filter for a reasonable lower bound. You just do a pass over the data, counting and inserting items which were definitely not already in the set.
This problem is well-addressed in the literature; a good review of various approaches is http://www.edbt.org/Proceedings/2008-Nantes/papers/p618-Metwally.pdf. The simplest approach (and most compact for very high accuracy requirements) is called Linear Counting. You hash elements to positions in a bitvector just like you would a Bloom filter (except only one hash function is required), but at the end you estimate the number of distinct elements by the formula D = -total_bits * ln(unset_bits/total_bits). Details are in the paper.
If you have a hash function that you trust, then you could maintain a hashset just like you would for the exact solution, but throw out any item whose hash value is outside of some small range. E.g., use a 32-bit hash, but only keep items where the first two bits of the hash are 0. Then multiply by the appropriate factor at the end to approximate the total number of unique elements.
Nobody has mentioned approximate algorithm designed specifically for this problem, Hyperloglog.

Resources