Can Dynamic resizing alone preserve correctness of Hash table? - data-structures

I was looking through MIT Open courseware 6.006 , It had the following question,
Alyssa decides to implement both collision resolution and dynamic resizing for her hash table. However, she doesn’t want to do more work than necessary, so she wonders. If she needs both to maintain the correctness and performance she expects. After all, If she has dynamic resizing, she can resize to avoid collisions; and if she has collision resolution, collisions don’t cause correctness issues. Which statement about these two properties true?
Dynamic resizing alone will preserve both properties.
Dynamic resizing alone will preserve correctness, but not performance.
Collision resolution alone will preserve performance, but not correctness.
Both are necessary to maintain performance and correctness.
The correct answer was given in solution as 4,
"Without collision resolution, no correctness: could have an actual hash collision, and then no amount of resizing will let both be entered into the table."
This is where I get confused if we keep increasing the table size and our hash function is simple uniform hashing , then wont the values of hash change eventually?
if h(k1) = h(k2) => k1 mod size1 = k2 mod size1
if k1, k2 are unique, then
for some size x
h(k1) != h(k2)
right?

Hash functions already have an upper bound on the size of the calculated hash. A hash function will calculate a fixed-length value from some potentially variable-length value, i.e. there will always be the chance of collisions, no matter how large your hash table is.
Simply because you are mapping a larger number of values to a smaller number of slots, slots will have more than one item assigned. That's a collision and it's unavoidable.

Related

Use of universal hashing

I'm trying to understand the usefullness of universal hashing over normal hashing, other than the function is randomly produced everytime, reading Cormen's book.
From what i understand in universal hashing we choose the function to be
H(x)=[(ax+b)mod p]mod m
with p being a prime number larger than all the keys, m the size of the data table, and a,b random numbers.
So for example if i want to read the ID of 80 people, and each ID has a value between [0,200], then m would be 80 and p would be 211(next prime number). Right?
I could use the function lets say
H(x)=[(100x+50)mod 211]mod 80
But why would this help? There is a high chance that i'm going to end up having a lot of empty slots of the table, taking space without reason. Wouldn't it be more usefull to lower the number m in order to get a smaller table so space isn't used wtihout reason?
Any help appreciated
I think the best way to answer your question is to abstract away from the particulars of the formula that you're using to compute hash codes and to think more about, generally, what the impact is of changing the size of a hash table.
The parameter m that you're considering tuning adjusts how many slots are in your hash table. Let's imagine that you're planning on dropping n items into your hash table. The ratio n / m is called the load factor of the hash table and is typically denoted by the letter α.
If you have a table with a high load factor (large α, small m), then you'll have less wasted space in the table. However, you'll also increase the cost of doing a lookup, since with lots of objects distributed into a small space you're likely to get a bunch of collisions that will take time to resolve.
On the other hand, if you have a table with a low load factor (small α, large m), then you decrease the likelihood of collisions and therefore will improve the cost of performing lookups. However, if α gets too small - say, you have 1,000 slots per element actually stored - then you'll have a lot of wasted space.
Part of the engineering aspect of crafting a good hash table is figuring out how to draw the balance between these two options. The best way to see what works and what doesn't is to pull out a profiler and measure how changes to α change your runtime.

Hash table - why is it faster than arrays?

In cases where I have a key for each element and I don't know the index of the element into an array, hashtables perform better than arrays (O(1) vs O(n)).
Why is that? I mean: I have a key, I hash it.. I have the hash.. shouldn't the algorithm compare this hash against every element's hash? I think there's some trick behind the memory disposition, isn't it?
In cases where I have a key for each element and I don't know the
index of the element into an array, hashtables perform better than
arrays (O(1) vs O(n)).
The hash table search performs O(1) in the average case. In the worst case, the hash table search performs O(n): when you have collisions and the hash function always returns the same slot. One may think "this is a remote situation," but a good analysis should consider it. In this case you should iterate through all the elements like in an array or linked lists (O(n)).
Why is that? I mean: I have a key, I hash it.. I have the hash..
shouldn't the algorithm compare this hash against every element's
hash? I think there's some trick behind the memory disposition, isn't
it?
You have a key, You hash it.. you have the hash: the index of the hash table where the element is present (if it has been located before). At this point you can access the hash table record in O(1). If the load factor is small, it's unlikely to see more than one element there. So, the first element you see should be the element you are looking for. Otherwise, if you have more than one element you must compare the elements you will find in the position with the element you are looking for. In this case you have O(1) + O(number_of_elements).
In the average case, the hash table search complexity is O(1) + O(load_factor) = O(1 + load_factor).
Remember, load_factor = n in the worst case. So, the search complexity is O(n) in the worst case.
I don't know what you mean with "trick behind the memory disposition". Under some points of view, the hash table (with its structure and collisions resolution by chaining) can be considered a "smart trick".
Of course, the hash table analysis results can be proven by math.
With arrays: if you know the value, you have to search on average half the values (unless sorted) to find its location.
With hashes: the location is generated based on the value. So, given that value again, you can calculate the same hash you calculated when inserting. Sometimes, more than 1 value results in the same hash, so in practice each "location" is itself an array (or linked list) of all the values that hash to that location. In this case, only this much smaller (unless it's a bad hash) array needs to be searched.
Hash tables are a bit more complex. They put elements in different buckets based on their hash % some value. In an ideal situation, each bucket holds very few items and there aren't many empty buckets.
Once you know the key, you compute the hash. Based on the hash, you know which bucket to look for. And as stated above, the number of items in each bucket should be relatively small.
Hash tables are doing a lot of magic internally to make sure buckets are as small as possible while not consuming too much memory for empty buckets. Also, much depends on the quality of the key -> hash function.
Wikipedia provides very comprehensive description of hash table.
A Hash Table will not have to compare every element in the Hash. It will calculate the hashcode according to the key. For example, if the key is 4, then hashcode may be - 4*x*y. Now the pointer knows exactly which element to pick.
Whereas if it has been an array, it will have to traverse through the whole array to search for this element.
Why is [it] that [hashtables perform lookups by key better than arrays (O(1) vs O(n))]? I mean: I have a key, I hash it.. I have the hash.. shouldn't the algorithm compare this hash against every element's hash? I think there's some trick behind the memory disposition, isn't it?
Once you have the hash, it lets you calculate an "ideal" or expected location in the array of buckets: commonly:
ideal bucket = hash % num_buckets
The problem is then that another value may have already hashed to that bucket, in which case the hash table implementation has two main choice:
1) try another bucket
2) let several distinct values "belong" to one bucket, perhaps by making the bucket hold a pointer into a linked list of values
For implementation 1, known as open addressing or closed hashing, you jump around other buckets: if you find your value, great; if you find a never-used bucket, then you can store your value in there if inserting, or you know you'll never find your value when searching. There's a potential for the searching to be even worse than O(n) if the way you traverse alternative buckets ends up searching the same bucket multiple times; for example, if you use quadratic probing you try the ideal bucket index +1, then +4, then +9, then +16 and so on - but you must avoid out-of-bounds bucket access using e.g. % num_buckets, so if there are say 12 buckets then ideal+4 and ideal+16 search the same bucket. It can be expensive to track which buckets have been searched, so it can be hard to know when to give up too: the implementation can be optimistic and assume it will always find either the value or an unused bucket (risking spinning forever), it can have a counter and after a threshold of tries either give up or start a linear bucket-by-bucket search.
For implementation 2, known as closed addressing or separate chaining, you have to search inside the container/data-structure of values that all hashed to the ideal bucket. How efficient this is depends on the type of container used. It's generally expected that the number of elements colliding at one bucket will be small, which is true of a good hash function with non-adversarial inputs, and typically true enough of even a mediocre hash function especially with a prime number of buckets. So, a linked list or contiguous array is often used, despite the O(n) search properties: linked lists are simple to implement and operate on, and arrays pack the data together for better memory cache locality and access speed. The worst possible case though is that every value in your table hashed to the same bucket, and the container at that bucket now holds all the values: your entire hash table is then only as efficient as the bucket's container. Some Java hash table implementations have started using binary trees if the number of elements hashing to the same buckets passes a threshold, to make sure complexity is never worse than O(log2n).
Python hashes are an example of 1 = open addressing = closed hashing. C++ std::unordered_set is an example of closed addressing = separate chaining.
The purpose of hashing is to produce an index into the underlying array, which enables you to jump straight to the element in question. This is usually accomplished by dividing the hash by the size of the array and taking the remainder index = hash%capacity.
The type/size of the hash is typically that of the smallest integer large enough to index all of RAM. On a 32 bit system this is a 32 bit integer. On a 64 bit system this is a 64 bit integer. In C++ this corresponds to unsigned int and unsigned long long respectively. To be pedantic C++ technically specifies minimum sizes for its primitives i.e. at least 32 bits and at least 64 bits, but that's beside the point. For the sake of making code portable C++ also provides a size_t primative which corresponds to the appropriate unsigned integer. You'll see that type a lot in for loops which index into arrays, in well written code. In the case of a language like Python the integer primitive grows to whatever size it needs to be. This is typically implemented in the standard libraries of other languages under the name "Big Integer". To deal with this the Python programming language simply truncates whatever value you return from the __hash__() method down to the appropriate size.
On this score I think it's worth giving a word to the wise. The result of arithmetic is the same regardless of whether you compute the remainder at the end or at each step along the way. Truncation is equivalent to computing the remainder modulo 2^n where n is the number of bits you leave intact. Now you might think that computing the remainder at each step would be foolish due to the fact that you're incurring an extra computation at every step along the way. However this is not the case for two reasons. First, computationally speaking, truncation is extraordinarily cheap, far cheaper than generalized division. Second, and this is the real reason as the first is insufficient, and the claim would generally hold even in its absence, taking the remainder at each step keeps the number (relatively) small. So instead of something like product = 31*product + hash(array[index]), you'll want something like product = hash(31*product + hash(array[index])). The primary purpose of the inner hash() call is to take something which might not be a number and turn it into one, where as the primary purpose of the outer hash() call is to take a potentially oversized number and truncate it. Lastly I'll note that in languages like C++ where integer primitives have a fixed size this truncation step is automatically performed after every operation.
Now for the elephant in the room. You've probably realized that hash codes being generally speaking smaller than the objects they correspond to, not to mention that the indices derived from them are again generally speaking even smaller still, it's entirely possible for two objects to hash to the same index. This is called a hash collision. Data structures backed by a hash table like Python's set or dict or C++'s std::unordered_set or std::unordered_map primarily handle this in one of two ways. The first is called separate chaining, and the second is called open addressing. In separate chaining the array functioning as the hash table is itself an array of lists (or in some cases where the developer feels like getting fancy, some other data structure like a binary search tree), and every time an element hashes to a given index it gets added to the corresponding list. In open addressing if an element hashes to an index which is already occupied the data structure probes over to the next index (or in some cases where the developer feels like getting fancy, an index defined by some other function as is the case in quadratic probing) and so on until it finds an empty slot, of course wrapping around when it reaches the end of the array.
Next a word about load factor. There is of course an inherent space/time trade off when it comes to increasing or decreasing the load factor. The higher the load factor the less wasted space the table consumes; however this comes at the expense of increasing the likelihood of performance degrading collisions. Generally speaking hash tables implemented with separate chaining are less sensitive to load factor than those implemented with open addressing. This is due to the phenomenon known as clustering where by clusters in an open addressed hash table tend to become larger and larger in a positive feed back loop as a result of the fact that the larger they become the more likely they are to contain the preferred index of a newly added element. This is actually the reason why the afore mentioned quadratic probing scheme, which progressively increases the jump distance, is often preferred. In the extreme case of load factors greater than 1, open addressing can't work at all as the number of elements exceeds the available space. That being said load factors greater than 1 are exceedingly rare in general. At time of writing Python's set and dict classes employ a max load factor of 2/3 where as Java's java.util.HashSet and java.util.HashMap use 3/4 with C++'s std::unordered_set and std::unordered_map taking the cake with a max load factor of 1. Unsurprisingly Python's hash table backed data structures handle collisions with open addressing where as their Java and C++ counterparts do it with separate chaining.
Last a comment about table size. When the max load factor is exceeded, the size of the hash table must of course be grown. Due to the fact that this requires that every element there in be reindexed, it's highly inefficient to grow the table by a fixed amount. To do so would incur order size operations every time a new element is added. The standard fix for this problem is the same as that employed by most dynamic array implementations. At every point where we need to grow the table we simply increase its size by its current size. This unsurprisingly is known as table doubling.
I think you answered your own question there. "shouldn't the algorithm compare this hash against every element's hash". That's kind of what it does when it doesn't know the index location of what you're searching for. It compares each element to find the one you're looking for:
E.g. Let's say you're looking for an item called "Car" inside an array of strings. You need to go through every item and check item.Hash() == "Car".Hash() to find out that that is the item you're looking for. Obviously it doesn't use the hash when searching always, but the example stands. Then you have a hash table. What a hash table does is it creates a sparse array, or sometimes array of buckets as the guy above mentioned. Then it uses the "Car".Hash() to deduce where in the sparse array your "Car" item is actually. This means that it doesn't have to search through the entire array to find your item.

Data structure to store 2D points clustered near the origin?

I need to use a spatial 2d map for my application. The map usually contains small amount of values in (-200, -200) - (200, 200) rectangle, most of them around (0, 0).
I thought of using hash map but then I need a hash function. I thought of x * 200 + y but then adding (0, 0) and (1, 0) will require 800 bytes for the hash table only, and memory is a problem in my application.
The map is immutable after initial setup so insertion time isn't a matter, but there is a lot of access (about 600 a second) and the target CPU isn't really fast.
What are the general memory/access time trade-offs between hash map and ordinary map(I believe RB-Tree in stl) in small areas? what is a good hash function for small areas?
I think that there are a few things that I need to explain in a bit more detail to answer your question.
For starters, there is a strong distinction between a hash function as its typically used in a program and the number of buckets used in a hash table. In most implementations of a hash function, the hash function is some mapping from objects to integers. The hash table is then free to pick any number of buckets it wants, then maps back from the integers to those buckets. Commonly, this is done by taking the hash code and then modding it by the number of buckets. This means that if you want to store points in a hash table, you don't need to worry about how large the values that your hash function produces are. For example, if the hash table has only three buckets and you produce objects with hash codes 0 and 1,000,000,000, then the first object would hash to the zeroth bucket and the second object would hash to the 1,000,000,000 % 3 = 1st bucket. You wouldn't need 1,000,000,000 buckets. Consequently, you shouldn't worry about picking a hash function like x * 200 + y, since unless your hash table is implemented very oddly you don't need to worry about space usage.
If you are creating a hash table in a way where you will be inserting only once and then spending a lot of time doing accesses, you may want to look into perfect hash functions and perfect hash tables. These are data structures that work by trying to find a hash function for the set of points that you're storing such that no collisions ever occur. They take (expected) O(n) time to create, and can do lookups in worst-case O(1) time. Barring the overhead from computing the hash function, this is the fastest way to look up points in space.
If you were just to dump everything in a tree-based map like most implementations of std::map, though, you should be perfectly fine. With at most 400x400 = 160,000 points, the time required to look up a point would be about lg 160,000 ≈ 18 lookups. This is unlikely to be a bottleneck in any application, though if you really need all the performance you can get the aforementioned perfect hash table is likely to be the best option.
However, both of these solutions only work if the queries you are interested in are of the form "does point p exist in the set or not?" If you want to do more complex geometric queries like nearest-neighbor lookups or finding all the points in a bounding box, you may want to look into more complex data structures like the k-d tree, which supports extremely fast (O(log n)) lookups and fast nearest-neighbor and range searches.
Hope this helps!
Slightly confused by your terminology.
The "map" objects in the standard library are implementations of associative arrays (either via hash tables or binary search trees).
If you're doing 2D spatial processing and are looking to implement a search structure, there are many dedicated data objects - i.e. quadtrees and k-d trees.
Edit: For a few ideas on implementations, perhaps check: https://stackoverflow.com/questions/1402014/kdtree-implementation-c.
Honestly - the data structures aren't that complex - I've always rolled my own.

When should I do rehashing of entire hash table?

How do I decide when should I do rehashing of entire hash table?
This depends a great deal on how you're resolving collisions. If you user linear probing, performance usually starts to drop pretty badly with a load factor much higher than 60% or so. If you use double hashing, a load factor of 80-85% is usually pretty reasonable. If you use collision chaining, performance usually remains reasonable with load factors up to around 150% or or more.
I've sometimes even created a hash table with balanced trees for collision resolution. In this case, you can almost forget about re-hashing -- the performance doesn't start to deteriorate noticeably until the number of items exceeds the table size by at least a couple orders of magnitude.
Generally, you have a hash table containing N elements distributed in an array of M slots.
There is a percent value (called "growthFactor") defined by the user when instantiating the hash table that is used in this way:
if (growthRatio < (N/M))
Rehash();
the rehash means your array of M slots should be resized to contain more elements (a prime number greater than the current size (or 2x greater) is ideal) and that your elements must be distributed in the new larger array.
Such value should set to something between 0.6 and 0.8.
A rule of thumb is to resize the table once it's 3/4 full.

Efficiently estimating the number of unique elements in a large list

This problem is a little similar to that solved by reservoir sampling, but not the same. I think its also a rather interesting problem.
I have a large dataset (typically hundreds of millions of elements), and I want to estimate the number of unique elements in this dataset. There may be anywhere from a few, to millions of unique elements in a typical dataset.
Of course the obvious solution is to maintain a running hashset of the elements you encounter, and count them at the end, this would yield an exact result, but would require me to carry a potentially large amount of state with me as I scan through the dataset (ie. all unique elements encountered so far).
Unfortunately in my situation this would require more RAM than is available to me (nothing that the dataset may be far larger than available RAM).
I'm wondering if there would be a statistical approach to this that would allow me to do a single pass through the dataset and come up with an estimated unique element count at the end, while maintaining a relatively small amount of state while I scan the dataset.
The input to the algorithm would be the dataset (an Iterator in Java parlance), and it would return an estimated unique object count (probably a floating point number). It is assumed that these objects can be hashed (ie. you can put them in a HashSet if you want to). Typically they will be strings, or numbers.
You could use a Bloom Filter for a reasonable lower bound. You just do a pass over the data, counting and inserting items which were definitely not already in the set.
This problem is well-addressed in the literature; a good review of various approaches is http://www.edbt.org/Proceedings/2008-Nantes/papers/p618-Metwally.pdf. The simplest approach (and most compact for very high accuracy requirements) is called Linear Counting. You hash elements to positions in a bitvector just like you would a Bloom filter (except only one hash function is required), but at the end you estimate the number of distinct elements by the formula D = -total_bits * ln(unset_bits/total_bits). Details are in the paper.
If you have a hash function that you trust, then you could maintain a hashset just like you would for the exact solution, but throw out any item whose hash value is outside of some small range. E.g., use a 32-bit hash, but only keep items where the first two bits of the hash are 0. Then multiply by the appropriate factor at the end to approximate the total number of unique elements.
Nobody has mentioned approximate algorithm designed specifically for this problem, Hyperloglog.

Resources