Binary-search tree vs hash table in terms of memory - algorithm

I have heard different things on the internet including someone clearly stating that BST requires less memory compared to hast tables, because hash tables acquire more memory at once than they need at the moment.
Can one tell me the pros and cons of each structure compared to the other only only it terms of memory.

A binary search tree is nothing but a linked list with 2 pointers per node. The memory required is of the order O(n) ,i.e., same as the number of elements stored in it.
A hash map on the other hand, is generally implemented as an array. So, there is some unused space on it always. Have a read here about how a hash map is implemented in java.: http://java-performance.info/memory-consumption-of-java-data-types-2/
HashMap is built on top of the array of Map.Entry objects. The
implementation ensures that this array length is always equal to at
least max( size, capacity ) / load_factor. Default load factor for
HashMap is 0.75 and default capacity is 16. Load factor specifies
which part of an array could be used for storage and is a value
between 0 and 1. The higher is the load factor, the less space is
being wasted, but HashMap starts to work slower due to increased rate
of collisions. The smaller if the load factor, the more memory is
wasted, but the performance of a HashMap is increasing due to smaller
possibility of collisions.

Related

Use of universal hashing

I'm trying to understand the usefullness of universal hashing over normal hashing, other than the function is randomly produced everytime, reading Cormen's book.
From what i understand in universal hashing we choose the function to be
H(x)=[(ax+b)mod p]mod m
with p being a prime number larger than all the keys, m the size of the data table, and a,b random numbers.
So for example if i want to read the ID of 80 people, and each ID has a value between [0,200], then m would be 80 and p would be 211(next prime number). Right?
I could use the function lets say
H(x)=[(100x+50)mod 211]mod 80
But why would this help? There is a high chance that i'm going to end up having a lot of empty slots of the table, taking space without reason. Wouldn't it be more usefull to lower the number m in order to get a smaller table so space isn't used wtihout reason?
Any help appreciated
I think the best way to answer your question is to abstract away from the particulars of the formula that you're using to compute hash codes and to think more about, generally, what the impact is of changing the size of a hash table.
The parameter m that you're considering tuning adjusts how many slots are in your hash table. Let's imagine that you're planning on dropping n items into your hash table. The ratio n / m is called the load factor of the hash table and is typically denoted by the letter α.
If you have a table with a high load factor (large α, small m), then you'll have less wasted space in the table. However, you'll also increase the cost of doing a lookup, since with lots of objects distributed into a small space you're likely to get a bunch of collisions that will take time to resolve.
On the other hand, if you have a table with a low load factor (small α, large m), then you decrease the likelihood of collisions and therefore will improve the cost of performing lookups. However, if α gets too small - say, you have 1,000 slots per element actually stored - then you'll have a lot of wasted space.
Part of the engineering aspect of crafting a good hash table is figuring out how to draw the balance between these two options. The best way to see what works and what doesn't is to pull out a profiler and measure how changes to α change your runtime.

Hash table - why is it faster than arrays?

In cases where I have a key for each element and I don't know the index of the element into an array, hashtables perform better than arrays (O(1) vs O(n)).
Why is that? I mean: I have a key, I hash it.. I have the hash.. shouldn't the algorithm compare this hash against every element's hash? I think there's some trick behind the memory disposition, isn't it?
In cases where I have a key for each element and I don't know the
index of the element into an array, hashtables perform better than
arrays (O(1) vs O(n)).
The hash table search performs O(1) in the average case. In the worst case, the hash table search performs O(n): when you have collisions and the hash function always returns the same slot. One may think "this is a remote situation," but a good analysis should consider it. In this case you should iterate through all the elements like in an array or linked lists (O(n)).
Why is that? I mean: I have a key, I hash it.. I have the hash..
shouldn't the algorithm compare this hash against every element's
hash? I think there's some trick behind the memory disposition, isn't
it?
You have a key, You hash it.. you have the hash: the index of the hash table where the element is present (if it has been located before). At this point you can access the hash table record in O(1). If the load factor is small, it's unlikely to see more than one element there. So, the first element you see should be the element you are looking for. Otherwise, if you have more than one element you must compare the elements you will find in the position with the element you are looking for. In this case you have O(1) + O(number_of_elements).
In the average case, the hash table search complexity is O(1) + O(load_factor) = O(1 + load_factor).
Remember, load_factor = n in the worst case. So, the search complexity is O(n) in the worst case.
I don't know what you mean with "trick behind the memory disposition". Under some points of view, the hash table (with its structure and collisions resolution by chaining) can be considered a "smart trick".
Of course, the hash table analysis results can be proven by math.
With arrays: if you know the value, you have to search on average half the values (unless sorted) to find its location.
With hashes: the location is generated based on the value. So, given that value again, you can calculate the same hash you calculated when inserting. Sometimes, more than 1 value results in the same hash, so in practice each "location" is itself an array (or linked list) of all the values that hash to that location. In this case, only this much smaller (unless it's a bad hash) array needs to be searched.
Hash tables are a bit more complex. They put elements in different buckets based on their hash % some value. In an ideal situation, each bucket holds very few items and there aren't many empty buckets.
Once you know the key, you compute the hash. Based on the hash, you know which bucket to look for. And as stated above, the number of items in each bucket should be relatively small.
Hash tables are doing a lot of magic internally to make sure buckets are as small as possible while not consuming too much memory for empty buckets. Also, much depends on the quality of the key -> hash function.
Wikipedia provides very comprehensive description of hash table.
A Hash Table will not have to compare every element in the Hash. It will calculate the hashcode according to the key. For example, if the key is 4, then hashcode may be - 4*x*y. Now the pointer knows exactly which element to pick.
Whereas if it has been an array, it will have to traverse through the whole array to search for this element.
Why is [it] that [hashtables perform lookups by key better than arrays (O(1) vs O(n))]? I mean: I have a key, I hash it.. I have the hash.. shouldn't the algorithm compare this hash against every element's hash? I think there's some trick behind the memory disposition, isn't it?
Once you have the hash, it lets you calculate an "ideal" or expected location in the array of buckets: commonly:
ideal bucket = hash % num_buckets
The problem is then that another value may have already hashed to that bucket, in which case the hash table implementation has two main choice:
1) try another bucket
2) let several distinct values "belong" to one bucket, perhaps by making the bucket hold a pointer into a linked list of values
For implementation 1, known as open addressing or closed hashing, you jump around other buckets: if you find your value, great; if you find a never-used bucket, then you can store your value in there if inserting, or you know you'll never find your value when searching. There's a potential for the searching to be even worse than O(n) if the way you traverse alternative buckets ends up searching the same bucket multiple times; for example, if you use quadratic probing you try the ideal bucket index +1, then +4, then +9, then +16 and so on - but you must avoid out-of-bounds bucket access using e.g. % num_buckets, so if there are say 12 buckets then ideal+4 and ideal+16 search the same bucket. It can be expensive to track which buckets have been searched, so it can be hard to know when to give up too: the implementation can be optimistic and assume it will always find either the value or an unused bucket (risking spinning forever), it can have a counter and after a threshold of tries either give up or start a linear bucket-by-bucket search.
For implementation 2, known as closed addressing or separate chaining, you have to search inside the container/data-structure of values that all hashed to the ideal bucket. How efficient this is depends on the type of container used. It's generally expected that the number of elements colliding at one bucket will be small, which is true of a good hash function with non-adversarial inputs, and typically true enough of even a mediocre hash function especially with a prime number of buckets. So, a linked list or contiguous array is often used, despite the O(n) search properties: linked lists are simple to implement and operate on, and arrays pack the data together for better memory cache locality and access speed. The worst possible case though is that every value in your table hashed to the same bucket, and the container at that bucket now holds all the values: your entire hash table is then only as efficient as the bucket's container. Some Java hash table implementations have started using binary trees if the number of elements hashing to the same buckets passes a threshold, to make sure complexity is never worse than O(log2n).
Python hashes are an example of 1 = open addressing = closed hashing. C++ std::unordered_set is an example of closed addressing = separate chaining.
The purpose of hashing is to produce an index into the underlying array, which enables you to jump straight to the element in question. This is usually accomplished by dividing the hash by the size of the array and taking the remainder index = hash%capacity.
The type/size of the hash is typically that of the smallest integer large enough to index all of RAM. On a 32 bit system this is a 32 bit integer. On a 64 bit system this is a 64 bit integer. In C++ this corresponds to unsigned int and unsigned long long respectively. To be pedantic C++ technically specifies minimum sizes for its primitives i.e. at least 32 bits and at least 64 bits, but that's beside the point. For the sake of making code portable C++ also provides a size_t primative which corresponds to the appropriate unsigned integer. You'll see that type a lot in for loops which index into arrays, in well written code. In the case of a language like Python the integer primitive grows to whatever size it needs to be. This is typically implemented in the standard libraries of other languages under the name "Big Integer". To deal with this the Python programming language simply truncates whatever value you return from the __hash__() method down to the appropriate size.
On this score I think it's worth giving a word to the wise. The result of arithmetic is the same regardless of whether you compute the remainder at the end or at each step along the way. Truncation is equivalent to computing the remainder modulo 2^n where n is the number of bits you leave intact. Now you might think that computing the remainder at each step would be foolish due to the fact that you're incurring an extra computation at every step along the way. However this is not the case for two reasons. First, computationally speaking, truncation is extraordinarily cheap, far cheaper than generalized division. Second, and this is the real reason as the first is insufficient, and the claim would generally hold even in its absence, taking the remainder at each step keeps the number (relatively) small. So instead of something like product = 31*product + hash(array[index]), you'll want something like product = hash(31*product + hash(array[index])). The primary purpose of the inner hash() call is to take something which might not be a number and turn it into one, where as the primary purpose of the outer hash() call is to take a potentially oversized number and truncate it. Lastly I'll note that in languages like C++ where integer primitives have a fixed size this truncation step is automatically performed after every operation.
Now for the elephant in the room. You've probably realized that hash codes being generally speaking smaller than the objects they correspond to, not to mention that the indices derived from them are again generally speaking even smaller still, it's entirely possible for two objects to hash to the same index. This is called a hash collision. Data structures backed by a hash table like Python's set or dict or C++'s std::unordered_set or std::unordered_map primarily handle this in one of two ways. The first is called separate chaining, and the second is called open addressing. In separate chaining the array functioning as the hash table is itself an array of lists (or in some cases where the developer feels like getting fancy, some other data structure like a binary search tree), and every time an element hashes to a given index it gets added to the corresponding list. In open addressing if an element hashes to an index which is already occupied the data structure probes over to the next index (or in some cases where the developer feels like getting fancy, an index defined by some other function as is the case in quadratic probing) and so on until it finds an empty slot, of course wrapping around when it reaches the end of the array.
Next a word about load factor. There is of course an inherent space/time trade off when it comes to increasing or decreasing the load factor. The higher the load factor the less wasted space the table consumes; however this comes at the expense of increasing the likelihood of performance degrading collisions. Generally speaking hash tables implemented with separate chaining are less sensitive to load factor than those implemented with open addressing. This is due to the phenomenon known as clustering where by clusters in an open addressed hash table tend to become larger and larger in a positive feed back loop as a result of the fact that the larger they become the more likely they are to contain the preferred index of a newly added element. This is actually the reason why the afore mentioned quadratic probing scheme, which progressively increases the jump distance, is often preferred. In the extreme case of load factors greater than 1, open addressing can't work at all as the number of elements exceeds the available space. That being said load factors greater than 1 are exceedingly rare in general. At time of writing Python's set and dict classes employ a max load factor of 2/3 where as Java's java.util.HashSet and java.util.HashMap use 3/4 with C++'s std::unordered_set and std::unordered_map taking the cake with a max load factor of 1. Unsurprisingly Python's hash table backed data structures handle collisions with open addressing where as their Java and C++ counterparts do it with separate chaining.
Last a comment about table size. When the max load factor is exceeded, the size of the hash table must of course be grown. Due to the fact that this requires that every element there in be reindexed, it's highly inefficient to grow the table by a fixed amount. To do so would incur order size operations every time a new element is added. The standard fix for this problem is the same as that employed by most dynamic array implementations. At every point where we need to grow the table we simply increase its size by its current size. This unsurprisingly is known as table doubling.
I think you answered your own question there. "shouldn't the algorithm compare this hash against every element's hash". That's kind of what it does when it doesn't know the index location of what you're searching for. It compares each element to find the one you're looking for:
E.g. Let's say you're looking for an item called "Car" inside an array of strings. You need to go through every item and check item.Hash() == "Car".Hash() to find out that that is the item you're looking for. Obviously it doesn't use the hash when searching always, but the example stands. Then you have a hash table. What a hash table does is it creates a sparse array, or sometimes array of buckets as the guy above mentioned. Then it uses the "Car".Hash() to deduce where in the sparse array your "Car" item is actually. This means that it doesn't have to search through the entire array to find your item.

When performing a looking in a Hash data structure, why is it a fast operation?

When you perform a lookup in a Hashtable, the key is converted into a hash. Now using that hashed value, does it directly map to a memory location, or are there more steps?
Just trying to understand things a little more under the covers.
And what other key based lookup data structures are there and why are they slower than a hash?
Hash tables are not necessarily fast. People consider hash tables a "fast" data structure because the retrieval time does not depend on the number of entries in the table. That is, retrieval from a hash table is an "O(1)" (constant time) operation.
Retrieval time from other data structures can vary depending on the number of entries in the map. For example, for a balanced binary tree, the retrieval time scales with the base-2 logarithm of its size; it's "O(log n)".
However, actually computing a hash code for an single object, in practice, often takes many times longer than comparing that type of object to others. So, you could find that for a small map, something like a red-black tree is faster than a hash table. As the maps grow, the hash table retrieval time will stay constant, and the red-black tree time will slowly grow until it is slower than a hash table.
A Hash (aka Hash Table) implies more than a Map (or Associative Array).
In particular, a Map (or Associative Array) is an Abstract Data Type:
...an associative array (also called a map or a dictionary) is an abstract data type composed of a collection of (key,value) pairs, such that each possible key appears at most once in the collection.
While a Hash table is an implementation of a Map (although it could also be considered an ADT that includes a "cost"):
...a hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys [...], to their associated values [...]. Thus, a hash table implements an associative array [or, map].
Thus it is an implementation-detail leaking out: a HashMap is a Map that uses a Hash-table algorithm and thus provides the expected performance characteristics of such an algorithm. The "leaking" of the implementation detail is good in this case because it provides some basic [expected] bound guarantees, such as an [expected] O(1) -- or constant time -- get.
Hint: a hash function is important part of a hash-table algorithm and sets a HashMap apart from other Map implementations such as a TreeMap (that uses a red-black tree) or a ConcurrentSkipListMap (that uses a skip list).
Another form of a Map is an Association List (or "alist", which is common in LISP programming). While association lists are O(n) for get, they can have much less overhead for small n, which brings up another point: Big-Oh describes limiting behavior (as n -> infinity) and does not address the relative performance for a particular [smallish] n:
A description of a function in terms of big O notation usually only provides an upper bound on the growth rate of the function.
Please refer to the links above (including the javadoc) for the basic characteristics and different implementation strategies -- anything else I say here is already said there (or in other SO answers). If there are specific questions, open a new SO post if warranted :-)
Happy coding.
Here is the source for the HashMap implementation that is in OpenJDK 7. Looking at the put method shows that it a simple chaining as a collision-resolution method and that the underlying "bucket array" will grow by a factor of 2 each resize (which is triggered when the load factor is reached). The load factor and amortized performance expectations -- including those of the hashing function used -- are covered in the class documentation.
"Key-based" implies a mapping of some sort. You can implement one in a linked list or array, and it would probably be pretty slow (O(n)) for lookups or deletes.
Hashing takes constant time. In the more sophisticated implementations it will typically map to a memory address which stores a list of pointers back at the key object in addition to the mapped object or value, for collision detection and resolution.
The expensive operations are following the list of the "hashed to this location" objects to figure out which one you are really looking for. In theory, this could be O(n) for each lookup! However, if we use a larger space the probability of this occurring is reduced (although a few collisions is almost inevitable per the Birthday Problem) drastically.
If you start getting over a certain threshold of collisions, most implementations will expand the size of the hash table, which also takes another O(n) time. However, this will on average take place no more often than every 1/n inserts. So we have amortized constant time.

Why are hash table expansions usually done by doubling the size?

I've done a little research on hash tables, and I keep running across the rule of thumb that when there are a certain number of entries (either max or via a load factor like 75%) the hash table should be expanded.
Almost always, the recommendation is to double (or double plus 1, i.e., 2n+1) the size of the hash table. However, I haven't been able to find a good reason for this.
Why double the size, rather than, say, increasing it 25%, or increasing it to the size of the next prime number, or next k prime numbers (e.g., three)?
I already know that it's often a good idea to choose an initial hash table size which is a prime number, at least if your hash function uses modulus such as universal hashing. And I know that's why it's usually recommended to do 2n+1 instead of 2n (e.g., http://www.concentric.net/~Ttwang/tech/hashsize.htm)
However as I said, I haven't seen any real explanation for why doubling or doubling-plus-one is actually a good choice rather than some other method of choosing a size for the new hash table.
(And yes I've read the Wikipedia article on hash tables :) http://en.wikipedia.org/wiki/Hash_table
Hash-tables could not claim "amortized constant time insertion" if, for instance, the resizing was by a constant increment. In that case the cost of resizing (which grows with the size of the hash-table) would make the cost of one insertion linear in the total number of elements to insert. Because resizing becomes more and more expensive with the size of the table, it has to happen "less and less often" to keep the amortized cost of insertion constant.
Most implementations allow the average bucket occupation to grow to until a bound fixed in advance before resizing (anywhere between 0.5 and 3, which are all acceptable values). With this convention, just after resizing the average bucket occupation becomes half that bound. Resizing by doubling keeps the average bucket occupation in a band of width *2.
Sub-note: because of statistical clustering, you have to take an average bucket occupation as low as 0.5 if you want many buckets to have at most one elements (maximum speed for finding ignoring the complex effects of cache size), or as high as 3 if you want a minimum number of empty buckets (that correspond to wasted space).
I had read a very interesting discussion on growth strategy on this very site... just cannot find it again.
While 2 is commonly used, it's been demonstrated that it was not the best value. One often cited problem is that it does not cope well with allocators schemes (which often allocate power of twos blocks) since it would always require a reallocation while a smaller number might in fact be reallocated in the same block (simulating in-place growth) and thus being faster.
Thus, for example, the VC++ Standard Library uses a growth factor of 1.5 (ideally should be the golden number if a first-fit memory allocation strategy is being used) after an extensive discussion on the mailing list. The reasoning is explained here:
I'd be interested if any other vector implementations uses a growth factor other than 2, and I'd also like to know whether VC7 uses 1.5 or 2 (since I don't have that compiler here).
There is a technical reason to prefer 1.5 to 2 -- more specifically, to prefer values less than 1+sqrt(5)/2.
Suppose you are using a first-fit memory allocator, and you're progressively appending to a vector. Then each time you reallocate, you allocate new memory, copy the elements, then free the old memory. That leaves a gap, and it would be nice to be able to use that memory eventually. If the vector grows too rapidly, it will always be too big for the available memory.
It turns out that if the growth factor is >= 1+sqrt(5)/2, the new memory will always be too big for the hole that has been left sofar; if it is < 1+sqrt(5)/2, the new memory will eventually fit. So 1.5 is small enough to allow the memory to be recycled.
Surely, if the growth factor is >= 2 the new memory will always be too big for the hole that has been left so far; if it is < 2, the new memory will eventually fit. Presumably the reason for (1+sqrt(5))/2 is...
Initial allocation is s.
The first resize is k*s.
The second resize is k*k*s, which will fit the hole iff k*k*s <= k*s+s, i.e. iff k <= (1+sqrt(5))/2
...the hole can be recycled asap.
It could, by storing its previous size, grow fibonaccily.
Of course, it should be tailored to the memory allocation strategy.
One reason for doubling size that is specific to hash containers is that if the container capacity is always a power of two, then instead of using a general purpose modulo for converting a hash to an offset, the same result can be achieved with bit shifting. Modulo is a slow operation for the same reasons that integer division is slow. (Whether integer division is "slow" in the context of whatever else is going in a program is of course case dependent but it's certainly slower than other basic integer arithmetic.)
Doubling the memory when expanding any type of collection is an oftenly used strategy to prevent memory fragmentation and not having to reallocate too often. As you point out there might be reasons to have a prime number of elements. When knowing your application and your data, you might also be able to predict the growth of the number of elements and thus choose another (larger or smaller) growth factor than doubling.
The general implementations found in libraries are exactly that: General implementations. They have to focus on being a reasonable choice in a variety of different situations. When knowing the context, it is almost always possible to write a more specialized and more efficient implementation.
If you don't know how many objects you will end up using (lets say N),
by doubling the space you'll do log2N reallocations at most.
I assume that if you choose a proper initial "n", you increase the odds
that 2*n + 1 will produce prime numbers in subsequent reallocations.
The same reasoning applies for doubling the size as for vector/ArrayList implementations, see this answer.

When should I do rehashing of entire hash table?

How do I decide when should I do rehashing of entire hash table?
This depends a great deal on how you're resolving collisions. If you user linear probing, performance usually starts to drop pretty badly with a load factor much higher than 60% or so. If you use double hashing, a load factor of 80-85% is usually pretty reasonable. If you use collision chaining, performance usually remains reasonable with load factors up to around 150% or or more.
I've sometimes even created a hash table with balanced trees for collision resolution. In this case, you can almost forget about re-hashing -- the performance doesn't start to deteriorate noticeably until the number of items exceeds the table size by at least a couple orders of magnitude.
Generally, you have a hash table containing N elements distributed in an array of M slots.
There is a percent value (called "growthFactor") defined by the user when instantiating the hash table that is used in this way:
if (growthRatio < (N/M))
Rehash();
the rehash means your array of M slots should be resized to contain more elements (a prime number greater than the current size (or 2x greater) is ideal) and that your elements must be distributed in the new larger array.
Such value should set to something between 0.6 and 0.8.
A rule of thumb is to resize the table once it's 3/4 full.

Resources