Hash table - why is it faster than arrays?

Hash table - why is it faster than arrays? - performance

In cases where I have a key for each element and I don't know the index of the element into an array, hashtables perform better than arrays (O(1) vs O(n)).
Why is that? I mean: I have a key, I hash it.. I have the hash.. shouldn't the algorithm compare this hash against every element's hash? I think there's some trick behind the memory disposition, isn't it?

In cases where I have a key for each element and I don't know the
index of the element into an array, hashtables perform better than
arrays (O(1) vs O(n)).
The hash table search performs O(1) in the average case. In the worst case, the hash table search performs O(n): when you have collisions and the hash function always returns the same slot. One may think "this is a remote situation," but a good analysis should consider it. In this case you should iterate through all the elements like in an array or linked lists (O(n)).
Why is that? I mean: I have a key, I hash it.. I have the hash..
shouldn't the algorithm compare this hash against every element's
hash? I think there's some trick behind the memory disposition, isn't
it?
You have a key, You hash it.. you have the hash: the index of the hash table where the element is present (if it has been located before). At this point you can access the hash table record in O(1). If the load factor is small, it's unlikely to see more than one element there. So, the first element you see should be the element you are looking for. Otherwise, if you have more than one element you must compare the elements you will find in the position with the element you are looking for. In this case you have O(1) + O(number_of_elements).
In the average case, the hash table search complexity is O(1) + O(load_factor) = O(1 + load_factor).
Remember, load_factor = n in the worst case. So, the search complexity is O(n) in the worst case.
I don't know what you mean with "trick behind the memory disposition". Under some points of view, the hash table (with its structure and collisions resolution by chaining) can be considered a "smart trick".
Of course, the hash table analysis results can be proven by math.

With arrays: if you know the value, you have to search on average half the values (unless sorted) to find its location.
With hashes: the location is generated based on the value. So, given that value again, you can calculate the same hash you calculated when inserting. Sometimes, more than 1 value results in the same hash, so in practice each "location" is itself an array (or linked list) of all the values that hash to that location. In this case, only this much smaller (unless it's a bad hash) array needs to be searched.

Hash tables are a bit more complex. They put elements in different buckets based on their hash % some value. In an ideal situation, each bucket holds very few items and there aren't many empty buckets.
Once you know the key, you compute the hash. Based on the hash, you know which bucket to look for. And as stated above, the number of items in each bucket should be relatively small.
Hash tables are doing a lot of magic internally to make sure buckets are as small as possible while not consuming too much memory for empty buckets. Also, much depends on the quality of the key -> hash function.
Wikipedia provides very comprehensive description of hash table.

A Hash Table will not have to compare every element in the Hash. It will calculate the hashcode according to the key. For example, if the key is 4, then hashcode may be - 4*x*y. Now the pointer knows exactly which element to pick.
Whereas if it has been an array, it will have to traverse through the whole array to search for this element.

Why is [it] that [hashtables perform lookups by key better than arrays (O(1) vs O(n))]? I mean: I have a key, I hash it.. I have the hash.. shouldn't the algorithm compare this hash against every element's hash? I think there's some trick behind the memory disposition, isn't it?
Once you have the hash, it lets you calculate an "ideal" or expected location in the array of buckets: commonly:
ideal bucket = hash % num_buckets
The problem is then that another value may have already hashed to that bucket, in which case the hash table implementation has two main choice:
1) try another bucket
2) let several distinct values "belong" to one bucket, perhaps by making the bucket hold a pointer into a linked list of values
For implementation 1, known as open addressing or closed hashing, you jump around other buckets: if you find your value, great; if you find a never-used bucket, then you can store your value in there if inserting, or you know you'll never find your value when searching. There's a potential for the searching to be even worse than O(n) if the way you traverse alternative buckets ends up searching the same bucket multiple times; for example, if you use quadratic probing you try the ideal bucket index +1, then +4, then +9, then +16 and so on - but you must avoid out-of-bounds bucket access using e.g. % num_buckets, so if there are say 12 buckets then ideal+4 and ideal+16 search the same bucket. It can be expensive to track which buckets have been searched, so it can be hard to know when to give up too: the implementation can be optimistic and assume it will always find either the value or an unused bucket (risking spinning forever), it can have a counter and after a threshold of tries either give up or start a linear bucket-by-bucket search.
For implementation 2, known as closed addressing or separate chaining, you have to search inside the container/data-structure of values that all hashed to the ideal bucket. How efficient this is depends on the type of container used. It's generally expected that the number of elements colliding at one bucket will be small, which is true of a good hash function with non-adversarial inputs, and typically true enough of even a mediocre hash function especially with a prime number of buckets. So, a linked list or contiguous array is often used, despite the O(n) search properties: linked lists are simple to implement and operate on, and arrays pack the data together for better memory cache locality and access speed. The worst possible case though is that every value in your table hashed to the same bucket, and the container at that bucket now holds all the values: your entire hash table is then only as efficient as the bucket's container. Some Java hash table implementations have started using binary trees if the number of elements hashing to the same buckets passes a threshold, to make sure complexity is never worse than O(log2n).
Python hashes are an example of 1 = open addressing = closed hashing. C++ std::unordered_set is an example of closed addressing = separate chaining.

The purpose of hashing is to produce an index into the underlying array, which enables you to jump straight to the element in question. This is usually accomplished by dividing the hash by the size of the array and taking the remainder index = hash%capacity.
The type/size of the hash is typically that of the smallest integer large enough to index all of RAM. On a 32 bit system this is a 32 bit integer. On a 64 bit system this is a 64 bit integer. In C++ this corresponds to unsigned int and unsigned long long respectively. To be pedantic C++ technically specifies minimum sizes for its primitives i.e. at least 32 bits and at least 64 bits, but that's beside the point. For the sake of making code portable C++ also provides a size_t primative which corresponds to the appropriate unsigned integer. You'll see that type a lot in for loops which index into arrays, in well written code. In the case of a language like Python the integer primitive grows to whatever size it needs to be. This is typically implemented in the standard libraries of other languages under the name "Big Integer". To deal with this the Python programming language simply truncates whatever value you return from the __hash__() method down to the appropriate size.
On this score I think it's worth giving a word to the wise. The result of arithmetic is the same regardless of whether you compute the remainder at the end or at each step along the way. Truncation is equivalent to computing the remainder modulo 2^n where n is the number of bits you leave intact. Now you might think that computing the remainder at each step would be foolish due to the fact that you're incurring an extra computation at every step along the way. However this is not the case for two reasons. First, computationally speaking, truncation is extraordinarily cheap, far cheaper than generalized division. Second, and this is the real reason as the first is insufficient, and the claim would generally hold even in its absence, taking the remainder at each step keeps the number (relatively) small. So instead of something like product = 31*product + hash(array[index]), you'll want something like product = hash(31*product + hash(array[index])). The primary purpose of the inner hash() call is to take something which might not be a number and turn it into one, where as the primary purpose of the outer hash() call is to take a potentially oversized number and truncate it. Lastly I'll note that in languages like C++ where integer primitives have a fixed size this truncation step is automatically performed after every operation.
Now for the elephant in the room. You've probably realized that hash codes being generally speaking smaller than the objects they correspond to, not to mention that the indices derived from them are again generally speaking even smaller still, it's entirely possible for two objects to hash to the same index. This is called a hash collision. Data structures backed by a hash table like Python's set or dict or C++'s std::unordered_set or std::unordered_map primarily handle this in one of two ways. The first is called separate chaining, and the second is called open addressing. In separate chaining the array functioning as the hash table is itself an array of lists (or in some cases where the developer feels like getting fancy, some other data structure like a binary search tree), and every time an element hashes to a given index it gets added to the corresponding list. In open addressing if an element hashes to an index which is already occupied the data structure probes over to the next index (or in some cases where the developer feels like getting fancy, an index defined by some other function as is the case in quadratic probing) and so on until it finds an empty slot, of course wrapping around when it reaches the end of the array.
Next a word about load factor. There is of course an inherent space/time trade off when it comes to increasing or decreasing the load factor. The higher the load factor the less wasted space the table consumes; however this comes at the expense of increasing the likelihood of performance degrading collisions. Generally speaking hash tables implemented with separate chaining are less sensitive to load factor than those implemented with open addressing. This is due to the phenomenon known as clustering where by clusters in an open addressed hash table tend to become larger and larger in a positive feed back loop as a result of the fact that the larger they become the more likely they are to contain the preferred index of a newly added element. This is actually the reason why the afore mentioned quadratic probing scheme, which progressively increases the jump distance, is often preferred. In the extreme case of load factors greater than 1, open addressing can't work at all as the number of elements exceeds the available space. That being said load factors greater than 1 are exceedingly rare in general. At time of writing Python's set and dict classes employ a max load factor of 2/3 where as Java's java.util.HashSet and java.util.HashMap use 3/4 with C++'s std::unordered_set and std::unordered_map taking the cake with a max load factor of 1. Unsurprisingly Python's hash table backed data structures handle collisions with open addressing where as their Java and C++ counterparts do it with separate chaining.
Last a comment about table size. When the max load factor is exceeded, the size of the hash table must of course be grown. Due to the fact that this requires that every element there in be reindexed, it's highly inefficient to grow the table by a fixed amount. To do so would incur order size operations every time a new element is added. The standard fix for this problem is the same as that employed by most dynamic array implementations. At every point where we need to grow the table we simply increase its size by its current size. This unsurprisingly is known as table doubling.

I think you answered your own question there. "shouldn't the algorithm compare this hash against every element's hash". That's kind of what it does when it doesn't know the index location of what you're searching for. It compares each element to find the one you're looking for:
E.g. Let's say you're looking for an item called "Car" inside an array of strings. You need to go through every item and check item.Hash() == "Car".Hash() to find out that that is the item you're looking for. Obviously it doesn't use the hash when searching always, but the example stands. Then you have a hash table. What a hash table does is it creates a sparse array, or sometimes array of buckets as the guy above mentioned. Then it uses the "Car".Hash() to deduce where in the sparse array your "Car" item is actually. This means that it doesn't have to search through the entire array to find your item.

Related

Data Structure for tuple indexing

I need a data structure that stores tuples and would allow me to do a query like: given tuple (x,y,z) of integers, find the next one (an upped bound for it). By that I mean considering the natural ordering (a,b,c)<=(d,e,f) <=> a<=d and b<=e and c<=f. I have tried MSD radix sort, which splits items into buckets and sorts them (and does this recursively for all positions in the tuples). Does anybody have any other suggestion? Ideally I would like the abouve query to happen within O(log n) where n is the number of tuples.

Two options.
Use binary search on a sorted array. If you build the keys ( assuming 32bit int)' with (a<<64)|(b<<32)|c and hold them in a simple array, packed one beside the other, you can use binary search to locate the value you are searching for ( if using C, there is even a library function to do this), and the next one is simply one position along. Worst case Performance is O(logN), and if you can do http://en.wikipedia.org/wiki/Interpolation_search then you might even approach O(log log N)
Problem with binary keys is might be tricky to add new values, might need gyrations if you will exceed available memory. But it is fast, only a few random memory accesses on average.
Alternatively, you could build a hash table by generating a key with a|b|c in some form, and then have the hash data pointing to a structure that contains the next value, whatever that might be. Possibly a little harder to create in the first place as when generating the table you need to know the next value already.
Problems with hash approach are it will likely use more memory than binary search method, performance is great if you don't get hash collisions, but then starts to drop off, although there a variations around this algorithm to help in some cases. Hash approach is possibly much easier to insert new values.
I also see you had a similar question along these lines, so I guess the guts of what I am saying is combine A,b,c to produce a single long key, and use that with binary search, hash or even b-tree. If the length of the key is your problem (what language), could you treat it as a string?
If this answer is completely off base, let me know and I will see if I can delete this answer, so you questions remains unanswered rather than a useless answer.

Chaining in hash tables

When we have a hash table with chaining:
I am just wondering if maintaining the list at each key in order affects the running time for searching, inserting and deleting in the hash table?

In theory: yes, since in the average case you will only have to walk half the chain to find if an item is on the chain or not.
In practice, there is probably not much difference, since the chains are typically very short, and the increased code complexity would also cost some cycles, mainly in the "insert" case.
BTW: in most cases the number of slots is considerably smaller than the "keyspace" of the hash values. If you can afford the space, storing the hash values in the chain nodes will save recomputing the hash value on every hop, and will avoid most of the final compares. This of course is a space<-->time tradeoff. As in:
struct hashnode **this;
for (this=& table[slot] ; *this; this = &(*this)->link) {
if ((*this)->hash != the_hash) continue;
if (compare ((*this)->payload , the_value)) continue;
break;
}
/* at this point "this" points to the pointer that points to the wanted element,
or to the NULL-pointer where it should be inserted.
For the sorted-list example, you should instead break out of the loop
if the compare function returns > 0, and handle that special case here.
*/

Hypothetically, you've chosen your hash algorithm and map size to mitigate the number of collisions you will get in the first place. At that point, you should have a very small list (ideally one or two elements) at any position, so the extra effort of maintaining a sorted structure in the chain is most certainly more than just iterating the small number of items in that bucket.

Yes, of course. The usually-cited O(1) for a hashtable is assuming perfect hashing - where no two items that are not the same resolve to the same hash.
In practice, that won't be the case. You'll always have (for a big enough data set) collisions. And collisions will mean more work at lookup time, regardless of whether you're using chaining or some other collision-resolution technique.
That's why it's very, very important to select a good hash function that is well designed/written, and a good match for the data you'll be using as the key for your hash table. Different types of data will hash better with different hash functions, in practice.

When performing a looking in a Hash data structure, why is it a fast operation?

When you perform a lookup in a Hashtable, the key is converted into a hash. Now using that hashed value, does it directly map to a memory location, or are there more steps?
Just trying to understand things a little more under the covers.
And what other key based lookup data structures are there and why are they slower than a hash?

Hash tables are not necessarily fast. People consider hash tables a "fast" data structure because the retrieval time does not depend on the number of entries in the table. That is, retrieval from a hash table is an "O(1)" (constant time) operation.
Retrieval time from other data structures can vary depending on the number of entries in the map. For example, for a balanced binary tree, the retrieval time scales with the base-2 logarithm of its size; it's "O(log n)".
However, actually computing a hash code for an single object, in practice, often takes many times longer than comparing that type of object to others. So, you could find that for a small map, something like a red-black tree is faster than a hash table. As the maps grow, the hash table retrieval time will stay constant, and the red-black tree time will slowly grow until it is slower than a hash table.

A Hash (aka Hash Table) implies more than a Map (or Associative Array).
In particular, a Map (or Associative Array) is an Abstract Data Type:
...an associative array (also called a map or a dictionary) is an abstract data type composed of a collection of (key,value) pairs, such that each possible key appears at most once in the collection.
While a Hash table is an implementation of a Map (although it could also be considered an ADT that includes a "cost"):
...a hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys [...], to their associated values [...]. Thus, a hash table implements an associative array [or, map].
Thus it is an implementation-detail leaking out: a HashMap is a Map that uses a Hash-table algorithm and thus provides the expected performance characteristics of such an algorithm. The "leaking" of the implementation detail is good in this case because it provides some basic [expected] bound guarantees, such as an [expected] O(1) -- or constant time -- get.
Hint: a hash function is important part of a hash-table algorithm and sets a HashMap apart from other Map implementations such as a TreeMap (that uses a red-black tree) or a ConcurrentSkipListMap (that uses a skip list).
Another form of a Map is an Association List (or "alist", which is common in LISP programming). While association lists are O(n) for get, they can have much less overhead for small n, which brings up another point: Big-Oh describes limiting behavior (as n -> infinity) and does not address the relative performance for a particular [smallish] n:
A description of a function in terms of big O notation usually only provides an upper bound on the growth rate of the function.
Please refer to the links above (including the javadoc) for the basic characteristics and different implementation strategies -- anything else I say here is already said there (or in other SO answers). If there are specific questions, open a new SO post if warranted :-)
Happy coding.
Here is the source for the HashMap implementation that is in OpenJDK 7. Looking at the put method shows that it a simple chaining as a collision-resolution method and that the underlying "bucket array" will grow by a factor of 2 each resize (which is triggered when the load factor is reached). The load factor and amortized performance expectations -- including those of the hashing function used -- are covered in the class documentation.

"Key-based" implies a mapping of some sort. You can implement one in a linked list or array, and it would probably be pretty slow (O(n)) for lookups or deletes.
Hashing takes constant time. In the more sophisticated implementations it will typically map to a memory address which stores a list of pointers back at the key object in addition to the mapped object or value, for collision detection and resolution.
The expensive operations are following the list of the "hashed to this location" objects to figure out which one you are really looking for. In theory, this could be O(n) for each lookup! However, if we use a larger space the probability of this occurring is reduced (although a few collisions is almost inevitable per the Birthday Problem) drastically.
If you start getting over a certain threshold of collisions, most implementations will expand the size of the hash table, which also takes another O(n) time. However, this will on average take place no more often than every 1/n inserts. So we have amortized constant time.

how to create a collection with O(1) complexity

I would like to create a data structure or collection which will have O(1) complexity in adding, removing and calculating no. of elements. How am I supposed to start?
I have thought of a solution: I will use a Hashtable and for each key / value pair inserted, I will have only one hash code, that is: my hash code algorithm will generate a unique hash value every time, so the index at which the value is stored will be unique (i.e. no collisions).
Will that give me O(1) complexity?

Yes that will work, but as you mentioned your hashing function needs to be 100% unique. Any duplicates will result in you having to use some sort of conflict resolution. I would recommend linear chaining.
edit: Hashmap.size() allows for O(1) access
edit 2: Respopnse to the confusion Larry has caused =P
Yes, Hashing is O(k) where k is the keylength. Everyone can agree on that. However, if you do not have a perfect hash, you simply cannot get O(1) time. Your claim was that you do not need uniqueness to acheive O(1) deletion of a specific element. I guarantee you that is wrong.
Consider a worst case scenario: every element hashes to the same thing. You end up with a single linked list which as everyone knows does not have O(1) deletion. I would hope, as you mentioned, nobody is dumb enough to make a hash like this.
Point is, uniqueness of the hash is a prerequisite for O(1) runtime.
Even then, though, it is technically not O(1) Big O efficiency. Only using amortized analysis you will acheive constant time efficiency in the worst case. As noted on wikipedia's article on amortized analysis
The basic idea is that a worst case operation can alter the state in such a way that the worst case cannot occur again for a long time, thus "amortizing" its cost.
That is referring to the idea that resizing your hashtable (altering the state of your data structure) at certain load factors can ensure a smaller chance of collisions etc.
I hope this clears everything up.

Adding, Removing and Size (provided it is tracked separately, using a simple counter) can be provided by a linked list. Unless you need to remove a specific item. You should be more specific about your requirements.

Doing a totally non-clashing hash function is quite tricky even when you know exactly the space of things being hashed, and it's impossible in general. It also depends deeply on the size of the array that you're hashing into. That is, you need to know exactly what you're doing to make that work.
But if you instead relax that a bit so that identical hash codes don't imply equality1, then you can use the existing Java HashMap framework for all the other parts. All you need to do is to plug in your own hashCode() implementation in your key class, which is something that Java has always supported. And make sure that you've got equality defined right too. At that point, you've got the various operations being not much more expensive than O(1), especially if you've got a good initial estimation for the capacity and load factor.
1 Equality must imply equal hash codes, of course.

Even if your hashcodes are unique this doesn't guarentee a collision free collection. This is because your hash map is not of an unlimited size. The hashcode has to be reduced to the number of buckets in your hash map and after this reduction you can still get collisions.
e.g. Say I have three objects A (hash: 2), B (hash: 18), C (hash: 66) All unique.
Say you put them in a HashMap of with a capacity of 16 (the default). If they were mapped to a bucket with % 16 (actually is more complex that this) after reducing the hash codes we now have A (hash: 2 % 16 = 2), B (hash: 18 % 16 = 2), C (hash: 66 % 16 = 2)
HashMap is likely to be faster than Hashtable, unless you need thread safety. (In which case I suggest you use CopncurrentHashMap)
IMHO, Hashtable has been a legacy collection for 12 years, and I would suggest you only use it if you have to.

What functionality do you need that a linked list won't give you?

Surprisingly, your idea will work, if you know all the keys you want to put in the collection in advance. The idea is to generate a special hash function which maps each key to a unique value in the range (1, n). Then our "hash table" is just a simple array (+ an integer to cache the number of elements)
Implementing this is not trivial, but it's not rocket science either. I'll leave it to Steve Hanov to explain the ins-and-outs, as he gives a much better explanation than I ever could.

It's simple. Just use a hash map. You don't need to do anything special. Hashmap itself is O(1) for insertion, deletion, calculating number of elements.
Even if the keys are not unique, the algorithm will still be O(1) as long as the Hashmap is automatically expanded in size if the collection gets too large (most implementations will do this for you automatically).
So, just use the Hash map according to the given documentation, and all will be well. Don't think up anything more complicated, it will just be a waste of time.
Avoiding collisions is really impossible with a hash .. if it was possible, then it would basically just be an array or a mapping to an array, not a hash. But it isn't necessary to avoid collisions, it will still be O(1) with collisions.

What sort of sorted datastructure is optimized for finding items within a range?

Say I have a bunch of objects with dates and I regularly want to find all the objects that fall between two arbitrary dates. What sort of datastructure would be good for this?

A binary search tree sounds like what you're looking for.
You can use it to find all the objects in O(log(N) + K), where N is the total number of objects and K is the number of objects that are actually in that range. (provided that it's balanced). Insertion/removal is O(log(N)).
Most languages have a built-in implementation of this.
C++:
http://www.cplusplus.com/reference/stl/set/
Java:
http://java.sun.com/j2se/1.4.2/docs/api/java/util/TreeSet.html
You can find the lower bound of the range (in log(n)) and then iterate from there until you reach the upper bound.

Assuming you mean by date when you say sorted, an array will do it.
Do a binary search to find the index that's >= the start date. You can then either do another search to find the index that's <= the end date leaving you with an offset & count of items, or if you're going to process them anyway just iterate though the list until you exceed the end date.

It's hard to give a good answer without a little more detail.
What kind of performance do you need?
If linear is fine then I would just use a list of dates and iterate through the list collecting all dates that fall within the range. As Andrew Grant suggested.
Do you have duplicates in the list?
If you need to have repeated dates in your collection then most implementations of a binary tree would probably be out. Something like Java's TreeSet are set implementations and don't allow repeated elements.
What are the access characteristics? Lots of lookups with few updates, vice-versa, or fairly even?
Most datastructures have trade-offs between lookups and updates. If you're doing lots of updates then some datastructure that are optimized for lookups won't be so great.
So what are the access characteristics of the data structure, what kind of performance do you need, and what are structural characteristics that it must support (e.g. must allow repeated elements)?

If you need to make random-access modifications: a tree, as in v3's answer. Find the bottom of the range by lookup, then count upwards. Inserting or deleting a node is O(log N). stbuton makes a good point that if you want to allow duplicates (as seems plausible for datestamped events), then you don't want a tree-based set.
If you do not need to make random-access modifications: a sorted array (or vector or whatever). Find the location of the start of the range by binary chop, then count upwards. Inserting or deleting is O(N) in the middle. Duplicates are easy.
Algorithmic performance of lookups is the same in both cases, O(M + log N), where M is the size of the range. But the array uses less memory per entry, and might be faster to count through the range, because after the binary chop it's just forward sequential memory access rather than following pointers.
In both cases you can arrange for insertion at the end to be (amortised) O(1). For the tree, keep a record of the end element at the head, and you get an O(1) bound. For the array, grow it exponentially and you get amortised O(1). This is useful if the changes you make are always or almost-always "add a new event with the current time", since time is (you'd hope) a non-decreasing quantity. If you're using system time then of course you'd have to check, to avoid accidents when the clock resets backwards.
Alternative answer: an SQL table, and let the database optimise how it wants. And Google's BigTable structure is specifically designed to make queries fast, by ensuring that the result of any query is always a consecutive sequence from a pre-prepared index :-)

You want a structure that keeps your objects sorted by date, whenever you insert or remove a new one, and where finding the boundary for the segment of all objects later than or earlier than a given date is easy.
A heap seems the perfect candidate. In practical applications, heaps are simply represented by an array, where all the objects are stored in order. Seeing that sorted array as a heap is simply a way to make insertions of new objects and deletions happen in the right place, and in O(log(n)).
When you have to find all the objects between date A (excluded) and B (included), find the position of A (or the insert position, that is, the position of the earlier element later than A), and the position of B (or the insert position of B), and return all the objects between those positions (which is simply the section between those positions in the array/heap)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio