questions on data structure design for vectors?

questions on data structure design for vectors? - algorithm

While reading some materials on data structure design for sparse vectors, the authors make some statements as follows.
A hash table could be used
to implement a simple index-to-value mapping. Accessing an index value is slower than with direct array
access, but not by much.
Why assessing an index value is slower when using hash table?
Further, the authors state that
The problem with a hash-backed implementation is that it becomes relatively slow to iterate through
all values in order by index.
An ordered mapping based on a tree structure or
similar can address this problem, since it maintains keys in order. The price of this feature is longer access
time.
Why hash-based implementation performs bad when iterating through all values? Does that due the slower operation of assessing an index?
How can a tree structure help this kind of issue?

Accessing a hash table index is just a bit slower because of the calculation overhead.
In a hash table, if you request item 452345435 it doesn't mean it's in cell 452345435 ... The hash table performs a series of calculation to find the right cell. This is implementation dependent.
Hash table Performance analysis
Hash tables don't store sorted data. So if you want to get the items in the right order, a sorting algorithm will need to be called.
To solve that, you can use a tree, or any other sorted data structure.
But that will increase the inserting complexity from O(1) (hash table) to O(logn) (insert to a tree, sorted database).
That because each index will be added to both data structures, and the complexity will be O(1) + O(logn) = O(logn)
It will still take only O(1) to retrieve the data, because it's enough to request it from the hash table.

Related

Why not use hashing/hash tables for everything?

In computer science, it is said that the insert, delete and searching operations for hash tables have a complexity of O(1), which is the best. So, I was wondering, why do we need to use other data structures since hashing operations are so fast? Why can't we just simply use hashing/hash tables for everything?

Hash tables, on average, do have excellent time complexity for insertion, retrieval, and deletion. BUT:
Big-O complexity isn't everything. The constant factor is also very important. You could use hashtables in place of arrays, with the array indexes as hash keys. In either case, the time complexity of retrieving an item is O(1). But the constant factor is way higher for the hash table as opposed to the array.
Memory consumption may be much higher. This is certainly true if you use hash tables to replace arrays. (Of course, if the array is sparse, then the hash table may take less memory.)
There are some operations which are not efficiently supported by hash tables, such as iterating over all the elements whose keys are within a certain range, finding the element with the largest key or smallest key, and so on.
The O(n) complexity is on average. For some extreme cases (for example, all data fall into the same bucket), it would be inefficient.
All of that aside, you do still have a good point. Hashtables have an extraordinarily broad range of suitable use cases. That's why they are the primary built-in data structure in some scripting languages, like Lua.

You may use Hash to search the element, but you cannot use it to do the things like find the largest number quickly, you should use the data strutcture for the specified problem. Hash cannot solve all the problem.

HashTable is not answer for all. If your hash function does not distribute your key well than hashMap may turn into a linkedList in worst case for which the insertion, deletion, search will take O(N) in worst case.
HashMap has significant memory footprint so there are some use cases where you memory is too precious than time complexity then you HashMap may not be the best choice.
HashMap is not an answer for range queries or prefix queries. So that is why most of the database vendor do implement indexing by Btree rather than only by hashing for range or prefix queries.
HashTable in general exhibit poor locality of reference that is, the data to be accessed is distributed seemingly at random in memory.
For certain string processing applications, such as spellchecking, hash tables may be less efficient than tries, finite automata, or Judy arrays. Also, if each key is represented by a small enough number of bits, then, instead of a hash table, one may use the key directly as the index into an array of values. Note that there are no collisions in this case.

Hash Tables are not sorted (map)
Hash Tables are not best for head/tail insert (link list/deque)
Hash Tables have overhead to support searching (vector/array)

The potential security issues of hash tables on the web should also be pointed out. If someone knows the hash function, that person may perform a denial-of-service attack by creating lots of items with the same hashcode.

I don't get it, enum/symbol-keys not wasteful enough? ;) What about just using the raw string pointer as key? I must have overlooked some obvious advantage in hashing... but now thinking about it, it makes less and less sense.
It's all just local representation anyway, right? I mean, I could share the data everywhere... API's, IPC or RPC - but not sure how helpful those hashed keys are unless the full string is embedded too.
Meaning you just spent a lot of time hashing strings back and forth for your own amusement.
I'll just leave this here...

What is the best data structure for fast dictionary search?

As far as I know, hash tables and double array tries are two of fastest data structures for searching a dictionary. Are there any other data structures or algorithms that can beat them?

Hash table need not always be a fast searching data structure. It really depends on how good your hashing function is. If your hashing function is not very good it could resolves multiple keys to map to similar index causing collision and make the hash table degenerate to O(n) run time.
Self Balancing trees are considered to be fast data structures as well as they guarantee O(log n)

Must a hash table be implemented using an array?

Must a hash table be implemented using an array? Will an alternative data structure achieve the same efficiency? If yes, why? If no, what condition must the data structure satisfy to ensure the same efficiency as provided by arrays?

Must a hash table be implemented using an array?
No. You could implement the HashTable interface with other datastructures besided the arrray. E.g. a Red-Black tree (java's TreeMap).
This offers O(logN) access time.
But Hash Table is expected to have O(1) access time (at best case - no collisions).
This can be achieved only via an array which offers the possibility of random access in constant time.
what condition must the data structure satisfy to ensure the same
efficiency as provided by arrays?
Must have a comparable performance (less than O(N)) with an array. A treemap has O(logN) worst access time for all operations

What datastructure is effective for minimizing the cost of look ups in hash table buckets?

Given a hash table with collisions, the generic hash table implementation will cause look ups within a bucket to run in O(n), assuming that a linkedlist is used.
If we switch the linked list for a binary search tree, we go down to O(log n). Is this the best we can do, or is there a better data structure for this use case?
Using hash tables for the buckets themselves would bring the look up time to O(1), but that would require clever revisions of the hash function.

There is trade-off between insertion time to look-up time in your solution. (Keep bucket sorted)
If you want to keep every bucket sorted, you will get O(log n) look-up time using Binary search. However when you insert a new element, you will have to place him in the right location so the bucket will continue be sorted - O(log n) search time for placing new element.
So in your solution, you get total complexity O(log n) for both insertion and look-up.
(In contrast to the traditional solution that take O(n) for look-up in the worst case, and O(1) for insertion)
EDIT :
If you choose to use a sorted bucket, of course you can't use LinkedList any more. You can switch to any other suitable data structure.

Perfect hashing is known to achieve collision-free O(1) hashing of a limited set of keys known at the time the hash function is constructed. The Wikipedia article mensions several aproaches to apply those ideas to a dynamic set of keys, like dynamic perfect hashing and cuckoo hashing, which might be of interest to you.

You've pretty much answered your own question. Since a hash table is just an array of other data structures, your lookup time is just dependent on the lookup time of the secondary data structure and how well your hash function distributes items across the buckets.

Advantages of Binary Search Trees over Hash Tables

What are the advantages of binary search trees over hash tables?
Hash tables can look up any element in Theta(1) time and it is just as easy to add an element....but I'm not sure of the advantages going the other way around.

One advantage that no one else has pointed out is that binary search tree allows you to do range searches efficiently.
In order to illustrate my idea, I want to make an extreme case. Say you want to get all the elements whose keys are between 0 to 5000. And actually there is only one such element and 10000 other elements whose keys are not in the range. BST can do range searches quite efficiently since it does not search a subtree which is impossible to have the answer.
While, how can you do range searches in a hash table? You either need to iterate every bucket space, which is O(n), or you have to look for whether each of 1,2,3,4... up to 5000 exists.
(what about the keys between 0 and 5000 are an infinite set? for example keys can be decimals)

Remember that Binary Search Trees (reference-based) are memory-efficient. They do not reserve more memory than they need to.
For instance, if a hash function has a range R(h) = 0...100, then you need to allocate an array of 100 (pointers-to) elements, even if you are just hashing 20 elements. If you were to use a binary search tree to store the same information, you would only allocate as much space as you needed, as well as some metadata about links.

One "advantage" of a binary tree is that it may be traversed to list off all elements in order. This is not impossible with a Hash table but is not a normal operation one design into a hashed structure.

In addition to all the other good comments:
Hash tables in general have better cache behavior requiring less memory reads compared to a binary tree. For a hash table you normally only incur a single read before you have access to a reference holding your data. The binary tree, if it is a balanced variant, requires something in the order of k * lg(n) memory reads for some constant k.
On the other hand, if an enemy knows your hash-function the enemy can enforce your hash table to make collisions, greatly hampering its performance. The workaround is to choose the hash-function randomly from a family, but a BST does not have this disadvantage. Also, when the hash table pressure grows too much, you often tend to enlargen and reallocate the hash table which may be an expensive operation. The BST has simpler behavior here and does not tend to suddenly allocate a lot of data and do a rehashing operation.
Trees tend to be the ultimate average data structure. They can act as lists, can easily be split for parallel operation, have fast removal, insertion and lookup on the order of O(lg n). They do nothing particularly well, but they don't have any excessively bad behavior either.
Finally, BSTs are much easier to implement in (pure) functional languages compared to hash-tables and they do not require destructive updates to be implemented (the persistence argument by Pascal above).

The main advantages of a binary tree over a hash table is that the binary tree gives you two additional operations you can't do (easily, quickly) with a hash table
find the element closest to (not necessarily equal to) some arbitrary key value (or closest above/below)
iterate through the contents of the tree in sorted order
The two are connected -- the binary tree keeps its contents in a sorted order, so things that require that sorted order are easy to do.

A (balanced) binary search tree also has the advantage that its asymptotic complexity is actually an upper bound, while the "constant" times for hash tables are amortized times: If you have a unsuitable hash function, you could end up degrading to linear time, rather than constant.

A binary tree is slower to search and insert into, but has the very nice feature of the infix traversal which essentially means that you can iterate through the nodes of the tree in a sorted order.
Iterating through the entries of a hash table just doesn't make a lot of sense because they are all scattered in memory.

A hashtable would take up more space when it is first created - it will have available slots for the elements that are yet to be inserted (whether or not they are ever inserted), a binary search tree will only be as big as it needs to be. Also, when a hash-table needs more room, expanding to another structure could be time-consuming, but that might depend on the implementation.

A binary search tree can be implemented with a persistent interface, where a new tree is returned but the old tree continues to exist. Implemented carefully, the old and new trees shares most of their nodes. You cannot do this with a standard hash table.

BSTs also provide the "findPredecessor" and "findSuccessor" operations (To find the next smallest and next largest elements) in O(logn) time, which might also be very handy operations. Hash Table can't provide in that time efficiency.

From Cracking the Coding Interview, 6th Edition
We can implement the hash table with a balanced binary search tree (BST) . This gives us an O(log n) lookup time. The advantage of this is potentially using less space, since we no longer allocate a large array. We can also iterate through the keys in order, which can be useful sometimes.

GCC C++ case study
Let's also get some insight from one of the most important implementations in the world. As we will see, it actually matches out theory perfectly!
As shown at What is the underlying data structure of a STL set in C++?, in GCC 6.4:
std::map uses BST
std::unordered_map uses hashmap
So this already points out to the fact that you can't transverse a hashmap efficiently, which is perhaps the main advantage of a BST.
And then, I also benchmarked insertion times in hash map vs BST vs heap at Heap vs Binary Search Tree (BST) which clearly highlights the key performance characteristics:
BST insertion is O(log), hashmap is O(1). And in this particular implementation, hashmap is almost always faster than BST, even for relatively small sizes
hashmap, although much faster in general, has some extremely slow insertions visible as single points in the zoomed out plot.
These happen when the implementation decides that it is time to increase its size, and it needs to be copied over to a larger one.
In more precise terms, this is because only its amortized complexity is O(1), not the worst case, which is actually O(n) during the array copy.
This might make hashmaps inadequate for certain real-time applications, where you need stronger time guarantees.
Related:
Binary Trees vs. Linked Lists vs. Hash Tables
https://cs.stackexchange.com/questions/270/hash-tables-versus-binary-trees

If you want to access the data in a sorted manner, then a sorted list has to be maintained in parallel to the hash table. A good example is Dictionary in .Net. (see http://msdn.microsoft.com/en-us/library/3fcwy8h6.aspx).
This has the side-effect of not only slowing inserts, but it consumes a larger amount of memory than a b-tree.
Further, since a b-tree is sorted, it is simple to find ranges of results, or to perform unions or merges.

It also depends on the use, Hash allows to locate exact match. If you want to query for a range then BST is the choice. Suppose you have a lots of data e1, e2, e3 ..... en.
With hash table you can locate any element in constant time.
If you want to find range values greater than e41 and less than e8, BST can quickly find that.
The key thing is the hash function used to avoid a collision. Of course, we cannot totally avoid a collision, in which case we resort to chaining or other methods. This makes retrieval no longer constant time in worst cases.
Once full, hash table has to increase its bucket size and copy over all the elements again. This is an additional cost not present over BST.

Binary search trees are good choice to implement dictionary if the keys have some total order (keys are comparable) defined on them and you want to preserve the order information.
As BST preserves the order information, it provides you with four additional dynamic set operations that cannot be performed (efficiently) using hash tables. These operations are:
Maximum
Minimum
Successor
Predecessor
All these operations like every BST operation have time complexity of O(H). Additionally all the stored keys remain sorted in the BST thus enabling you to get the sorted sequence of keys just by traversing the tree in in-order.
In summary if all you want is operations insert, delete and remove then hash table is unbeatable (most of the time) in performance. But if you want any or all the operations listed above you should use a BST, preferably a self-balancing BST.

A hashmap is a set associative array. So, your array of input values gets pooled into buckets. In an open addressing scheme, you have a pointer to a bucket, and each time you add a new value into a bucket, you find out where in the bucket there are free spaces. There are a few ways to do this- you start at the beginning of the bucket and increment the pointer each time and test whether its occupied. This is called linear probing. Then, you can do a binary search like add, where you double the difference between the beginning of the bucket and where you double up or back down each time you are searching for a free space. This is called quadratic probing.
OK. Now the problems in both these methods is that if the bucket overflows into the next buckets address, then you need to-
Double each buckets size- malloc(N buckets)/change the hash function-
Time required: depends on malloc implementation
Transfer/Copy each of the earlier buckets data into the new buckets data. This is an O(N) operation where N represents the whole data
OK. but if you use a linkedlist there shouldn't be such a problem right? Yes, In linked lists you don't have this problem. Considering each bucket to begin with a linked list, and if you have 100 elements in a bucket it requires you to traverse those 100 elements to reach the end of the linkedlist hence the List.add(Element E) will take time to-
Hash the element to a bucket- Normal as in all implementations
Take time to find the last element in said bucket- O(N) operation.
The advantage of the linkedlist implementation is that you don't need the memory allocation operation and O(N) transfer/copy of all buckets as in the case of the open addressing implementation.
So, the way to minimize the O(N) operation is to convert the implementation to that of a Binary Search Tree where find operations are O(log(N)) and you add the element in its position based on it's value. The added feature of a BST is that it comes sorted!

Hash Tables are not good for indexing. When you are searching for a range, BSTs are better. That's the reason why most database indexes use B+ trees instead of Hash Tables

Binary search trees can be faster when used with string keys. Especially when strings are long.
Binary search trees using comparisons for less/greater which are fast for strings (when they are not equal). So a BST can quickly answer when a string is not found.
When it's found it will need to do only one full comparison.
In a hash table. You need to calculate the hash of the string and this means you need to go through all bytes at least once to compute the hash. Then again, when a matching entry is found.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio