A data structure with certain properties - c++11

I want to implement a data structure myself in C++11. What I'm planning to do is having a data structure with the following properties:
search. O(log(n))
insert. O(log(n))
delete. O(log(n))
iterate. O(n)
What I have been thinking about after research was implementing a balanced binary search tree. Are there other structures that would fulfill my needs? I am completely new to this topic and thought a question here would give me a good jumpstart.

First of all, using the existing standard library data types is definitely the way to go for production code. But since you are asking how to implement such data structures yourself, I assume this is mainly an educational exercise for you.
Binary search trees of some form (https://en.wikipedia.org/wiki/Self-balancing_binary_search_tree#Implementations) or B-trees (https://en.wikipedia.org/wiki/B-tree) and hash tables (https://en.wikipedia.org/wiki/Hash_table) are definitely the data structures that are usually used to accomplish efficient insertion and lookup. If you want to go wild you can combine the two by using a tree instead of a linked list to handle hash collisions (although this has a good potential to actually make your implementation slower if you don't make massive mistakes in sizing your hash table or in choosing an adequate hash function).
Since I'm assuming you want to learn something, you might want to have a look at minimal perfect hashing in the context of hash tables (https://en.wikipedia.org/wiki/Perfect_hash_function) although this only has uses in special applications (I had the opportunity to use a perfect minimal hash function exactly once). But it sure is fascinating. As you can see from the link above, the botany of search trees is virtually limitless in scope so you can also go wild on that front.

Related

What is the utility of treap data structure?

I am currently studying advanced data structures and I came across a weird data structure called Treap. I understand what Treap is but I can't seem to find it's utility in a valid use case scenario.
Why should you use such a data structure and in what type of problems/conditions treaps are best used?
I find myself much more into using either hash maps, min/max heaps, binary search tree or balanced binary search trees, but I can't tell on why should you use a treap.
They are easier to implement and more importantly, that makes them easier to modify/maintain into the future if you want to make slight variations on them or change them some way. They also allow for efficient parallel versions of set operations Union/Intersect/Difference which is extremely valuable. Using them simultaneously as a heap and binary tree isn't really very handy unless the stuff you use for priorities are coincidentally really nicely randomly distributed/permuted. I suppose there might be a case where that would be handy, but it seems really unlikely. Stuff so randomly distributed is usually more like a hash key which typically aren't useful as ordered data. How often do you want to pull people out in order of their SSNs? I guess it's possible but unlikely.

What are the advantages of splay tree over hash table?

Since splay trees are used for caching I was wondering what are the advantages of Splay Tree over HashTable when I want to cache efficiently?
When should I prefer splay tree over hash table?
I guess it is a more specialized case than BST, there please don't link to BST vs Hashtable answer.
It really depends on what you mean by efficiency.
If its on the storage/space consumed by the data structure, they would be the same. On the other hand, when we talk about time complexity, it would still depend on how your data structure is being used.
If your users per say, use your data structure in a way that they will be accessing very similar data (related data everytime), caching using the splay tree would be great since its worst case revolves around O(log n), whereas a hashtable has a worst case of O(n).
An instance where a hash table would work really bad is when your hashes are stored in less than 3 buckets or worse just 1 bucket or chain or thread. I suppose you could imagine what happens when you try to access the data.
but in general when it comes to caching, Hash table would work very well since it has an average time complexity of O(1).
you could also think of it this way. If what you want is to make your data structure access the "most recent"/"most accesssed" data faster then you could consider working with splay trees. but if you want your data structure to be able to access different kinds of data at an ease then you would want to consider using the hash table.
Also you would want to note that it is very important to make sure to choose a good hash function, table size and data structure to use in a bucket to make it work well with your use case.
In fact combining the two could also work :)
This is really just based on my opinion so I hope I helped! It's really a very shallow comparison, I suggest you just google how both works when it comes to caching and make your own comparison! Kudos!

Hashing Performances

I am studying for a test and in the study sheet it asks about the Performances of hashing it asks things such as
Add
Remove
Search / Contains
Space vs Time
Applications where it makes sense
Performance of Hashing means analysis of how various data structures using hash technique.
For finding out of performance of hashing, first find out which are available data structures which use hashing. Example could be hash tables, hash maps, hash trees etc.
After that have a round of performance testing for below operations on each of data structure.
Add an element
Remove
Search
Find out Time and space complexity for each operation and each data structure.
After that you will quite a good idea of where you can use which data structure or if a given a requirement you will be able to find out which data structure will be performance efficient.
Hope this helps and clears your doubts :)

Why are Haskell Maps implemented as balanced binary trees instead of traditional hashtables?

From my limited knowledge of Haskell, it seems that Maps (from Data.Map) are supposed to be used much like a dictionary or hashtable in other languages, and yet are implemented as self-balancing binary search trees.
Why is this? Using a binary tree reduces lookup time to O(log(n)) as opposed to O(1) and requires that the elements be in Ord. Certainly there is a good reason, so what are the advantages of using a binary tree?
Also:
In what applications would a binary tree be much worse than a hashtable? What about the other way around? Are there many cases in which one would be vastly preferable to the other? Is there a traditional hashtable in Haskell?
Hash tables can't be implemented efficiently without mutable state, because they're based on array lookup. The key is hashed and the hash determines the index into an array of buckets. Without mutable state, inserting elements into the hashtable becomes O(n) because the entire array must be copied (alternative non-copying implementations, like DiffArray, introduce a significant performance penalty). Binary-tree implementations can share most of their structure so only a couple pointers need to be copied on inserts.
Haskell certainly can support traditional hash tables, provided that the updates are in a suitable monad. The hashtables package is probably the most widely used implementation.
One advantage of binary trees and other non-mutating structures is that they're persistent: it's possible to keep older copies of data around with no extra book-keeping. This might be useful in some sort of transaction algorithm for example. They're also automatically thread-safe (although updates won't be visible in other threads).
Traditional hashtables rely on memory mutation in their implementation. Mutable memory and referential transparency are at ends, so that relegates hashtable implementations to either the IO or ST monads. Trees can be implemented persistently and efficiently by leaving old leaves in memory and returning new root nodes which point to the updated trees. This lets us have pure Maps.
The quintessential reference is Chris Okasaki's Purely Functional Data Structures.
Why is this? Using a binary tree reduces lookup time to O(log(n)) as opposed to O(1)
Lookup is only one of the operations; insertion/modification may be more important in many cases; there are also memory considerations. The main reason the tree representation was chosen is probably that it is more suited for a pure functional language. As "Real World Haskell" puts it:
Maps give us the same capabilities as hash tables do in other languages. Internally, a map is implemented as a balanced binary tree. Compared to a hash table, this is a much more efficient representation in a language with immutable data. This is the most visible example of how deeply pure functional programming affects how we write code: we choose data structures and algorithms that we can express cleanly and that perform efficiently, but our choices for specific tasks are often different their counterparts in imperative languages.
This:
and requires that the elements be in Ord.
does not seem like a big disadvantage. After all, with a hash map you need keys to be Hashable, which seems to be more restrictive.
In what applications would a binary tree be much worse than a hashtable? What about the other way around? Are there many cases in which one would be vastly preferable to the other? Is there a traditional hashtable in Haskell?
Unfortunately, I cannot provide an extensive comparative analysis, but there is a hash map package, and you can check out its implementation details and performance figures in this blog post and decide for yourself.
My answer to what the advantage of using binary trees is, would be: range queries. They require, semantically, a total preorder, and profit from a balanced search tree organization algorithmically. For simple lookup, I'm afraid there may only be good Haskell-specific answers, but not good answers per se: Lookup (and indeed hashing) requires only a setoid (equality/equivalence on its key type), which supports efficient hashing on pointers (which, for good reasons, are not ordered in Haskell). Like various forms of tries (e.g. ternary tries for elementwise update, others for bulk updates) hashing into arrays (open or closed) is typically considerably more efficient than elementwise searching in binary trees, both space and timewise. Hashing and Tries can be defined generically, though that has to be done by hand -- GHC doesn't derive it (yet?). Data structures such as Data.Map tend to be fine for prototyping and for code outside of hotspots, but where they are hot they easily become a performance bottleneck. Luckily, Haskell programmers need not be concerned about performance, only their managers. (For some reason I presently can't find a way to access the key redeeming feature of search trees amongst the 80+ Data.Map functions: a range query interface. Am I looking the wrong place?)

Iterable O(1) insert and random delete collection

I am looking to implement my own collection class. The characteristics I want are:
Iterable - order is not important
Insertion - either at end or at iterator location, it does not matter
Random Deletion - this is the tricky one. I want to be able to have a reference to a piece of data which is guaranteed to be within the list, and remove it from the list in O(1) time.
I plan on the container only holding custom classes, so I was thinking a doubly linked list that required the components to implement a simple interface (or abstract class).
Here is where I am getting stuck. I am wondering whether it would be better practice to simply have the items in the list hold a reference to their node, or to build the node right into them. I feel like both would be fairly simple, but I am worried about coupling these nodes into a bunch of classes.
I am wondering if anyone has an idea as to how to minimize the coupling, or possibly know of another data structure that has the characteristics I want.
It'd be hard to beat a hash map.
Take a look at tries.
Apparently they can beat hashtables:
Unlike most other algorithms, tries have the peculiar feature that the time to insert, or to delete or to find is almost identical because the code paths followed for each are almost identical. As a result, for situations where code is inserting, deleting and finding in equal measure tries can handily beat binary search trees or even hash tables, as well as being better for the CPU's instruction and branch caches.
It may or may not fit your usage, but if it does, it's likely one of the best options possible.
In C++, this sounds like the perfect fit for std::unordered_set (that's std::tr1::unordered_set or boost::unordered_set to you if you have an older compiler). It's implemented as a hash set, which has the characteristics you describe.
Here's the interface documentation. Note that the hash containers actually offer two sets of iterators, the usual ones and local ones which only go through one bucket.
Many other languages have "hash sets" as well, certainly Java and C#.

Resources