How to deal with duplicates in red-black trees? - algorithm

So I've been(so far unsuccessfully) trying to make my red-black tree implementation work consistently with duplicates, but it seems to always be missing that small something, so here I am.
I tried make the tree lean to one side, but It didn't seem to balance it properly(from the color perspective).I'd like to ask how should one go about adding duplicates to a red-black tree?(apart obviously making the node fat, holding or pointing to duplicate key values).
Not really looking for a code review, more interested in suggestions. So basically the methods(taken from Introduction to Algorithms, Third Edition) I use for insert and balancing are these(while rotations are pretty obvious):

If you look at the pseudo-code you wrote here, it is completely agnostic to the question of whether keys are duplicate or not. The code here only looks at the result of comparing keys, and doesn't care if they are identical or not. In fact, unique-key implementations need to go out of their way to make RB-Insert detect duplicate keys. The data structure doesn't care about this naturally, and the algorithms and proofs hold whether there are duplicate keys or not. If you implemented these functions correctly, it should work as is.
I also disagree with the comments advising you to hold what you call "fat nodes". Holding multiple keys is the common implementation of C++'s std::multimap, for example. Not that from a computational complexity point of view, say that you have altogether n keys, but each k are a multiple. Using the "efficient" fat node version, the complexity of the basic find operation will be Θ(log(n / k)) = Θ(log(n) - log(k)); using the multiple key version, the complexity will be Θ(log(n)). In real life cases, probably k << n, which means that the relative difference is negligible.

Related

A data structure with certain properties

I want to implement a data structure myself in C++11. What I'm planning to do is having a data structure with the following properties:
search. O(log(n))
insert. O(log(n))
delete. O(log(n))
iterate. O(n)
What I have been thinking about after research was implementing a balanced binary search tree. Are there other structures that would fulfill my needs? I am completely new to this topic and thought a question here would give me a good jumpstart.
First of all, using the existing standard library data types is definitely the way to go for production code. But since you are asking how to implement such data structures yourself, I assume this is mainly an educational exercise for you.
Binary search trees of some form (https://en.wikipedia.org/wiki/Self-balancing_binary_search_tree#Implementations) or B-trees (https://en.wikipedia.org/wiki/B-tree) and hash tables (https://en.wikipedia.org/wiki/Hash_table) are definitely the data structures that are usually used to accomplish efficient insertion and lookup. If you want to go wild you can combine the two by using a tree instead of a linked list to handle hash collisions (although this has a good potential to actually make your implementation slower if you don't make massive mistakes in sizing your hash table or in choosing an adequate hash function).
Since I'm assuming you want to learn something, you might want to have a look at minimal perfect hashing in the context of hash tables (https://en.wikipedia.org/wiki/Perfect_hash_function) although this only has uses in special applications (I had the opportunity to use a perfect minimal hash function exactly once). But it sure is fascinating. As you can see from the link above, the botany of search trees is virtually limitless in scope so you can also go wild on that front.

What are appropriate applications for a linked (doubly as well) list?

I have a question about fundamentals in data structures.
I understand that array's access time is faster than a linked list. O(1)- array vs O(N) -linked list
But a linked list beats an array in removing an element since there is no shifting needing O(N)- array vs O(1) -linked list
So my understanding is that if the majority of operations on the data is delete then using a linked list is preferable.
But if the use case is:
delete elements but not too frequently
access ALL elements
Is there a clear winner? In a general case I understand that the downside of using the list is that I access each node which could be on a separate page while an array has better locality.
But is this a theoretical or an actual concern that I should have?
And is the mixed-type i.e. create a linked list from an array (using extra fields) good idea?
Also does my question depend on the language? I assume that shifting elements in array has the same cost in all languages (at least asymptotically)
Singly-linked lists are very useful and can be better performance-wise relative to arrays if you are doing a lot of insertions/deletions, as opposed to pure referencing.
I haven't seen a good use for doubly-linked lists for decades.
I suppose there are some.
In terms of performance, never make decisions without understanding relative performance of your particular situation.
It's fairly common to see people asking about things that, comparatively speaking, are like getting a haircut to lose weight.
Before writing an app, I first ask if it should be compute-bound or IO-bound.
If IO-bound I try to make sure it actually is, by avoiding inefficiencies in IO, and keeping the processing straightforward.
If it should be compute-bound then I look at what its inner loop is likely to be, and try to make that swift.
Regardless, no matter how much I try, there will be (sometimes big) opportunities to make it go faster, and to find them I use this technique.
Whatever you do, don't just try to think it out or go back to your class notes.
Your problem is different from anyone else's, and so is the solution.
The problem with a list is not just the fragmentation, but mostly the data dependency. If you access every Nth element in array you don't have locality, but the accesses may still go to memory in parallel since you know the address. In a list it depends on the data being retrieved, and therefore traversing a list effectively serializes your memory accesses, causing it to be much slower in practice. This of course is orthogonal to asymptotic complexities, and would harm you regardless of the size.

Iterable O(1) insert and random delete collection

I am looking to implement my own collection class. The characteristics I want are:
Iterable - order is not important
Insertion - either at end or at iterator location, it does not matter
Random Deletion - this is the tricky one. I want to be able to have a reference to a piece of data which is guaranteed to be within the list, and remove it from the list in O(1) time.
I plan on the container only holding custom classes, so I was thinking a doubly linked list that required the components to implement a simple interface (or abstract class).
Here is where I am getting stuck. I am wondering whether it would be better practice to simply have the items in the list hold a reference to their node, or to build the node right into them. I feel like both would be fairly simple, but I am worried about coupling these nodes into a bunch of classes.
I am wondering if anyone has an idea as to how to minimize the coupling, or possibly know of another data structure that has the characteristics I want.
It'd be hard to beat a hash map.
Take a look at tries.
Apparently they can beat hashtables:
Unlike most other algorithms, tries have the peculiar feature that the time to insert, or to delete or to find is almost identical because the code paths followed for each are almost identical. As a result, for situations where code is inserting, deleting and finding in equal measure tries can handily beat binary search trees or even hash tables, as well as being better for the CPU's instruction and branch caches.
It may or may not fit your usage, but if it does, it's likely one of the best options possible.
In C++, this sounds like the perfect fit for std::unordered_set (that's std::tr1::unordered_set or boost::unordered_set to you if you have an older compiler). It's implemented as a hash set, which has the characteristics you describe.
Here's the interface documentation. Note that the hash containers actually offer two sets of iterators, the usual ones and local ones which only go through one bucket.
Many other languages have "hash sets" as well, certainly Java and C#.

Data structure name: combination array/linked list

I have come up with a data structure that combines some of the advantages of linked lists with some of the advantages of fixed-size arrays. It seems very obvious to me, and so I'd expect someone to have thought of it and named it already. Does anyone know what this is called:
Take a small fixed-size array. If the number of elements you want to put in your array is greater than the size of the array, add a new array and whatever pointers you like between the old and the new.
Thus you have:
Static array
—————————————————————————
|1|2|3|4|5|6|7|8|9|a|b|c|
—————————————————————————
Linked list
———— ———— ———— ———— ————
|1|*->|2|*->|3|*->|4|*->|5|*->NULL
———— ———— ———— ———— ————
My thing:
———————————— ————————————
|1|2|3|4|5|*->|6|7|8|9|a|*->NULL
———————————— ————————————
Edit: For reference, this algorithm provides pretty poor worst-case addition/deletion performance, and not much better average-case. The big advantage for my scenario is the improved cache performance for read operations.
Edit re bounty: Antal S-Z's answer was so complete and well-researched that I wanted to provide em with a bounty for it. Apparently Stack Overflow doesn't let me accept an answer as soon as I've offered a bounty, so I'll have to wait (admittedly I am abusing the intention bounty system somewhat, although it's in the name of rewarding someone for an excellent answer). Of course, if someone does manage to provide a better answer, more power to them, and they can most certainly have the bounty instead!
Edit re names: I'm not interested in what you'd call it, unless you'd call it that because that's what authorities on the subject would call it. If it's a name you just came up with, I'm not interested. What I want is a name that I can look up in text books and with Google. (Also, here's a tip: Antal's answer is what I was looking for. If your answer isn't "unrolled linked list" without a very good reason, it's just plain wrong.)
It's called an unrolled linked list. There appear to be a couple of advantages, one in speed and one in space. First, if the number of elements in each node is appropriately sized (e.g., at most the size of one cache line), you get noticeably better cache performance from the improved memory locality. Second, since you have O(n/m) links, where n is the number of elements in the unrolled linked list and m is the number of elements you can store in any node, you can also save an appreciable amount of space, which is particularly noticeable if each element is small. When constructing unrolled linked lists, apparently implementations will try to generally leave space in the nodes; when you try to insert in a full node, you move half the elements out. Thus, at most one node will be less than half full. And according to what I can find (I haven't done any analysis myself), if you insert things randomly, nodes tend to actually be about three-quarters full, or even fuller if operations tend to be at the end of the list.
And as everyone else (including Wikipedia) is saying, you might want to check out skip lists. Skip lists are a nifty probabilistic data structure used to store ordered data with O(log n) expected running time for insert, delete, and find. It's implemented by a "tower" of linked lists, each layer having fewer elements the higher up it is. On the bottom, there's an ordinary linked list, having all the elements. At each successive layer, there are fewer elements, by a factor of p (usually 1/2 or 1/4). The way it's built is as follows. Each time an element is added to the list, it's inserted into the appropriate place in the bottom row (this uses the "find" operation, which can also be made fast). Then, with probability p, it's inserted into the appropriate place in the linked list "above" it, creating that list if it needs to; if it was placed in a higher list, then it will again appear above with probability p. To query something in this data structure, you always check the top lane, and see if you can find it. If the element you see is too large, you drop to the next lowest lane and start looking again. It's sort of like a binary search. Wikipedia explains it very well, and with nice diagrams. The memory usage is going to be worse, of course, and you're not going to have the improved cache performance, but it is generally going to be faster.
References
“Unrolled Linked List”, http://en.wikipedia.org/wiki/Unrolled_linked_list
“Unrolled Linked Lists”, Link
“Skip List”, http://en.wikipedia.org/wiki/Skip_list
The skip list lecture(s) from my algorithms class.
CDR coding (if you're old enough to remember Lisp Machines).
Also see ropes which is a generalization of this list/array idea for strings.
I would call this a bucket list.
While I don't know your task, I would strongly suggest you have a look at skip lists.
As for name, I'm thinking a bucket list would probably be most apropos
You can call it LinkedArrays.
Also, I would like to see the pseudo-code for the removeIndex operation.
What are the advantages of this data structure in terms of insertion and deletion?
Ex:
What if you want to add an element between 3 and 4? still have to do a shift, it takes O(N)
How do you find out the correct bucket for elementAt?
I agree with jer, you must take a look on skip list. It brings the advantages of Linked List and Arrays. The most of operations are done in O(log N)

Hash tables using VLists

Phil Bagwell, in his 2002 paper on the VList data structure, indicates that you can use a VList to implement a persistent hash table. However, his explanation of how that worked didn't include much detail, and I don't understand it. Can anybody give me a more detailed explanation, or even examples?
Further, it appears to me from what I can see that this data structure, while it may have the same big-O complexity as a Hashtable, will be slower because it does additional lookups. Does anybody care to do a detailed analysis of just how much slower, preferably including cache behaviour? How does the performance relationship between the two change in the case of having no collisions or many?
I had a look at this paper, and it appears very preliminary. The fact that no later version has been published, and that the original appeared in IFL (which is a work-in-progress sort of meeting), suggests that you may be wasting your time.
Hrmm there seem to be a number of issues with the data structures proposed by the paper in question.
Off the cuff, the naive vlists mentioned first seem to need unique references in order to get anything near the time guarantees proposed. You lose the ability for the most part to share tails. You can share the tiny nodes towards the back of the list, but you wind up having to duplicate the largest vlist node the moment you cons something onto the cdr of vlist that is still active. That cost is proportional to the cost of copying the whole list.
With the 2d modifications mentioned later it becomes constant again, but its a pretty large constant, since you wind up at least copying the head of a list of pages (or worse, a vlist) and the first page in your list.
The functional hash list stuff in there didn't seem to make much sense to me to be honest. It was just a brief blurb that seemed to be bolted onto an otherwise unrelated paper, without enough detail to really make out how practical it is.

Resources