Data structure needed - data-structures

After doing some thought I came to the conclusion that I require a data structure that supports:
Insert
Remove
Find
Delete minimum
of course I want to implement this in the best complexity I can.
My thoughts are that a Self-balancing binary search tree will do A-D in O(log(n)) (worst case).
Maybe this can be improved somehow so A-C will be in O(log(n)) and D (that I think will be more frequent) will run in O(1).
I do a worst case analysis, but if you can think of something that will run 'fast' but it's Amortized analysis or on average than it's no problem.
any improvement to what I have in mind is welcomed!
(note: I believe that A and D will be much more frequent that B and C)

It needs to be some sort of sorted, balanced tree. It is not likely that any tree will be significantly better suited for the minimum deletion, as it will still require re-balancing anyway. All of the operations you ask for will be O(log(n)). Red-black trees are readily available in C++ and Java.

What you’re describing is a priority queue, augmented by a “find” operation.
It is usually implemented in terms of a min-heap. All operations you listed, except “find”, run in O(log n), and it is notably the most efficient overall data structure for this job. It is important to note that this is a special case of a binary tree that can be implemented much more efficiently than a general binary search tree, both in terms of memory consumption and performance (same asymptotic performance but much better constant factors).
Unfortunately, “find” still takes O(n).
It is implemented in Java in the PriorityQueue class.

Related

Splay Tree Worst Case Search Time

Since splay tree is a type of unbalanced binary search tree (brilliant.org/wiki/splay-tree), it cannot guarantee a height of at most O(log(n)). Thus, I would think it cannot guarantee a worst case search time of O(log(n)).
But according to bigocheatsheet.com:
Splay Tree has worst case search time of O(log(n))???
You’re correct; the cost of a lookup in a splay tree can reach Θ(n) for an imbalanced tree.
Many resources like the big-O cheat sheet either make simplifying assumptions or just have factually incorrect data in them. It’s unclear whether they were just wrong here, or whether they were talking amortized worst case, etc.
It’s always best to know the internals of the data structures you’re working with so that you can understand where the runtimes come from.

time complexity for binary search trees

If I use an insert() function for my bst, the time complexity can be as bad as O(n) and as good as O(log n). I'm assumng that if I had a perfectly balanced tree, the time complexity is log n because I am able to ignore half of the tree every time I go down a "branch". And if my tree is completely unbalanced it would be O(n). Am I correct for thinking this?
Yes, that is correct, see e.g. wikipedia, http://en.wikipedia.org/wiki/Binary_search_tree#Searching.
If you use e.g. C++ STL std::map or std::set, you get a red-black, balanced tree. Also worth noting is that with these STL data structures, you get this performance 100% of the time, which can be very important in e.g. hard real-time systems. Hash tables are even faster, but are not fast a 100% of the time like the red-black trees.

Implementation of priority queue by AVL Tree data structure

Priority queue:
Basic operations: Insertion
Delete (Delete minumum element)
Goal: To provide efficient running time or order of growth for above functionality.
Implementation of Priority queue By:
Linked List: Insertion will take o(n) in case of insertion at end o(1) in case of
insertion at head.
Delet (Finding minumum and Delete this ) will take o(n)
BST:
Insertion/Deltion of minimum = In avg case it will take o(logn) worst case 0(n)
AVL Tree:
Insertion/deletion/searching: o(log n) in all cases.
My confusion goes here:
Why not we have used AVL Tree for implementation of Priority queue, Why we gone
for Binary heap...While as we know that in AVL Tree we can do insertion/ Deletion/searching in o(log n) in worst case.
Complexity isn't everything, there are other considerations for actual performance.
For most purposes, most people don't even use an AVL tree as a balanced tree (Red-Black trees are more common as far as I've seen), let alone as a priority queue.
This is not to say that AVL trees are useless, I quite like them. But they do have a relatively expensive insert. What AVL trees are good for (beating even Red-Black trees) is doing lots and lots of lookups without modification. This is not what you need for a priority queue.
As a separate consideration -- never mind your O(log n) insert for a binary heap, a fibonacci heap has O(1) insert and O(log N) delete-minimum. There are a lot of data structures to choose from with slightly different trade-offs, so you wouldn't expect to see everyone just pick the first thing that satisfies your (quite brief) criteria.
Binary heap is not Binary Search Tree (BST). If severely unbalanced / deteriorated into a list, it will indeed take O(n) time. Heaps are usually always O(log(n)) or better. IIRC Sedgewick claimed O(1) average-time for array-based heaps.
Why not AVL? Because it maintains too much order in a structure. Too much order means, too much effort went into maintaining that order. The less order we can get away with, the better - it will usually translate to faster operations. For example, RBTs are better than AVL trees. RBTs, red-black trees, are almost balanced trees - they save operations while still ensuring O(log(n)) time.
But any tree is totally-ordered structure, so heaps are generally better, because they only ensure that the minimal element is on top. They are only partially ordered.
Because in a binary heap the minimum element is the root.

Hash Table v/s Trees

Are hashtables always faster than trees? Though Hashtables have O(1) search complexity but suppose if due to badly designed hash function lot of collisions happen and if we handle collisions using chained structure (say a balanced tree) then the worst case running time for search would be O(log n). So can I conclude for big or small data sets even in case of worst case scenarios hash tables will always be faster than trees? Also If I have ample memory and I dont want range searches can I always go for a hash table?
Are hashtables always faster than trees?
No, not always. This depends on many things, such as the size of the collection, the hash function, and for some hash table implementations - also the number of delete ops.
hash-tables are O(1) per op on average - but this is not always the case. They might be O(n) in worst cases.
Some reasons I can think of at the moment to prefer trees:
Ordering is important. [hash-tables are not maintaining order, BST is sorted by definition]
Latency is an issue - and you cannot suffer the O(n) that might occur. [This might be critical for real-time systems]
Ther data might be "similar" related to your hash function, and many elements hashed to the same locations [collisions] is not unprobable. [this can be sometimes solved by using a different hash function]
For relatively small collections - many times the hidden constant between hashtable's O(1) is much higher then the tree's - and using a tree might be faster for small collections.
However - if the data is huge, latency is not an issue and collisions are unprobable - hash-tables are asymptotically better then using a tree.
If due to badly designed hash function lot of collisions happen and if we handle collisions using chained structure (say a balanced tree) then the worst case running time for search would be O(n) (not O(log n)). Therefore you cannot conclude for big or small data sets even in case of worst case scenarios hash tables will always be faster than trees.
Use hashtable, and init it with the proper dimension. For example if you use only half space the collisions are very few.
In worst case scenario you'll have O(n) time in hast-tables. But this is a billions less probable then sun exploding write now, so when using a good hash-function you can safely assume it works in O(1) unless sun explodes.
On the other hand, performance of both Hash-Tables and Trees can vary on implementation, language, and phase of the moon, so the only good answer to this question is "Try both, think and pick better".

Using red black trees for sorting

The worst-case running time of insertion on a red-black tree is O(lg n) and if I perform a in-order walk on the tree, I essentially visit each node, so the total worst-case runtime to print the sorted collection would be O(n lg n)
I am curious, why are red-black trees not preferred for sorting over quick sort (whose average-case running time is O(n lg n).
I see that maybe because red-black trees do not sort in-place, but I am not sure, so maybe someone could help.
Knowing which sort algorithm performs better really depend on your data and situation.
If you are talking in general/practical terms,
Quicksort (the one where you select the pivot randomly/just pick one fixed, making worst case Omega(n^2)) might be better than Red-Black Trees because (not necessarily in order of importance)
Quicksort is in-place. The keeps your memory footprint low. Say this quicksort routine was part of a program which deals with a lot of data. If you kept using large amounts of memory, your OS could start swapping your process memory and trash your perf.
Quicksort memory accesses are localized. This plays well with the caching/swapping.
Quicksort can be easily parallelized (probably more relevant these days).
If you were to try and optimize binary tree sorting (using binary tree without balancing) by using an array instead, you will end up doing something like Quicksort!
Red-Black trees have memory overheads. You have to allocate nodes possibly multiple times, your memory requirements with trees is doubles/triple that using arrays.
After sorting, say you wanted the 1045th (say) element, you will need to maintain order statistics in your tree (extra memory cost because of this) and you will have O(logn) access time!
Red-black trees have overheads just to access the next element (pointer lookups)
Red-black trees do not play well with the cache and the pointer accesses could induce more swapping.
Rotation in red-black trees will increase the constant factor in the O(nlogn).
Perhaps the most important reason (but not valid if you have lib etc available), Quicksort is very simple to understand and implement. Even a school kid can understand it!
I would say you try to measure both implementations and see what happens!
Also, Bob Sedgewick did a thesis on quicksort! Might be worth reading.
There are plenty of sorting algorithms which are worst case O(n log n) - for example, merge sort. The reason quicksort is preferred is because it is faster in practice, even though algorithmically it may not be as good as some other algorithms.
Often in-built sorts use a combination of various methods depending on the values of n.
There are many cases where red-back trees are not bad for sorting. My testing showed, compared to natural merge sort, that red-black trees excel where:
Trees are better for Dups:
All the tests where dups need to be eleminated, tree algorithm is better. This is not astonishing, since the tree can be kept very small from the beginning, whereby algorithms that are designed for inline array sort might pass around larger segments for a longer time.
Trees are better for Random:
All the tests with random, tree algorithm is better. This is also not astonishing, since in a tree distance between elements is shorter and shifting is not necessary. So repeatedly inserting into a tree could need less effort than sorting an array.
So we get the impression that the natural merge sort only excels in ascending and descending special cases. Which cant be even said for quick sort.
Gist with the test cases here.
P.S.: it should be noted that using trees for sorting is non-trivial. One has not only to provide an insert routine but also a routine that can linearize the tree back to an array. We are currently using a get_last and a predecessor routine, which doesn't need a stack. But these routines are not O(1) since they contain loops.
Big-O time complexity measures do not usually take into account scalar factors, e.g., O(2n) and O(4n) are usually just reduced to O(n). Time complexity analysis is based on operational steps at an algorithmic level, not at a strict programming level, i.e., no source code or native machine instruction considerations.
Quicksort is generally faster than tree-based sorting since (1) the methods have the same algorithmic average time complexity, and (2) lookup and swapping operations require fewer program commands and data accesses when working with simple arrays than with red-black trees, even if the tree uses an underlying array-based implementation. Maintenance of the red-black tree constraints requires additional operational steps, data field value storage/access (node colors), etc than the simple array partition-exchange steps of a quicksort.
The net result is that red-black trees have higher scalar coefficients than quicksort does that are being obscured by the standard O(n log n) average time complexity analysis result.
Some other practical considerations related to machine architectures are briefly discussed in the Quicksort article on Wikipedia
Generally, representations of O(nlgn) algorithms can be expanded to A*nlgn + B where A and B are constants. There are many algorithmic proofs that show the coefficients for quicksort are smaller than those of other algorithms. That is in best-case (quick sort performs horribly on sorted data).
Hi the best way to explain the difference between all sorting routine in my opinion is.
(My answer is for people who are confused how quick sort is faster in practice than another sorting algo).
"Think u are running on a very slow computer".
First thing one comparing operation takes 1 hour.
One shifting operation takes 2 hours.
"I am using hour just to make people understand how important time is".
Now from all the sorting operations quick-sort have very very less comparisons and very less swapping for elements.
Quick-sort is faster for this main reason.

Resources