Most efficient data structure for inserting and sorting - sorting

I need an efficient data structure for storing, I need to insert and maybe order. However, keeping the order after every insert is not necessary and I think that sortings are much less than inserts.
I was considering red-black trees but I'm not sure how fast it is inserting a node in an RB tree (compared to inserting it in a list for example); however, sorting in an RB tree is much more time-efficient.
What data structure is the most efficient for that?
Thanks for your time

I'm not sure how fast it is inserting a node in an RB tree (compared to inserting it in a list for example
Insertion has this average time complexity:
RB tree (also worst case): O(logn)
Sorted list: O(n)
Unsorted list: O(1)
Traversing all values in sorted order:
RB tree: O(n)
Sorted list: O(n)
Unsorted list: O(nlogn)
So for asymptotically increasing data sizes, insertion will eventually run faster on an RB than on a sorted list, although for small sizes the list can be faster (as it has less constant overhead). The actual tipping point will depend on implementation aspects, including the programming language and the structure of the values to compare. But insertion into a non sorted list will of course outperform both.
Sorting a list on demand as is needed for an unsorted list, has a cost, but it is "only" O(nlogn) compared to O(n). So if sorting doesn't have to happen that frequently, it may be a viable option. Again, the tipping point -- as where the overall running time of several inserts and sorts is faster than the alternatives -- depends on implementation aspects.
What data structure is the most efficient for that?
In practice I have found B+ trees to be a good choice for fast insertion. Just like RB trees they have O(logn) insertion time, but one can tune the data structure with varying block sizes, trying to find out which one works best for your actual case. This is not possible with RB trees. Also B+ trees have the sorted list sitting in a linked list of sorted blocks, so iteration in sorted order is trivial. Nothing much is going to beat the speed of that.
Another interesting alternative is a skip list. It resembles a B+ tree a bit, but its operations are easier to implement. It uses a factor more memory (same complexity) by the absence of blocks and more pointers.
Which one will work the best depends on implementation/platform factors. In the end you'll want to implement some alternatives and compare them with benchmark tests.

Related

Why "delete" operation is considered to be "slow" on a sorted array?

I am currently studying algorithms and data structures with the help of the famous Stanford course by Tim Roughgarden. In video 13-1 when explaining Balanced Binary Search Trees he compared them to sorted arrays and mentioned that we do not do deletion on sorted array because it is too slow (I believe he meant "slow in comparison with other operations, that we can run in constant [Select, Min/Max, Pred/Succ], O(log n) [Search, Rank] and O(n) [Output/print] time").
I cannot stop thinking about this statement. Namely I cannot wrap my mind around the following:
Let's say we are given an order statistic or a value of the item we
want to delete from a sorted (ascending) array.
We can most certainly find its position in array using Select or
Search in constant or O(n) time respectively.
We can then remove this item and iterate over the items to the right
of the deleted one, incrementing their indices by one, which will take
O(n) time. [this is me (possibly unsuccessfully) trying to describe
the 'move each of them 1 position to the left' operation]
The whole operation will take linear time - O(n) - in the worst case
scenario.
Key question - Am I thinking in a wrong way? If not, why is it considered slow and undesirable?
You are correct: deleting from an array is slow because you have to move all elements after it one position to the left, so that you can cover the hole you created.
Whether O(n) is considered slow depends on the situation. Deleting from an array is most likely part of a larger, more complex algorithm, e.g. inside a loop. This then would add a factor of n to your final complexity, which is usually bad. Using a tree would only add a factor of log n, and O(n log n) is much better than O(n^2) (asymptotically).
The statement is relative to the specific data structure which is being used to hold the sorted values: A sorted array. This specific data structure would be selected for simplicity, for efficient storage, and for quick searches, but is slow for adding and removing elements from the data structure.
Other data structures which hold sorted values may be selected. For example, a binary tree, or a balanced binary tree, or a trie. Each has different characteristics in terms of operation performance and storage efficiency, and would be selected based on the intended usage.
A sorted array is slow for additions and removals because, on average, these operations require shifting half of the array to make room for a new element (or, respectively, to fill in an emptied cell).
However, on many architectures, the simplicity of the data structure and the speed of shifting means that the data structure is fine for "small" data sets.

Complexity of maintaining a sorted list vs inserting all values then sorting

Would the time and space complexity to maintain a list of numbers in sorted order (i.e start with the first one insert it, 2nd one comes along you insert it in sorted order and so on ..) be the same as inserting them as they appear and then sorting after all insertions have been made?
How do I make this decision? Can you demonstrate in terms of time and space complexity for 'n' elements?
I was thinking in terms of phonebook, what is the difference of storing it in a set and presenting sorted data to the user each time he inserts a record into the phonebook VS storing the phonebook records in a sorted order in a treeset. What would it be for n elements?
Every time you insert into a sorted list and maintain its sortedness, it is O(logn) comparisons to find where to place it but O(n) movements to place it. Since we insert n elements this is O(n^2). But, I think that if you use a data structure that is designed for inserting sorted data into (such as a binary tree) then do a pass at the end to turn it into a list/array, it is only O(nlogn). On the other hand, using such a more complex data structure will use about O(n) additional space, whereas all other approaches can be done in-place and use no additional space.
Every time you insert into an unsorted list it is O(1). Sorting it all at the end is O(nlogn). This means overall it is O(nlogn).
However, if you are not going to make lists of many elements (1000 or less) it probably doesn't matter what big-O it is, and you should either focus on what runs faster for small data sets, or not worry at all if it is not a performance issue.
It depends on what data structure you are inserting them in. If you are asking about inserting in an array, the answer is no. It takes O(n) space and time to store the n elements, and then O(n log n) to sort them, so O(n log n) total. While inserting into an array may require you to move \Omega(n) elements so takes \Theta(n^2). The same problem will be true with most "sequential" data structures. Sorry.
On the other hand, some priority queues such as lazy leftist heaps, fibonacci heaps, and Brodal queues have O(1) insert. While, a Finger Tree gives O(n log n) insert AND linear access (Finger trees are as good as a linked list for what a linked list is good for and as good as balanced binary search trees for what binary search trees are good for--they are kind of amazing).
There are going to be application-specific trade-offs to algorithm selection. The reasons one might use an insertion sort rather than some kind of offline sorting algorithm are enumerated on the Insertion Sort wikipedia page.
The determining factor here is less likely to be asymptotic complexity and more likely to be what you know about your data (e.g., is it likely to be already sorted?)
I'd go further, but I'm not convinced that this isn't a homework question asked verbatim.
Option 1
Insert at correct position in sorted order.
Time taken to find the position for i+1-th element :O(logi)
Time taken to insert and maintain order for i+1-th element: O(i)
Space Complexity:O(N)
Total time:(1*log 1 +2*log 2 + .. +(N-1)*logN-1) =O(NlogN)
Understand that this is just the time complexity.The running time can be very different from this.
Option 2:
Insert element O(1)
Sort elements O(NlogN)
Depending on the sort you employ the space complexity varies, though you can use something like quicksort, which doesn't need much space anyway.
In conclusion though both time complexity are the same, the bounds are weak and mathematically you can come up with better bounds.Also note that worst case complexity may never be encountered in practical situations, probably you will see only average cases all the time.If performance is such a vital issue in your application, you should test both sets of code on random sampling.Do tell me which one works faster after your tests.My guess is option 1.

Fastest data structure for inserting/sorting

I need a data structure that can insert elements and sort itself as quickly as possible. I will be inserting a lot more than sorting. Deleting is not much of a concern and nethier is space. My specific implementation will additionally store nodes in an array, so lookup will be O(1), i.e. you don't have to worry about it.
If you're inserting a lot more than sorting, then it may be best to use an unsorted list/vector, and quicksort it when you need it sorted. This keeps inserts very fast. The one1 drawback is that sorting is a comparatively lengthy operation, since it's not amortized over the many inserts. If you depend on relatively constant time, this can be bad.
1 Come to think of it, there's a second drawback. If you underestimate your sort frequency, this could quickly end up being overall slower than a tree or a sorted list. If you sort after every insert, for instance, then the insert+quicksort cycle would be a bad idea.
Just use one of the self-balanced binary search trees, such as red-black tree.
Use any of the Balanced binary trees like AVL trees. It should give O(lg N) time complexity for both of the operations you are looking for.
If you don't need random access into the array, you could use a Heap.
Worst and average time complexity:
O(log N) insertion
O(1) read largest value
O(log N) to remove the largest value
Can be reconfigured to give smallest value instead of largest. By repeatedly removing the largest/smallest value you get a sorted list in O(N log N).
If you can do a lot of inserts before each sort then obviously you should just append the items and sort no sooner than you need to. My favorite is merge sort. That is O(N*Log(N)), is well behaved, and has a minimum of storage manipulation (new, malloc, tree balancing, etc.)
HOWEVER, if the values in the collection are integers and reasonably dense, you can use an O(N) sort, where you just use each value as an index into a big-enough array, and set a boolean TRUE at that index. Then you just scan the whole array and collect the indices that are TRUE.
You say you're storing items in an array where lookup is O(1). Unless you're using a hash table, that suggests your items may be dense integers, so I'm not sure if you even have a problem.
Regardless, memory allocating/deleting is expensive, and you should avoid it by pre-allocating or pooling if you can.
I had some good experience for that kind of task using a Skip List
At least in my case it was about 5 times faster compared to adding everything to a list first and then running a sort over it at the end.

Using red black trees for sorting

The worst-case running time of insertion on a red-black tree is O(lg n) and if I perform a in-order walk on the tree, I essentially visit each node, so the total worst-case runtime to print the sorted collection would be O(n lg n)
I am curious, why are red-black trees not preferred for sorting over quick sort (whose average-case running time is O(n lg n).
I see that maybe because red-black trees do not sort in-place, but I am not sure, so maybe someone could help.
Knowing which sort algorithm performs better really depend on your data and situation.
If you are talking in general/practical terms,
Quicksort (the one where you select the pivot randomly/just pick one fixed, making worst case Omega(n^2)) might be better than Red-Black Trees because (not necessarily in order of importance)
Quicksort is in-place. The keeps your memory footprint low. Say this quicksort routine was part of a program which deals with a lot of data. If you kept using large amounts of memory, your OS could start swapping your process memory and trash your perf.
Quicksort memory accesses are localized. This plays well with the caching/swapping.
Quicksort can be easily parallelized (probably more relevant these days).
If you were to try and optimize binary tree sorting (using binary tree without balancing) by using an array instead, you will end up doing something like Quicksort!
Red-Black trees have memory overheads. You have to allocate nodes possibly multiple times, your memory requirements with trees is doubles/triple that using arrays.
After sorting, say you wanted the 1045th (say) element, you will need to maintain order statistics in your tree (extra memory cost because of this) and you will have O(logn) access time!
Red-black trees have overheads just to access the next element (pointer lookups)
Red-black trees do not play well with the cache and the pointer accesses could induce more swapping.
Rotation in red-black trees will increase the constant factor in the O(nlogn).
Perhaps the most important reason (but not valid if you have lib etc available), Quicksort is very simple to understand and implement. Even a school kid can understand it!
I would say you try to measure both implementations and see what happens!
Also, Bob Sedgewick did a thesis on quicksort! Might be worth reading.
There are plenty of sorting algorithms which are worst case O(n log n) - for example, merge sort. The reason quicksort is preferred is because it is faster in practice, even though algorithmically it may not be as good as some other algorithms.
Often in-built sorts use a combination of various methods depending on the values of n.
There are many cases where red-back trees are not bad for sorting. My testing showed, compared to natural merge sort, that red-black trees excel where:
Trees are better for Dups:
All the tests where dups need to be eleminated, tree algorithm is better. This is not astonishing, since the tree can be kept very small from the beginning, whereby algorithms that are designed for inline array sort might pass around larger segments for a longer time.
Trees are better for Random:
All the tests with random, tree algorithm is better. This is also not astonishing, since in a tree distance between elements is shorter and shifting is not necessary. So repeatedly inserting into a tree could need less effort than sorting an array.
So we get the impression that the natural merge sort only excels in ascending and descending special cases. Which cant be even said for quick sort.
Gist with the test cases here.
P.S.: it should be noted that using trees for sorting is non-trivial. One has not only to provide an insert routine but also a routine that can linearize the tree back to an array. We are currently using a get_last and a predecessor routine, which doesn't need a stack. But these routines are not O(1) since they contain loops.
Big-O time complexity measures do not usually take into account scalar factors, e.g., O(2n) and O(4n) are usually just reduced to O(n). Time complexity analysis is based on operational steps at an algorithmic level, not at a strict programming level, i.e., no source code or native machine instruction considerations.
Quicksort is generally faster than tree-based sorting since (1) the methods have the same algorithmic average time complexity, and (2) lookup and swapping operations require fewer program commands and data accesses when working with simple arrays than with red-black trees, even if the tree uses an underlying array-based implementation. Maintenance of the red-black tree constraints requires additional operational steps, data field value storage/access (node colors), etc than the simple array partition-exchange steps of a quicksort.
The net result is that red-black trees have higher scalar coefficients than quicksort does that are being obscured by the standard O(n log n) average time complexity analysis result.
Some other practical considerations related to machine architectures are briefly discussed in the Quicksort article on Wikipedia
Generally, representations of O(nlgn) algorithms can be expanded to A*nlgn + B where A and B are constants. There are many algorithmic proofs that show the coefficients for quicksort are smaller than those of other algorithms. That is in best-case (quick sort performs horribly on sorted data).
Hi the best way to explain the difference between all sorting routine in my opinion is.
(My answer is for people who are confused how quick sort is faster in practice than another sorting algo).
"Think u are running on a very slow computer".
First thing one comparing operation takes 1 hour.
One shifting operation takes 2 hours.
"I am using hour just to make people understand how important time is".
Now from all the sorting operations quick-sort have very very less comparisons and very less swapping for elements.
Quick-sort is faster for this main reason.

Best self-balancing BST for quick insertion of a large number of nodes

I've been able to find details on several self-balancing BSTs through several sources, but I haven't found any good descriptions detailing which one is best to use in different situations (or if it really doesn't matter).
I want a BST that is optimal for storing in excess of ten million nodes. The order of insertion of the nodes is basically random, and I will never need to delete nodes, so insertion time is the only thing that would need to be optimized.
I intend to use it to store previously visited game states in a puzzle game, so that I can quickly check if a previous configuration has already been encountered.
Red-black is better than AVL for insertion-heavy applications. If you foresee relatively uniform look-up, then Red-black is the way to go. If you foresee a relatively unbalanced look-up where more recently viewed elements are more likely to be viewed again, you want to use splay trees.
Why use a BST at all? From your description a dictionary will work just as well, if not better.
The only reason for using a BST would be if you wanted to list out the contents of the container in key order. It certainly doesn't sound like you want to do that, in which case go for the hash table. O(1) insertion and search, no worries about deletion, what could be better?
The two self-balancing BSTs I'm most familiar with are red-black and AVL, so I can't say for certain if any other solutions are better, but as I recall, red-black has faster insertion and slower retrieval compared to AVL.
So if insertion is a higher priority than retrieval, red-black may be a better solution.
[hash tables have] O(1) insertion and search
I think this is wrong.
First of all, if you limit the keyspace to be finite, you could store the elements in an array and do an O(1) linear scan. Or you could shufflesort the array and then do a linear scan in O(1) expected time. When stuff is finite, stuff is easily O(1).
So let's say your hash table will store any arbitrary bit string; it doesn't much matter, as long as there's an infinite set of keys, each of which are finite. Then you have to read all the bits of any query and insertion input, else I insert y0 in an empty hash and query on y1, where y0 and y1 differ at a single bit position which you don't look at.
But let's say the key lengths are not a parameter. If your insertion and search take O(1), in particular hashing takes O(1) time, which means that you only look at a finite amount of output from the hash function (from which there's likely to be only a finite output, granted).
This means that with finitely many buckets, there must be an infinite set of strings which all have the same hash value. Suppose I insert a lot, i.e. ω(1), of those, and start querying. This means that your hash table has to fall back on some other O(1) insertion/search mechanism to answer my queries. Which one, and why not just use that directly?

Resources