Sorting an array - performance

I want to know which is better to sort an array of elements.
Is it better to get good performance to sort the array at the end when I finish filling it? Or it is better to sort it each time I add an element to it?

Ignoring the application-specific information for a moment, consider that sorted insertion requires, worst case, O(n) operations for each element. For n elements, this of course gives us O(n^2). If this sounds familiar, it's because what you're doing is (as another commenter pointed out) an insertion sort. In contrast, performing one quicksort on the entire list will take, worse case, O(n log n) time.
So, does this mean you should definitely wait until the end to sort? No. The important thing to remember is that O(n log n) is the best we can do for sorting in the general case. Your application-speciifc knowledge of the data can influence the best algorithm for the job. For example, if you know your data is already almost sorted, then insertion sort will give you linear time complexity.
Your mileage may vary, but I think the algorithmic viewpoint is useful when examining the general problem of "When should I sort?"

It depends on what is critical to you. Do you need to be able to insert very fast (a lot of entries but little queries) or do you need to be able to query very fast and insert slowly (a lot of queries but not many entries) ?
This is basically your problem to solve. When you know this you can select an appropriate sorting algorithm and apply it.
Edit: This is assuming that either choice actually matters. This depends a lot on your activity (inserts vs queries) and the amount of data that you need to sort.

Related

Algorithmic complexity vs real life situations?

My question is about theory vs practice thing.
Let’s say for example that I want to sort a list of numbers. Mergesort has a complexity of O(n*logn) while bubblesort has a complexity of O(n^2).
This means that mergesort is quicker. But the complexity doesn’t take into account the whole thing happening on a computer. What I mean by that, is that mergesort for example is a divide and conquer algorithm and it needs more space than bubblesort.
So isn’t it possible that the creation of this additional space and usage of resources (time to transfer the data, to populate the code instructions, etc) to take more time than bubblesort which doesn’t use any additional space ?
Wouldn’t be possible to be more efficient to use an algorithm with worse (“bigger”) complexity than another for certain length of inputs (maybe small) ?
The answer is a clear yes.
A classic example is that insertion sort is O(n^2). However efficient sorting implementations often switch to insertion sort at something like 100 elements left because insertion sort makes really good use of cache, and avoids pipeline stalls in the CPU. No, insertion sort won't scale, but it outperforms.
The way that I put it is that scalability is like a Mack Truck. You want it for a big load, but it might not be the best thing to take for a shopping trip at the local grocery store.
Algorithmic complexity only tells you how two algorithms will compare as their input grows larger, i.e. approaches infinity. It tells you nothing about how they will compare on smaller inputs. The only way to know that for sure is to benchmark on data and equipment that represents a typical situation.

What sorting algorithm fit this 'stream-like' condition?

I have a buffer receiving data, which means the data are like 'stream' and have latency in 'IO'. The way I am doing now is when the buffer is full, using qsort to sort the buffer and write the result to disk. but there is obvious latency when doing qsort, so I am looking for some other sorting algorithms that may start sorting while the data is being added to the buffer, in order to reduce time consumed overall.
don't know if I have made myself clear and leave any comments if needed, thanks
Heap sort keeps the data permanently in a partially sorted condition and so is comparable to Insertion sort. But it is substantially quicker and has a worst case of O(n log n) compared with O(n2) for Insertion Sort.
How is this going to work? Presumably at some point you have to stop reading from the stream, store what you have sorted, and start reading a new set of data?
I think merge-sort or tree sort can be of great help . Look why on wikipedia.
When you can cut the huge input in reasonable large blocks, merge-sort is more appropriate.
When you insert small pieces at a time, tree-sort is more appropriate.
You want to implement an online sorting algorithm, ie an algorithm which runs while receiving the data in a streamlined fashion. Search for online algorithms over the web and you may find other nice algorithms.
In your case I would use tree sort. It doesn't have a better complexity than quicksort (both are O(nlog n) most of the time and O(n²) in few bad cases). But it amortizes the cost over each input. Which means the delay you have to wait after the last data is added is not of order O(nlog n), but O(log n)
You can try to use my Link Array structure. It should be ok for sequential adding of random data while keeping it sorted (look at the numbers in the table). This is a variation of Skip list approach but with easier implementation and logic (although the performance of Skip list should be better)

Reading streamed data into a sorted list

We know that, in general, the "smarter" comparison sorts on arbitrary data run in worst case complexity O(N * log(N)).
My question is what happens if we are asked not to sort a collection, but a stream of data. That is, values are given to us one by one with no indicator of what comes next (other than that the data is valid/in range). Intuitively, one might think that it is superior then to sort data as it comes in (like picking up a poker hand one by one) rather than gathering all of it and sorting later (sorting a poker hand after it's dealt). Is this actually the case?
Gathering and sorting would be O(N + N * log(N)) = O(N * log(N)). However if we sort it as it comes in, it is O(N * K), where K = time to find the proper index + time to insert the element. This complicates things, since the value of K now depends on our choice of data structure. An array is superior in finding the index but wastes time inserting the element. A linked list can insert more easily but cannot binary search to find the index.
Is there a complete discussion on this issue? When should we use one method or another? Might there be a desirable in-between strategy of sorting every once in a while?
Balanced tree sort has O(N log N) complexity and maintains the list in sorted order while elements are added.
Absolutely not!
Firstly, if I can sort in-streaming data, I can just accept all my data in O(N) and then stream it to myself and sort it using the quicker method. I.e. you can perform a reduction from all-data to stream, which means it cannot be faster.
Secondly, you're describing an insertion sort, which actually runs in O(N^2) time (i.e. your description of O(NK) was right, but K is not constant, rather a function of N), since it might take O(N) time to find the appropriate index. You could improve it to be a binary insertion sort, but that would run in O(NlogN) (assuming you're using a linked list, an array would still take O(N^2) even with the binary optimisation), so you haven't really saved anything.
Probably also worth mentioning the general principle; that as long as you're in the comparison model (i.e. you don't have any non-trivial and helpful information about the data which you're sorting, which is the general case) any sorting algorithm will be at best O(NlogN). I.e. the worst-case running time for a sorting algorithm in this model is omega(NlogN). That's not an hypothesis, but a theorem. So it is impossible to find anything faster (under the same assumptions).
Ok, if the timing of the stream is relatively slow, you will have a completely sorted list (minus the last element) when your last element arrives. Then, all that remains to do is a single binary search cycle, O(log n) not a complete binary sort, O(n log n). Potentially, there is a perceived performance gain, since you are getting a head-start on the other sort algorithms.
Managing, queuing, and extracting data from a stream is a completely different issue and might be counter-productive to your intentions. I would not recommend this unless you can sort the complete data set in about the same time it takes to stream one or maybe two elements (and you feel good about coding the streaming portion).
Use Heap Sort in those cases where Tree Sort will behave badly i.e. large data set since Tree sort needs additional space to store the tree structure.

Sorting algorithm that runs in time O(n) and also sorts in place

Is there any sorting algorithm which has running time of O(n) and also sorts in place?
There are a few where the best case scenario is O(n), but it's probably because the collection of items is already sorted. You're looking at O(n log n) on average for some of the better ones.
With that said, the Wiki on sorting algorithms is quite good. There's a table that compares popular algorithms, stating their complexity, memory requirements (indicating whether the algorithm might be "in place"), and whether they leave equal value elements in their original order ("stability").
http://en.wikipedia.org/wiki/Sorting_algorithm
Here's a little more interesting look at performance, provided by this table (from the above Wiki):
http://en.wikipedia.org/wiki/File:SortingAlgoComp.png
Some will obviously be easier to implement than others, but I'm guessing that the ones worth implementing have already been done so in a library for your choosing.
No.
There's proven lower bound O(n log n) for general sorting.
Radix sort is based on knowing the numeric range of the data, but the in-place radix sorts mentioned here in practice require multiple passes for real-world data.
Radix Sort can do that:
http://en.wikipedia.org/wiki/Radix_sort#In-place_MSD_radix_sort_implementations
Depends on the input and the problem. For example, 1...n numbers can be sorted in O(n) in place.
Spaghetti sort is O(n), though arguably not in-place. Also, it's analog only.

Stable, efficient sort?

I'm trying to create an unusual associative array implementation that is very space-efficient, and I need a sorting algorithm that meets all of the following:
Stable (Does not change the relative ordering of elements with equal keys.)
In-place or almost in-place (O(log n) stack is fine, but no O(n) space usage or heap allocations.
O(n log n) time complexity.
Also note that the data structure to be sorted is an array.
It's easy to see that there's a basic algorithm that matches any 2 of these three (insertion sort matches 1 and 2, merge sort matches 1 and 3, heap sort matches 2 and 3), but I cannot for the life of me find anything that matches all three of these criteria.
Merge sort can be written to be in-place I believe. That may be the best route.
Note: standard quicksort is not O(n log n) ! In the worst case, it can take up to O(n^2) time. The problem is that you might pivot on an element which is far from the median, so that your recursive calls are highly unbalanced.
There is a way to combat this, which is to carefully pick a median which is guaranteed, or at least very likely, to be close to the median. It is surprising that you can actually find the exact median in linear time, although in your case it sounds like you care about speed so I would not suggest this.
I think the most practical approach is to implement a stable quicksort (it's easy to keep stable) but use the median of 5 random values as the pivot at each step. This makes it highly unlikely that you'll have a slow sort, and is stable.
By the way, merge sort can be done in-place, although it's tricky to do both in-place and stable.
What about quicksort?
Exchange can do that too, might be more "stable" by your terms, but quicksort is faster.
There's a list of sort algorithms on Wikipedia. It includes categorization by execution time, stability, and allocation.
Your best bet is probably going to be modifying an efficient unstable sort to be stable, thereby making it less efficient.
There is a class of stable in-place merge algorithms, although they are complicated and linear with a rather high constant hidden in the O(n). To learn more, have a look at this article, and its bibliography.
Edit: the merge phase is linear, thus the mergesort is nlog_n.
Because your elements are in an array (rather than, say, a linked list) you have some information about their original order available to you in the array indices themselves. You can take advantage of this by writing your sort and comparison functions to be aware of the indices:
function cmp( ar, idx1, idx2 )
{
// first compare elements as usual
rc = (ar[idx1]<ar[idx2]) ? -1 : ( (ar[idx1]>ar[idx2]) ? 1 : 0 );
// if the elements are identical, then compare their positions
if( rc != 0 )
rc = (idx1<idx2) ? -1 : ((idx1>idx2) ? 1 : 0);
return rc;
}
This technique can be used to make any sort stable, as long as the sort ONLY performs element swaps. The indices of elements will change, but the relative order of identical elements will stay the same, so the sort remains robust. It won't work out of the box for a sort like heapsort because the original heapification "throws away" the relative ordering, though you might be able to adapt the idea to other sorts.
Quicksort can be made stable reasonably easy simply by having an sequence field added to each record, initializing it to the index before sorting and using it as the least significant part of the sort key.
This has a slightly adverse effect on the time taken but it doesn't affect the time complexity of the algorithm. It also has a minimal storage cost overhead for each record, but that rarely matters until you get very large numbers of records (and is mimimized with larger record sizes).
I've used this method with C's qsort() function to avoid writing my own. Each record has a 32-bit integer added and populated with the starting sequence number before calling qsort().
Then the comparison function checked the keys and the sequence (this guarantees there are no duplicate keys), turning the quicksort into a stable one. I recall that it still outperformed the inherently stable mergesort for the data sets I was using.
Your mileage may vary, so always remember: Measure, don't guess!
Quicksort can be made stable by doing it on a linked list. This costs n to pick random or median of 3 pivots but with a very small constant (list traversal).
By splitting the list and ensuring that the left list is sorted so same values go left and the right list is sorted so the same values go right, the sort will be implicity stable for no real extra cost. Also, since this deals with assignment rather than swapping, I think the speed might actually be slightly better than a quick sort on an array since there's only a single write.
So in conclusion, list up all your items and run quicksort on a list
Don't worry too much about O(n log n) until you can demonstrate that it matters. If you can find an O(n^2) algorithm with a drastically lower constant, go for it!
The general worst-case scenario is not relevant if your data is highly constrained.
In short: Run some test.
There's a nice list of sorting functions on wikipedia that can help you find whatever type of sorting function you're after.
For example, to address your specific question, it looks like an in-place merge sort is what you want.
However, you might also want to take a look at strand sort, it's got some very interesting properties.
I have implemented a stable in-place quicksort and a stable in-place merge sort. The merge sort is a bit faster, and guaranteed to work in O(n*log(n)^2), but not the quicksort. Both use O(log(n)) space.
Perhaps shell sort? If I recall my data structures course correctly, it tended to be stable, but it's worse case time is O(n log^2 n), although it performs O(n) on almost sorted data. It's based on insertion sort, so it sorts in place.
Maybe I'm in a bit of a rut, but I like hand-coded merge sort. It's simple, stable, and well-behaved. The additional temporary storage it needs is only N*sizeof(int), which isn't too bad.

Resources