Sorting an array consisting random numbers [closed]

Sorting an array consisting random numbers [closed] - algorithm

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
In this sorting animation, I saw that heap sort and merge sort works best for an array containing random numbers. But, what about if we compare these sorting algorithms with Radix and introsort?
In short, which type of sorting algorithm is best to sort an array consisting of random numbers?
Thanks

For an array of random numbers, a least significant digit first counting variation of radix sort is normally fastest for smaller arrays that fit within cache. While for larger arrays, using one most significant digit first to split up the array into smaller sub-arrays that fit in cache will be faster. Since the data is random, the main time overhead for a radix sort is the randomly distributed writes, which is not cache friendly if the array is significantly larger than cache. If the original and working arrays fit within cache, for most systems, the random access writes don't incur a significant time penalty.
There is also a choice for the base used in a radix sort. For example 32 bit numbers can be sorted in 4 passes if using base 256 (8 bit "digits"). Using base 65536 (16 bit "digits") usually exceeds the size of the L1 and/or L2 caches, so it's not faster in most cases, even though it only takes two passes. For 64 bit numbers, four 11 bit "digits" and two 10 bit "digits" could be used to sort in 6 passes, instead of using eight 8 bit "digits" to sort in 8 passes. However, the 11/10 bit digit variation isn't going to be faster unless the array is large enough and the distribution of the random numbers is uniform enough to use up most of the storage used to hold the counts / indexes.
Link to a prior thread about radix sort variations:
Radix Sort Optimization
merge sort "best case"
For a standard merge sort, the number of moves is always the same, but if the data is already sorted, only half the number of compares is done.
quick sort / intro sort "best case"
Best case for quick sort is random data. Using the middle value for partition doesn't make much difference for random data, but if the data is already sorted, it ends up as best case. Intro sort generally involves extra code to check if the recursion is getting too deep where it switches to heap sort. For random data, this shouldn't happen, so the extra code to check for switch to heap sort is just extra overhead.

Here you can see time complexities of various sorting algorithms in best, average and worst cases: http://bigocheatsheet.com/
As you want to compare the time complexities of sorting algorithms with random numbers, we can simply compare their average-case time complexities.
You can further look into their algorithms and analyze their time complexities.
https://www.geeksforgeeks.org/merge-sort/
https://www.geeksforgeeks.org/radix-sort/
https://www.geeksforgeeks.org/heap-sort/
https://www.geeksforgeeks.org/know-your-sorting-algorithm-set-2-introsort-cs-sorting-weapon/
Merge Sort is a widely used sorting algorithm in various libraries which has sort function implementations.
Merge sort sorts in O(nlogn) time and O(n) space.
Heap Sort sorts in O(nlogn) time and O(1) space.
Radix sort sorts in O(nk) time and O(n+k) space.
Intro sort sorts in O(nlogn) time and O(logn) space.
Intro Sort is mix of quick, insertion and heap sort.
Intro Sort is probably the best one.
There is no perfect algorithm, different algorithms have a different set of advantages and disadvantages.
Note: All time complexities are average case.

Related

Preferred Sorting For People Based On Their Age

Suppose we have 1 million entries of an object 'Person' with two fields 'Name', 'Age'. The problem was to sort the entries based on the 'Age' of the person.
I was asked this question in an interview. I answered that we could use an array to store the objects and use quick sort as that would save us from using additional space but interviewer told that memory was not a factor.
My question is what would be the factor that would decide which sort to use?
Also what would be the preferred way to store this?
In this scenario does any sorting algorithm have an advantage over another sorting algorithm and would result in a better complexity?

This Stackoverflow link may be useful to you.
The answers above are sufficient but i would like to add some more information from the link above.
I am copying some information from the answers in, the link above, over here.
We should note that even if the fields in the Object are very big (i.e. long names) you do not need to use a file system sort, you can use an in-memory sort, because
# elements * 8 ~= 762 MB (most modern systems have enough memory for that)
^
key(age) + pointer to struct requires 8 bytes in 32 bits system
It is important to minimize the disk accesses - because disks are not random access, and disk accesses are MUCH slower then RAM accesses.
Now, use a sort of your choice on that - and avoid using disk for the sorting process.
Some possibilities of sorts (on RAM) for this case are:
Standard quicksort or merge-sort (Which you had already thought of)
Bucket sort can also be applied here, since the rage is limited to [0,150] (Which others have specified here under the name Count Sort)
Radix sort (For the same reason, radix sort will need ceil(log_2(150)) ~= 8 iterations
I wanted to point out the memory aspect in case you may encounter the same question but may need to answer it taking the memory constraints into consideration. In fact your constraints are even less(10^6 compared to the 10^8 in the other question).
As for the matter of storing it -
The quickest way to sort it would be to allocate 151 linked lists/vector (let's call them buckets or whatever you may depending on the language you prefer) and put each person's data structure in the bucket according to his/her age(all people's ages are between 0 and 150):
bucket[person->age].add(person)
As others have pointed out Bucket Sort is going to be the better option for you.
In fact the beauty of bucket sort is that if you have to perform any operation on ranges of ages(like from 10-50 years of age) you can partition your bucket sizes according to your requirements(like have varied bucket range for each bucket).
I repeat again i have copied the information from the answers in the link given above, but i believe they might be useful to you.

If the array has n elements, then quicksort (or, actually, any comparison-based sort) is Ω(n log(n)).
Here, though, it looks like you have here an alternative to comparison-based sorting, since you need to sort only on age. Suppose there are m distinct ages. In this case, Counting Sort, will be Θ(m + n). For the specifics of your question, assuming that age is in years, m is much smaller than n, and you can do this in linear time.
The implementation is trivial. Simply create an array of, say, 200 entries (200 being an upper bound on the age). The array is of linked lists. Scan over the people, and place each person in the linked list in the appropriate entry. Now, just concatenate the lists according to the positions in the array.

Different sorting algorithms perform at different complexities, yes. Some use different amounts of space. And in practice, real performance with the same complexity varies too.
http://www.cprogramming.com/tutorial/computersciencetheory/sortcomp.html
There're different ways to set up a quicksort's partition method that could have an effect for ages. Shell sorts can have different gap settings that perform better for certain types of input. But maybe your interviewer was more interested in you thinking about 1 million people having a lot of duplicate ages; which might mean you want a 3-way quicksort, or as suggested in comments a counting sort.

This is an interview question, so I guess interviewee's answer is more important than correct sorting algorithm. Your problem is sorting array of Object with field age is integer. Age has some special properties:
integer: there are some sorting algorithms specially design for integer.
finite: you know maximum age of people, right? For example that will be 200.
I will list some sorting algorithm for this problem with advantages and disadvantages that suitable enough in one interview session:
Quick sort: complexity is O(NLogN) and can apply to any data set. Quicksort is the fastest sort that using compare operator between two elements. Biggest disadvantage of quicksort is quicksort isn't stable. That means two objects equal in age doesn't maintain order after sorting.
Merge sort: complexity is O(NLogN). Little bit slower than quicksort but this is a stable sort. Also this algorithm can apply to any data set.
radix sort: complexity is O(w*n), with n is size of your list and w is maximum length of number of digits in your dataset. For example: length of 12 is 3, length of 154 is 3. So if people's age maximum is 99, complexity should be O(2*n). This algorithm just can apply to integer or string.
Counting sort complexity is O(m+n). With n is size of your list and m is number of distinct ages. This algorithm just can apply to integer.
Because we are sorting milion of entries and all values are integer stand in range 0 .. 200 so ton of duplicate values. So counting sort is the best fit with complexity O(200 + N), with N ~= 1,000,000. 200 is not much.

If you assume that you have finite number of different values of age (usually people are not older then 100) then you could use
counting sort (https://en.wikipedia.org/wiki/Counting_sort). You would be able to sort in linear time.

What should be the ideal threshold on array size in order to use a non-recursive sorting method?

I recently did a revision on sorting algorithms. While revisioning, I imagined some code that selects the optimal of two available sorting algorithms to sort an array, according to array's size. For example, it has to choose between insertion sort and quicksort.
It's well known that quicksort is used extensively to sort large arrays and that achieves its average case time, that is O(nlogn), although its worst-case time is O(n^2). On the other hand, insertion sort isn't recursive, thus it may consume less CPU time when it sorts a small-sized array. So, what should be a nice threshold size for the aforementioned code in order to choose the most efficient of those algorithms?
Other performance factors, like "how close" is a given sequence to its sorted permutation, aren't concerning me right now.

From Princeton University's quicksort page
Cutoff to insertion sort. As with mergesort, it pays to switch to
insertion sort for tiny arrays. The optimum value of the cutoff is
system-dependent, but any value between 5 and 15 is likely to work
well in most situations.
I personally prefer a cut off size of 15. But again that is system dependent and may or may not be the best in your case.

If linear time sorting algorithms like Radix Sort exist, when do we need to use comparison sorts? [duplicate]

This question already has answers here:
linear time sortings for all categories
(2 answers)
Why quicksort is more popular than radix-sort?
(6 answers)
Closed 8 years ago.
Since all comparison sort algorithms take at least n lg n time, why would we ever need to use something like quicksort when we can express the items in the quicksort list as bits and using something like radix sort which is linear?

Radix sort tends to exhibit poor cache locality, see for example this paper for an analysis of different sorting algorithms under the influence of cache (skip to the conclusion for a discussion of radix sort's poor cache locality compared to quicksort and mergesort). Quicksort and mergesort partition the data such that after a few iterations a partition will fit on a few cache lines, whereas radix sort keeps shuffling the data. In addition, radix sort either needs to use linked data structures for its buckets (which exhibit poor cache performance), or else it needs to use over-large arrays (which waste memory).
Also, depending on radix sort's radix size, its constant factor may be larger than quicksort's / mergesort's log factor. In an extreme case, using a radix of 2 on 64-bit integers, radix sort has a constant factor of 64 (one pass per bit), whereas it's highly unlikely that quicksort's / mergesort's log factor is that large (as this would imply that you're sorting 2^64 elements)

Modern implementations of mergesort using a SIMD kernel to sort short arrays can be very, very fast. This paper by some folks at Intel describes one such implementation. The chief advantage here is that the SIMD kernel can do several comparisons and swaps per clock cycle, gaining and taking advantage of several bits of information about the array to be sorted per clock cycle.
Quicksort requires a test, a store, and an increment of one of two pointers at each iteration, which forms a single huge dependency chain. This isn't great, since it means you're gaining one bit of information about the array every few clock cycles.
Radix sorts have the same problem as Quicksort (each pass is a single huge dependency chain with an access and increment of one pointer from a largish, uniformly-distributed set). However, on uniformly-distributed inputs, a properly implemented MSD radix sort using five- or six-bit keys can do in one pass over the input what Quicksort will take five or six passes to do. I haven't timed this stuff recently, but a good MSD radix sort might still be the best way to sort large arrays of ints or long longs.
None of this stuff about radix sort will keep you warm at night if your input is badly distributed and the universe of possible keys is large when compared to the number of keys in your input.

is selection sort faster than insertion for big arrays? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Efficency of Insertion Sort vs Bubble sort vs Selection sort?
is selection sort faster than insertion for big arrays? When its in the worstest case?
I know insertion be faster than selection, but for large arrays and in the worstest case?

The size of the array involved is rarely of much consequence.
The real question is the speed of comparison vs. copying. The time a selection sort will win is when a comparison is a lot faster than copying. Just for example, let's assume two fields: a single int as a key, and another megabyte of data attached to it.
In such a case, comparisons involve only that single int, so it's really fast, but copying involves the entire megabyte, so it's almost certainly quite a bit slower.
Since the selection sort does a lot of comparisons, but relatively few copies, this sort of situation will favor it. The insertion sort does a lot more copies, so in a situation like this, the slower copies will slow it down quite a bit.
As far as worst case for a insertion sort, it'll be pretty much the opposite -- anything where copying is fast, but comparison is slow. There are a few more cases that favor insertion as well, such as when elements might be slightly scrambled, but each is still within a short distance of its final location when sorted.
If the data doesn't provide a solid indication in either direction, chances are pretty decent that insertion sort will work out better.

According to the Wikipedia article,
In general, insertion sort will write to the array O(n2) times,
whereas selection sort will write only O(n) times. For this reason
selection sort may be preferable in cases where writing to memory is
significantly more expensive than reading, such as with EEPROM or
flash memory.
That's going to be true regardless of the array size. In fact, the difference will be more pronounced as the arrays get larger.

Insertion sort, if well implemented use memcpy() to move the other values. So it depends on the processor, the cache speed (first level 2nd level chache), the cache size, when one algo becomes faster then the other.
I remember an implementations (was it java?) where one algo was used when the number of elements did not exceed a specific hard coded threshold, otherwise another algo was used.
So, you simply have to measure it.
The big O notation is for small and medium array a bit misleading, then O(N) means c * O(N).
And the factor c influences the total execution time.

Is there ever a good reason to use Insertion Sort?

For general-purpose sorting, the answer appears to be no, as quick sort, merge sort and heap sort tend to perform better in the average- and worst-case scenarios. However, insertion sort appears to excel at incremental sorting, that is, adding elements to a list one at a time over an extended period of time while keeping the list sorted, especially if the insertion sort is implemented as a linked list (O(log n) average case vs. O(n)). However, a heap seems to be able to perform just (or nearly) as well for incremental sorting (adding or removing a single element from a heap has a worst-case scenario of O(log n)). So what exactly does insertion sort have to offer over other comparison-based sorting algorithms or heaps?

From http://www.sorting-algorithms.com/insertion-sort:
Although it is one of the elementary sorting algorithms with
O(n2) worst-case time, insertion sort
is the algorithm of choice either when
the data is nearly sorted (because it
is adaptive) or when the problem size
is small (because it has low
overhead).
For these reasons, and because it is also stable, insertion sort is
often used as the recursive base case
(when the problem size is small) for
higher overhead divide-and-conquer
sorting algorithms, such as merge sort
or quick sort.

An important concept in analysis of algorithms is asymptotic analysis. In the case of two algorithms with different asymptotic running times, such as one O(n^2) and one O(nlogn) as is the case with insertion sort and quicksort respectively, it is not definite that one is faster than the other.
The important distinction with this sort of analysis is that for sufficiently large N, one algorithm will be faster than another. When analyzing an algorithm down to a term like O(nlogn), you drop constants. When realistically analyzing the running of an algorithm, those constants will be important only for situations of small n.
So what does this mean? That means for certain small n, some algorithms are faster. This article from EmbeddedGurus.net includes an interesting perspective on choosing different sorting algorithms in the case of a limited space (16k) and limited memory system. Of course, the article references only sorting a list of 20 integers, so larger orders of n is irrelevant. Shorter code and less memory consumption (as well as avoiding recursion) were ultimately more important decisions.
Insertion sort has low overhead, it can be written fairly succinctly, and it has several two key benefits: it is stable, and it has a fairly fast running case when the input is nearly sorted.

Yes, there is a reason to use either an insertion sort or one of its variants.
The sorting alternatives (quick sort, etc.) of the other answers here make the assumption that the data is already in memory and ready to go.
But if you are attempting to read in a large amount of data from a slower external source (say a hard drive), there is a large amount of time wasted as the bottleneck is clearly the data channel or the drive itself. It just cannot keep up with the CPU. A natural series of waits occur during any read. These waits are wasted CPU cycles unless you use them to sort as you go.
For instance, if you were to make your solution to this be the following:
Read a ton of data in a dedicated loop into memory
Sort that data
You would very likely take longer than if you did the following in two threads.
Thread A:
Read a datum
Place datum into FIFO queue
(Repeat until data exhausted from drive)
Thread B:
Get a datum from the FIFO queue
Insert it into the proper place in your sorted list
(repeat until queue empty AND Thread A says "done").
...the above will allow you to use the otherwise wasted time. Note: Thread B does not impede Thread A's progress.
By the time the data is fully read, it will have been sorted and ready for use.

Most sorting procedures will use quicksort and then insertion sort for very small data sets.

If you're talking about maintaining a sorted list, there is no advantage over some kind of tree, it's just slower.
Well, maybe it consumes less memory or is a simpler implementation.
Inserting into a sorted list will involve a scan, which means that each insert is O(n), therefore sorting n items becomes O(n^2)
Inserting into a container such as a balanced tree, is typically log(n), therefore the sort is O(n log(n)) which is of course better.
But for small lists it hardly makes any difference. You might use an insert sort if you have to write it yourself without any libraries, the lists are small and/or you don't care about performance.

YES,
Insertion sort is better than Quick Sort on short lists.
In fact an optimal Quick Sort has a size threshold that it stops at, and then the entire array is sorted by insertion sort over the threshold limits.
Also...
For maintaining a scoreboard, Binary Insertion Sort may be as good as it gets.
See this page.

For small array size insertion sort outperforms quicksort.
Java 7 and Java 8 uses dual pivot quicksort to sort primitive data types.
Dual pivot quicksort out performs typical single pivot quicksort. According to algorithm of dual pivot quicksort :
For small arrays (length < 27), use the Insertion sort algorithm.
Choose two pivot...........
Definitely, insertion sort out performs quicksort for smaller array size and that is why you switch to insertion sort for arrays of length less than 27. The reason could be: there are no recursions in insertion sort.
Source: http://codeblab.com/wp-content/uploads/2009/09/DualPivotQuicksort.pdf

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio