Related
If there are so many faster and more efficient sorting algorithms available (merge sort, heap sort, quick sort), why is selection sort still taught? If it is because they are still used, when are some examples where this would be true?
I believe it's still taught because it's a simple algorithm to understand and helps build the foundation for other sorting algorithms. It's also an easy exercise in understanding time and space complexity for algorithms. Not aware of any practical usages in modern computing, but it does have very low memory overhead so can be ideal for situations where memory is at a premium.
Personally, Selection Sort only exists as a teaching process, besides that, I wouldn't see any other reason to use it.
If gives you a great understanding on Big-O and it's effective to compare selection sort to quick sort/merge sort/heap sort so that you can actually experience the difference in run-time
In short. Selection sort is used for educational purposes 😂
Selection sort has a few advantages
Less memory write compare to other algorithms. So,may be useful for disk operations or in ROMs. Also an in-place sort.
It is the basic idea of Heap Sort.
I'm working on detecting duplicates in a list of around 5 million addresses, and was wondering if there was consensus on an efficient algorithm for such a purpose. I've looked at the Dedupe library on Gitbub (https://github.com/datamade/dedupe), but based on the documentation I'm not clear that this would scale to a large application well.
As an aside, I'm just looking to define duplicates based on textual similarity - have already done a lot of cleaning on the addresses. I've been using a crude method using Levenshtein distance, but was wondering if there's anything more efficient for large datasets.
Thanks,
Dedupe should work fine for data of that size.
There has been some excellent work by Michael Wick and Beka Steorts that have better complexity than dedupe.
I was using the built-in sort function of Matlab:
[temp, Idx] = sort(M,2);
I would like to have the sorted index of each row of M, which is a matrix of size > 50k.
I searched hard but did not find anything.. It would be greatly appreciated if you have any comments!
To get a sense of how much room for improvement you have, I would suggest writing a test program in C and use qsort or in C++ and user sort and carefully time it on 7000 inputs of size 7000 (or whatever setup you have in Matlab).
I'm going to give you my estimate: probably Matlab's sort runs (on properly vectorized code, like yours) as fast as C++, and you're just seeing the effect of running an algorithm that takes O(n^2 log n). It is reported in Matlab's marketing material that its sort function was faster than C's qsort, but take it with a grain of salt.
The best way to speed up that sort is to get a faster computer. It will speed everything else up too. :)
The fact is, you can rarely speed up a single call to something like a sort. MATLAB is already doing that in an efficient manner, using an optimized code internally. (Reread the carlosdc answer.) The things you can sometimes get a boost on are tools that are written in MATLAB itself.
So, what can you do? Short of buying that new computer, you can look at your overall code. One single sort of that size is never that big of a problem. But the reason for doing that sort over and over again is. Think carefully about the code, about whether you can change the flow or avoid a many times repeated sort. Algorithm change is often a FAR bigger source of improvement than the wee bit you would ever get even if you could improve that sort.
Sorting is fundamentally O(n log n).
As long as you have a reasonably efficient implementation, this is unlikely to change much.
That said, as Andrew Janke's comment suggests, multi-threading can improve things dramatically.
GPU programming can be a way to get massive speedups. If you have R2010b or later, you may be able to use accelerated versions of built-in functions like sort from Mathworks.
Otherwise, write a mex wrapper around the CUDA Thrust library which includes a sort.
You could write your own sort function in C/C++ as MEX. MATLAB documentation has examples for it.
There exist many sort algorithms which are better then other in edge cases, for example almost sorted data or stability (which does not matter in MATLAB because all its types are value types).
Is your data numeric or strings? For strings there are probably special algorithms for ASCII sort, sometimes natural sort is preferable.
Why would you choose bubble sort over other sorting algorithms?
You wouldn't.
Owen Astrachan of Duke University once wrote a research paper tracing the history of bubble sort (Bubble Sort: An Archaeological Algorithmic Analysis) and quotes CS legend Don Knuth as saying
In short, the bubble sort seems to have nothing to recommend it, except a catchy name
and the fact that it leads to some interesting theoretical problems.
The paper concludes with
In this paper we have investigated the origins of bubble sort and its enduring popularity despite warnings against its use by many experts. We confirm the warnings by analyzing its complexity both in coding and runtime.
Bubble sort is slower than the other O(n2) sorts; it's about four times as slow as insertion sort and twice as slow as selection sort. It does have good best-case behavior (if you include a check for no swaps), but so does Insertion Sort: just one pass over an already-sorted array.
Bubble Sort is impractically slow on almost all real data sets. Any good implementation of quicksort, heapsort, or mergesort is likely to outperform it by a wide margin. Recursive sorts that use a simpler sorting algorithm for small-enough base-cases use Insertion Sort, not Bubble Sort.
Also, the President of the United States says you shouldn't use it.
Related: Why bubble sort is not efficient? has some more details.
There's one circumstance in which bubble sort is optimal, but it's one that can only really occur with ancient hardware (basically, something like a drum memory with two heads, where you can only read through the data in order, and only work with two data items that are directly next to each other on the drum).
Other than that, it's utterly useless, IMO. Even the excuse of getting something up and running quickly is nonsense, at least in my opinion. A selection sort or insertion sort is easier to write and/or understand.
You would implement bubble sort if you needed to create a web page showing an animation of bubble sort in action.
When all of the following conditions are true
Implementing speed is way more important than execution speed (probability <1%)
Bubble sort is the only sorting algorithm you remember from university class (probability 99%)
You have no sorting library at hand (probability <1%)
You don't have access to Google (probability <1%)
That would be less than 0,000099 % chance that you need to implement bubble sort, that is less than one in a million.
If your data is on a tape that is fast to read forward, slow to seek backward, and fast to rewind (or is a loop so it doesn't need rewinding), then bubblesort will perform quite well.
I suspect a trick question. No one would choose bubble sort over other sorting algorithms in the general case. The only time it really makes any sense is when you're virtually certain that the input is (nearly) sorted already.
Bubble sort is easy to implement. While the 'standard' implementation has poor performance, there is a very simple optimization which makes it a strong contender compared to many other simple algorithms. Google 'combsort', and see the magic of a few well placed lines. Quicksort still outperforms this, but is less obvious to implement and needs a language that supports recursive implementations.
I can think of a few reasons for bubble sort:
It's a basic elementary sort. They're great for beginner programmers learning the if, for, and while statements.
I can picture some free time for a programmer to experiment on how all the sorts work. What better to start with at the top with than the bubble sort (yes, this does demean its rank, but who doesn't think 'bubble sort' if someone says 'sorting algorithms').
Very easy to remember and work with for any algorithm.
When I was starting on linked lists, bubble sort helped me understand how all the nodes worked well with each other.
Now I'm feeling like a lame commercial advertising about bubble sort so I'll be quiet now.
I suppose you would choose bubble sort if you needed a sorting algorithm which was guaranteed to be stable and had a very small memory footprint. Basically, if memory is really scarce in the system (and performance isn't a concern) then it would work, and would be easily understood by anybody supporting the code. It also helps if you know ahead of time that the values are mostly sorted already.
Even in that case, insertion sort would probably be better.
And if it's a trick question, next time suggest Bogosort as an alternative. After all, if they're looking for bad sorting, that's the way to go.
It's useful for "Baby's First Sort" types of exercises in school because it's easy to explain how it works and it's easy to implement. Once you've written it, and maybe run it once, delete it and never think of it again.
You might use Bubblesort if you just wanted to try something quickly. If, for instance, you are in a new environment and you are playing around with a new idea, you can quickly throw in a bubble sort in very little time. It might take you much longer to remember and write a different sort and debug it and you still might not get it right. If your experiment works out and you need to use the code for something real, then you can spend the time to get it right.
No sense putting a lot of effort into the sort algorithm if you are just prototyping.
When demonstrating with a concrete example how not to implement a sort routine.
Because your other sorting algorithm is Monkey Sort? ;)
Seriously though, bubble sort is mainly a sorting algorithm for educational reasons and has no practical value.
When the array is already "almost" sorted or you have few additions into an already sorted-list, you can use bubble sort to resort it. Bubble sort usually works for small data-sets.
Sorting has been studied for decades, so surely the sorting algorithms provide by any programming platform (java, .NET, etc.) must be good by now, right? Is there any reason to override something like System.Collections.SortedList?
There are absolutely times where your intimate understanding of your data can result in much, much more efficient sorting algorithms than any general purpose algorithm available. I shared an example of such a situation in another post at SO, but I'll share it hear just to provide a case-in-point:
Back in the days of COBOL, FORTRAN, etc... a developer working for a phone company had to take a relatively large chunk of data that consisted of active phone numbers (I believe it was in the New York City area), and sort that list. The original implementation used a heap sort (these were 7 digit phone numbers, and a lot of disk swapping was taking place during the sort, so heap sort made sense).
Eventually, the developer stumbled on a different approach: By realizing that one, and only one of each phone number could exist in his data set, he realized that he didn't have to store the actual phone numbers themselves in memory. Instead, he treated the entire 7 digit phone number space as a very long bit array (at 8 phone numbers per byte, 10 million phone numbers requires just over a meg to capture the entire space). He then did a single pass through his source data, and set the bit for each phone number he found to 1. He then did a final pass through the bit array looking for high bits and output the sorted list of phone numbers.
This new algorithm was much, much faster (at least 1000x faster) than the heap sort algorithm, and consumed about the same amount of memory.
I would say that, in this case, it absolutely made sense for the developer to develop his own sorting algorithm.
If your application is all about sorting, and you really know your problem space, then it's quite possible for you to come up with an application specific algorithm that beats any general purpose algorithm.
However, if sorting is an ancillary part of your application, or you are just implementing a general purpose algorithm, chances are very, very good that some extremely smart university types have already provided an algorithm that is better than anything you will be able to come up with. Quick Sort is really hard to beat if you can hold things in memory, and heap sort is quite effective for massive data set ordering (although I personally prefer to use B+Tree type implementations for the heap b/c they are tuned to disk paging performance).
Generally no.
However, you know your data better than the people who wrote those sorting algorithms. Perhaps you could come up with an algorithm that is better than a generic algorithm for your specific set of data.
Implementing you own sorting algorithm is akin to optimization and as Sir Charles Antony Richard Hoare said, "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil".
Certain libraries (such as Java's very own Collections.sort) implement a sort based on criteria that may or may not apply to you. For example, Collections.sort uses a merge sort for it's O(n log(n)) efficiency as well as the fact that it's an in-place sort. If two different elements have the same value, the first element in the original collection stays in front (good for multi-pass sorting to different criteria (first scan for date, then for name, the collection stays name (then date) sorted)) However, if you want slightly better constants or have a special data-set, it might make more sense to implement your own quick sort or radix sort specific exactly to what you want to do.
That said, all operations are fast on sufficiently small n
Short answer; no, except for academic interest.
You might want to multi-thread the sorting implementation.
You might need better performance characteristics than Quicksorts O(n log n), think bucketsort for example.
You might need a stable sort while the default algorithm uses quicksort. Especially for user interfaces you'll want to have the sorting order be consistent.
More efficient algorithms might be available for the data structures you're using.
You might need an iterative implementation of the default sorting algorithm because of stack overflows (eg. you're sorting large sets of data).
Ad infinitum.
A few months ago the Coding Horror blog reported on some platform with an atrociously bad sorting algorithm. If you have to use that platform then you sure do want to implement your own instead.
The problem of general purpose sorting has been researched to hell and back, so worrying about that outside of academic interest is pointless. However, most sorting isn't done on generalized input, and often you can use properties of the data to increase the speed of your sorting.
A common example is the counting sort. It is proven that for general purpose comparison sorting, O(n lg n) is the best that we can ever hope to do.
However, suppose that we know the range that the values to be sorted are in a fixed range, say [a,b]. If we create an array of size b - a + 1 (defaulting everything to zero), we can linearly scan the array, using this array to store the count of each element - resulting in a linear time sort (on the range of the data) - breaking the n lg n bound, but only because we are exploiting a special property of our data. For more detail, see here.
So yes, it is useful to write your own sorting algorithms. Pay attention to what you are sorting, and you will sometimes be able to come up with remarkable improvements.
If you have experience at implementing sorting algorithms and understand the way the data characteristics influence their performance, then you would already know the answer to your question. In other words, you would already know things like a QuickSort has pedestrian performance against an almost sorted list. :-) And that if you have your data in certain structures, some sorts of sorting are (almost) free. Etc.
Otherwise, no.