How to speed up 'sort' function in Matlab? - performance

I was using the built-in sort function of Matlab:
[temp, Idx] = sort(M,2);
I would like to have the sorted index of each row of M, which is a matrix of size > 50k.
I searched hard but did not find anything.. It would be greatly appreciated if you have any comments!

To get a sense of how much room for improvement you have, I would suggest writing a test program in C and use qsort or in C++ and user sort and carefully time it on 7000 inputs of size 7000 (or whatever setup you have in Matlab).
I'm going to give you my estimate: probably Matlab's sort runs (on properly vectorized code, like yours) as fast as C++, and you're just seeing the effect of running an algorithm that takes O(n^2 log n). It is reported in Matlab's marketing material that its sort function was faster than C's qsort, but take it with a grain of salt.

The best way to speed up that sort is to get a faster computer. It will speed everything else up too. :)
The fact is, you can rarely speed up a single call to something like a sort. MATLAB is already doing that in an efficient manner, using an optimized code internally. (Reread the carlosdc answer.) The things you can sometimes get a boost on are tools that are written in MATLAB itself.
So, what can you do? Short of buying that new computer, you can look at your overall code. One single sort of that size is never that big of a problem. But the reason for doing that sort over and over again is. Think carefully about the code, about whether you can change the flow or avoid a many times repeated sort. Algorithm change is often a FAR bigger source of improvement than the wee bit you would ever get even if you could improve that sort.

Sorting is fundamentally O(n log n).
As long as you have a reasonably efficient implementation, this is unlikely to change much.
That said, as Andrew Janke's comment suggests, multi-threading can improve things dramatically.
GPU programming can be a way to get massive speedups. If you have R2010b or later, you may be able to use accelerated versions of built-in functions like sort from Mathworks.
Otherwise, write a mex wrapper around the CUDA Thrust library which includes a sort.

You could write your own sort function in C/C++ as MEX. MATLAB documentation has examples for it.
There exist many sort algorithms which are better then other in edge cases, for example almost sorted data or stability (which does not matter in MATLAB because all its types are value types).
Is your data numeric or strings? For strings there are probably special algorithms for ASCII sort, sometimes natural sort is preferable.

Related

What are some examples of Sorting methods in Web development?

I am a TA for algorithms class and we are doing a unit on sorting, and I wanted to a discussion of quicksort. There are many good theoretical discussions of sorting methods on the web showing which one is better in which circumstances...
What are some real-life instances of quick-sort I can give to my student. Especially in the field of Web Development.
does Django use quick-sort?
does React?
does Leaflet use any kind of sort?
In fact, I don't really care about quicksort particularly. Any sorting method will do if I can point to a specific library that uses it. Thanks.
why are my students learning sort? why am i teaching this? i can think of academic or theoretical reasons... basically that we are constantly ordering things - either in their own right or as part of another algorithm. how about for my students, who may never have to write their own sort function?
I'll answer the question "why do we learn how to write a sort function?" Why do we learn to write anything that's already given to us by a library? Hashes, lists, queues, trees... why learn to write any of them?
The most important is to appreciate their performance consequences and when to use which one. For example, Ruby Arrays supply a lot of built in functionality. They're so well done and easy to use that it's easy to forget you're working with a list and write yourself a pile of molasses.
Look at this loop that finds a thing in a list and replaces it.
things.each { |thing|
idx = thing.index(marker)
thing[idx] = stuff
}
With no understanding of the underlying algorithms that seems perfectly sensible.
For each list in the list of things.
Find the item to replace.
Insert a new item in its place.
Two steps per thing. What could be simpler? And when they run it with a small amount of test data it's fine. When they put it into production with a real amount of data and having to do it thousands of times per second it's dog slow. Why? Without an appreciation for what all those methods are doing under the hood, they cannot know.
things.each { |thing| # O(things)
idx = thing.index(marker) # O(thing)
thing[idx] = stuff # O(1)
}
Those deceivingly simple looking Array methods are their own hidden loops. In the worst case each one must scan the whole list. Loops in loops makes this exponentially slow, it's O(n*m). How slow? If things is 1000 items long, and each thing has 1000 items in it that's... 1000 * 1000 or 1,000,000 operations!
And this isn't nearly the amount of trouble students can get into, normally they write O(n!) loops. I actually find it hard to come up with an example I'm so ingrained against it.
But that only becomes apparent after you throw a ton of data at it. While you're writing it, how can you know?
How can they make it faster? Without understanding the other options available to you and their performance characteristics, like hashes and sets and trees, they cannot know. And experienced programmer would make one immediate change to the data structure and change things to a list of sets.
things.each { |thing| # O(things)
thing.delete(marker) # O(1)
thing.add(stuff) # O(1)
}
This is much faster. Deleting and adding with an unordered set is O(1) so it's effectively free no matter how large thing gets. Now if things is 1000 items long, and each thing has 1000 items in it that's 1000 operations. By using a more appropriate data structure I just sped up that loop by 1000 times. Really what I did is changed it from O(n*m) to O(n).
Another solid example is learning how to write a solid comparison function for multi-level data. Why is the Schwartzian transform fast? You can't appreciate that without understanding how sorting works.
You could simply be told these things, sorting is O(n log n), finding something in a list is O(n), and so on... but having to do it yourself gives you a visceral appreciation for what's going on under the hood. It makes you appreciate all the work a modern language does for you.
That said, there's little point in writing six different sort algorithms, or four different trees, or five different hash conflict resolution functions. Write one of each to appreciate them, then just learn about the rest so you know they exist and when to use them. 98% of the time the exact algorithm doesn't matter, but sometimes it's good to know that a merge sort might work better than a quick sort.
Because honestly, you're never going to write your own sort function. Or tree. Or hash. Or queue. And if you do, you probably shouldn't be. Unless you intend to be the 1% that writes the underlying libraries (like I do), if you're just going to write web apps and business logic, you don't need a full blown Computer Science education. Spend that time learning Software Engineering instead: testing, requirements, estimation, readability, communications, etc...
So when a student asks "why are we learning this stuff when it's all built into the language now?" (echos of "why do I have to learn math when I have a calculator?") have them write their naive loop with their fancy methods. Shove big data at it and watch it slow to a crawl. Then write an efficient loop with good selection of data structures and algorithms and show how it screams through the data. That's their answer.
NOTE: This is the original answer before the question was understood.
Most modern languages use quicksort as their default sort, but usually modified to avoid the O(n^2) worst case. Here's the BSD man page on their implementation of qsort_r(). Ruby uses qsort_r.
The qsort() and qsort_r() functions are an implementation of C.A.R. Hoare's ``quicksort'' algorithm, a variant of partition-exchange sorting; in particular, see D.E. Knuth's Algorithm Q. Quicksort takes O N lg N average time. This implementation uses median selection to avoid its O N**2 worst-case behavior.
PHP also uses quicksort, though I don't know which particular implementation.
Perl uses its own implementation of quicksort by default. But you can also request a merge sort via the sort pragma.
In Perl versions 5.6 and earlier the quicksort algorithm was used to implement "sort()", but in Perl 5.8 a mergesort algorithm was also made available, mainly to guarantee worst case O(N log N) behaviour: the worst case of quicksort is O(N**2). In Perl 5.8 and later, quicksort defends against quadratic behaviour by shuffling large arrays before sorting.
Python since 2.3 uses Timsort and is guaranteed to be stable. Any software written in Python (Django) is likely to also use the default Timsort.
Javascript, really the ECMAScript specification, does not say what type of sorting algorithm to use for Array.prototype.sort. It only says that it's not guaranteed to be stable. This means the particular sorting algorithm is left to the Javascript implementation. Like Python, any Javascript frameworks such as React or Leaflet are likely to use the built in sort.
Visual Basic for Applications (VBA) comes with NO sorting algorithm. You have to write your own. This is a bizarre oversight for any language, but particularly one that's designed for business use and spreadsheets.
Almost any table is sorted. Most web apps are backed by SQL database and the actual sorting is performed inside that SQL database. For example SQL query SELECT id, date, total FROM orders ORDER BY date DESC. This kind of sorting uses already sorted database indexes, which are mostly implemented using B-trees (or data structures inspired by B-trees). But if data needs to be sorted on the fly then I think quicksort is usually used.
Sorting, merging of sorted files and binary search in sorted files is often used in big data processing, analytics, ad dispatching, fulltext search... Even Google results are sorted :)
Sometimes you don't need sort, but partial sort, or min-heap. For example in Dijkstra's algorithm for finding shortest path. Which is used (or can be used, or I would use it :) ) for example in route planning (Google Maps).
As pointed out by Schwern, the sorting is almost always provided by the programming language or its implementation engine, and libraries / frameworks just use that algorithm, with a custom comparison function when they need to sort complex objects.
Now if your objective is to have a real life example in the Web context, you could actually use on the contrary the "lack of" sorting method in SVG, and make an exercise out of it. Unlike other DOM elements, an SVG container paints its children in the order they are appended, irrespective of any "z-index" equivalent. So to implement a "z-index" functionality, you have to re-order the nodes yourself.
And to avoid just using a custom comparison function and relying on array.sort, you could add extra constraints, like stability, typically to preserve the current order of nodes with the same "z-index".
Since you mention Leaflet, one of the frustration with the pre 1.0 version (e.g. 0.7.7), was that all vector shapes are appended into the same single SVG container, without any provided sorting functionality, except for bringToFront / bringToBack.

CUDA parallel sorting algorithm vs single thread sorting algorithms

I have a large amount of data which i need to sort, several million array each with tens of thousand of values. What im wondering is the following:
Is it better to implement a parallel sorting algorithm, on the GPU, and run it across all the arrays
OR
implement a single thread algorithm, like quicksort, and assign each thread, of the GPU, a different array.
Obviously speed is the most important factor. For single thread sorting algorithm memory is a limiting factor. Ive already tried to implement a recursive quicksort but it doesnt seem to work for large amounts of data so im assuming there is a memory issue.
Data type to be sorted is long, so i dont believe a radix sort would be possible due to the fact that it a binary representation of the numbers would be too long.
Any pointers would be appreciated.
Sorting is an operation that has received a lot of attention. Writing your own sort isn't advisable if you are interested in high performance. I would consider something like thrust, back40computing, moderngpu, or CUB for sorting on the GPU.
Most of the above will be handling an array at a time, using the full GPU to sort an array. There are techniques within thrust to do a vectorized sort which can handle multiple arrays "at once", and CUB may also be an option for doing a "per-thread" sort (let's say, "per thread block").
Generally I would say the same thing about CPU sorting code. Don't write your own.
EDIT: I guess one more comment. I would lean heavily towards the first approach you mention (i.e. not doing a sort per thread.) There are two related reasons for this:
Most of the fast sorting work has been done along the lines of your first method, not the second.
The GPU is generally better at being fast when the work is well adapted for SIMD or SIMT. This means we generally want each thread to be doing the same thing and minimizing branching and warp divergence. This is harder to achieve (I think) in the second case, where each thread appears to be following the same sequence but in fact data dependencies are causing "algorithm divergence". On the surface of it, you might wonder if the same criticism might be levelled at the first approach, but since these libraries I mention arer written by experts, they are aware of how best to utilize the SIMT architecture. The thrust "vectorized sort" and CUB approaches will allow multiple sorts to be done per operation, while still taking advantage of SIMT architecture.

Fast/Area optimised sorting in hardware (fpga)

I'm trying to sort an array of 8bit numbers using vhdl.
I'm trying to find out a method which optimise delay and another which would use less hardware.
The size of the array is fixed. But I'm also interested to extend the functionality to variable lengths.
I've come across 3 algorithms so far:
Bathcher Parallel
Method Green Sort
Van Vorris Sort
Which of these will do the best job? Are there any other methods I should be looking at?
Thanks.
There is a lot of research articles in the matter. You could try to search the web for it. I did a search for "Sorting Networks" and came up with a lot of comparisons of different algorithms and how well they fitted into an FPGA.
The algorithm you choose will greatly depend on which parameter is most important to optimize for, i.e. latency, area, etc. Another important factor is where the values are stored at the beginning and end of the sort. If they are stored in registers, all might be accessed at once, but if you have to read them from a memory with a limited width, you should consider that in your implementation as well, because then you will have to sort values in a stream, and rearrange that stream before saving it back to memory.
Personally, I'd consider something time-constant like merge-sort, which has a constant time to sort, so you could easily schedule the sort for a fixed size array. I'm however not sure how well this scales or works with arbitrary sized arrays. You'd probably have to set an upper limit on array size, and also this approach works best if all data is stored in registers.
I read about this in a book by Knuth and according to that book, the Batcher's parallel merge sort is the fastest algorithm and also the most hardware efficient.

Performance Testing for Calculation-Heavy Programs

What are some good tips and/or techniques for optimizing and improving the performance of calculation heavy programs. I'm talking about things like complication graphics calculations or mathematical and simulation types of programming where every second saved is useful, as opposed to IO heavy programs where only a certain amount of speedup is helpful.
While changing the algorithm is frequently mentioned as the most effective method here,I'm trying to find out how effective different algorithms are in the first place, so I want to create as much efficiency with each algorithm as is possible. The "problem" I'm solving isn't something thats well known, so there are few if any algorithms on the web, but I'm looking for any good advice on how to proceed and what to look for.
I am exploring the differences in effectiveness between evolutionary algorithms and more straightforward approaches for a particular group of related problems. I have written three evolutionary algorithms for the problem already and now I have written an brute force technique that I am trying to make as fast as possible.
Edit: To specify a bit more. I am using C# and my algorithms all revolve around calculating and solving constraint type problems for expressions (using expression trees). By expressions I mean things like x^2 + 4 or anything else like that which would be parsed into an expression tree. My algorithms all create and manipulate these trees to try to find better approximations. But I wanted to put the question out there in a general way in case it would help anyone else.
I am trying to find out if it is possible to write a useful evolutionary algorithm for finding expressions that are a good approximation for various properties. Both because I want to know what a good approximation would be and to see how the evolutionary stuff compares to traditional methods.
It's pretty much the same process as any other optimization: profile, experiment, benchmark, repeat.
First you have to figure out what sections of your code are taking up the time. Then try different methods to speed them up (trying methods based on merit would be a better idea than trying things at random). Benchmark to find out if you actually did speed them up. If you did, replace the old method with the new one. Profile again.
I would recommend against a brute force approach if it's at all possible to do it some other way. But, here are some guidelines that should help you speed your code up either way.
There are many, many different optimizations you could apply to your code, but before you do anything, you should profile to figure out where the bottleneck is. Here are some profilers that should give you a good idea about where the hot spots are in your code:
GProf
PerfMon2
OProfile
HPCToolkit
These all use sampling to get their data, so the overhead of running them with your code should be minimal. Only GProf requires that you recompile your code. Also, the last three let you do both time and hardware performance counter profiles, so once you do a time (or CPU cycle) profile, you can zoom in on the hotter regions and find out why they might be running slow (cache misses, FP instruction counts, etc.).
Beyond that, it's a matter of thinking about how best to restructure your code, and this depends on what the problem is. It may be that you've just got a loop that the compiler doesn't optimize well, and you can inline or move things in/out of the loop to help the compiler out. Or, if you're running as fast as you can with basic arithmetic ops, you may want to try to exploit vector instructions (SSE, etc.) If your code is parallel, you might have load balance problems, and you may need to restructure your code so that data is better distributed across cores.
These are just a few examples. Performance optimization is complex, and it might not help you nearly enough if you're doing a brute force approach to begin with.
For more information on ways people have optimized things, there were some pretty good examples in the recent Why do you program in assembly? question.
If your optimization problem is (quasi-)convex or can be transformed into such a form, there are far more efficient algorithms than evolutionary search.
If you have large matrices, pay attention to your linear algebra routines. The right algorithm can make shave an order of magnitude off the computation time, especially if your matrices are sparse.
Think about how data is loaded into memory. Even when you think you're spending most of your time on pure arithmetic, you're actually spending a lot of time moving things between levels of cache etc. Do as much as you can with the data while it's in the fastest memory.
Try to avoid unnecessary memory allocation and de-allocation. Here's where it can make sense to back away from a purely OO approach.
This is more of a tip to find holes in the algorithm itself...
To realize maximum performance, simplify everything inside the most inner loop at the expense of everything else.
One example of keeping things simple is the classic bouncing ball animation. You can implement gravity by looking up the definition in your physics book and plugging in the numbers, or you can do it like this and save precious clock cycles:
initialize:
float y = 0; // y coordinate
float yi = 0; // incremental variable
loop:
y += yi;
yi += 0.001;
if (y > 10)
yi = -yi;
But now let's say you're having to do this with nested loops in an N-body simulation where every particle is attracted to every other particle. This can be an enormously processor intensive task when you're dealing with thousands of particles.
You should of course take the same approach as to simplify everything inside the most inner loop. But more than that, at the very simplest level you should also use data types wisely. For example, math operations are faster when working with integers than floating point variables. Also, addition is faster than multiplication, and multiplication is faster than division.
So with all of that in mind, you should be able to simplify the most inner loop using primarily addition and multiplication of integers. And then any scaling down you might need to do can be done afterwards. To take the y and yi example, if yi is an integer that you modify inside the inner loop then you could scale it down after the loop like this:
y += yi * 0.01;
These are very basic low-level performance tips, but they're all things I try to keep in mind whenever I'm working with processor intensive algorithms. Of course, if you then take these ideas and apply them to parallel processing on a GPU then you can take your algorithm to a whole new level. =)
Well how you do this depends the most on which language
you are using. Still, the key in any language
in the profiler. Profile your code. See which
functions/operations are taking the most time and then determine
if you can make these costly operations more efficient.
Standard bottlenecks in numerical algorithms are memory
usage (do you access matrices in the order which the elements
are stored in memory); communication overhead, etc. They
can be little different than other non-numerical programs.
Moreover, many other factors such as preconditioning, etc.
can lead to drastically difference performance behavior
of the SAME algorithm on the same problem. Make sure
you determine optimal parameters for your implementations.
As for comparing different algorithms, I recommend
reading the paper
"Benchmarking optimization software with performance profiles,"
Jorge Moré and Elizabeth D. Dolan, Mathematical Programming 91 (2002), 201-213.
It provides a nice, uniform way to compare different algorithms being
applied to the same problem set. It really should be better known
outside of the optimization community (in my not so humble opinion
at least).
Good luck!

Is there any reason to implement my own sorting algorithm?

Sorting has been studied for decades, so surely the sorting algorithms provide by any programming platform (java, .NET, etc.) must be good by now, right? Is there any reason to override something like System.Collections.SortedList?
There are absolutely times where your intimate understanding of your data can result in much, much more efficient sorting algorithms than any general purpose algorithm available. I shared an example of such a situation in another post at SO, but I'll share it hear just to provide a case-in-point:
Back in the days of COBOL, FORTRAN, etc... a developer working for a phone company had to take a relatively large chunk of data that consisted of active phone numbers (I believe it was in the New York City area), and sort that list. The original implementation used a heap sort (these were 7 digit phone numbers, and a lot of disk swapping was taking place during the sort, so heap sort made sense).
Eventually, the developer stumbled on a different approach: By realizing that one, and only one of each phone number could exist in his data set, he realized that he didn't have to store the actual phone numbers themselves in memory. Instead, he treated the entire 7 digit phone number space as a very long bit array (at 8 phone numbers per byte, 10 million phone numbers requires just over a meg to capture the entire space). He then did a single pass through his source data, and set the bit for each phone number he found to 1. He then did a final pass through the bit array looking for high bits and output the sorted list of phone numbers.
This new algorithm was much, much faster (at least 1000x faster) than the heap sort algorithm, and consumed about the same amount of memory.
I would say that, in this case, it absolutely made sense for the developer to develop his own sorting algorithm.
If your application is all about sorting, and you really know your problem space, then it's quite possible for you to come up with an application specific algorithm that beats any general purpose algorithm.
However, if sorting is an ancillary part of your application, or you are just implementing a general purpose algorithm, chances are very, very good that some extremely smart university types have already provided an algorithm that is better than anything you will be able to come up with. Quick Sort is really hard to beat if you can hold things in memory, and heap sort is quite effective for massive data set ordering (although I personally prefer to use B+Tree type implementations for the heap b/c they are tuned to disk paging performance).
Generally no.
However, you know your data better than the people who wrote those sorting algorithms. Perhaps you could come up with an algorithm that is better than a generic algorithm for your specific set of data.
Implementing you own sorting algorithm is akin to optimization and as Sir Charles Antony Richard Hoare said, "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil".
Certain libraries (such as Java's very own Collections.sort) implement a sort based on criteria that may or may not apply to you. For example, Collections.sort uses a merge sort for it's O(n log(n)) efficiency as well as the fact that it's an in-place sort. If two different elements have the same value, the first element in the original collection stays in front (good for multi-pass sorting to different criteria (first scan for date, then for name, the collection stays name (then date) sorted)) However, if you want slightly better constants or have a special data-set, it might make more sense to implement your own quick sort or radix sort specific exactly to what you want to do.
That said, all operations are fast on sufficiently small n
Short answer; no, except for academic interest.
You might want to multi-thread the sorting implementation.
You might need better performance characteristics than Quicksorts O(n log n), think bucketsort for example.
You might need a stable sort while the default algorithm uses quicksort. Especially for user interfaces you'll want to have the sorting order be consistent.
More efficient algorithms might be available for the data structures you're using.
You might need an iterative implementation of the default sorting algorithm because of stack overflows (eg. you're sorting large sets of data).
Ad infinitum.
A few months ago the Coding Horror blog reported on some platform with an atrociously bad sorting algorithm. If you have to use that platform then you sure do want to implement your own instead.
The problem of general purpose sorting has been researched to hell and back, so worrying about that outside of academic interest is pointless. However, most sorting isn't done on generalized input, and often you can use properties of the data to increase the speed of your sorting.
A common example is the counting sort. It is proven that for general purpose comparison sorting, O(n lg n) is the best that we can ever hope to do.
However, suppose that we know the range that the values to be sorted are in a fixed range, say [a,b]. If we create an array of size b - a + 1 (defaulting everything to zero), we can linearly scan the array, using this array to store the count of each element - resulting in a linear time sort (on the range of the data) - breaking the n lg n bound, but only because we are exploiting a special property of our data. For more detail, see here.
So yes, it is useful to write your own sorting algorithms. Pay attention to what you are sorting, and you will sometimes be able to come up with remarkable improvements.
If you have experience at implementing sorting algorithms and understand the way the data characteristics influence their performance, then you would already know the answer to your question. In other words, you would already know things like a QuickSort has pedestrian performance against an almost sorted list. :-) And that if you have your data in certain structures, some sorts of sorting are (almost) free. Etc.
Otherwise, no.

Resources