What is the quickest method of matrix multiplication? - algorithm

I've been working on a rather extensive program as of late, and i'm currently at a point where I have to utilize matrix multiplication. Thing is, for this particular program, speed is crucial. I'm familiar with a number of matrix setups, but I would like to know which method will run the fastest. I've done extensive research, but turned up very little results. Here is a list of the matrix multiplication algorithms I am familiar with:
Iterative algorithm
Divide and Conquer algorithm
Sub Cubic algorithms
Shared Memory Parallelism
If anyone needs clarification on the methods I listed, or on the question in general, feel free to ask.

The Strassen algorithm and the naive (O(n^3)) one are the most used in practice.
More complex algorithms with tighter asymptotic bounds are not used because they benefits would be apparent only for extremely large matrices, due to their complexity, e.g. Coppersmith algorithm.
As others pointed out you might want to use a library like ATLAS which will automatically tune the algorithm depending on the characteristcs of the platform where you are executing, e.g. L1/L2 cache sizes.

Quickest way might be using an existing library that's already optimized, you don't have to reinvent the wheel every time.

Related

How does matrix multiplication scale with optimized linalg libraries?

What is the scaling of matrix-matrix multiplications when using an optimized linear algebra library (let's consider MKL and OpenBLAS)? I did some googling and the Strassen algorithm came up, which is O(N^2.807) scaling, but I couldn't find a clear answer. Can anybody confirm this or is there a better algorithm that is used instead these days?
I'm asking because I'm seeing unexpected scaling in my code. The matrices are quite large and I wonder if it's really all the pdgemms taking up the time, something that I usually don't have to worry about at all.

A good parameter optimization algorithm for a limited number of points with variance

I'm trying to meta-optimize an algorithm, which has almost a dosen constants. I guess some form of genetic algorithm should be used. However, the algorithm itself is quite heavy and probabilistic by nature (a version of ant colony optimization). Thus calculating the fitness for some set of parameters is quite slow and the results include a lot of variance. Even the order of magnitude for some of the parameters is not exactly clear, so the distribution on some components will likely need to be logarithmic.
Would someone have ideas about suitable algorithms for this problem? I.e. it would need to converge with a limited number of measurement points and also be able to handle randomness in the measured fitness. Also, the easier it is to implement with Java the better of course. :)
If you can express you model algebraically (or as differential equations), consider trying a derivative-based optimization methods. These have the theoretical properties you desire and are much more computationally efficient than black-box/derivative free optimization methods. If you have a MATLAB license, give fmincon a try. Note: fmincon will work much better if you supply derivative information. Other modeling environments include Pyomo, CasADi and Julia/JuMP, which will automatically calculate derivatives and interface with powerful optimization solvers.

Floating point algorithms with potential for performance optimization

For a university lecture I am looking for floating point algorithms with known asymptotic runtime, but potential for low-level (micro-)optimization. This means optimizations such as minimizing cache misses and register spillages, maximizing instruction level parallelism and taking advantage of SIMD (vector) instructions on new CPUs. The optimizations are going to be CPU-specific and will make use of applicable instruction set extensions.
The classic textbook example for this is matrix multiplication, where great speedups can be achieved by simply reordering the sequence of memory accesses (among other tricks). Another example is FFT. Unfortunately, I am not allowed to choose either of these.
Anyone have any ideas, or an algorithm/method that could use a boost?
I am only interested in algorithms where a per-thread speedup is conceivable. Parallelizing problems by multi-threading them is fine, but not the scope of this lecture.
Edit 1: I am taking the course, not teaching it. In the past years, there were quite a few projects that succeeded in surpassing the current best implementations in terms of performance.
Edit 2: This paper lists (from page 11 onwards) seven classes of important numerical methods and some associated algorithms that use them. At least some of the mentioned algorithms are candidates, it is however difficult to see which.
Edit 3: Thank you everyone for your great suggestions! We proposed to implement the exposure fusion algorithm (paper from 2007) and our proposal was accepted. The algorithm creates HDR-like images and consists mainly of small kernel convolutions followed by weighted multiresolution blending (on the Laplacian pyramid) of the source images. Interesting for us is the fact that the algorithm is already implemented in the widely used Enfuse tool, which is now at version 4.1. So we will be able to validate and compare our results with the original and also potentially contribute to the development of the tool itself. I will update this post in the future with the results if I can.
The simplest possible example:
accumulation of a sum. unrolling using multiple accumulators and vectorization allow a speedup of (ADD latency)*(SIMD vector width) on typical pipelined architectures (if the data is in cache; because there's no data reuse, it typically won't help if you're reading from memory), which can easily be an order of magnitude. Cute thing to note: this also decreases the average error of the result! The same techniques apply to any similar reduction operation.
A few classics from image/signal processing:
convolution with small kernels (especially small 2d convolves like a 3x3 or 5x5 kernel). In some sense this is cheating, because convolution is matrix multiplication, and is intimately related to the FFT, but in reality the nitty-gritty algorithmic techniques of high-performance small kernel convolutions are quite different from either.
erode and dilate.
what image people call a "gamma correction"; this is really evaluation of an exponential function (maybe with a piecewise linear segment near zero). Here you can take advantage of the fact that image data is often entirely in a nice bounded range like [0,1] and sub-ulp accuracy is rarely needed to use much cheaper function approximations (low-order piecewise minimax polynomials are common).
Stephen Canon's image processing examples would each make for instructive projects. Taking a different tack, though, you might look at certain amenable geometry problems:
Closest pair of points in moderately high dimension---say 50000 or so points in 16 or so dimensions. This may have too much in common with matrix multiplication for your purposes. (Take the dimension too much higher and dimensionality reduction silliness starts mattering; much lower and spatial data structures dominate. Brute force, or something simple using a brute-force kernel, is what I would want to use for this.)
Variation: For each point, find the closest neighbour.
Variation: Red points and blue points; find the closest red point to each blue point.
Welzl's smallest containing circle algorithm is fairly straightforward to implement, and the really costly step (check for points outside the current circle) is amenable to vectorisation. (I suspect you can kill it in two dimensions with just a little effort.)
Be warned that computational geometry stuff is usually more annoying to implement than it looks at first; don't just grab a random paper without understanding what degenerate cases exist and how careful your programming needs to be.
Have a look at other linear algebra problems, too. They're also hugely important. Dense Cholesky factorisation is a natural thing to look at here (much more so than LU factorisation) since you don't need to mess around with pivoting to make it work.
There is a free benchmark called c-ray.
It is a small ray-tracer for spheres designed to be a benchmark for floating-point performance.
A few random stackshots show that it spends nearly all its time in a function called ray_sphere that determines if a ray intersects a sphere and if so, where.
They also show some opportunities for larger speedup, such as:
It does a linear search through all the spheres in the scene to try to find the nearest intersection. That represents a possible area for speedup, by doing a quick test to see if a sphere is farther away than the best seen so far, before doing all the 3-d geometry math.
It does not try to exploit similarity from one pixel to the next. This could gain a huge speedup.
So if all you want to look at is chip-level performance, it could be a decent example.
However, it also shows how there can be much bigger opportunities.

Is there a chart of all the data structures and algorithms listed anywere?

Is there a chart or table anywhere that displays a lot of(at least the popular ones) data structures and algorithms with their running times and efficiency?
What I am looking for is something that I can glance at, and decide which structure/algorithm is best for a particular case. It would be helpful when working on a new project or just as a study guide.
A chart or table isn't going to be a particularly useful reference.
If you're going to be using a particular algorithm or data structure to tackle a problem, you'd better know and understand it inside and out. And that includes knowing (and knowing how to derive) their respective efficiencies. It's not particularly difficult. Most standard algorithms have simple, intuitive run-times like N^2, N*logN, etc.
That being said, run-time Big-O isn't everything. Take sorting for example. Heap sort has a better Big-O than say quick sort, yet quick sort performs much better in practice. Constant factors in Big-O's can also make a huge difference.
When you're talking about data structures, there's a lot more to them than meets the eye. For example, a hash map seems like just a tree map with much better performance, but you get additional sorting structure with a tree map.
Knowing what is the best algorithm/data structure to use is a matter of knowledge experience, not a look up table.
Though back to your question, I don't know of any such reference. It would be a good exercise to make one yourself though. Wikipedia has pretty decent articles on common algorithms/data structures along with some decent analysis.
I don't believe that any such list exists. The sheer number of known algorithms and data structures is staggering, and new ones are being developed all the time. Moreover, many of these algorithms and data structures are specialized, meaning that even if you had a list in front of you it would be difficult to know which ones were applicable for the particular problems you were trying to solve.
Another concern with such a list is how to quantify efficiency. If you were to rank algorithms in terms of asymptotic complexity (big-O), then you might end up putting certain algorithms and data structures that are asymptotically optimal but impractically slow on small inputs ahead of algorithms that are known to be fast for practical cases but might not be theoretically perfect. As an example, consider looking up the median-of-medians algorithm for linear time order statistics, which has such a huge constant factor that other algorithms tend to be much better in practice. Or consider quicksort, which in the worst-case is O(n2) but in practice has average complexity O(n lg n) and is much faster than other sorting algorithms.
On the other hand, were you to try to list the algorithms by runtime efficiency, the list would be misleading. Runtime efficiency is based on a number of factors that are machine- and input-specific (such as locality, size of the input, shape of the input, speed of the machine, processor architecture, etc.) It might be useful as a rule-of-thumb, but in many cases you might be mislead by the numbers to pick one algorithm when another is far superior.
There's also implementation complexity to consider. Many algorithms exist only in papers, or have reference implementations that are not optimized or are written in a language that isn't what you're looking for. If you find a Holy Grail algorithm that does exactly what you want but no implementation for it, it might be impossibly difficult to code up and debug your own version. For example, if there weren't a preponderance of red/black tree implementations, do you think you'd be able to code it up on your own? How about Fibonacci heaps? Or (from personal experience) van Emde Boas trees? Often it may be a good idea to pick a simpler algorithm that's "good enough" but easy to implement over a much more complex algorithm.
In short, I wish a table like this could exist that really had all this information, but practically speaking I doubt it could be constructed in a way that's useful. The Wikipedia links from #hammar's comments are actually quite good, but the best way to learn what algorithms and data structures to use in practice is by getting practice trying them out.
Collecting all algorithms and/or data structures is essentially impossible -- even as I'm writing this, there's undoubtedly somebody, somewhere is inventing some new algorithm or data structure. In the greater scheme of things, it's probably not of much significance, but it's still probably new and (ever so slightly) different from anything anybody's done before (though, of course, it's always possible it'll turn out to be a big, important thing).
That said, the US NIST has a Dictionary of Algorithms and Data Structures that lists more than most people ever know or care about. It covers most of the obvious "big" ones that everybody knows, and an awful lot of less-known ones as well. The University of Canterbury has another that is (or at least seems to me) a bit more modest, but still covers most of what a typical programmer probably cares about, and is a bit better organized for finding an algorithm to solve a particular problem, rather than being based primarily on already knowing the name of the algorithm you want.
There are also various collections/lists that are more specialized. For example, The Stony Brook Algorithm Repository is devoted primarily (exclusively?) to combinatorial algorithms. It's based on the Algorithm Design Manual, so it can be particularly useful if you have/use that book (and in case you're wondering, this book is generally quite highly regarded).
The first priority for a computer program is correctness and the second, most of the time, is programmer time - something directly linked to mantainability and extensibility.
Because of this, there is a school of programming that advocates just using simple stuff like arrays of records, unless it happens to be a performance sensitive part, in which case you need not only consider data structures and algorithms but also the "architechture" that led you to have that problem in the first place.

Performance Testing for Calculation-Heavy Programs

What are some good tips and/or techniques for optimizing and improving the performance of calculation heavy programs. I'm talking about things like complication graphics calculations or mathematical and simulation types of programming where every second saved is useful, as opposed to IO heavy programs where only a certain amount of speedup is helpful.
While changing the algorithm is frequently mentioned as the most effective method here,I'm trying to find out how effective different algorithms are in the first place, so I want to create as much efficiency with each algorithm as is possible. The "problem" I'm solving isn't something thats well known, so there are few if any algorithms on the web, but I'm looking for any good advice on how to proceed and what to look for.
I am exploring the differences in effectiveness between evolutionary algorithms and more straightforward approaches for a particular group of related problems. I have written three evolutionary algorithms for the problem already and now I have written an brute force technique that I am trying to make as fast as possible.
Edit: To specify a bit more. I am using C# and my algorithms all revolve around calculating and solving constraint type problems for expressions (using expression trees). By expressions I mean things like x^2 + 4 or anything else like that which would be parsed into an expression tree. My algorithms all create and manipulate these trees to try to find better approximations. But I wanted to put the question out there in a general way in case it would help anyone else.
I am trying to find out if it is possible to write a useful evolutionary algorithm for finding expressions that are a good approximation for various properties. Both because I want to know what a good approximation would be and to see how the evolutionary stuff compares to traditional methods.
It's pretty much the same process as any other optimization: profile, experiment, benchmark, repeat.
First you have to figure out what sections of your code are taking up the time. Then try different methods to speed them up (trying methods based on merit would be a better idea than trying things at random). Benchmark to find out if you actually did speed them up. If you did, replace the old method with the new one. Profile again.
I would recommend against a brute force approach if it's at all possible to do it some other way. But, here are some guidelines that should help you speed your code up either way.
There are many, many different optimizations you could apply to your code, but before you do anything, you should profile to figure out where the bottleneck is. Here are some profilers that should give you a good idea about where the hot spots are in your code:
GProf
PerfMon2
OProfile
HPCToolkit
These all use sampling to get their data, so the overhead of running them with your code should be minimal. Only GProf requires that you recompile your code. Also, the last three let you do both time and hardware performance counter profiles, so once you do a time (or CPU cycle) profile, you can zoom in on the hotter regions and find out why they might be running slow (cache misses, FP instruction counts, etc.).
Beyond that, it's a matter of thinking about how best to restructure your code, and this depends on what the problem is. It may be that you've just got a loop that the compiler doesn't optimize well, and you can inline or move things in/out of the loop to help the compiler out. Or, if you're running as fast as you can with basic arithmetic ops, you may want to try to exploit vector instructions (SSE, etc.) If your code is parallel, you might have load balance problems, and you may need to restructure your code so that data is better distributed across cores.
These are just a few examples. Performance optimization is complex, and it might not help you nearly enough if you're doing a brute force approach to begin with.
For more information on ways people have optimized things, there were some pretty good examples in the recent Why do you program in assembly? question.
If your optimization problem is (quasi-)convex or can be transformed into such a form, there are far more efficient algorithms than evolutionary search.
If you have large matrices, pay attention to your linear algebra routines. The right algorithm can make shave an order of magnitude off the computation time, especially if your matrices are sparse.
Think about how data is loaded into memory. Even when you think you're spending most of your time on pure arithmetic, you're actually spending a lot of time moving things between levels of cache etc. Do as much as you can with the data while it's in the fastest memory.
Try to avoid unnecessary memory allocation and de-allocation. Here's where it can make sense to back away from a purely OO approach.
This is more of a tip to find holes in the algorithm itself...
To realize maximum performance, simplify everything inside the most inner loop at the expense of everything else.
One example of keeping things simple is the classic bouncing ball animation. You can implement gravity by looking up the definition in your physics book and plugging in the numbers, or you can do it like this and save precious clock cycles:
initialize:
float y = 0; // y coordinate
float yi = 0; // incremental variable
loop:
y += yi;
yi += 0.001;
if (y > 10)
yi = -yi;
But now let's say you're having to do this with nested loops in an N-body simulation where every particle is attracted to every other particle. This can be an enormously processor intensive task when you're dealing with thousands of particles.
You should of course take the same approach as to simplify everything inside the most inner loop. But more than that, at the very simplest level you should also use data types wisely. For example, math operations are faster when working with integers than floating point variables. Also, addition is faster than multiplication, and multiplication is faster than division.
So with all of that in mind, you should be able to simplify the most inner loop using primarily addition and multiplication of integers. And then any scaling down you might need to do can be done afterwards. To take the y and yi example, if yi is an integer that you modify inside the inner loop then you could scale it down after the loop like this:
y += yi * 0.01;
These are very basic low-level performance tips, but they're all things I try to keep in mind whenever I'm working with processor intensive algorithms. Of course, if you then take these ideas and apply them to parallel processing on a GPU then you can take your algorithm to a whole new level. =)
Well how you do this depends the most on which language
you are using. Still, the key in any language
in the profiler. Profile your code. See which
functions/operations are taking the most time and then determine
if you can make these costly operations more efficient.
Standard bottlenecks in numerical algorithms are memory
usage (do you access matrices in the order which the elements
are stored in memory); communication overhead, etc. They
can be little different than other non-numerical programs.
Moreover, many other factors such as preconditioning, etc.
can lead to drastically difference performance behavior
of the SAME algorithm on the same problem. Make sure
you determine optimal parameters for your implementations.
As for comparing different algorithms, I recommend
reading the paper
"Benchmarking optimization software with performance profiles,"
Jorge Moré and Elizabeth D. Dolan, Mathematical Programming 91 (2002), 201-213.
It provides a nice, uniform way to compare different algorithms being
applied to the same problem set. It really should be better known
outside of the optimization community (in my not so humble opinion
at least).
Good luck!

Resources