How to speed up MATLAB codes?

How to speed up MATLAB codes? - performance

As great as MATLAB is as a mathematical language, its speed is not as fast as one like it to be. I am wondering what are the general practices to speed up running a MATLAB code? For example I know that if instead of running for loops one can do computations in vector/matrix format s/he will see speedup in running the code.
I am wondering what are other suggestions.

Here are a few basic performance tips:
Learn to use the profiler to understand which parts of your
computation are slow
Limit the amounts of expensive function calls via vectorization
Preassign arrays instead of growing them in loops
Use multithreaded functions (such as bsxfun)
Use the latest version of Matlab - there have
been tremendous performance gains over the last 5 years
Use the parallel toolbox for multicore and/or GPU processing
Use efficient algorithms
Use Java or C/C++ code where appropriate (though the speed-up can be disappointing)

If you're doing a lot of easily-parallelizable operations, parfor will automatically parallelize your for loops: http://www.mathworks.com/help/toolbox/distcomp/parfor.html

Installing Lightspeed.
I have recently been through the frustrating process of installing Tom Minka's Lightspeed on my Mac. Along the way I learnt a few hard lessons worth sharing with other Mac users.
My system has the following specifications
OS X version 10.8.5
Xcode version 4.6.3
Matlab version 2011a
1) Make sure that Lightspeed is installed on a path with NO spaces in its name. I made the mistake of putting it inside "Library/Application Support/Matlab" which caused me endless trouble. In particular, it led to the same issue reported by Tomer Levinboim (levinboim.blogspot.co.nz) with the added problem that his fixes did not fully resolve the problem!
2) Read Michel Valstar's notes "Compiling Matlab Mex Files on a Mac" and install the recommended patch from Mathworks ( http://www.mathworks.com/matlabcentral/answers/94092). This patch applies all the needed flag/option changes that Levinboim identifies.
3) Change the line options.COMPFLAGS in the install_lightspeed.m file inside the lightspeed folder to point to:
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk
4) In Matlab check that the current path points at the Lightspeed folder. Run the command install_lightspeed. If successful run test_lightspeed. You should now have a working version of Lightspeed!
5) Path settings persist between sessions so the startup.sh approach suggested in the Read Me appears to be unnecessary on a Mac. However, if you wish to go down that track, first read:
Where is startup.m supposed to be?
http://obasic.net/set-your-customized-startup-file-for-matlab .

You might begin reviewing some ways to begin thinking about vectorization here.
After that, the PDF given here, even though incomplete, provides many Matlab idioms that give good performance.

I just found this here: Writing Fast MATLAB Code.
by Pascal Getreuer and this here: Lightspeed Toolbox. Great stuff...

Related

Structure from Motion or SLAM for windows?

Is there any libraries that can be used on windows for using SfM or SLAM?
This will be in python btw
So far everything I am seeing is in Linux

Since you asked about Sfm I assume you are looking for visual SLAM solutions. These solutions are computationally expensive, because you basically deal with a lot of iterative minimizations on large parameter spaces. Because of that, high-level languages are poorly suited for that task. So, I can recomment one of two things (depending on the precision you need):
1) Don't use SFM or SLAM, but just some simple visual odometry python package (quite a few on github). If you are not familiar with the term, we can say it's juste local pose computation but without the optimizations that are used in SLAM or SFM. So you might get locally decent results, but forget about globally coherent trajectories.
2) Use one of the freely available state of the art libraries such as ORBSLAM_2_windows and use your own python wrappers.

Possible shortcomings for using JIT with R?

I recently discovered that one can use JIT (just in time) compilation with R using the compiler package (I summarizes my findings on this topic in a recent blog post).
One of the questions I was asked is:
Is there any pitfall? it sounds too good to be true, just put one line
of code and that's it.
After looking around I could find one possible issue having to do with the "start up" time for the JIT. But is there any other issue to be careful about when using JIT?
I guess that there will be some limitation having to do with R's environments architecture, but I can not think of a simple illustration of the problem off the top of my head, any suggestions or red flags will be of great help?

the output of a simple test with rpart could be an advice not to use enableJIT in ALL cases:
library(rpart)
fo <- function() for(i in 1:500){rpart(Kyphosis ~ Age + Number + Start, data=kyphosis)}
system.time(fo())
#User System verstrichen
#2.11 0.00 2.11
require(compiler)
enableJIT(3)
system.time(fo())
#User System verstrichen
#35.46 0.00 35.60
Any explanantion?

The rpart example given above, no longer seems to be an issue:
library("rpart")
fo = function() {
for(i in 1:500){
rpart(Kyphosis ~ Age + Number + Start, data=kyphosis)
}
} system.time(fo())
# user system elapsed
# 1.212 0.000 1.206
compiler::enableJIT(3)
# [1] 3
system.time(fo())
# user system elapsed
# 1.212 0.000 1.210
I've also tried a number of other examples, such as
growing a vector;
A function that's just a wrapper around mean
While I don't always get a speed-up, I've never experience a significant slow-down.
R> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04 LTS

In principle, once the byte-code is compiled and loaded, it should always be interpreted at least as fast as the original AST interpreter. Some code will benefit from big speedups, this is usually code with a lot of scalar operations and loops where most time is spent in R interpretation (I've seen examples with 10x speedup but arbitrary micro-benchmarks could indeed inflate this as needed). Some code will run at the same speed, this is usually code well vectorized and hence spending nearly no time in interpretation. Now, compilation itself can be slow. Hence, the just in time compiler now does not compile functions when it guesses it won't pay off (and the heuristics change over time, this is already in 3.4.x). The heuristics don't always guess it right, so there may be situations when compilation won't pay off. Typical problematic patterns are code generation, code modification and manipulation of bindings of environments captured in closures.
Packages can be byte-compiled at installation time so that the compilation cost is not paid (repeatedly) at run time, at least for code that is known ahead of time. This is now the default in development version of R. While the loading of compiled code is much faster than compiling it, in some situations one may be loading even code that won't be executed, so there actually may be an overhead, but overall pre-compilation is beneficial. Recently some parameters of the GC have been tuned to reduce the cost of loading code that won't be executed.
My recommendation for package writers would be to use the defaults (just-in-time compilation is now on by default in released versions, byte-compilation at package installation time is now on in the development version). If you find an example where the byte-code compiler does not perform well, please submit a bug report (I've also seen a case involving rpart in earlier versions). I would recommend against code generation and code manipulation and particularly so in hot loops. This includes defining closures, deleting and inserting bindings in environments captured by closures. Definitely one should not do eval(parse(text= in hot loops (and this had been bad already without byte-compilation). It is always better to use branches than to generate new closures (without branches) dynamically. Also it is better to write code with loops than to dynamically generate code with huge expressions (without loops). Now with the byte-code compiler, it is now often ok to write loops operating on scalars in R (the performance won't be as bad as before, so one could more often get away without switching to C for the performance critical parts).

Further to the previous answer, experimentation shows the problem is not with the compilation of the loop, it is with the compilation of closures. [enableJIT(0) or enableJIT(1) leave the code fast, enableJIT(2) slows it down dramatically, and enableJIT(3) is slightly faster than the previous option (but still very slow)]. Also contrary to Hansi's comment, cmpfun slows execution to a similar extent.

Parallel STL algorithms in OS X

I working on converting an existing program to take advantage of some parallel functionality of the STL.
Specifically, I've re-written a big loop to work with std::accumulate. It runs, nicely.
Now, I want to have that accumulate operation run in parallel.
The documentation I've seen for GCC outline two specific steps.
Include the compiler flag -D_GLIBCXX_PARALLEL
Possibly add the header <parallel/algorithm>
Adding the compiler flag doesn't seem to change anything. The execution time is the same, and I don't see any indication of multiple core usage when monitoring the system.
I get an error when adding the parallel/algorithm header. I thought it would be included with the latest version of gcc (4.7).
So, a few questions:
Is there some way to definitively determine if code is actually running in parallel?
Is there a "best practices" way of doing this on OS X? (Ideal compiler flags, header, etc?)
Any and all suggestions are welcome.
Thanks!

See http://threadingbuildingblocks.org/
If you only ever parallelize STL algorithms, you are going to disappointed in the results in general. Those algorithms generally only begin to show a scalability advantage when working over very large datasets (e.g. N > 10 million).
TBB (and others like it) work at a higher level, focusing on the overall algorithm design, not just the leaf functions (like std::accumulate()).

Second alternative is to use OpenMP, which is supported by both GCC and
Clang, though is not STL by any means, but is cross-platform.
Third alternative is to use Grand Central Dispatch - the official multicore API in OSX, again hardly STL.
Forth alternative is to wait for C++17, it will have Parallelism module.

How can this linear solver be linked within Mathematica?

Here is a good linear solver named GotoBLAS. It is available for download and runs on most computing platforms. My question is, is there an easy way to link this solver with the Mathematica kernel, so that we can call it like LinearSolve? One thing most of you may agree on for sure is that if we have a very large Linear system then we better get it solved by some industry standard Linear solver. The inbuilt solver is not meant for really large problems.
Now that Mathematica 8 has come up with better compilation and library link capabilities we can expect to use some of those solvers from within Mathematica. The question is does that require little tuning of the source code, or you need to be an advanced wizard to do it. Here in this forum we may start linking some excellent open source programs like GotoBLAS with Mathematica and exchange our views. Less experienced people can get some insight from the pro users and at the end we get a much stronger Mathematica. It will be an open project for the ever increasing Mathematica community and a platform where these newly introduced capabilities of Mathematica 8 could be transparently documented for future users.
I hope some of you here will give solid ideas on how we can get GotoBLAS running from within Mathematica. As the newer compilation and library link capabilities are usually not very well documented, they are not used by the common users very often. This question can act as a toy example to document these new capabilities of Mathematica. Help in this direction by the experienced forum members will really lift the motivation of new users like me as well as it will teach us a very useful thing to extend Mathematica's number crunching arsenal.

The short answer, I think, is that this is not something you really want to do.
GotoBLAS, as I understand it, is a specific implementation of BLAS, which stands for Basic Linear Algebra Subroutines. "Basic" really means quite basic here - multiply a matrix times a vector, for example. Thus, BLAS is not a solver that a function like LinearSolve would call. LinearSolve would (depending on the exact form of the arguments) call a LAPACK command, which is a higher level package built on top of BLAS. Thus, to really link GotoBLAS (or any BLAS) into Mathematica, one would really need to recompile the whole kernel.
Of course, one could write a C/Fortran program that was compiled against GotoBLAS and then link that into Mathematica. The resulting program would only use GotoBLAS when running whatever specific commands you've linked into Mathematica, however, which rather misses the whole point of BLAS.

The Wolfram Kernel (Mathematica) is already linked to the highly-optimized Intel Math Kernel Library, and is distributed with Mathematica. The MKL is multithreaded and vectorized, so I'm not sure what GotoBLAS would improve upon.

Performance-focused desktop-program: Ruby or Go?

I currently don't know either of the two languages. Design of a piece of software is close to complete.
The intriguing:
Ruby: Enjoyable. Follows thought process. Made for humans.
Go: Good performance. Fast compile times.
I don't know about Ruby's performance. If it's a lot slower than Go, I'll go with the latter (talking about typical speed here).
I'll learn both eventually, but right now, this will determine which one first.
Update: It's a very basic image-editing program. Technical and especially perceived speed should be high. Startup time is especially important.

Sadly, neither language is appropriate for a desktop image editing program.
You haven't told us which desktop you have in mind, I'll assume it's either Windows or Mac.
Ruby is not appropriate because it fails 2 of your requirements:
it has a terrible startup time because at startup it has to initialize a rather complicated VM, which involves loading quite a big part of its standard library
it's very slow (compared to C/Java/Go) doing the kind of computations that image processing entails
Go is statically linked and is compiled to machine code, so its startup time is excellent and the speed is close to C (i.e. it's the fastest language you can hope to choose after C/C++).
However, Go has no support whatsoever for writing Mac desktop apps (i.e. it has no bridge to Objective-C/Cocoa runtime) and the support for writing Windows desktop apps is extremely poor.
If you're doing Windows, the only language that gives you fast startup time is C/C++/Delphi. C# might have acceptable startup time and it's fast enough for the task (very popular paint.net is written in C# and you can find an old version of the code which is BSD-licensed and re-use a lot of its code).
For Mac, I would recommend Objective C - it's the native language of the platform, best documented and with the best, free dev tools (XCode). You can use https://github.com/philippec/Pixen as a starting point.

You really need to give us some idea as to what you consider to be good and bad performance because it's a very subjective subject.
For example, people are usually willing to trade a certain amount of technical or perceived speed for a system that easier to work with or develope. Plus it also matters what you are tying to do. Each language has it's own strengths and weaknesses. Ruby may be faster at some things than Go. Then again, if you really need speed, perhaps you should be looking at a language that is closer to the metal such as C.
Sometimes though, requests for speed from users are subjective too. I once had a system that the users thought was taking too long to do a specific task. There was no way technically to speed it up, so I animated the "Processing ..." window. Because the users could now see something "happening" on the screen, they thought it was going faster. On a stop watch, it actually took a couple of seconds longer.

I think those languages are the worst you can choose for performance-critical application. I don't know much about Go, but Ruby is similar to Python (even slower) and Python is slow as hell. As i've been reading, Go is much faster than Ruby, but still is like two or three times slower compared to other programming languages... It depends on what are you trying to do, of course, ie. I wouldn't choose any of those for real-time physics or something like that.
http://shootout.alioth.debian.org/u32/performance.php?test=nbody
Why is go language so slow?
http://attractivechaos.github.com/plb/
I've been working with python for a couple of years and it's really slow and I'm sure you will hate it and Ruby is very similar to Python and it's slower but as Go is too new I don't really know much about it, I can't tell..

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio