very slow matlab jacket if statement - performance

I encountered a very slow if statement response using cuda\jacket in matlab. (5 sec vs 0.02 sec for the same code that finds local maxima, using a simple for loop and an if condition)
Being new to GPU programming, I went reading and when I saw a previous matlab if statements with CUDA SO discussion, I felt something is missing.
You don't need to use cuda to know that it is better to vectorized your code. However, there are cases where you will need to use an if statement anyway.
For example, I'd like to find whether a pixel of a 2D image (say m(a,b)) is the the local maximum of its 8 nearest neighbors. In matlab, an easy way to do that is by using 8 logical conditions on an if statement:
if m(a,b)>m(a-1,b-1) & m(a,b)>(a,b-1) & m(a,b)>(a+1,b-1) & ... etc on all nearest neighbors
I'd appreciate if you have an idea how to resolve (or vectorize) this...

The problem with using multiple "if" statement (or any other conditional statement) is that for each the statements, the result is copied from gpu to host and this can be costly.
The simplest way is to vectorize in the following manner.
window = m(a-1:a+1, b-1:b+1);
if all(window(:) <= m(a,b))
% do something
end
This can be further optimized if you can show what the if / else conditions are doing. i.e. please post the if/else code to see if other optimizations are available (i.e look at possible ways to remove if condition entirely).
EDIT
With new information, here is what can be done.
for j = 1:length(y)
a = x(j);
b = y(j);
window = d(a-1:a+1, b-1:b+1);
condition = all(window(:) <= d(a,b));
M(a, b) = condition + ~condition * M(a,b);
end
You can use gfor loop to make it even faster.
gfor j = 1:length(y)
a = x(j);
b = y(j);
window = d(a-1:a+1, b-1:b+1);
condition = all(window(:) <= d(a,b));
M(a, b) = condition + ~condition * M(a,b);
gend

Using built-in functions
The easiest already optimized approach is probably to use the imregionalmax function,
maxinI = imregionalmax(I, CONN);
where CONN is the desired connectivity (in your case 8).
Note however that imregionalmax is part of the image processing toolbox.
Using the max function
If you're trying to see if just that one pixel is the local maximum of it's neighbors you would probably do something like
if m(a,b) == max(max(m( (a-1) : (a+1), (b-1) : (b+1))))
Or perhaps rather than taking two max it may be faster in some cases to reshape,
if m(a,b) == max(reshape (m( (a-1) : (a+1), (b-1) : (b+1)), 9,1) )
Without the max function
Lastly if you want to avoid the max function altogether that is also possible in a more vectorized form than you have so far, namely
if all(reshape( m(a,b) >= m( (a-1) : (a+1), (b-1) : (b+1)), 9,1))

Related

Vectorization or alternative to speed up MATLAB loop

I am using MATLAB to run a for loop in which variable-length portions of a large vector are updated at each iteration with the content of another vector; something like:
for k=1:N
vec1(idx_start1(k):idx_end1(k)) = vec1(idx_start1(k):idx_end1(k)) +...
a(k)*vec2(idx_start2(k):idx_end2(k));
end
The selected portions of vec1 and vec2 are not so small and N can be quite large; moreover, if this can be useful, idx_end(k)<idx_start(k+1) does not necessarily hold (i.e. vec1's edited portions may be partially re-updated in subsequent iterations). As a consequence, the above is by far the slowest portion of code in my script and I would like to speed it up, if possible.
Is there any way to vectorize the above for loop in order to make it run faster? Or, are there any alternative approaches to improve its execution speed?
EDIT:
As requested in the comments, here are some example values: Using the profiler to check execution times, the loop above runs in about 3.3 s with N=5e4, length(vec1)=3e6, length(vec2)=1.7e3 and the portions indexed by idx_start/end are slightly shorter on average than the latter, although not significantly.
Of course, 3.3 s is not particularly worrying in itself, but I would like to be able to increase especially N and vec1 by 1 or 2 orders of magnitude and in such a loop it will take quite longer to run.
Sorry, I couldn't find a way to speed up your code. This is the code I created to try to speed it up:
N = 5e4;
vec1 = 1:3e6;
vec2 = 1:1.7e3;
rng(0)
a = randn(N, 1);
idx_start1 = randi([1, 2.9e6], N, 1);
idx_end1 = idx_start1 + 1000;
idx_start2 = randi([1, 0.6e3], N, 1);
idx_end2 = idx_start2 + 1000;
for k=1:N
vec1(idx_start1(k):idx_end1(k)) = vec1(idx_start1(k):idx_end1(k)) + a(k) * vec2(idx_start2(k):idx_end2(k));
% use = idx_start1(k):idx_end1(k);
% vec1(use) = vec1(use) + a(k) * vec2(idx_start2(k):idx_end2(k));
end
The two commented-out lines of code in the for loop were my attempt to speed it up, but it actually made it slower, much to my surprise. Generally, I would create a variable for an array that is used more than once thinking that is faster, but it is not. The code that is not commented out runs in 0.24 s versus 0.67 seconds for the code that is commented out.

Huge memory allocation running a julia function?

I try to run the following function in julia command, but when timing the function I see too much memory allocations which I can't figure out why.
function pdpf(L::Int64, iters::Int64)
snr_dB = -10
snr = 10^(snr_dB/10)
Pf = 0.01:0.01:1
thresh = rand(100)
Pd = rand(100)
for m = 1:length(Pf)
i = 0
for k = 1:iters
n = randn(L)
s = sqrt(snr) * randn(L)
y = s + n
energy_fin = (y'*y) / L
#inbounds thresh[m] = erfcinv(2Pf[m]) * sqrt(2/L) + 1
if energy_fin[1] >= thresh[m]
i += 1
end
end
#inbounds Pd[m] = i/iters
end
#thresh = erfcinv(2Pf) * sqrt(2/L) + 1
#Pd_the = 0.5 * erfc(((thresh - (snr + 1)) * sqrt(L)) / (2*(snr + 1)))
end
Running that function in the julia command on my laptop, I get the following shocking numbers:
julia> #time pdpf(1000, 10000)
17.621551 seconds (9.00 M allocations: 30.294 GB, 7.10% gc time)
What is wrong with my code? Any help is appreciated.
I don't think this memory allocation is so surprising. For instance, consider all of the times that the inner loop gets executed:
for m = 1:length(Pf) this gives you 100 executions
for k = 1:iters this gives you 10,000 executions based on the arguments you supply to the function.
randn(L) this gives you a random vector of length 1,000, based on the arguments you supply to the function.
Thus, just considering these, you've got 100*10,000*1000 = 1 billion Float64 random numbers being generated. Each one of them takes 64 bits = 8 bytes. I.e. 8GB right there. And, you've got two calls to randn(L) which means that you're at 16GB allocations already.
You then have y = s + n which means another 8GB allocations, taking you up to 24GB. I haven't looked in detail on the remaining code to get you from 24GB to 30GB allocations, but this should show you that it's not hard for the GB allocations to start adding up in your code.
If you're looking at places to improve, I'll give you a hint that these lines can be improved by using the properties of normal random variables:
n = randn(L)
s = sqrt(snr) * randn(L)
y = s + n
You should easily be able to cut down the allocations here from 24GB to 8GB in this way. Note that y will be a normal random variable here as you've defined it, and think up a way to generate a normal random variable with an identical distribution to what y has now.
Another small thing, snr is a constant inside your function. Yet, you keep taking its sqrt 1 million separate times. In some settings, 'checking your work' can be helpful, but I think that you can be confident the computer will get it right the first time and thus you don't need to make it keep re-doing this calculation ; ). There are other similar places you can improve your code to avoid duplicate computations here that I'll leave to you to locate.
aireties gives a good answer for why you have so many allocations. You can do more to reduce the number of allocations. Using this property we know that y = s+n is really y = sqrt(snr) * randn(L) + randn(L) and so we can instead do y = rvvar*randn(L) where rvvar= sqrt(1+sqrt(snr)^2) is defined outside the loop (thanks for the fix!). This will halve the number of random variables needed.
Outside the loop you can save sqrt(2/L) to cut down a little bit of time.
I don't think transpose is special-cased yet, so try using dot(y,y) instead of y'*y. I know dot for sure is just a loop without having to transpose, while the other may transpose depending on the version of Julia.
Something that would help performance (but not allocations) would be to use one big randn(L,iters) and loop through that. The reason is because if you make all of your random numbers all at once it's faster since it can use SIMD and a bunch of other goodies. If you want to implicitly do that without changing your code much, you can use ChunkedArrays.jl where you can use rands = ChunkedArray(randn,L) to initialize it and then everytime you want a randn(L), you instead use next(rands). Inside the ChunkedArray it actually makes bigger vectors and replenishes them as needed, but like this you can just get your randn(L) without having to keep track of all of that.
Edit:
ChunkedArrays probably only save time when L is smaller. This gives the code:
function pdpf(L::Int64, iters::Int64)
snr_dB = -10
snr = 10^(snr_dB/10)
Pf = 0.01:0.01:1
thresh = rand(100)
Pd = rand(100)
rvvar= sqrt(1+sqrt(snr)^2)
for m = 1:length(Pf)
i = 0
for k = 1:iters
y = rvvar*randn(L)
energy_fin = (y'*y) / L
#inbounds thresh[m] = erfcinv(2Pf[m]) * sqrt(2/L) + 1
if energy_fin[1] >= thresh[m]
i += 1
end
end
#inbounds Pd[m] = i/iters
end
end
which runs in half the time as using two randn calls. Indeed from the ProfileViewer we get:
#profile pdpf(1000, 10000)
using ProfileView
ProfileView.view()
I circled the two parts for the line y = rvvar*randn(L), so the vast majority of the time is random number generation. Last time I checked you could still get a decent speedup on random number generation by changing to to VSL.jl library, but you need MKL linked to your Julia build. Note that from the Google Summer of Code page you can see that there is a project to make a repo RNG.jl with faster psudo-rngs. It looks like it already has a few new ones implemented. You may want to check them out and see if they give speedups (or help out with that project!)

Exclude matrix elements from calculation with respect to performance

I am trying to save some calculation time. I am doing some Image processing with the well known Lucas Kanade algorithm. Starting point was this paper by Baker / Simon.
I am doing this Matlab and I also use a background substractor. I want the substractor to set all background to 0 or have a logical mask with 1 as foreground and 0 as background.
What I want to have is to exclude all matrix elements which are background from the calculation. My goal is to save time for the calculation. I am aware that I can use syntax like
A(A>0) = ...
But that doesn't work in a way like
B(A>0) = A.*C.*D
because I am getting an error:
In an assignment A(I) = B, the number of elements in B and I must be the same.
This is probably because A,B and C all together have more elements than only matrix A.
In c-code I would just loop the matrix and check if the pixel has the value 0 and the continue. In this case a save a whole bunch of calculations.
In matlab however it's not very fast to loop through the matrix. So is there a fast way to solve my Problem? I couldn't find a sufficient answere to my problem here.
I case anybody is interested: I am trying to use robust error function instead of quadratic ones.
Update:
I tried the following approach to test the speed as suggested by #Acorbe:
function MatrixTest()
n = 100;
A = rand(n,n);
B = rand(n,n);
C = rand(n,n);
D = rand(n,n);
profile clear, profile on;
for i=1:10000
tests(A,B,C,D);
end
profile off, profile report;
function result = tests(A,B,C,D)
idx = (B>0);
t = A(idx).*B(idx).*C(idx).*D(idx);
LGS1a(idx) = t;
LGS1b = A.*B.*C.*D;
And i got the folloing results with the profiler of matlab:
t = A(idx).*B(idx).*C(idx).*D(idx); 1.520 seconds
LGS1a(idx) = t; 0.513 seconds
idx = (B>0); 0.264 seconds
LGS1b = A.*B.*C.*D; 0.155 seconds
As you can see, the overhead of accessing the matrix by index hast far more costs than just
What about the following?
mask = A>0;
B = zeros(size(A)); % # some initialization
t = A.*C.*D;
B( mask ) = t( mask );
in this way you select just the needed elements of t. Maybe there is some overhead in the calculation, although likely negligible with respect to for loops slowness.
EDIT:
If you want more speed, you can try a more selective approach which uses the mask everywhere.
t = A(mask).*C(mask).*D(mask);
B( mask ) = t;

a faster way of implementing the nested loop with gamma function

I am trying to evaluate the following integral:
I can find the area for the following polynomial as follows:
pn =
-0.0250 0.0667 0.2500 -0.6000 0
First using the integration by Simpson's rule
fn=#(x) exp(polyval(pn,x));
area=quad(fn,-10,10);
fprintf('area evaluated by Simpsons rule : %f \n',area)
and the result is area evaluated by Simpsons rule : 11.483072
Then with the following code that evaluates the summation in the above formula with gamma function
a=pn(1);b=pn(2);c=pn(3);d=pn(4);f=pn(5);
area=0;
result=0;
for n=0:40;
for m=0:40;
for p=0:40;
if(rem(n+p,2)==0)
result=result+ (b^n * c^m * d^p) / ( factorial(n)*factorial(m)*factorial(p) ) *...
gamma( (3*n+2*m+p+1)/4 ) / (-a)^( (3*n+2*m+p+1)/4 );
end
end
end
end
result=result*1/2*exp(f)
and this returns 11.4831. More or less the same result with the quad function. Now my question is whether or not it is possible for me to get rid of this nested loop as I will construct the cumulative distribution function so that I can get samples from this distribution using the inverse CDF transform. (for constructing the cdf I will use gammainc i.e. the incomplete gamma function instead of gamma)
I will need to sample from such densities that may have different polynomial coefficients and speed is of concern to me. I can already sample from such densities using Monte Carlo methods but I would like to see whether or not it is possible for me to use exact sampling from the density in order to speed up.
Thank you very much in advance.
There are several things one might do. The simplest is to avoid calling factorial. Instead one can use the relation that
factorial(n) = gamma(n+1)
Since gamma seems to be actually faster than a call to factorial, you can save a bit there. Even better, you can
>> timeit(#() factorial(40))
ans =
4.28681157826087e-05
>> timeit(#() gamma(41))
ans =
2.06671024634146e-05
>> timeit(#() gammaln(41))
ans =
2.17632543333333e-05
Even better, one can do all 4 calls in a single call to gammaln. For example, think about what this does:
gammaln([(3*n+2*m+p+1)/4,n+1,m+1,p+1])*[1 -1 -1 -1]'
Note that this call has no problem with overflows either in case your numbers get large enough. And since gammln is vectorized, that one call is fast. It costs little more time to compute 4 values than it does to compute one.
>> timeit(#() gammaln([15 20 40 30]))
ans =
2.73937416896552e-05
>> timeit(#() gammaln(40))
ans =
2.46521943333333e-05
Admittedly, if you use gammaln, you will need a call to exp at the end to recover the final result. You could do it with a single call to gamma however too. Perhaps like this:
g = gamma([(3*n+2*m+p+1)/4,n+1,m+1,p+1]);
g = g(1)/(g(2)*g(3)*g(4));
Next, you can be more creative in the inner loop on p. Rather than a full loop, coupled with a test to ignore the combinations you don't need, why not just do this?
for p=mod(n,2):2:40
That statement will select only those values of p that would have been used anyway, so now you can drop the if statement completely.
All of the above will give you what I'll guess is about a 5x speed increase in your loops. But it still has a set of nested loops. With some effort, you might be able to improve that too.
For example, rather than computing all of those factorials (or gamma functions) many times, do it ONCE. This should work:
a=pn(1);b=pn(2);c=pn(3);d=pn(4);f=pn(5);
area=0;
result=0;
nlim = 40;
facts = factorial(0:nlim);
gammas = gamma((0:(6*nlim+1))/4);
for n=0:nlim
for m=0:nlim
for p=mod(n,2):2:nlim
result = result + (b.^n * c.^m * d.^p) ...
.*gammas(3*n+2*m+p+1 + 1) ...
./ (facts(n+1).*facts(m+1).*facts(p+1)) ...
./ (-a)^( (3*n+2*m+p+1)/4 );
end
end
end
result=result*1/2*exp(f)
In my test on my machine, I find that your triply nested loops required 4.3 seconds to run. My version above produces the same result, yet required only 0.028418 seconds, a speedup of roughly 150 to 1, despite the triply nested loops.
Well, without even making changes to your code you could install an excellent package from Tom Minka at Microsoft called lightspeed which replaces some built-in matlab functions with much faster versions. I know there's a replacement for gammaln().
You'll get nontrivial speed improvements, though I'm not sure how much, and it's straight-forward to install.

Performance of swapping two elements in MATLAB

Purely as an experiment, I'm writing sort functions in MATLAB then running these through the MATLAB profiler. The aspect I find most perplexing is to do with swapping elements.
I've found that the "official" way of swapping two elements in a matrix
self.Data([i1, i2]) = self.Data([i2, i1])
runs much slower than doing it in four lines of code:
e1 = self.Data(i1);
e2 = self.Data(i2);
self.Data(i1) = e2;
self.Data(i2) = e1;
The total length of time taken up by the second example is 12 times less than the single line of code in the first example.
Would somebody have an explanation as to why?
Based on suggestions posted, I've run some more tests.
It appears the performance hit comes when the same matrix is referenced in both the LHS and RHS of the assignment.
My theory is that MATLAB uses an internal reference-counting / copy-on-write mechanism, and this is causing the entire matrix to be copied internally when it's referenced on both sides. (This is a guess because I don't know the MATLAB internals).
Here are the results from calling the function 885548 times. (The difference here is times four, not times twelve as I originally posted. Each of the functions have the additional function-wrapping overhead, while in my initial post I just summed up the individual lines).
swap1: 12.547 s
swap2: 14.301 s
swap3: 51.739 s
Here's the code:
methods (Access = public)
function swap(self, i1, i2)
swap1(self, i1, i2);
swap2(self, i1, i2);
swap3(self, i1, i2);
self.SwapCount = self.SwapCount + 1;
end
end
methods (Access = private)
%
% swap1: stores values in temporary doubles
% This has the best performance
%
function swap1(self, i1, i2)
e1 = self.Data(i1);
e2 = self.Data(i2);
self.Data(i1) = e2;
self.Data(i2) = e1;
end
%
% swap2: stores values in a temporary matrix
% Marginally slower than swap1
%
function swap2(self, i1, i2)
m = self.Data([i1, i2]);
self.Data([i2, i1]) = m;
end
%
% swap3: does not use variables for storage.
% This has the worst performance
%
function swap3(self, i1, i2)
self.Data([i1, i2]) = self.Data([i2, i1]);
end
end
In the first (slow) approach, the RHS value is a matrix, so I think MATLAB incurs a performance penalty in creating a new matrix to store the two elements. The second (fast) approach avoids this by working directly with the elements.
Check out the "Techniques for Improving Performance" article on MathWorks for ways to improve your MATLAB code.
you could also do:
tmp = self.Data(i1);
self.Data(i1) = self.Data(i2);
self.Data(i2) = tmp;
Zach is potentially right in that a temporary copy of the matrix may be made to perform the first operation, although I would hazard a guess that there is some internal optimization within MATLAB that attempts to avoid this. It may be a function of the version of MATLAB you are using. I tried both of your cases in version 7.1.0.246 (a couple years old) and only saw a speed difference of about 2-2.5.
It's possible that this may be an example of speed improvement by what's called "loop unrolling". When doing vector operations, at some level within the internal code there is likely a FOR loop which loops over the indices you are swapping. By performing the scalar operations in the second example, you are avoiding any overhead from loops. Note these two (somewhat silly) examples:
vec = [1 2 3 4];
%Example 1:
for i = 1:4,
vec(i) = vec(i)+1;
end;
%Example 2:
vec(1) = vec(1)+1;
vec(2) = vec(2)+1;
vec(3) = vec(3)+1;
vec(4) = vec(4)+1;
Admittedly, it would be much easier to simply use vector operations like:
vec = vec+1;
but the examples above are for the purpose of illustration. When I repeat each example multiple times over and time them, Example 2 is actually somewhat faster than Example 1. For a small loop with a known number (in the example, just 4), it can actually be more efficient to forgo the loop. Of course, in this particular example, the vector operation given above is actually the fastest.
I usually follow this rule: Try a few different things, and pick the fastest for your specific problem.
This post deserves an update, since the JIT compiler is now a thing (since R2015b) and so is timeit (since R2013b) for more reliable function timing.
Below is a short benchmarking function for element swapping within a large array.
I have used the terms "directly swapping" and "using a temporary variable" to describe the two methods in the question respectively.
The results are pretty staggering, the performance of directly swapping 2 elements using is increasingly poor by comparison to using a temporary variable.
function benchie()
% Variables for plotting, loop to increase size of the arrays
M = 15; D = zeros(1,M); W = zeros(1,M);
for n = 1:M;
N = 2^n;
% Create some random array of length N, and random indices to swap
v = rand(N,1);
x = randi([1, N], N, 1);
y = randi([1, N], N, 1);
% Time the functions
D(n) = timeit(#()direct);
W(n) = timeit(#()withtemp);
end
% Plotting
plot(2.^(1:M), D, 2.^(1:M), W);
legend('direct', 'with temp')
xlabel('number of elements'); ylabel('time (s)')
function direct()
% Direct swapping of two elements
for k = 1:N
v([x(k) y(k)]) = v([y(k) x(k)]);
end
end
function withtemp()
% Using an intermediate temporary variable
for k = 1:N
tmp = v(y(k));
v(y(k)) = v(x(k));
v(x(k)) = tmp;
end
end
end

Resources