arrayfire evaluation of equations running really slowly - gpgpu

I have been working on a project to simulate biologically inspired neural networks using arrayfire. I got to the point of doing some timing tests and was disappointed with the results I was getting. I decided to try and go with one of the fastest, dirt-simple models for a timing test case, the Izhikevich model. When I ran the new test with that model the results were worse. The code I am using is below. It is not doing anything fancy. It is just standard matrix algebra. However, it takes over 5 seconds to do a single evaluation of the equation for just 10 neurons! Every stop after that takes roughly that same amount of time as well.
Code:
unsigned int neuron_count = 10;
array a = af::constant(0.02, neuron_count);
array b = af::constant(0.2, neuron_count);
array c = af::constant(-65.0, neuron_count);
array d = af::constant(6, neuron_count);
array v = af::constant(-70.0, neuron_count);
array u = af::constant(-20.0, neuron_count);
array i = af::constant(14, neuron_count);
double tau = 0.2;
void StepIzhikevich()
{
v = v + tau*(0.04*pow(v, 2) + 5 * v + 140 - u + i);
//af_print(v);
u = u + tau*a*(b*v - u);
//Leaving off spike threshold checks for now
}
void TestIzhikevich()
{
StepIzhikevich();
timer::start();
StepIzhikevich();
printf("elapsed seconds: %g\n", timer::stop());
}
Here are the timing results for different numbers of neurons.
results:
neurons seconds
10 5.18275
100 5.27969
1000 5.20637
10000 4.86609
Increasing the number of neurons does not appear to have a huge effect. The time goes down a little. Am I doing something wrong here? Is there a better way to optimize things with arrayfire to get better results?
When I switched the v equation to use v*v instead pow(v, 2) the time required for a step went down to 3.75762. That is still extremely slow though, so something odd is happening.
[EDIT]
I tried to split the processing up into pieces and found something new. Here is the code I am using now.
Code:
unsigned int neuron_count = 10;
array a = af::constant(0.02, neuron_count);
array b = af::constant(0.2, neuron_count);
array c = af::constant(-65.0, neuron_count);
array d = af::constant(6, neuron_count);
array v = af::constant(-70.0, neuron_count);
array u = af::constant(-20.0, neuron_count);
array i = af::constant(14, neuron_count);
array g = af::constant(0.0, neuron_count);
double tau = 0.2;
void StepIzhikevich()
{
array j = tau*(0.04*pow(v, 2));
//af_print(j);
array k = 5 * v + 140 - u + i;
//af_print(k);
array l = v + j + k;
//af_print(l);
v = l; //If this line is here time is long on second loop
//g = l; //If this is here then time is short.
//u = u + tau*a*(b*v - u);
//Leaving off spike threshold checks for now
}
void TestIzhikevich()
{
timer::start();
StepIzhikevich();
printf("elapsed seconds: %g\n", timer::stop());
timer::start();
StepIzhikevich();
printf("elapsed seconds: %g\n", timer::stop());
}
When I run it without reassigning back to v, or assigning it to a new variable g, then the time for the step on both the first and second run are small
results:
elapsed seconds: 0.0036143
elapsed seconds: 0.00340621
However, when I put v = l; back in, then the first time it runs it is fast, but from then on it is slow.
results:
elapsed seconds: 0.0034497
elapsed seconds: 2.98624
Any ideas on what is causing this?
[EDIT 2]
I still do not know why it is doing this, but I have found a workaround by copying the v array before using it again.
Code:
unsigned int neuron_count = 100000;
array v = af::constant(-70.0, neuron_count);
array u = af::constant(-20.0, neuron_count);
array i = af::constant(14, neuron_count);
double tau = 0.2;
void StepIzhikevich()
{
//array vp = v;
array vp = v.copy();
//af_print(vp);
array j = tau*(0.04*pow(vp, 2));
//af_print(j);
array k = 5 * vp + 140 - u + i;
//af_print(k);
array l = vp + j + k;
//af_print(l);
v = l; //If this line is here time is long on second loop
}
void TestIzhikevich()
{
for (int i = 0; i < 10; i++)
{
timer::start();
StepIzhikevich();
printf("loop: %d ", i);
printf("elapsed seconds: %g\n", timer::stop());
timer::start();
}
}
Here are the results now. The second time it runs it is a bit slow, but now it is fast after that. Huge improvement over before.
Results:
loop: 0 elapsed seconds: 0.657355
loop: 1 elapsed seconds: 0.981287
loop: 2 elapsed seconds: 0.000416182
loop: 3 elapsed seconds: 0.000415045
loop: 4 elapsed seconds: 0.000421014
loop: 5 elapsed seconds: 0.000413339
loop: 6 elapsed seconds: 0.00041675
loop: 7 elapsed seconds: 0.000412202
loop: 8 elapsed seconds: 0.000473321
loop: 9 elapsed seconds: 0.000677432

Related

Vectorized code slower than loops? MATLAB

In the problem Im working on there is such a part of code, as shown below. The definition part is just to show you the sizes of arrays. Below I pasted vectorized version - and it is >2x slower. Why it happens so? I know that i happens if vectorization requiers large temporary variables, but (it seems) it is not true here.
And generally, what (other than parfor, with I already use) can I do to speed up this code?
maxN = 100;
levels = maxN+1;
xElements = 101;
umn = complex(zeros(levels, levels));
umn2 = umn;
bessels = ones(xElements, xElements, levels); % 1.09 GB
posMcontainer = ones(xElements, xElements, maxN);
tic
for j = 1 : xElements
for i = 1 : xElements
for n = 1 : 2 : maxN
nn = n + 1;
mm = 1;
for m = 1 : 2 : n
umn(nn, mm) = bessels(i, j, nn) * posMcontainer(i, j, m);
mm = mm + 1;
end
end
end
end
toc % 0.520594 seconds
tic
for j = 1 : xElements
for i = 1 : xElements
for n = 1 : 2 : maxN
nn = n + 1;
m = 1:2:n;
numOfEl = ceil(n/2);
umn2(nn, 1:numOfEl) = bessels(i, j, nn) * posMcontainer(i, j, m);
end
end
end
toc % 1.275926 seconds
sum(sum(umn-umn2)) % veryfying, if all done right
Best regards,
Alex
From the profiler:
Edit:
In reply to #Jason answer, this alternative takes the same time:
for n = 1:2:maxN
nn(n) = n + 1;
numOfEl(n) = ceil(n/2);
end
for j = 1 : xElements
for i = 1 : xElements
for n = 1 : 2 : maxN
umn2(nn(n), 1:numOfEl(n)) = bessels(i, j, nn(n)) * posMcontainer(i, j, 1:2:n);
end
end
end
Edit2:
In reply to #EBH :
The point is to do the following:
parfor i = 1 : xElements
for j = 1 : xElements
umn = complex(zeros(levels, levels)); % cleaning
for n = 0:maxN
mm = 1;
for m = -n:2:n
nn = n + 1; % for indexing
if m < 0
umn(nn, mm) = bessels(i, j, nn) * negMcontainer(i, j, abs(m));
end
if m > 0
umn(nn, mm) = bessels(i, j, nn) * posMcontainer(i, j, m);
end
if m == 0
umn(nn, mm) = bessels(i, j, nn);
end
mm = mm + 1; % for indexing
end % m
end % n
beta1 = sum(sum(Aj1.*umn));
betaSumSq1(i, j) = abs(beta1).^2;
beta2 = sum(sum(Aj2.*umn));
betaSumSq2(i, j) = abs(beta2).^2;
end % j
end % i
I speeded it up as much, as I was able to. What you have written is taking only the last bessels and posMcontainer values, so it does not produce the same result. In the real code, those two containers are filled not with 1, but with some precalculated values.
After your edit, I can see that umn is just a temporary variable for another calculation. It still can be mostly vectorizable:
betaSumSq1 = zeros(xElements); % preallocating
betaSumSq2 = zeros(xElements); % preallocating
% an index matrix to fetch the right values from negMcontainer and
% posMcontainer:
indmat = tril(repmat([0 1;1 0],ceil((maxN+1)/2),floor(levels/2)));
indmat(end,:) = [];
% an index matrix to fetch the values in correct order for umn:
b_ind = repmat([1;0],ceil((maxN+1)/2),1);
b_ind(end) = [];
tempind = logical([fliplr(indmat) b_ind indmat+triu(ones(size(indmat)))]);
% permute the arrays to prevent squeeze:
PM = permute(posMcontainer,[3 1 2]);
NM = permute(negMcontainer,[3 1 2]);
B = permute(bessels,[3 1 2]);
for k = 1 : maxN+1 % third dim
for jj = 1 : xElements % columns
b = B(:,jj,k); % get one vector of B
% perform b*NM for every row of NM*indmat, than flip the result:
neg = fliplr(bsxfun(#times,bsxfun(#times,indmat,NM(:,jj,k).'),b));
% perform b*PM for every row of PM*indmat:
pos = bsxfun(#times,bsxfun(#times,indmat,PM(:,jj,k).'),b);
temp = [neg mod(1:levels,2).'.*b pos].'; % concat neg and pos
% assign them to the right place in umn:
umn = reshape(temp(tempind.'),[levels levels]).';
beta1 = Aj1.*umn;
betaSumSq1(jj,k) = abs(sum(beta1(:))).^2;
beta2 = Aj2.*umn;
betaSumSq2(jj,k) = abs(sum(beta2(:))).^2;
end
end
This reduce running time from ~95 seconds to less 3 seconds (both without parfor), so it improves in almost 97%.
I would suspect it is memory allocation. You are re-allocating the m array in a 3 deep loop.
try rearranging the code:
tic
for n = 1 : 2 : maxN
nn = n + 1;
m = 1:2:n;
numOfEl = ceil(n/2);
for j = 1 : xElements
for i = 1 : xElements
umn2(nn, 1:numOfEl) = bessels(i, j, nn) * posMcontainer(i, j, m);
end
end
end
toc % 1.275926 seconds
I was trying this in Igor pro, which a similar language, but with different optimizations. So the direct translations don't time the same way as Matlab (vectorized was slightly faster in Igor). But reordering the loops did speed up the vectorized form.
In your second part of the code, that is setting umn2, inside the loops, you have:
nn = n + 1;
m = 1:2:n;
numOfEl = ceil(n/2);
Those 3 lines don't require any input from the i and j loops, they only use the n loop. So reordering the loops such that i and j are inside the n loop will mean that those 3 lines are done xElements^2 (100^2) times less often. I suspect it is that m = 1:2:n line that takes time, since that is allocating an array.

octave is slow; suggestions

Have run the following code in both Octave 4.0.0 and MATLAB 2014. Time difference is silly, i.e. more than two orders of magnitude. Running on Windows laptop. What can be done to improve Octave computational speed?
startTime = cputime;
iter = 1; % iter is the current iteration of the loop
itSum = 0; % itSum is the sum of the iterations
stopCrit = sqrt(275); % stopCrit is the stopping criteria for the while loop
while itSum < stopCrit
itSum = itSum + 1/iter;
iter = iter + 1;
if iter > 1e7, break, end
end
iter-1
totTime = cputime - startTime
Octave: totTime ~ 112
MATLAB: totTime < 0.4
It takes a lot of iterations in the loop to compute the results in your code. Vectorizing the code will help speed up a lot. My following code do exactly what you did, but vectorize the computation quite a bit. See if it helps.
startTime = cputime;
iter = 1; % iter is the current iteration of the loop
itSum = 0; % itSum is the sum of the iterations
stopCrit = sqrt(275); % stopCrit is the stopping criteria for the while loop
step=1000;
while(itSum < stopCrit && iter <= 1e7)
itSum=itSum+sum(1./(iter:iter+step));
iter = iter + step+ 1;
end
iter=iter-step-1;
itSum=sum(1./(1:iter));
for i=(iter+1):(iter+step)
itSum=itSum+1/i;
if(itSum+1/i>stopCrit)
iter=i-1;
break;
end
end
totTime = cputime - startTime
My runtime is only about 0.6 second using the above code. If you do not care about exactly when the loop stops, the following code is even faster:
startTime = cputime;
iter = 1; % iter is the current iteration of the loop
itSum = 0; % itSum is the sum of the iterations
stopCrit = sqrt(275); % stopCrit is the stopping criteria for the while loop
step=1000;
while(itSum < stopCrit && iter <= 1e7)
itSum=itSum+sum(1./(iter:iter+step));
iter = iter + step+ 1;
end
iter=iter-step-1;
totTime = cputime - startTime
My runtime is only about 0.35 second in latter case.
You can also try:
itSum = sum(1./(1:exp(stopCrit)));
%start the iteration
iter = exp(stopCrit-((stopCrit-itSum)/abs(stopCrit-itSum))*(stopCrit-itSum));
itSum = sum(1./(1:iter))
With this methode you will only have 1 or 2 iteration. But of course you sum each time the whole array.

How can I vectorize these nested for-loops in Matlab?

I have a piece of code here I need to streamline as it is greatly increasing the runtime of my script:
size=300;
resultLength = (size+1)^3;
freqResult=zeros(1, resultLength);
inc=1;
for i=0:size,
for j=0:size,
for k=0:size,
freqResult(inc)=(c/2)*sqrt((i/L)^2+(j/W)^2+(k/H)^2);
inc=inc+1;
end
end
end
c, L, W, and H are all constants. As the size input gets over about 400, the runtime is too long to wait for, and I can watch my disk space draining by the gigabyte. Any advice?
Thanks!
What about this:
[kT, jT, iT] = ind2sub([size+1, size+1, size+1], [1:(size+1)^3]);
for indx = 1:numel(iT)
i = iT(indx) - 1;
j = jT(indx) - 1;
k = kT(indx) - 1;
freqResult1(indx) = (c/2)*sqrt((i/L)^2+(j/W)^2+(k/H)^2);
end
On my PC, for size = 400, version with 3 loops takes 136s and this one takes 19s.
For more "matlaby" way u could also even do as follows:
[kT, jT, iT] = ind2sub([size+1, size+1, size+1], [1:(size+1)^3]);
func = #(i, j, k) (c/2)*sqrt((i/L)^2+(j/W)^2+(k/H)^2);
freqResult2 = arrayfun(func, iT-1, jT-1, kT-1);
But for some reason, this is slower then the above version.
A faster solution can be (based on Marcin's answer):
[k, j, i] = ind2sub([size+1, size+1, size+1], [1:(size+1)^3]);
freqResult = (c/2)*sqrt(((i-1)/L).^2+((j-1)/W).^2+((k-1)/H).^2);
It takes about 5 seconds to run on my PC for size = 300
The following is even faster (but it doesn't look very good):
k = repmat(0:size,[1 (size+1)^2]);
j = repmat(kron(0:size, ones(1,size+1)),[1 (size+1)]);
i = kron(0:size, ones(1,(size+1)^2));
freqResult = (c/2)*sqrt((i/L).^2+(j/W).^2+(k/H).^2);
which takes ~3.5s for size = 300

Making a more efficient monte carlo simulation

So, I've written this code that should effectively estimate the area under the curve of the function defined as h(x). My problem is that i need to be able to estimate the area to within 6 decimal places, but the algorithm i've defined in estimateN seems to be using too heavy for my machine. Essentially the question is how can i make the following code more efficient? Is there a way i can get rid of that loop?
h = function(x) {
return(1+(x^9)+(x^3))
}
estimateN = function(n) {
count = 0
k = 1
xpoints = runif(n, 0, 1)
ypoints = runif(n, 0, 3)
while(k <= n){
if(ypoints[k]<=h(xpoints[k]))
count = count+1
k = k+1
}
#because of the range that im using for y
return(3*(count/n))
}
#uses the fact that err<=1/sqrt(n) to determine size of dataset
estimate_to = function(i) {
n = (10^i)^2
print(paste(n, " repetitions: ", estimateN(n)))
}
estimate_to(6)
Replace this code:
count = 0
k = 1
while(k <= n){
if(ypoints[k]<=h(xpoints[k]))
count = count+1
k = k+1
}
With this line:
count <- sum(ypoints <= h(xpoints))
If it's truly efficiency you're striving for, integrate is several orders of magnitude faster (not to mention more memory efficient) for this problem.
integrate(h, 0, 1)
# 1.35 with absolute error < 1.5e-14
microbenchmark(integrate(h, 0, 1), estimate_to(3), times=10)
# Unit: microseconds
# expr min lq median uq max neval
# integrate(h, 0, 1) 14.456 17.769 42.918 54.514 83.125 10
# estimate_to(3) 151980.781 159830.956 162290.668 167197.742 174881.066 10

Purposefully Slow MATLAB Function?

I want to write a really, really, slow program for MATLAB. I'm talking like, O(2^n) or worse. It has to finish, and it has to be deterministically slow, so no "if rand() = 123,123, exit!" This sounds crazy, but it's actually for a distributed systems test. I need to create a .m file, compile it (with MCC), and then run it on my distributed system to perform some debugging operations.
The program must constantly be doing work, so sleep() is not a valid option.
I tried making a random large matrix and finding its inverse, but this was completing too quickly. Any ideas?
This naive implementation of the Discrete Fourier Transform takes ~ 9 seconds for a 2048 long input vector x on my 1.86 GHz single core machine. Going to 4096 inputs extends the time to ~ 35 seconds, close to the 4x I would expect for O(N^2). I don't have the patience to try longer inputs :)
function y = SlowDFT(x)
t = cputime;
y = zeros(size(x));
for c1=1:length(x)
for c2=1:length(x)
y(c1) = y(c1) + x(c2)*(cos((c1-1)*(c2-1)*2*pi/length(x)) - ...
1j*sin((c1-1)*(c2-1)*2*pi/length(x)));
end
end
disp(cputime-t);
EDIT: Or if you're looking to stress memory more than CPU:
function y = SlowDFT_MemLookup(x)
t = cputime;
y = zeros(size(x));
cosbuf = cos((0:1:(length(x)-1))*2*pi/length(x));
for c1=1:length(x)
cosctr = 1;
sinctr = round(3*length(x)/4)+1;
for c2=1:length(x)
y(c1) = y(c1) + x(c2)*(cosbuf(cosctr) ...
-1j*cosbuf(sinctr));
cosctr = cosctr + (c1-1);
if cosctr > length(x), cosctr = cosctr - length(x); end
sinctr = sinctr + (c1-1);
if sinctr > length(x), sinctr = sinctr - length(x); end
end
end
disp(cputime-t);
This is faster than calculating sin and cos on each iteration. A 2048 long input took ~ 3 seconds, and a 16384 long input took ~ 180 seconds.
Count to 2n. Optionally, make a slow function call in each iteration.
If you want real work that's easy to set up and stresses CPU way over memory:
Large dense matrix inversion (not slow enough? make it bigger.)
Factor an RSA number
How about using inv? It has been reported to be quite slow.
Do some work in a loop. You can tune the time it takes to complete using the number of loop iterations.
I don't speak MATLAB but something equivalent to the following might work.
loops = 0
counter = 0
while (loops < MAX_INT) {
counter = counter + 1;
if (counter == MAX_INT) {
loops = loops + 1;
counter = 0;
}
}
This will iterate MAX_INT*MAX_INT times. You can put some computationally heavy thing in the loop for it to take longer if this is not enough.
Easy! Go back to your Turing machine roots and think of processes that are O(2^n) or worse.
Here's a fairly simple one (warning, untested but you get the point)
N = 12; radix = 10;
odometer = zeros(N, 1);
done = false;
while (~done)
done = true;
for i = 1:N
odometer(i) = odometer(i) + 1;
if (odometer(i) >= radix)
odometer(i) = 0;
else
done = false;
break;
end
end
end
Even better, how about calculating Fibonacci numbers recursively? Runtime is O(2^N), since fib(N) has to make two function calls fib(N-1) and fib(N-2), but stack depth is O(N), since only one of those function calls happens at a time.
function y = fib(n)
if (n <= 1)
y = 1;
else
y = fib(n-1) + fib(n-2);
end
end
You could ask it to factor(X) for a suitably large X
You could also test if a given input is prime by just dividing it by all smaller numbers. This would give you O(n^2).
Try this one:
tic
isprime( primes(99999999) );
toc
EDIT:
For a more extensive set of tests, use these benchmarks (perhaps for multiple repetitions even):
disp(repmat('-',1,85))
disp(['MATLAB Version ' version])
disp(['Operating System: ' system_dependent('getos')])
disp(['Java VM Version: ' version('-java')]);
disp(['Date: ' date])
disp(repmat('-',1,85))
N = 3000; % matrix size
A = rand(N,N);
A = A*A;
tic; A*A; t=toc;
fprintf('A*A \t\t\t%f sec\n', t)
tic; [L,U,P] = lu(A); t=toc; clear L U P
fprintf('LU(A)\t\t\t%f sec\n', t)
tic; inv(A); t=toc;
fprintf('INV(A)\t\t\t%f sec\n', t)
tic; [U,S,V] = svd(A); t=toc; clear U S V
fprintf('SVD(A)\t\t\t%f sec\n', t)
tic; [Q,R,P] = qr(A); t=toc; clear Q R P
fprintf('QR(A)\t\t\t%f sec\n', t)
tic; [V,D] = eig(A); t=toc; clear V D
fprintf('EIG(A)\t\t\t%f sec\n', t)
tic; det(A); t=toc;
fprintf('DET(A)\t\t\t%f sec\n', t)
tic; rank(A); t=toc;
fprintf('RANK(A)\t\t\t%f sec\n', t)
tic; cond(A); t=toc;
fprintf('COND(A)\t\t\t%f sec\n', t)
tic; sqrtm(A); t=toc;
fprintf('SQRTM(A)\t\t%f sec\n', t)
tic; fft(A(:)); t=toc;
fprintf('FFT\t\t\t%f sec\n', t)
tic; isprime(primes(10^7)); t=toc;
fprintf('Primes\t\t\t%f sec\n', t)
The following are the results on my machine using N=1000 for one iteration only (note primes is using as upper bound 10^7 NOT 10^8 [takes way more time!])
A*A 0.178329 sec
LU(A) 0.118864 sec
INV(A) 0.319275 sec
SVD(A) 15.236875 sec
QR(A) 0.841982 sec
EIG(A) 3.967812 sec
DET(A) 0.121882 sec
RANK(A) 1.813042 sec
COND(A) 1.809365 sec
SQRTM(A) 22.750331 sec
FFT 0.113233 sec
Primes 27.080918 sec
this will run 100% cpu for WANTED_TIME seconds
WANTED_TIME = 2^n; % seconds
t0=cputime;
t=cputime;
while (t-t0 < WANTED_TIME)
t=cputime;
end;

Resources