How to improve the performance of this piece of code? - wolfram-mathematica

I'm trying to learn a bit of Julia, after reading the manual for several hours, I wrote the following piece of code:
ie = 200;
ez = zeros(ie + 1);
hy = zeros(ie);
fdtd1d (steps)=
for n in 1:steps
for i in 2:ie
ez[i]+= (hy[i] - hy[i-1])
end
ez[1]= sin(n/10)
for i in 1:ie
hy[i]+= (ez[i+1]- ez[i])
end
end
#time fdtd1d(10000);
elapsed time: 2.283153795 seconds (239659044 bytes allocated)
I believe it's under optimizing, because it's much slower than the corresponding Mathematica version:
ie = 200;
ez = ConstantArray[0., {ie + 1}];
hy = ConstantArray[0., {ie}];
fdtd1d = Compile[{{steps}},
Module[{ie = ie, ez = ez, hy = hy},
Do[ez[[2 ;; ie]] += (hy[[2 ;; ie]] - hy[[1 ;; ie - 1]]);
ez[[1]] = Sin[n/10];
hy[[1 ;; ie]] += (ez[[2 ;; ie + 1]] - ez[[1 ;; ie]]), {n,
steps}]; Sow#ez; Sow#hy]];
result = fdtd1d[10000]; // AbsoluteTiming
{0.1280000, Null}
So, how to make the Julia version of fdtd1d faster?

Two things:
The first time you run the function the time will include the compile time of the code. If you want a apples to apples comparison with a compiled function in Mathematica you should run the function twice and time the second run. With your code I get:
elapsed time: 1.156531976 seconds (447764964 bytes allocated)
for the first run which includes the compile time and
elapsed time: 1.135681299 seconds (447520048 bytes allocated)
for the second run when you don't need to compile.
The second thing, and arguably the bigger thing, is that you should avoid global variables in performance critical code. This is the first tip in the performance tips section of the manual.
Here is the same code using local variables:
function fdtd1d_local(steps, ie = 200)
ez = zeros(ie + 1);
hy = zeros(ie);
for n in 1:steps
for i in 2:ie
ez[i]+= (hy[i] - hy[i-1])
end
ez[1]= sin(n/10)
for i in 1:ie
hy[i]+= (ez[i+1]- ez[i])
end
end
return (ez, hy)
end
fdtd1d_local(10000)
#time fdtd1d_local(10000);
To compare your Mathematica code on my machine gives
{0.094005, Null}
while the result from #time for fdtd1d_local is:
elapsed time: 0.015188926 seconds (4176 bytes allocated)
Or about 6 times faster. Global variables make a big difference.

I believe in using limited number of loops and use loops only when required. Expressions can be used in place of loops. It is not possible to avoid all the loops, but the code would be optimized if we reduce some of them.
In the above program I did a bit of optimization by using expressions. The time was almost reduced by half.
ORIGINAL CODE :
ie = 200;
ez = zeros(ie + 1);
hy = zeros(ie);
fdtd1d (steps)=
for n in 1:steps
for i in 2:ie
ez[i]+= (hy[i] - hy[i-1])
end
ez[1]= sin(n/10)
for i in 1:ie
hy[i]+= (ez[i+1]- ez[i])
end
end
#time fdtd1d(10000);
The output is
julia>
elapsed time: 1.845615295 seconds (239687888 bytes allocated)
OPTIMIZED CODE:
ie = 200;
ez = zeros(ie + 1);
hy = zeros(ie);
fdtd1d (steps)=
for n in 1:steps
ez[2:ie] = ez[2:ie]+hy[2:ie]-hy[1:ie-1];
ez[1]= sin(n/10);
hy[1:ie] = hy[1:ie]+ez[2:end]- ez[1:end-1]
end
#time fdtd1d(10000);
OUTPUT
julia>
elapsed time: 0.93926323 seconds (206977748 bytes allocated)

Related

Julia: why doesn't shared memory multi-threading give me a speedup?

I want to use shared memory multi-threading in Julia. As done by the Threads.#threads macro, I can use ccall(:jl_threading_run ...) to do this. And whilst my code now runs in parallel, I don't get the speedup I expected.
The following code is intended as a minimal example of the approach I'm taking and the performance problem I'm having: [EDIT: See later for even more minimal example]
nthreads = Threads.nthreads()
test_size = 1000000
println("STARTED with ", nthreads, " thread(s) and test size of ", test_size, ".")
# Something to be processed:
objects = rand(test_size)
# Somewhere for our results
results = zeros(nthreads)
counts = zeros(nthreads)
# A function to do some work.
function worker_fn()
work_idx = 1
my_result = results[Threads.threadid()]
while work_idx > 0
my_result += objects[work_idx]
work_idx += nthreads
if work_idx > test_size
break
end
counts[Threads.threadid()] += 1
end
end
# Call our worker function using jl_threading_run
#time ccall(:jl_threading_run, Ref{Cvoid}, (Any,), worker_fn)
# Verify that we made as many calls as we think we did.
println("\nCOUNTS:")
println("\tPer thread:\t", counts)
println("\tSum:\t\t", sum(counts))
On an i7-7700, a typical single threaded result is:
STARTED with 1 thread(s) and test size of 1000000.
0.134606 seconds (5.00 M allocations: 76.563 MiB, 1.79% gc time)
COUNTS:
Per thread: [999999.0]
Sum: 999999.0
And with 4 threads:
STARTED with 4 thread(s) and test size of 1000000.
0.140378 seconds (1.81 M allocations: 25.661 MiB)
COUNTS:
Per thread: [249999.0, 249999.0, 249999.0, 249999.0]
Sum: 999996.0
Multi-threading slows things down! Why?
EDIT: A better minimal example can be created #threads macro itself.
a = zeros(Threads.nthreads())
b = rand(test_size)
calls = zeros(Threads.nthreads())
#time Threads.#threads for i = 1 : test_size
a[Threads.threadid()] += b[i]
calls[Threads.threadid()] += 1
end
I falsely assumed that the #threads macro's inclusion in Julia would mean that there was a benefit to be had.
The problem you have is most probably false sharing.
You can solve it by separating the areas you write to far enough like this (here is a "quick and dirty" implementation to show the essence of the change):
julia> function f(spacing)
test_size = 1000000
a = zeros(Threads.nthreads()*spacing)
b = rand(test_size)
calls = zeros(Threads.nthreads()*spacing)
Threads.#threads for i = 1 : test_size
#inbounds begin
a[Threads.threadid()*spacing] += b[i]
calls[Threads.threadid()*spacing] += 1
end
end
a, calls
end
f (generic function with 1 method)
julia> #btime f(1);
41.525 ms (35 allocations: 7.63 MiB)
julia> #btime f(8);
2.189 ms (35 allocations: 7.63 MiB)
or doing per-thread accumulation on a local variable like this (this is a preferred approach as it should be uniformly faster):
function getrange(n)
tid = Threads.threadid()
nt = Threads.nthreads()
d , r = divrem(n, nt)
from = (tid - 1) * d + min(r, tid - 1) + 1
to = from + d - 1 + (tid ≤ r ? 1 : 0)
from:to
end
function f()
test_size = 10^8
a = zeros(Threads.nthreads())
b = rand(test_size)
calls = zeros(Threads.nthreads())
Threads.#threads for k = 1 : Threads.nthreads()
local_a = 0.0
local_c = 0.0
for i in getrange(test_size)
for j in 1:10
local_a += b[i]
local_c += 1
end
end
a[Threads.threadid()] = local_a
calls[Threads.threadid()] = local_c
end
a, calls
end
Also note that you are probably using 4 treads on a machine with 2 physical cores (and only 4 virtual cores) so the gains from threading will not be linear.

Memory allocation in a fixed point algorithm

I need to find the fixed point of a function f. The algorithm is very simple:
Given X, compute f(X)
If ||X-f(X)|| is lower than a certain tolerance, exit and return X,
otherwise set X equal to f(X) and go back to 1.
I'd like to be sure I'm not allocating memory for a new object at every iteration
For now, the algorithm looks like this:
iter1 = function(x::Vector{Float64})
for iter in 1:max_it
oldx = copy(x)
g1(x)
delta = vnormdiff(x, oldx, 2)
if delta < tolerance
break
end
end
end
Here g1(x) is a function that sets x to f(x)
But it seems this loop allocates a new vector at every loop (see below).
Another way to write the algorithm is the following:
iter2 = function(x::Vector{Float64})
oldx = similar(x)
for iter in 1:max_it
(oldx, x) = (x, oldx)
g2(x, oldx)
delta = vnormdiff(oldx, x, 2)
if delta < tolerance
break
end
end
end
where g2(x1, x2) is a function that sets x1 to f(x2).
Is thi the most efficient and natural way to write this kind of iteration problem?
Edit1: timing shows that the second code is faster:
using NumericExtensions
max_it = 1000
tolerance = 1e-8
max_it = 100
g1 = function(x::Vector{Float64})
for i in 1:length(x)
x[i] = x[i]/2
end
end
g2 = function(newx::Vector{Float64}, x::Vector{Float64})
for i in 1:length(x)
newx[i] = x[i]/2
end
end
x = fill(1e7, int(1e7))
#time iter1(x)
# elapsed time: 4.688103075 seconds (4960117840 bytes allocated, 29.72% gc time)
x = fill(1e7, int(1e7))
#time iter2(x)
# elapsed time: 2.187916177 seconds (80199676 bytes allocated, 0.74% gc time)
Edit2: using copy!
iter3 = function(x::Vector{Float64})
oldx = similar(x)
for iter in 1:max_it
copy!(oldx, x)
g1(x)
delta = vnormdiff(x, oldx, 2)
if delta < tolerance
break
end
end
end
x = fill(1e7, int(1e7))
#time iter3(x)
# elapsed time: 2.745350176 seconds (80008088 bytes allocated, 1.11% gc time)
I think replacing the following lines in the first code
for iter = 1:max_it
oldx = copy( x )
...
by
oldx = zeros( N )
for iter = 1:max_it
oldx[:] = x # or copy!( oldx, x )
...
will be more efficient because no array is allocated. Also, the code can be made more efficient by writing for-loops explicitly. This can be seen, for example, from the following comparison
function test()
N = 1000000
a = zeros( N )
b = zeros( N )
#time c = copy( a )
#time b[:] = a
#time copy!( b, a )
#time \
for i = 1:length(a)
b[i] = a[i]
end
#time \
for i in eachindex(a)
b[i] = a[i]
end
end
test()
The result obtained with Julia0.4.0 on Linux(x86_64) is
elapsed time: 0.003955609 seconds (7 MB allocated)
elapsed time: 0.001279142 seconds (0 bytes allocated)
elapsed time: 0.000836167 seconds (0 bytes allocated)
elapsed time: 1.19e-7 seconds (0 bytes allocated)
elapsed time: 1.28e-7 seconds (0 bytes allocated)
It seems that copy!() is faster than using [:] in the left-hand side,
though the difference becomes marginal in repeated calculations (there seems to be
some overhead for the first [:] calculation). Btw, the last example using eachindex() is very convenient for looping over multi-dimensional arrays.
Similar comparison can be made for vnormdiff(), where use of norm( x - oldx ) etc is slower than an explicit loop for vector norm, because the former allocates one temporary array for x - oldx.

Matlab is slow when using user defined function with calculation in GPU

When I run the code shown below, the tic/toc pair inside the function shows it takes very short time (<< 1sec) to go through all the lines. However, it actually takes around 2.3secs to get the outputs!!! I use the tic/toc pair to measure the time.
tic
rnn.v = 11;
rnn.h = 101;
rnn.o = 7;
rnn.h_init = randn(1,rnn.h,'gpuArray');
rnn.W_vh = randn(rnn.v,rnn.h,'gpuArray');
rnn.W_hh = randn(rnn.h,rnn.h,'gpuArray');
rnn.W_ho = randn(rnn.h,rnn.o,'gpuArray');
inData.V = randn(10000,11,100,'gpuArray');
inData.TimeSteps =100;
inData.BatchSize = 10000;
[H,OX] = forward_pass(rnn, inData)
toc
All the matrices in rnn, and inData are gpuArray, so all the calculation are carried out in GPU. The outputs are also gpuArray.
function [H,OX] = forward_pass(rnn, inData)
tic;
%initial hidden state values
H_init = gpuArray(repmat(rnn.h_init,[inData.BatchSize,1]));
%initialize state H
H = zeros(inData.BatchSize, rnn.h, inData.TimeSteps,'gpuArray');
%initialize OX (which is H * Who)
OX = zeros(inData.BatchSize, rnn.o, inData.TimeSteps,'gpuArray');
for t = 1 : inData.TimeSteps
if t == 1
HX_t = H_init * rnn.W_hh...
+ inData.V(:,:,t) * rnn.W_vh;
else
HX_t = H(:,:,(t-1)) * rnn.W_hh...
+ inData.V(:,:,t) * rnn.W_vh;
end
H(:,:,t) = tanh(HX_t);
OX(:,:,t) = H(:,:,t) * rnn.W_ho;
end
toc;
end
Normally, if you use gather() function, it will be slow. I didn't use the gather() function to transfer the outputs to workspace, I don't know why it is still so slow. It looks like the last line "end" takes more than 2secs.
Anyone knows how to accelerate the function call?
First off, for proper benchmarking you do need to use gather either inside the function call or afterwards. In the former case, you would have a non-gpu output from the function call and in the latter case, a gpu-based datatype would be the output. Now, back to your problem, you are using very few TimeSteps and as such any optimization that you might try out won't reflect in a huge manner. Here's an optimized version that will show increased performance as you increase Timesteps -
function [H,OX] = forward_pass(rnn, inData)
H = zeros(inData.BatchSize, rnn.h, inData.TimeSteps,'gpuArray');
T = reshape(permute(inData.V,[1 3 2]),[],size(inData.V,2))*rnn.W_vh;
H(:,:,1) = tanh(bsxfun(#plus,rnn.h_init * rnn.W_hh,T(1:size(inData.V,1),:)));
for t = 2 : inData.TimeSteps
H(:,:,t) = tanh( H(:,:,(t-1))*rnn.W_hh + ...
T((t-1)*size(inData.V,1)+1: t*size(inData.V,1),:));
end
A = reshape(permute(H,[1 3 2]),[],size(H,2))*rnn.W_ho;
OX = permute(reshape(A,size(H,1),size(A,1)/size(H,1),[]),[1 3 2]);
return;
Benchmarking
Test Case #1
Parameters
rnn.v = 11;
rnn.h = 5;
rnn.o = 7;
inData.TimeSteps = 10000;
inData.BatchSize = 10;
Results
---- Original Code :
Elapsed time is 5.678876 seconds.
---- Modified Code :
Elapsed time is 3.821059 seconds.
Test Case #2
Parameters
inData.TimeSteps = 50000; (rest are same as in Test Case #1)
Results
---- Original Code :
Elapsed time is 28.392290 seconds.
---- Modified Code :
Elapsed time is 19.031776 seconds.
Please note that these are tested on GTX 750 Ti.

More on using i and j as variables in Matlab: speed

The Matlab documentation says that
For speed and improved robustness, you can replace complex i and j by 1i. For example, instead of using
a = i;
use
a = 1i;
The robustness part is clear, as there might be variables called i or j. However, as for speed, I have made a simple test in Matlab 2010b and I obtain results which seem to contradict the claim:
>>clear all
>> a=0; tic, for n=1:1e8, a=i; end, toc
Elapsed time is 3.056741 seconds.
>> a=0; tic, for n=1:1e8, a=1i; end, toc
Elapsed time is 3.205472 seconds.
Any ideas? Could it be a version-related issue?
After comments by #TryHard and #horchler, I have tried assigning other values to the variable a, with these results:
Increasing order of elapsed time:
"i" < "1i" < "1*i" (trend "A")
"2i" < "2*1i" < "2*i" (trend "B")
"1+1i" < "1+i" < "1+1*i" (trend "A")
"2+2i" < "2+2*1i" < "2+2*i" (trend "B")
I think you are looking at a pathological example. Try something more complex (results shown for R2012b on OSX):
(repeated addition)
>> clear all
>> a=0; tic, for n=1:1e8, a = a + i; end, toc
Elapsed time is 2.217482 seconds. % <-- slower
>> clear all
>> a=0; tic, for n=1:1e8, a = a + 1i; end, toc
Elapsed time is 1.962985 seconds. % <-- faster
(repeated multiplication)
>> clear all
>> a=0; tic, for n=1:1e8, a = a * i; end, toc
Elapsed time is 2.239134 seconds. % <-- slower
>> clear all
>> a=0; tic, for n=1:1e8, a = a * 1i; end, toc
Elapsed time is 1.998718 seconds. % <-- faster
One thing to remember is that optimizations are applied differently whether you are running from the command line or a saved M-function.
Here is a test of my own:
function testComplex()
tic, test1(); toc
tic, test2(); toc
tic, test3(); toc
tic, test4(); toc
tic, test5(); toc
tic, test6(); toc
end
function a = test1
a = zeros(1e7,1);
for n=1:1e7
a(n) = 2 + 2i;
end
end
function a = test2
a = zeros(1e7,1);
for n=1:1e7
a(n) = 2 + 2j;
end
end
function a = test3
a = zeros(1e7,1);
for n=1:1e7
a(n) = 2 + 2*i;
end
end
function a = test4
a = zeros(1e7,1);
for n=1:1e7
a(n) = 2 + 2*j;
end
end
function a = test5
a = zeros(1e7,1);
for n=1:1e7
a(n) = complex(2,2);
end
end
function a = test6
a = zeros(1e7,1);
for n=1:1e7
a(n) = 2 + 2*sqrt(-1);
end
end
The results on my Windows machine running R2013a:
>> testComplex
Elapsed time is 0.946414 seconds. %// 2 + 2i
Elapsed time is 0.947957 seconds. %// 2 + 2j
Elapsed time is 0.811044 seconds. %// 2 + 2*i
Elapsed time is 0.685793 seconds. %// 2 + 2*j
Elapsed time is 0.767683 seconds. %// complex(2,2)
Elapsed time is 8.193529 seconds. %// 2 + 2*sqrt(-1)
Note that the results fluctuate a little bit with different runs where the order of calls is shuffled. So take the timings with a grain of salt.
My conclusion: doesn't matter in terms of speed if you use 1i or 1*i.
One interesting difference is that if you also have a variable in the function scope where you also use it as the imaginary unit, MATLAB throws an error:
Error: File: testComplex.m Line: 38 Column: 5
"i" previously appeared to be used as a function or command, conflicting with its
use here as the name of a variable.
A possible cause of this error is that you forgot to initialize the variable, or you
have initialized it implicitly using load or eval.
To see the error, change the above test3 function into:
function a = test3
a = zeros(1e7,1);
for n=1:1e7
a(n) = 2 + 2*i;
end
i = rand(); %// added this line!
end
i.e, the variable i was used as both a function and a variable in the same local scope.

Purposefully Slow MATLAB Function?

I want to write a really, really, slow program for MATLAB. I'm talking like, O(2^n) or worse. It has to finish, and it has to be deterministically slow, so no "if rand() = 123,123, exit!" This sounds crazy, but it's actually for a distributed systems test. I need to create a .m file, compile it (with MCC), and then run it on my distributed system to perform some debugging operations.
The program must constantly be doing work, so sleep() is not a valid option.
I tried making a random large matrix and finding its inverse, but this was completing too quickly. Any ideas?
This naive implementation of the Discrete Fourier Transform takes ~ 9 seconds for a 2048 long input vector x on my 1.86 GHz single core machine. Going to 4096 inputs extends the time to ~ 35 seconds, close to the 4x I would expect for O(N^2). I don't have the patience to try longer inputs :)
function y = SlowDFT(x)
t = cputime;
y = zeros(size(x));
for c1=1:length(x)
for c2=1:length(x)
y(c1) = y(c1) + x(c2)*(cos((c1-1)*(c2-1)*2*pi/length(x)) - ...
1j*sin((c1-1)*(c2-1)*2*pi/length(x)));
end
end
disp(cputime-t);
EDIT: Or if you're looking to stress memory more than CPU:
function y = SlowDFT_MemLookup(x)
t = cputime;
y = zeros(size(x));
cosbuf = cos((0:1:(length(x)-1))*2*pi/length(x));
for c1=1:length(x)
cosctr = 1;
sinctr = round(3*length(x)/4)+1;
for c2=1:length(x)
y(c1) = y(c1) + x(c2)*(cosbuf(cosctr) ...
-1j*cosbuf(sinctr));
cosctr = cosctr + (c1-1);
if cosctr > length(x), cosctr = cosctr - length(x); end
sinctr = sinctr + (c1-1);
if sinctr > length(x), sinctr = sinctr - length(x); end
end
end
disp(cputime-t);
This is faster than calculating sin and cos on each iteration. A 2048 long input took ~ 3 seconds, and a 16384 long input took ~ 180 seconds.
Count to 2n. Optionally, make a slow function call in each iteration.
If you want real work that's easy to set up and stresses CPU way over memory:
Large dense matrix inversion (not slow enough? make it bigger.)
Factor an RSA number
How about using inv? It has been reported to be quite slow.
Do some work in a loop. You can tune the time it takes to complete using the number of loop iterations.
I don't speak MATLAB but something equivalent to the following might work.
loops = 0
counter = 0
while (loops < MAX_INT) {
counter = counter + 1;
if (counter == MAX_INT) {
loops = loops + 1;
counter = 0;
}
}
This will iterate MAX_INT*MAX_INT times. You can put some computationally heavy thing in the loop for it to take longer if this is not enough.
Easy! Go back to your Turing machine roots and think of processes that are O(2^n) or worse.
Here's a fairly simple one (warning, untested but you get the point)
N = 12; radix = 10;
odometer = zeros(N, 1);
done = false;
while (~done)
done = true;
for i = 1:N
odometer(i) = odometer(i) + 1;
if (odometer(i) >= radix)
odometer(i) = 0;
else
done = false;
break;
end
end
end
Even better, how about calculating Fibonacci numbers recursively? Runtime is O(2^N), since fib(N) has to make two function calls fib(N-1) and fib(N-2), but stack depth is O(N), since only one of those function calls happens at a time.
function y = fib(n)
if (n <= 1)
y = 1;
else
y = fib(n-1) + fib(n-2);
end
end
You could ask it to factor(X) for a suitably large X
You could also test if a given input is prime by just dividing it by all smaller numbers. This would give you O(n^2).
Try this one:
tic
isprime( primes(99999999) );
toc
EDIT:
For a more extensive set of tests, use these benchmarks (perhaps for multiple repetitions even):
disp(repmat('-',1,85))
disp(['MATLAB Version ' version])
disp(['Operating System: ' system_dependent('getos')])
disp(['Java VM Version: ' version('-java')]);
disp(['Date: ' date])
disp(repmat('-',1,85))
N = 3000; % matrix size
A = rand(N,N);
A = A*A;
tic; A*A; t=toc;
fprintf('A*A \t\t\t%f sec\n', t)
tic; [L,U,P] = lu(A); t=toc; clear L U P
fprintf('LU(A)\t\t\t%f sec\n', t)
tic; inv(A); t=toc;
fprintf('INV(A)\t\t\t%f sec\n', t)
tic; [U,S,V] = svd(A); t=toc; clear U S V
fprintf('SVD(A)\t\t\t%f sec\n', t)
tic; [Q,R,P] = qr(A); t=toc; clear Q R P
fprintf('QR(A)\t\t\t%f sec\n', t)
tic; [V,D] = eig(A); t=toc; clear V D
fprintf('EIG(A)\t\t\t%f sec\n', t)
tic; det(A); t=toc;
fprintf('DET(A)\t\t\t%f sec\n', t)
tic; rank(A); t=toc;
fprintf('RANK(A)\t\t\t%f sec\n', t)
tic; cond(A); t=toc;
fprintf('COND(A)\t\t\t%f sec\n', t)
tic; sqrtm(A); t=toc;
fprintf('SQRTM(A)\t\t%f sec\n', t)
tic; fft(A(:)); t=toc;
fprintf('FFT\t\t\t%f sec\n', t)
tic; isprime(primes(10^7)); t=toc;
fprintf('Primes\t\t\t%f sec\n', t)
The following are the results on my machine using N=1000 for one iteration only (note primes is using as upper bound 10^7 NOT 10^8 [takes way more time!])
A*A 0.178329 sec
LU(A) 0.118864 sec
INV(A) 0.319275 sec
SVD(A) 15.236875 sec
QR(A) 0.841982 sec
EIG(A) 3.967812 sec
DET(A) 0.121882 sec
RANK(A) 1.813042 sec
COND(A) 1.809365 sec
SQRTM(A) 22.750331 sec
FFT 0.113233 sec
Primes 27.080918 sec
this will run 100% cpu for WANTED_TIME seconds
WANTED_TIME = 2^n; % seconds
t0=cputime;
t=cputime;
while (t-t0 < WANTED_TIME)
t=cputime;
end;

Resources