I seem to be getting different performance results when using eigs. On the same matrix, calling
[c, v] = eigs(A, 2, 'sm');
somtimes takes 30 seconds and sometimes 2 seconds.
I need to know whether there's a speedup using some caching on subsequent calls for eigs on the same matrix since I need to report the times...
If so, this doesn't appear to be a generic feature. I ran this test from the command line
A = randn(10000);
B = randn(10000);
C = B;
tic; [c1,v1] = eigs(A,2,'sm'); toc;
tic; [c2,v2] = eigs(A,2,'sm'); toc;
tic; [c3,v3] = eigs(B,2,'sm'); toc;
tic; [c4,v4] = eigs(C,2,'sm'); toc
and got this result
Elapsed time is 32.373128 seconds.
Elapsed time is 28.412905 seconds.
Elapsed time is 32.752616 seconds.
Elapsed time is 29.024055 seconds.
I'm surprised, because usually MATLAB tries to outsmart you and will store results for reuse.
Under some circumstances, a large enough matrix might push things into virtual memory, or not, depending upon whether there is a large enough block of contiguous RAM available. Or, you may be doing something on the side.
You can verify what is happening by watching a process monitor as you do the test. Are there suddenly large amounts of disk accesses? If so, then virtual memory is being touched. Is there a different, unrelated process active that is hogging the CPU?
Related
I have some performance problems with parallel computing in Julia. I am new in both, Julia and parallel calculations.
In order to learn, I parallelized a code that should benefits from parallelization, but it does not.
The program estimates the mean of the mean of the components of arrays whose elements were chosen randomly with an uniform distribution.
Serial version
tic()
function mean_estimate(N::Int)
iter = 100000*2
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
a = mean_estimate(0)
toc()
println("The mean is: ", a)
Parallelized version
addprocs(CPU_CORES - 1)
println("CPU cores ", CPU_CORES)
tic()
#everywhere function mean_estimate(N::Int)
iter = 100000
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
the_mean = mean(vcat(pmap(mean_estimate,[1,2])...))
toc()
println("The mean is: ", the_mean)
Notes:
The factor 2 in the fourth line of the serial code is because I tried the code in a PC with two cores.
I checked the usage of the two cores with htop, and it seems to be ok.
The outputs I get are:
me#pentium-ws:~/average$ time julia serial.jl
elapsed time: 2.68671022 seconds
The mean is: 0.49999736055814215
real 0m2.961s
user 0m2.928s
sys 0m0.116s
and
me#pentium-ws:~/average$ time julia -p 2 parallel.jl
CPU cores 2
elapsed time: 2.890163089 seconds
The mean is: 0.5000104221069994
real 0m7.576s
user 0m11.744s
sys 0m0.308s
I've noticed that the serial version is slightly faster than the parallelized one for the timed part of the code. Also, that there is large difference in the total execution time.
Questions
Why is the parallelized version slower? (what I am doing wrong?)
Which is the right way to parallelize this program?
Note: I use pmap with vcat because I wish to try with the median too.
Thanks for your help
EDIT
I measured times as #HighPerformanceMark suggested. The tic()/toc() times are the following. The iteration number is 2E6 for every case.
Array Size Single thread Parallel Ratio
5000 2.69 2.89 1.07
100 000 488.77 346.00 0.71
1000 000 4776.58 4438.09 0.93
I am puzzled about why there is not clear trend with array size.
You should pay prime attention to suggestions in the comments.
As #ChrisRackauckas points out, type instability is a common stumbling block for performant Julia code. If you want highly performant code, then make sure that your functions are type-stable. Consider annotating the return type of the function pmap and/or vcat, e.g. f(pids::Vector{Int}) = mean(vcat(pmap(mean_estimate, pids))) :: Float64 or something similar, since pmap does not strongly type its output. Another strategy is to roll your own parallel scheduler. You can use pmap source code as a springboard (see code here).
Furthermore, as #AlexMorley commented, you are confounding your performance measurements by including compilation times. Normally performance of a function f() is measured in Julia by running it twice and measuring only the second run. In the first run, the JIT compiler compiles f() before running it, while the second run uses the compiled function. Compilation incurs a (unwanted) performance cost, so timing the second run avoid measuring the compilation.
If possible, preallocate all outputs. In your code, you have set each worker to allocate its own zeros(iter) and its own rand(p). This can have dramatic performance consequences. A sketch of your code:
# code mean_estimate as two functions
f(p::Int) = mean(rand(p))
function g(iter::Int, p::Int)
vec_mean = zeros(iter)
for i in eachindex(vec_mean)
vec_mean[i] = f(p)
end
return mean(vec_mean)
end
# run twice, time on second run to get compute time
g(200000, 5000)
#time g(200000, 5000)
### output on my machine
# 2.792953 seconds (600.01 k allocations: 7.470 GB, 24.65% gc time)
# 0.4999951853035917
The #time macro is alerting you that the garbage collector is cleaning up a lot of allocated memory during execution, several gigabytes in fact. This kills performance. Memory allocations may be overshadowing any distinction between your serial and parallel compute times.
Lastly, remember that parallel computing incurs overhead from scheduling and managing individual workers. Your workers are computing the mean of the means of many random vectors of length 5000. But you could succinctly compute the mean (or median) of, say, 5M entries with
x = rand(5_000_000)
mean(x)
#time mean(x) # 0.002854 seconds (5 allocations: 176 bytes)
so it is unclear how your parallel computing scheme improves upon serial performance. Parallel computing generally provides the best help when your arrays are truly beefy or your calculations are arithmetically intense, and vector means probably do not fall in that domain.
One last note: you may want to peek at SharedArrays, which distribute arrays over several workers with a common memory pool, or the experimental multithreading facilities in Julia. You may find those parallel frameworks more intuitive than pmap.
Many scientific computing languages make a distinction between absolute time (wall clock) and CPU time (processor cycles). For example, in Matlab we have:
>> tic; pause(1); toc
Elapsed time is 1.009068 seconds.
>> start = cputime; pause(1); elapsed = cputime - start
elapsed =
0
and in Mathematica we have:
>>In[1]:= AbsoluteTiming[Pause[1]]
>>Out[1]= {1.0010572, Null}
>>In[2]:= Timing[Pause[1]]
>>Out[2]= {0., Null}
This distinction is useful when benchmarking code run on computation servers, where there may be high variance in the absolute timing results depending on what other processes are running concurrently.
The Julia standard library provides support for timing of expressions through tic(), toc(), #time and a few other functions/macros all based on time_ns(), a function that measures absolute time.
>>julia> #time sleep(1)
elapsed time: 1.017056895 seconds (135788 bytes allocated)
My question: Is there a simple way to get the elapsed CPU time for an expression evaluation in Julia?
(Side note: doing some digging, it appears that Julia timing is based on the uv_hrtime() function from libuv. It seems to me that using uv_getrusage from the same library might give a way to access elapsed CPU time in Julia, but I'm no expert. Has anybody tried using anything like this?)
I couldn't find any existing solutions, so I've put together a package with some simple CPU timing functionality here: https://github.com/schmrlng/CPUTime.jl. The package is completely untested on parallel code and may have other bugs, but if anybody else would like to try it out calling
>> Pkg.clone("https://github.com/schmrlng/CPUTime.jl.git")
from the julia> prompt should install the package.
Julia does have the commands tic() and toc() which work just like tic and toc in Matlab:
julia> tic(); 7^1000000000; toc()
elapsed time: 0.046563597 seconds
0.046563597
I am writing a code that is outputting to a DAQ which controls a device. I want to have it send a signal out precisely every 1 second. Depending on the performance of my proccessor the code sometimes takes longer or shorter than 1 second. Is there any way to improve this bit of code?
Elapsed time is 1.000877 seconds.
Elapsed time is 0.992847 seconds.
Elapsed time is 0.996886 seconds.
for i= 1:100
tic
pause(.99)
toc
end
Using pause is known to be fairly imprecise (on the order of 10 ms). Matlab in recent versions has optimized tic toc to be low-overhead and as precise as possible (see here).
You can make use of tic toc to be more precise than pause using the following code:
ntimes = 100;
times = zeros(ntimes,1);
time_dur = 0.99;
for i= 1:ntimes
outer = tic;
while toc(outer) < time_dur
end
times(i) = toc(outer);
end
mean(times)
std(times)
Here is my outcome for 50 measurements: mean = 0.9900 with a std = 1.0503e-5, which is much more precise than using pause.
Using the original code with just pause, for 50 measurements I get: mean = 0.9981 with a std = 0.0037.
This is a inproved version of shimizu's answer. The main issue is a minimal clock drift. Each iteration the time stamp is taken and then then the timer is reset. The clock drifts by the execution time of these two commands.
A secondary minor improvement combines pause and the tic-toc technique to lower the cpu load.
ntimes = 100;
times = zeros(ntimes,1);
time_dur = 0.99;
t = tic;
for ix= 1:ntimes
pause((time_dur*ix-toc(t)-0.1))
while toc(t) < time_dur*ix
end
times(ix) = toc(t);
end
mean(diff(times))
std(diff(times))
If you want your DAQ to update exactly every second, use a DAQ with a FIFO buffer and a clock and configured to read a value from the FIFO exactly once per second.
Even if you got your MATLAB task iterations running exactly one second apart, the inconsistent delay in communication with the DAQ would mess up your timing.
I tried the matlab's convolution function conv2 convn with gpuArray.
For example convn(gpuArray.rand(100,100,10,'single'),gpuArray.rand(5,'single') and compared it to the cpu version convn(rand(100,100,10),rand(5)). Unfortunately the gpu version is much slower than the cpu version, especially noticeable when I put the function for example into a loop(which will be relevant for me). Does anyone know an alternative to fast convolution using matlab and the gpu for relatively small filtering kernels from 5x5 to 14x14?
The GPU performance is limited by the data array size [100x100x10] and [5x5] in your test case.
The actual performance also depends on the GPU and CPU module type. For your data size (test case 2 of the following code), I can get a performance improvement (2.75x) on GPU Tesla M2090 and CPU Xeon E5-2609.
For the following matlab test code
m=1000;
n=100;
k=5;
gc=convn(gpuArray.rand(m,m,10,'single'),gpuArray.rand(k,'single'));
tic;
for i=1:n
gc=convn(gpuArray.rand(m,m,10,'single'),gpuArray.rand(k,'single'));
end
toc
c=convn(rand(m,m,10,'single'),rand(k,'single'));
tic;
for i=1:n
c=convn(rand(m,m,10,'single'),rand(k,'single'));
end
toc
When m=1000; n=100; k=5; I got very good performance improvement (11.6x) on GPU.
Elapsed time is 2.367453 seconds.
Elapsed time is 27.502952 seconds.
But when m=100; n=1000; k=5; I got only 2.75x
Elapsed time is 1.206053 seconds.
Elapsed time is 3.330559 seconds.
When m=100; n=1000; k=14;, it becomes better (4.84x).
Elapsed time is 2.804957 seconds.
Elapsed time is 13.585698 seconds.
In trying to choose which indexing method to recommend, I tried to measeure the performance. However, the measurements confused me a lot. I ran this multiple times in different orders, but the measurements remain consistent.
Here is how I measured the performance:
for N = [10000 15000 100000 150000]
x = round(rand(N,1)*5)-2;
idx1 = x~=0;
idx2 = abs(x)>0;
tic
for t = 1:5000
idx1 = x~=0;
end
toc
tic
for t = 1:5000
idx2 = abs(x)>0;
end
toc
end
And this is the result:
Elapsed time is 0.203504 seconds.
Elapsed time is 0.230439 seconds.
Elapsed time is 0.319840 seconds.
Elapsed time is 0.352562 seconds.
Elapsed time is 2.118108 seconds. % This is the strange part
Elapsed time is 0.434818 seconds.
Elapsed time is 0.508882 seconds.
Elapsed time is 0.550144 seconds.
I checked and for values around 100000 this also happens, even at 50000 the strange measurements occur.
So my question is: Does anyone else experience this for a certain range, and what causes this? (Is it a bug?)
I think this is something to do with JIT (results below are using 2011b). Depending on system, version of Matlab, the size of variables, and exactly what is in the loop(s), it is not always faster to use JIT. This is related to the "warm-up" effect, where sometimes if you run an m-file more than once in a session it gets quicker after the first run, as the accelerator only has to compile some parts of the code once.
JIT on (feature accel on)
Elapsed time is 0.176765 seconds.
Elapsed time is 0.185301 seconds.
Elapsed time is 0.252631 seconds.
Elapsed time is 0.284415 seconds.
Elapsed time is 1.782446 seconds.
Elapsed time is 0.693508 seconds.
Elapsed time is 0.855005 seconds.
Elapsed time is 1.004955 seconds.
JIT off (feature accel off)
Elapsed time is 0.143924 seconds.
Elapsed time is 0.184360 seconds.
Elapsed time is 0.206405 seconds.
Elapsed time is 0.306424 seconds.
Elapsed time is 1.416654 seconds.
Elapsed time is 2.718846 seconds.
Elapsed time is 2.110420 seconds.
Elapsed time is 4.027782 seconds.
ETA, kinda interesting to see what happens if you use integers instead of doubles:
JIT on, same code but converted x using int8
Elapsed time is 0.202201 seconds.
Elapsed time is 0.192103 seconds.
Elapsed time is 0.294974 seconds.
Elapsed time is 0.296191 seconds.
Elapsed time is 2.001245 seconds.
Elapsed time is 2.038713 seconds.
Elapsed time is 0.870500 seconds.
Elapsed time is 0.898301 seconds.
JIT off, using int8
Elapsed time is 0.198611 seconds.
Elapsed time is 0.187589 seconds.
Elapsed time is 0.282775 seconds.
Elapsed time is 0.282938 seconds.
Elapsed time is 1.837561 seconds.
Elapsed time is 1.846766 seconds.
Elapsed time is 2.746034 seconds.
Elapsed time is 2.760067 seconds.
This may due to some automatic optimization matlab uses for its Basic Linear Algebra Subroutine.
Just like yours, my configuration (OSX 10.8.4, R2012a with default settings) takes longer to compute idx1 = x~=0 for x (10e5 elements) than x (11e5 elements). See the left panel of the figure where the processing time (y-axis) is measured for different vector size (x-axis). You will see a lower proceesing time for N>103000. In this panel, I also displayed the number of cores that were active during the calculation. You will see that there is no drop for the one-core configuration. It means that matlab do not optimize the execution of ~= when 1 core is active (no parallelization possible). Matlab enables some optimization routines when two conditions are met: multiple cores and a vector of sufficient size.
The right panel displays the results when feature('accel','on'/off') is set to off (doc). Here, only one core is active (1-core and 4-core are identical) and therefore no optimization is possible.
Finally, the function I used for activating/deactivating the cores is maxNumCompThreads. According to Loren Shure, maxNumCompThreads controls both the JIT and BLAS. Since feature('JIT','on'/'off') did not play a role in the performance, BLAS is the last option remaining.
I will leave the final sentence to Loren: "The main message here is that you should not generally need to use this function [maxNumCompThreads] at all! Why? Because we'd like to make MATLAB do the best job possible for you."
accel = {'on';'off'};
figure('Color','w');
N = 100000:1000:105000;
for ind_accel = 2:-1:1
eval(['feature(''accel'',''' accel{ind_accel} ''')']);
tElapsed = zeros(4,length(N));
for ind_core = 1:4
maxNumCompThreads(ind_core);
n_core = maxNumCompThreads;
for ii = 1:length(N)
fprintf('core asked: %d(true:%d) - N:%d\n',ind_core,n_core, ii);
x = round(rand(N(ii),1)*5)-2;
idx1 = x~=0;
tStart = tic;
for t = 1:5000
idx1 = x~=0;
end
tElapsed(ind_core,ii) = toc(tStart);
end
end
h2 = subplot(1,2,ind_accel);
plot(N, tElapsed,'-o','MarkerSize',10);
legend({('1':'4')'});
xlabel('Vector size','FontSize',14);
ylabel('Processing time','FontSize',14);
set(gca,'FontSize',14,'YLim',[0.2 0.7]);
title(['accel ' accel{ind_accel}]);
end