I have a complex matrix A, and would like to modify it Nt times according to A = exp( -1i*(A + abs(A).^2) ). The size of A is typically 1000x1000, and the number of times to run would be around 10000.
I am looking to reduce the time taken to carry out these operations. For 1000 iterations on the CPU, I measure around 6.4 seconds. Following the Matlab documentation, I was able to move this to the GPU, which reduced the time taken to 0.07 seconds (an incredible x91 improvement!). So far so good.
However, I also now read this link in the docs, which describes how we can sometimes find even further improvement for element-wise calculations if we use arrayfun() as well. If I try to follow the tutorial, the time taken is actually worse, clocking in at 0.47 seconds. My tests are shown below:
Nt = 1000; % Number of times to run each method
test_functionFcn = #test_function;
A = rand( 500, 600, 'double' ) + rand( 500, 600, 'double' )*1i; % Define an initial complex matrix
gpu_A = gpuArray(A); % Transfer matrix to a GPU array
%%%%%%%%%%%%%%%%%%%% Run the calculation Nt times on CPU only %%%%%%%%%%%%%%%%%%%%
cpu_data_out = A;
tic
for k = 1:Nt
cpu_data_out = test_function( cpu_data_out );
end
tcpu = toc;
%%%%%%%%%%%%%%%%% Run the calculation Nt times on GPU directly %%%%%%%%%%%%%%%%%%%%
gpu_data_out = gpu_A;
tic
for k = 1:Nt
gpu_data_out = test_function(gpu_data_out);
end
tgpu = toc;
%%%%%%%%%%%%%% Run the calculation Nt times on GPU using arrayfun() %%%%%%%%%%%%%%
gpuarrayfun_data_out = gpu_A;
tic
for k = 1:Nt
gpuarrayfun_data_out = arrayfun( test_functionFcn, gpuarrayfun_data_out );
end
tgpu_arrayfun = toc;
%%% Print results %%%
fprintf( 'Time taken using only CPU: %g\n', tcpu );
fprintf( 'Time taken using gpuArray directly: %g\n', tgpu );
fprintf( 'Time taken using GPU + arrayfun(): %g\n', tgpu_arrayfun );
%%% Function to operate on matrices %%%
function y = test_function(x)
y = exp(-1i*(x + abs(x).^2));
end
and the results are:
Time taken using only CPU: 6.38785
Time taken using gpuArray directly: 0.0680587
Time taken using GPU + arrayfun(): 0.474612
My questions are:
Am I using arrayfun() correctly in this situation, and it is expected that arrayfun() should be worse?
If so, and it is really just expected that it is slower than the direct gpuArray method, is there any easy (i.e non-MEX) way to speed up such a calculation? (I see they also mention using pagefun for example).
Thanks in advance for any advice.
(The graphics card is Nvidia Quadro M4000, and I am running Matlab R2017a)
Edit
After reading #Edric's answer, I think it is important to show a little more of the wider code. One thing I didn't mention in the OP is that in my actual main code, is that inside the k=1:Nt loop there is an additional operation which is a matrix multiplication with the transpose of a sparse, tridiagonal matrix. Here is a more fleshed out MWE of what is really going on:
Nt = 1000; % Number of times to run each method
N_rows = 500;
N_cols = 600;
test_functionFcn = #test_function;
A = rand( N_rows, N_cols, 'double' ) + rand( N_rows, N_cols, 'double' )*1i; % Define an initial complex matrix
%%% Generate a sparse, tridiagonal, square transformation matrix %%%%%%%%
mm = 10*ones(N_cols,1); % Subdiagonal elements
dd = 20*ones(N_cols,1); % Main diagonal elements
pp = 30*ones(N_cols,1); % Superdiagonal elements
M = spdiags([mm dd pp],-1:1,N_cols,N_cols);
M(1,1) = 6; % Set a couple of other entries
M(2,1) = 3;
%%%%%%%%%%%%%%%%%%%% Run the calculation Nt times on CPU only %%%%%%%%%%%%
cpu_data_out = A;
for k = 1:Nt
cpu_data_out = test_function( cpu_data_out );
cpu_data_out = cpu_data_out*M.';
end
%%% Function to operate on matrices %%%
function y = test_function(x)
y = exp(-1i*(x + abs(x).^2));
end
I'm very sorry for not including that in the OP - I did not realise at the time that it might be relevant to the solution. Does this change things? Are there still gains to be made with arrayfun() on the GPU, or is this now not suitable for converting to arrayfun() ?
A few points here. Firstly, (and most importantly), to time code on the GPU, you need to use either gputimeit, or you need to inject a call to wait(gpuDevice) before calling toc. That's because work is launched asynchronously on the GPU, and you only get accurate timings by waiting for it to finish. With those minor modifications, on my GPU, I see 0.09 seconds for the gpuArray method, and 0.18 seconds for the arrayfun version.
Running a loop of GPU operations is generally inefficient, so the main gain you can get here is by pushing the loop inside the arrayfun function body so that that loop runs directly on the GPU. Like this:
%%% Function to operate on matrices %%%
function x = test_function(x,Nt)
for ii = 1:Nt
x = exp(-1i*(x + abs(x).^2));
end
end
You'll need to invoke it like A = arrayfun(#test_function, A, Nt). On my GPU, this brings the arrayfun time down to 0.05 seconds, so about twice as fast as the plain gpuArray version.
I get a pretty consistent time difference for small matrices in favor of max(A(:)):
>> A=rand(100); tic; max(A(:)); toc; tic; max(max(A)); toc;
Elapsed time is 0.000060 seconds.
Elapsed time is 0.000083 seconds.
but for large matrices, the time difference is inconsistent:
>> A=rand(1e3); tic; max(A(:)); toc; tic; max(max(A)); toc;
Elapsed time is 0.001072 seconds.
Elapsed time is 0.001103 seconds.
>> A=rand(1e3); tic; max(A(:)); toc; tic; max(max(A)); toc;
Elapsed time is 0.000847 seconds.
Elapsed time is 0.000792 seconds.
same for larger,
>> A = rand(1e4); tic; max(A(:)); toc; tic; max(max(A)); toc;
Elapsed time is 0.049073 seconds.
Elapsed time is 0.050206 seconds.
>> A = rand(1e4); tic; max(A(:)); toc; tic; max(max(A)); toc;
Elapsed time is 0.072577 seconds.
Elapsed time is 0.060357 seconds.
Why is there a difference and what would be the best practice?
As horchler says this is machine dependent. However, on my machine I saw a clear performance decrease for the max(max(max(... for higher dimensions. I also saw a slight (but consistent) advantage in speed for max(A(:)) for a more sorted type o matrix as the toeplitz matrix. Still, for the test case that you tried I saw hardly any difference.
Also max(max(max(... is error prone due to all the paranthesis I would prefer the max(A(:)). The execution time for this function seems to be stable for all dimensions, which means that it is easy to know how much time this function takes to execute.
Thirdly: The function max seems to be very fast and this mean that the performance should be a minor issue here. This means that max(A(:)) would be preferred in this case for its readability.
So as a conclusion, I would prefer max(A(:)), but if you think that max(max(A)) is clearer you could probably use this.
On my machine there are no differences in times that are really worth worrying about.
n = 2:0.2:4;
for i = 1:numel(n)
a = rand(floor(10^n(i)));
t1(i) = timeit(#()max(a(:)));
t2(i) = timeit(#()max(max(a)));
end
>> t1
t1 =
Columns 1 through 7
7.4706e-06 1.5349e-05 3.1569e-05 2.803e-05 5.6141e-05 0.00041006 0.0011328
Columns 8 through 11
0.0027755 0.006876 0.0171 0.042889
>> t2
t2 =
Columns 1 through 7
1.1959e-05 2.2539e-05 2.3641e-05 4.1313e-05 7.6301e-05 0.00040654 0.0011396
Columns 8 through 11
0.0027885 0.0068966 0.01718 0.042997
I am new to using distributed and codistributed arrays in matlab. The parallel code I have produced works, but is much slower than the serial version and I have no idea why. The code examples below compute the eigenvalues of hessian matrices from volumetic data.
Serial version:
S = size(D);
Dsmt=imgaussian(D,2,20);
[fx, fy, fz] = gradient(Dsmt);
DHess = zeros([3 3 S(1) S(2) S(3)]);
[DHess(1,1,:,:,:), DHess(1,2,:,:,:), DHess(1,3,:,:,:)] = gradient(fx);
[DHess(2,1,:,:,:), DHess(2,2,:,:,:), DHess(2,3,:,:,:)] = gradient(fy);
[DHess(3,1,:,:,:), DHess(3,2,:,:,:), DHess(3,3,:,:,:)] = gradient(fz);
d = zeros([3 S(1) S(2) S(3)]);
for i = 1 : S(1)
fprintf('Slice %d out of %d\n', i, S(1));
for ii = 1 : S(2)
for iii = 1 : S(3)
d(:,i,ii,iii) = eig(squeeze(DHess(:,:,i,ii,iii)));
end
end
end
Parallel version:
S = size(D);
Dsmt=imgaussian(D,2,20);
[fx, fy, fz] = gradient(Dsmt);
DHess = zeros([3 3 S(1) S(2) S(3)]);
[DHess(1,1,:,:,:), DHess(1,2,:,:,:), DHess(1,3,:,:,:)] = gradient(fx);
[DHess(2,1,:,:,:), DHess(2,2,:,:,:), DHess(2,3,:,:,:)] = gradient(fy);
[DHess(3,1,:,:,:), DHess(3,2,:,:,:), DHess(3,3,:,:,:)] = gradient(fz);
CDHess = distributed(DHess);
spmd
d = zeros([3 S(1) S(2) S(3)], codistributor('1d',4));
for i = 1 : S(1)
fprintf('Slice %d out of %d\n', i, S(1));
for ii = 1 : S(2)
for iii = drange(1 : S(3))
d(:,i,ii,iii) = eig(squeeze(CDHess(:,:,i,ii,iii)));
end
end
end
end
If someone could shed some light on the issue I would be very grateful
Here is a re-written version of your code. I have split the work over the outer-most loop, not as in your case - the inner-most loop. I have also explicitly allocated local parts of the d result vector, and the local part of the Hessian matrix.
In your code you rely on drange to split the work, and you access the distributed arrays directly to avoid extracting the local part. Admittedly, it should not result in such a great slowdown if MATLAB did everything correctly. The bottom line is, I don't know why your code is so slow - most likely because MATLAB does some remote data accessing despite the fact that you distributed your matrices.
Anyway, the below code runs and gives pretty good speedup on my computer using 4 labs. I have generated synthetic random input data to have something to work on. Have a look at the comments. If something is unclear, I can elaborate later.
clear all;
D = rand(512, 512, 3);
S = size(D);
[fx, fy, fz] = gradient(D);
% this part could also be parallelized - at least a bit.
tic;
DHess = zeros([3 3 S(1) S(2) S(3)]);
[DHess(1,1,:,:,:), DHess(1,2,:,:,:), DHess(1,3,:,:,:)] = gradient(fx);
[DHess(2,1,:,:,:), DHess(2,2,:,:,:), DHess(2,3,:,:,:)] = gradient(fy);
[DHess(3,1,:,:,:), DHess(3,2,:,:,:), DHess(3,3,:,:,:)] = gradient(fz);
toc
% your sequential implementation
d = zeros([3, S(1) S(2) S(3)]);
disp('sequential')
tic
for i = 1 : S(1)
for ii = 1 : S(2)
for iii = 1 : S(3)
d(:,i,ii,iii) = eig(squeeze(DHess(:,:,i,ii,iii)));
end
end
end
toc
% my parallel implementation
disp('parallel')
tic
spmd
% just for information
disp(['lab ' num2str(labindex)]);
% distribute the input data along the third dimension
% This is the dimension of the outer-most loop, hence this is where we
% want to parallelize!
DHess_dist = codistributed(DHess, codistributor1d(3));
DHess_local = getLocalPart(DHess_dist);
% create an output data distribution -
% note that this time we split along the second dimension
codist = codistributor1d(2, codistributor1d.unsetPartition, [3, S(1) S(2) S(3)]);
localSize = [3 codist.Partition(labindex) S(2) S(3)];
% allocate local part of the output array d
d_local = zeros(localSize);
% your ordinary loop, BUT! the outermost loop is split amongst the
% threads explicitly, using local indexing. In the loop only local parts
% of matrix d and DHess are accessed
for i = 1:size(d_local,2)
for ii = 1 : S(2)
for iii = 1 : S(3)
d_local(:,i,ii,iii) = eig(squeeze(DHess_local(:,:,i,ii,iii)));
end
end
end
% assemble local results to a codistributed matrix
d_dist = codistributed.build(d_local, codist);
end
toc
isequal(d, d_dist)
And the output
Elapsed time is 0.364255 seconds.
sequential
Elapsed time is 33.498985 seconds.
parallel
Lab 1:
lab 1
Lab 2:
lab 2
Lab 3:
lab 3
Lab 4:
lab 4
Elapsed time is 9.445856 seconds.
ans =
1
Edit I have checked the performance on a reshaped matrix DHess=[3x3xN]. The performance is not much better (10%), so it is not substantial. But maybe you can implement the eig a bit differently? After all, those are 3x3 matrices you are dealing with.
You didn't specify where you've opened your matlabpool, and that will be the main factor determining what speedup you get.
If you are using the 'local' scheduler, then there is often no benefit to using distributed arrays. In particular, if the time-consuming operations are multithreaded in MATLAB already, then they will almost certainly slow down when using the local scheduler since the matlabpool workers run in single-threaded mode.
If you are using some other scheduler with the workers on a separate machine then you might be able to get speedup, but that depends on what you're doing. There's an example here http://www.mathworks.com/products/parallel-computing/examples.html?file=/products/demos/shipping/distcomp/paralleldemo_backslash_bench.html which shows some benchmarks of MATLAB's \ operator.
Finally, it's worth noting that indexing distributed arrays is unfortunately rather slow, especially compared to MATLAB's built-in indexing. If you can extract the 'local part' of your codistributed arrays inside the spmd block and work exclusively with those, that might also help.
I have 2 input variables:
a vector of p-values (p) with N elements (unsorted)
and N x M matrix with p-values obtained by random permutations (pr) with M iterations. N is quite large, 10K to 100K or more. M let's say 100.
I'm estimating the False Discovery Rate (FDR) for each element of p representing how many p-values from random permutations will pass if the current p-value (from p) will be the threshold.
I wrote the function with ARRAYFUN, but it takes lot of time for large N (2 min for N=20K), comparable to for-loop.
function pfdr = fdr_from_random_permutations(p, pr)
%# ... skipping arguments checks
pfdr = arrayfun( #(x) mean(sum(pr<=x))./sum(p<=x), p);
Any ideas how to make it faster?
Comments about statistical issues here are also welcome.
The test data can be generated as p = rand(N,1); pr = rand(N,M);.
Well, the trick was indeed sorting the vectors. I give credit to #EgonGeerardyn for that. Also, there is no need to use mean. You can just divide everything afterwards by M. When p is sorted, finding the amount of values that are less than current x, is just a running index. pr is a more interesting case - I used a running index called place to discover how many elements are less than x.
Edit(2): Here is the fastest version I come up with:
function Speedup2()
N = 10000/4 ;
M = 100/4 ;
p = rand(N,1); pr = rand(N,M);
tic
pfdr = arrayfun( #(x) mean(sum(pr<=x))./sum(p<=x), p);
toc
tic
out = zeros(numel(p),1);
[p,sortIndex] = sort(p);
pr = sort(pr(:));
pr(end+1) = Inf;
place = 1;
N = numel(pr);
for i=1:numel(p)
x = p(i);
while pr(place)<=x
place = place+1;
end
exp1a = place-1;
exp2 = i;
out(i) = exp1a/exp2;
end
out(sortIndex) = out/ M;
toc
disp(max(abs(pfdr-out)));
end
And the benchmark results for N = 10000/4 ; M = 100/4 :
Elapsed time is 0.898689 seconds.
Elapsed time is 0.007697 seconds.
2.220446049250313e-016
and for N = 10000 ; M = 100 ;
Elapsed time is 39.730695 seconds.
Elapsed time is 0.088870 seconds.
2.220446049250313e-016
First of all, tr to analyze this using the profiler. Profiling should ALWAYS be the first step when trying to improve performance. We can all guess at what is causing your performance drop, but the only way to be sure and focus on the right part is to inspect the profiler report.
I didn't run the profiler on your code, as I don't want to generate test data to do so; but I have some ideas about what work is being carried out in vain. In your function mean(sum(pr<=x))./sum(p<=x), you are repeatedly summing over p<=x. All in all, one call includes N comparisons and N-1 summations. So for both, you have behavior that is quadratic in N when all N values of p are calculated.
If you step through a sorted version of p, you need less calculations and comparisons, as you can keep track of a running sum (i.e. behavior that is linear in N). I guess a similar method could be applied to the other part of the calculation.
edit:
The implementation of my idea as expressed above:
function pfdr = fdr(p,pr)
[N, M] = size(pr);
[p, idxP] = sort(p);
[pr] = sort(pr(:));
pfdr = NaN(N,1);
parfor iP = 1:N
x = p(iP);
m = sum(pr<=x)/M;
pfdr(iP) = m/iP;
end
pfdr(idxP) = pfdr;
If you have access to the parallel computing toolbox, the parfor loop will allow you to gain some performance. I used two basic ideas: mean(sum(pr<=x)) is actually equal to sum(pr(:)<=x)/M. On the other hand, since p is sorted, this allows you to just take the index as the number of elements (in the assumption that every element is unique, otherwise you'll have to work with unique to do the full rigorous analysis).
As you should already know very well by running the profiler yourself, the line m = sum(pr<=x)/M; is the main resource hog. This can be tackled similarly to p by making use of the sorted nature of pr.
I tested my code (both for identical results and for time consumption) against yours. For N=20e3; M=100, I get about 63 seconds to run your code and 43 seconds to run mine on my main computer (MATLAB 2011a on 64 bit Arch Linux, 8 GiB RAM, Core i7 860). For smaller values of M the gain is larger. But this gain is in part due to parallelization.
edit2: Apparently, I came to very similar results as Andrey, my result would have been very similar had I pursued the same approach.
However, I realised that there are some built-in functions that do more or less what you need, i.e. quite similar to determining the empirical cumulative density function. And this can be done by constructing the histogram:
function pfdr = fdr(p,pr)
[N, M] = size(pr);
[p, idxP] = sort(p);
count = histc(pr(:), [0; p]);
count = cumsum(count(1:N));
pfdr = count./(1:N).';
pfdr(idxP) = pfdr/M;
For the same M and N as above, this code takes 228 milliseconds on my computer. It takes 104 milliseconds for Andrey's parameters, so on my computer it turns out a bit slower, but I think this code is far more readable than intricate for loops (as was the case in both our examples).
Following the discussion between me and Andrey in this question, this very late answer is just to prove to Andrey that vectorized solutions are still faster than JIT'ed loops, they sometimes just aren't as easy to find.
I am more than willing to remove this answer if it is deemed inappropriate by the OP.
Now, on to business, here's the original arrayfun, looped version by Andrey, and vectorized version by Egon:
function test
clc
N = 10000/4 ;
M = 100/4 ;
p = rand(N,1);
pr = rand(N,M);
%% first option
tic
pfdr = arrayfun( #(x) mean(sum(pr<=x))./sum(p<=x), p);
toc
%% second option
tic
out = zeros(numel(p),1);
[p2,sortIndex] = sort(p);
pr2 = sort(pr(:));
pr2(end+1) = Inf;
place = 1;
for i=1:numel(p2)
x = p2(i);
while pr2(place)<=x
place = place+1;
end
exp1a = place-1;
exp2 = i;
out(i) = exp1a/exp2;
end
out(sortIndex) = out/ M;
toc
%% third option
tic
[p2,sortIndex] = sort(p);
count = histc(pr2(:), [0; p2]);
count = cumsum(count(1:N));
out = count./(1:N).';
out(sortIndex) = out/M;
toc
end
Results on my laptop:
Elapsed time is 0.916196 seconds.
Elapsed time is 0.011429 seconds.
Elapsed time is 0.007328 seconds.
and for N=1000; M = 100; :
Elapsed time is 38.082718 seconds.
Elapsed time is 0.127052 seconds.
Elapsed time is 0.042686 seconds.
So: vectorized is 2-3 times faster.
I'm trying to use the parfor loop in the matlab parallelism package.
I'm having a similar problem to this guy : MATLAB parfor slicing issue? . The output matrix doesn't seem to be recognized as a sliced variable. In my specific case, I'm trying to stack use other for loops inside the parfor, and I have trouble applying the solution proposed in the other thread to my problem. Here is a dummy example of what I'm trying to do :
n=175;
matlabpool;
Matred=zeros(n,n);
Matx2Cell = cell(n);
parfor i=1:n
for j=1:n
for k=1:n
Matred(j,k)=exp((j+i+k)/500)
end;
end;
Matx2Cell{i}=Matred;
end;
matlabpool close;
P.S. I know it would work to put the parfor on the k-loop instead of the i-loop... But I'd still like to put it on the i-loop (I believe it would be more time-efficient in my real program).
Thanks a lot
Frédéric Godin
You can put Matred = zeros(n); into the parfor body, but this is very slow. Instead define a function with Matred = zeros(n); in it: effectively the same thing, but much faster:
function Matred = calcMatred(i,n)
Matred=zeros(n);
for j=1:n
for k=1:n
Matred(j,k)=exp((j+i+k)/500);
end
end
Here is a time comparison:
matlabpool
n = 175;
Matx2Cell = cell(n,1);
tic
parfor i=1:n
Matred=zeros(n);
for j=1:n
for k=1:n
Matred(j,k)=exp((j+i+k)/500);
end
end
Matx2Cell{i}=Matred;
end
toc
tic
parfor i=1:n
Matx2Cell{i}=calcMatred(i,n);
end
toc
matlabpool close
On my machine, it takes 7 seconds for the first one and 0.3 seconds for the second.
Also note that I've changed the declaration of Matx2Cell to cell(n,1) since cell(n) makes an n x n cell array.
You need to move Matred into the parfor loop body. This needs to be done because each iteration of the parfor needs a new copy of Matred.
n=175;
matlabpool;
Matx2Cell = cell(n);
parfor i=1:n
Matred=zeros(n,n);
for j=1:n
for k=1:n
Matred(j,k)=exp((j+i+k)/500)
end;
end;
Matx2Cell{i}=Matred;
end;
matlabpool close;