Can I calculate the least squares in parallel in DolphinDB?

Can I calculate the least squares in parallel in DolphinDB? - cpu

X = randNormal(0, 1, 1000000*1000)$1000000:1000
Y = randNormal(0, 1, 1000000*1)
bb = ols(Y, X)
The above script runs slowly in DolphinDB with only one CPU core running. How to improve the efficiency?

Related

Efficiency of diag() - MATLAB

Motivation:
In writing out a matrix operation that was to be performed over tens of thousands of vectors I kept coming across the warning:
Requested 200000x200000 (298.0GB) array exceeds maximum array size
preference. Creation of arrays greater than this limit may take a long
time and cause MATLAB to become unresponsive. See array size limit or
preference panel for more information.
The reason for this was my use of diag() to get the values down the diagonal of an matrix inner product. Because MATLAB is generally optimized for vector/matrix operations, when I first write code, I usually go for the vectorized form. In this case, however, MATLAB has to build the entire matrix in order to get the diagonal which causes the memory and speed issues.
Experiment:
I decided to test the use of diag() vs a for loop to see if at any point it was more efficient to use diag():
num = 200000; % Matrix dimension
x = ones(num, 1);
y = 2 * ones(num, 1);
% z = diag(x*y'); % Expression to solve
% Loop approach
tic
z = zeros(num,1);
for i = 1 : num
z(i) = x(i)*y(i);
end
toc
% Dividing the too-large matrix into process-able chunks
fraction = [10, 20, 50, 100, 500, 1000, 5000, 10000, 20000];
time = zeros(size(fraction));
for k = 1 : length(fraction)
f = fraction(k);
% Operation to time
tic
z = zeros(num,1);
for i = 1 : k
first = (i-1) * (num / f);
last = first + (num / f);
z(first + 1 : last) = diag(x(first + 1: last) * y(first + 1 : last)');
end
time(k) = toc;
end
% Plot results
figure;
hold on
plot(log10(fraction), log10(chunkTime));
plot(log10(fraction), repmat(log10(loopTime), 1, length(fraction)));
plot(log10(fraction), log10(chunkTime), 'g*'); % Plot points along time
legend('Partioned Running Time', 'Loop Running Time');
xlabel('Log_{10}(Fractional Size)'), ylabel('Log_{10}(Running Time)'), title('Running Time Comparison');
This is the result of the test:
(NOTE: The red line represents the loop time as a threshold--it's not to say that the total loop time is constant regardless of the number of loops)
From the graph it is clear that it takes breaking the operations down into roughly 200x200 square matrices to be faster to use diag than to perform the same operation using loops.
Question:
Can someone explain why I'm seeing these results? Also, I would think that with MATLAB's ever-more optimized design, there would be built-in handling of these massive matrices within a diag() function call. For example, it could just perform the i = j indexed operations. Is there a particular reason why this might be prohibitive?
I also haven't really thought of memory implications for diag using the partition method, although it's clear that as the partition size decreases, memory requirements drop.

Test of speed of diag vs. a loop.
Initialization:
n = 10000;
M = randn(n, n); %create a random matrix.
Test speed of diag:
tic;
d = diag(M);
toc;
Test speed of loop:
tic;
d = zeros(n, 1);
for i=1:n
d(i) = M(i,i);
end;
toc;
This would test diag. Your code is not a clean test of diag...
Comment on where there might be confusion
Diag only extracts the diagonal of a matrix. If x and y are vectors, and you do d = diag(x * y'), MATLAB first constructs the n by n matrix x*y' and calls diag on that. This is why, you get the error, "cannot construct 290GB matrix..." Matlab interpreter does not optimize in a crazy way, realize you only want the diagonal and construct just a vector (rather than full matrix with x*y', that does not happen.
Not sure if you're asking this, but the fastest way to calculate d = diag(x*y') where x and y are n by 1 vectors would simply be: d = x.*y

symmetric multicore processor

For this question,
Consider the following code statements executing at the same time on four processors in a symmetric multicore processor (SMP). Assume that before these statements are executed, both x and y are 0.
Core1:x=2;
Core2:y=2;
Core3:w=x+y+1;
Core4:z=x+y;
What are all the possible resulting values of w, x, y and z? For each possible outcome, explain how you arrived at those values. You will need to examine all possible interleavings of instructions. (9 marks)
Would i be right in thinking that the answer would be:
x = 2, y = 2, w = 1, z = 0
x = 2, y = 2, w = 3, z = 2
x = 2, y = 2, w = 5, z = 4
As the code is executing on a symmetric multicore processor, the processor uses a single address space meaning that if loads and stores are not synchronised, one of the processors could start working on the data before another is finished.

Matlab's bsxfun() - what explains the performance differences when expanding along different dimensions?

In my line of work (econometrics/statistics), I frequently have to multiply matrices of different sizes and then perform additional operations on the resulting matrix. I have always relied on bsxfun() to vectorize the code, which in general I find it to be more efficient than repmat(). But what I don't understand is why sometimes the performance of bsxfun() can be very different when expanding the matrices along different dimensions.
Consider this specific example:
x = ones(j, k, m);
beta = rand(k, m, s);
exp_xBeta = zeros(j, m, s);
for im = 1 : m
for is = 1 : s
xBeta = x(:, :, im) * beta(:, im, is);
exp_xBeta(:, im, is) = exp(xBeta);
end
end
y = mean(exp_xBeta, 3);
Context:
We have data from m markets and within each market we want to calculate the expectation of exp(X * beta) where X is a j x k matrix, and beta is a k x 1 random vector. We compute this expectation by monte-carlo integration - make s draws of beta, compute exp(X * beta) for each draw, and then take the mean. Typically we get data with m > k > j, and we use a very large s. In this example I simply let X to be a matrix of ones.
I did 3 versions of vectorization using bsxfun(), they differ by how X and beta are shaped:
Vectorization 1
x1 = x; % size [ j k m 1 ]
beta1 = permute(beta, [4 1 2 3]); % size [ 1 k m s ]
tic
xBeta = bsxfun(#times, x1, beta1);
exp_xBeta = exp(sum(xBeta, 2));
y1 = permute(mean(exp_xBeta, 4), [1 3 2 4]); % size [ j m ]
time1 = toc;
Vectorization 2
x2 = permute(x, [4 1 2 3]); % size [ 1 j k m ]
beta2 = permute(beta, [3 4 1 2]); % size [ s 1 k m ]
tic
xBeta = bsxfun(#times, x2, beta2);
exp_xBeta = exp(sum(xBeta, 3));
y2 = permute(mean(exp_xBeta, 1), [2 4 1 3]); % size [ j m ]
time2 = toc;
Vectorization 3
x3 = permute(x, [2 1 3 4]); % size [ k j m 1 ]
beta3 = permute(beta, [1 4 2 3]); % size [ k 1 m s ]
tic
xBeta = bsxfun(#times, x3, beta3);
exp_xBeta = exp(sum(xBeta, 1));
y3 = permute(mean(exp_xBeta, 4), [2 3 1 4]); % size [ j m ]
time3 = toc;
And this is how they performed (typically we get data with m > k > j, and we used a very large s):
j = 5, k = 15, m = 100, s = 2000:
For-loop version took 0.7286 seconds.
Vectorized version 1 took 0.0735 seconds.
Vectorized version 2 took 0.0369 seconds.
Vectorized version 3 took 0.0503 seconds.
j = 10, k = 15, m = 150, s = 5000:
For-loop version took 2.7815 seconds.
Vectorized version 1 took 0.3565 seconds.
Vectorized version 2 took 0.2657 seconds.
Vectorized version 3 took 0.3433 seconds.
j = 15, k = 35, m = 150, s = 5000:
For-loop version took 3.4881 seconds.
Vectorized version 1 took 1.0687 seconds.
Vectorized version 2 took 0.8465 seconds.
Vectorized version 3 took 0.9414 seconds.
Why is version 2 consistently always the fastest version? Initially, I thought the performance advantage was because s was set to dimension 1, which Matlab might be able to compute faster since it stored data in column-major order. But Matlab's profiler told me that the time taken to compute that mean was rather insignificant and was more or less the same among all 3 versions. Matlab spent most of the time evaluating the line with bsxfun(), and that's also where the run-time difference was the biggest among the 3 versions.
Any thought on why version 1 is always the slowest and version 2 is always the fastest?
I've updated my test code here:
Code
EDIT: earlier version of this post was incorrect. beta should be of size (k, m, s).

bsxfun is of course one of the good tools to vectorize things, but if you can somehow introduce matrix-multiplication that would be best way to go about it, as matrix multiplications are really fast on MATLAB.
It seems here you can use matrix-multiplication to get exp_xBeta like so -
[m1,n1,r1] = size(x);
n2 = size(beta,2);
exp_xBeta_matmult = reshape(exp(reshape(permute(x,[1 3 2]),[],n1)*beta),m1,r1,n2)
Or directly get y as shown below -
y_matmult = reshape(mean(exp(reshape(permute(x,[1 3 2]),[],n1)*beta),2),m1,r1)
Explanation
To explain it in a bit more detail, we have the sizes as -
x : (j, k, m)
beta : (k, s)
Our end goal is to use the "eliminate" the k's from x and beta using matrix-multiplication. So, we can "push" the k in x to the end with permute and reshape to a 2D keeping k as the rows, i.e. ( j * m , k ) and then perform matrix-multiplication with beta ( k , s ) to give us ( j * m , s ). The product can then be reshaped to a 3D array ( j , m , s ) and perform elementwise exponential which would be exp_xBeta.
Now, if the final goal is y, which is getting the mean along the third dimension of exp_xBeta, it would be equivalent to calculating the mean along the rows of the matrix-multiplication product (j * m, s ) and then reshaping to ( j , m ) to get us y directly.

I did some more experiments this morning. It seems that it has to do with the fact that Matlab stores data in column major order after all.
In doing these experiments, I also added vectorization version 4 which does the same thing but orders the dimensions slightly different than versions 1-3.
To recap, here are how x and beta are ordered in all 4 versions:
Vectorization 1:
x : (j, k, m, 1)
beta : (1, k, m, s)
Vectorization 2:
x : (1, j, k, m)
beta : (s, 1, k, m)
Vectorization 3:
x : (k, j, m, 1)
beta : (k, 1, m, s)
Vectorization 4:
x : (1, k, j, m)
beta : (s, k, 1, m)
code : bsxfun_test.m
The two most costly operations in this code are:
(a) xBeta = bsxfun(#times, x, beta);
(b) exp_xBeta = exp(sum(xBeta, dimK));
where dimK is the dimension of k.
In (a), bsxfun() has to expand x along the dimension of s and beta along the dimension of j. When s is much larger than other dimensions, we should see some performance advantage in vectorizations 2 and 4, since they assign s as the first dimension.
j = 100; k = 100; m = 100; s = 1000;
Vectorized version 1 took 2.4719 seconds.
Vectorized version 2 took 2.1419 seconds.
Vectorized version 3 took 2.5071 seconds.
Vectorized version 4 took 2.0825 seconds.
If instead s is trivial and k is huge, then vectorization 3 should be the fastest since it puts k in dimension 1:
j = 10; k = 10000; m = 100; s = 1;
Vectorized version 1 took 0.0329 seconds.
Vectorized version 2 took 0.1442 seconds.
Vectorized version 3 took 0.0253 seconds.
Vectorized version 4 took 0.1415 seconds.
If we swap the value of k and j in the last example, vectorization 1 becomems the fastest since j is assigned to dimension 1:
j = 10000; k = 10; m = 100; s = 1;
Vectorized version 1 took 0.0316 seconds.
Vectorized version 2 took 0.1402 seconds.
Vectorized version 3 took 0.0385 seconds.
Vectorized version 4 took 0.1608 seconds.
But in general when k and j are close, j > k does not necessary imply vectorization 1 is faster than vectorization 3 since the operations performed in (a) and (b) are different.
In practice, I often have to run computation with s >>>> m > k > j. In such cases, it seems that ordering them in vectorization 2 or 4 gives the best results:
j = 10; k = 30; m = 100; s = 5000;
Vectorized version 1 took 0.4621 seconds.
Vectorized version 2 took 0.3373 seconds.
Vectorized version 3 took 0.3713 seconds.
Vectorized version 4 took 0.3533 seconds.
j = 15; k = 50; m = 150; s = 5000;
Vectorized version 1 took 1.5416 seconds.
Vectorized version 2 took 1.2143 seconds.
Vectorized version 3 took 1.2842 seconds.
Vectorized version 4 took 1.2684 seconds.
Takeaway: if bsxfun() has to expand along a dimension of size much bigger than other dimensions, assign that dimension to dimension 1!

Refer this other question and answer
If you are gonna process matrices of different dimensions using bsxfun, make sure that the biggest dimension of the matrices is kept in first dimension.
Here is my small example test:
%// Inputs
%// Taking one very big and one small vector, so that the difference could be seen clearly
a = rand(1000000,1);
b = rand(1,5);
%//---------------- testing with inbuilt function
%// preferred orientation [1]
t1 = timeit(#() bsxfun(#times, a, b))
%// not preferred [2]
t2 = timeit(#() bsxfun(#times, b.', a.'))
%//---------------- testing with anonymous function
%// preferred orientation [1]
t3 = timeit(#() bsxfun(#(x,y) x*y, a, b))
%// not preferred [2]
t4 = timeit(#() bsxfun(#(x,y) x*y, b.', a.'))
[1] Preferred orientation      -     larger dimension as first dimension
[2] Not preferred                 -     smaller dimension as first dimension
Small Note: The output given by all four methods are same even though their dimensions may differ.
Results:
t1 =
0.0461
t2 =
0.0491
t3 =
0.0740
t4 =
7.5249
>> t4/t3
ans =
101.6878
Method 3 is roughly 100 times faster than Method 4
To conclude:
Although the difference between preferred and unfavored orientation for built-in function is minimum,
The difference becomes huge for anonymous function. So it might be a best practice to use bigger dimension as dimension 1.

matlab sum(X-Y) vs sum(X) - sum(Y)

If we have two matrices X and Y, both two-dimensional, now mathematically we can say: sum(X-Y)=sum(X)-sum(Y).
Which is more efficient in Matlab? Which is faster?

On my machine, sum(x-y) is slightly faster for small arrays, but sum(x)-sum(y) is quite a lot faster for larger arrays. To benchmark, I'm using MATLAB R2015a on a Windows 7 machine with 32GB memory.
n = ceil(logspace(0,4,25));
for i = 1:numel(n)
x = rand(n(i));
y = rand(n(i));
t1(i) = timeit(#()sum(x-y));
t2(i) = timeit(#()sum(x)-sum(y));
clear x y
end
figure; hold on
plot(n, t1)
plot(n, t2)
legend({'sum(x-y)', 'sum(x)-sum(y)'})
xlabel('n'); ylabel('time')
set(gca, 'XScale', 'log', 'YScale', 'log')

You got me all curious and I decided to run some benchmark. By the time I was done it seems knedlsepp had it right as for larger matrices sum(X-Y) become quite slower.
The crossover seems to happen around 10^3 elements.
%% // Benchmark code
nElem = (1:9).'*(10.^(1:6)) ; nElem = nElem(:) ; %'//damn pretifier
nIter = numel(nElem) ;
res = zeros(nIter,2) ;
for ii=1:nIter
X = rand(nElem(ii) ,1) ;
Y = rand(nElem(ii) ,1) ;
f1 = #() sum(X-Y) ;
f2 = #() sum(X)-sum(Y) ;
res(ii,1) = timeit( f1 ) ;
res(ii,2) = timeit( f2 ) ;
clear f1 f2 X Y
end
loglog(nElem,res,'DisplayName','nElem')
I ran that a few times and the results are quite consistent on my machine. I blew my memory trying to go above 10^7 elements. Feel free to extend the test but I don't think the relationship is going to change much.
Specs: Windows 8.1 Pro / Matlab R2013a on:

Assuming that both x and y have N x M = K elements, then
For sum(x)-sum(y) you have:
K memory access to read x, K memory access to read y, one memory access to write the result --> 2K + 1 memory access (Assuming that the intermediate sum inside the sum function will be held in a CPU register).
2K sum operations + 1 subtraction --> 2k + 1 CPU operations.
For sum(x - y):
Matlab will calculate x - y and store the output then calculate the sum, so we have K memory access to read x, K memory access to read y, K memory access to write the result of the subtraction, K memory access to read the subtraction result again for the sum function then 1 memory access to write the sum result --> 4K + 1 memory operations.
k subtractions + k summations --> 2K operations.
As we can see sum(x - y) consumes many memory access, so in large number of elements it may consume higher time but I don't have an explanation why it's faster for small number of elements.

Matlab slow parallel processing with distributed arrays

I am new to using distributed and codistributed arrays in matlab. The parallel code I have produced works, but is much slower than the serial version and I have no idea why. The code examples below compute the eigenvalues of hessian matrices from volumetic data.
Serial version:
S = size(D);
Dsmt=imgaussian(D,2,20);
[fx, fy, fz] = gradient(Dsmt);
DHess = zeros([3 3 S(1) S(2) S(3)]);
[DHess(1,1,:,:,:), DHess(1,2,:,:,:), DHess(1,3,:,:,:)] = gradient(fx);
[DHess(2,1,:,:,:), DHess(2,2,:,:,:), DHess(2,3,:,:,:)] = gradient(fy);
[DHess(3,1,:,:,:), DHess(3,2,:,:,:), DHess(3,3,:,:,:)] = gradient(fz);
d = zeros([3 S(1) S(2) S(3)]);
for i = 1 : S(1)
fprintf('Slice %d out of %d\n', i, S(1));
for ii = 1 : S(2)
for iii = 1 : S(3)
d(:,i,ii,iii) = eig(squeeze(DHess(:,:,i,ii,iii)));
end
end
end
Parallel version:
S = size(D);
Dsmt=imgaussian(D,2,20);
[fx, fy, fz] = gradient(Dsmt);
DHess = zeros([3 3 S(1) S(2) S(3)]);
[DHess(1,1,:,:,:), DHess(1,2,:,:,:), DHess(1,3,:,:,:)] = gradient(fx);
[DHess(2,1,:,:,:), DHess(2,2,:,:,:), DHess(2,3,:,:,:)] = gradient(fy);
[DHess(3,1,:,:,:), DHess(3,2,:,:,:), DHess(3,3,:,:,:)] = gradient(fz);
CDHess = distributed(DHess);
spmd
d = zeros([3 S(1) S(2) S(3)], codistributor('1d',4));
for i = 1 : S(1)
fprintf('Slice %d out of %d\n', i, S(1));
for ii = 1 : S(2)
for iii = drange(1 : S(3))
d(:,i,ii,iii) = eig(squeeze(CDHess(:,:,i,ii,iii)));
end
end
end
end
If someone could shed some light on the issue I would be very grateful

Here is a re-written version of your code. I have split the work over the outer-most loop, not as in your case - the inner-most loop. I have also explicitly allocated local parts of the d result vector, and the local part of the Hessian matrix.
In your code you rely on drange to split the work, and you access the distributed arrays directly to avoid extracting the local part. Admittedly, it should not result in such a great slowdown if MATLAB did everything correctly. The bottom line is, I don't know why your code is so slow - most likely because MATLAB does some remote data accessing despite the fact that you distributed your matrices.
Anyway, the below code runs and gives pretty good speedup on my computer using 4 labs. I have generated synthetic random input data to have something to work on. Have a look at the comments. If something is unclear, I can elaborate later.
clear all;
D = rand(512, 512, 3);
S = size(D);
[fx, fy, fz] = gradient(D);
% this part could also be parallelized - at least a bit.
tic;
DHess = zeros([3 3 S(1) S(2) S(3)]);
[DHess(1,1,:,:,:), DHess(1,2,:,:,:), DHess(1,3,:,:,:)] = gradient(fx);
[DHess(2,1,:,:,:), DHess(2,2,:,:,:), DHess(2,3,:,:,:)] = gradient(fy);
[DHess(3,1,:,:,:), DHess(3,2,:,:,:), DHess(3,3,:,:,:)] = gradient(fz);
toc
% your sequential implementation
d = zeros([3, S(1) S(2) S(3)]);
disp('sequential')
tic
for i = 1 : S(1)
for ii = 1 : S(2)
for iii = 1 : S(3)
d(:,i,ii,iii) = eig(squeeze(DHess(:,:,i,ii,iii)));
end
end
end
toc
% my parallel implementation
disp('parallel')
tic
spmd
% just for information
disp(['lab ' num2str(labindex)]);
% distribute the input data along the third dimension
% This is the dimension of the outer-most loop, hence this is where we
% want to parallelize!
DHess_dist = codistributed(DHess, codistributor1d(3));
DHess_local = getLocalPart(DHess_dist);
% create an output data distribution -
% note that this time we split along the second dimension
codist = codistributor1d(2, codistributor1d.unsetPartition, [3, S(1) S(2) S(3)]);
localSize = [3 codist.Partition(labindex) S(2) S(3)];
% allocate local part of the output array d
d_local = zeros(localSize);
% your ordinary loop, BUT! the outermost loop is split amongst the
% threads explicitly, using local indexing. In the loop only local parts
% of matrix d and DHess are accessed
for i = 1:size(d_local,2)
for ii = 1 : S(2)
for iii = 1 : S(3)
d_local(:,i,ii,iii) = eig(squeeze(DHess_local(:,:,i,ii,iii)));
end
end
end
% assemble local results to a codistributed matrix
d_dist = codistributed.build(d_local, codist);
end
toc
isequal(d, d_dist)
And the output
Elapsed time is 0.364255 seconds.
sequential
Elapsed time is 33.498985 seconds.
parallel
Lab 1:
lab 1
Lab 2:
lab 2
Lab 3:
lab 3
Lab 4:
lab 4
Elapsed time is 9.445856 seconds.
ans =
1
Edit I have checked the performance on a reshaped matrix DHess=[3x3xN]. The performance is not much better (10%), so it is not substantial. But maybe you can implement the eig a bit differently? After all, those are 3x3 matrices you are dealing with.

You didn't specify where you've opened your matlabpool, and that will be the main factor determining what speedup you get.
If you are using the 'local' scheduler, then there is often no benefit to using distributed arrays. In particular, if the time-consuming operations are multithreaded in MATLAB already, then they will almost certainly slow down when using the local scheduler since the matlabpool workers run in single-threaded mode.
If you are using some other scheduler with the workers on a separate machine then you might be able to get speedup, but that depends on what you're doing. There's an example here http://www.mathworks.com/products/parallel-computing/examples.html?file=/products/demos/shipping/distcomp/paralleldemo_backslash_bench.html which shows some benchmarks of MATLAB's \ operator.
Finally, it's worth noting that indexing distributed arrays is unfortunately rather slow, especially compared to MATLAB's built-in indexing. If you can extract the 'local part' of your codistributed arrays inside the spmd block and work exclusively with those, that might also help.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Can I calculate the least squares in parallel in DolphinDB? - cpu

X = randNormal(0, 1, 10000001000)$1000000:1000 Y = randNormal(0, 1, 10000001) bb = ols(Y, X) The above script runs slowly in DolphinDB with only one CPU core running. How to improve the efficiency?

Related

Efficiency of diag() - MATLAB

symmetric multicore processor

Matlab's bsxfun() - what explains the performance differences when expanding along different dimensions?

matlab sum(X-Y) vs sum(X) - sum(Y)

Matlab slow parallel processing with distributed arrays

Categories

Resources

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Can I calculate the least squares in parallel in DolphinDB? - cpu

X = randNormal(0, 1, 1000000*1000)$1000000:1000 Y = randNormal(0, 1, 1000000*1) bb = ols(Y, X) The above script runs slowly in DolphinDB with only one CPU core running. How to improve the efficiency?

Related

Efficiency of diag() - MATLAB

symmetric multicore processor

Matlab's bsxfun() - what explains the performance differences when expanding along different dimensions?

matlab sum(X-Y) vs sum(X) - sum(Y)

Matlab slow parallel processing with distributed arrays

Categories

Resources

X = randNormal(0, 1, 10000001000)$1000000:1000 Y = randNormal(0, 1, 10000001) bb = ols(Y, X) The above script runs slowly in DolphinDB with only one CPU core running. How to improve the efficiency?