Efficient multiplication of very large matrices in MATLAB - algorithm

I don't have enough memory to simply create a diagonal D-by-D matrix, since D is large. I keep getting an 'out of memory' error.
Instead of performing M x D x D operations in the first multiplication, I do M x D operations, but still my code takes ages to run.
Can anybody find a more effective way to perform the multiplication A'*B*A? Here's what I've attempted so far:
D=20000
M=25
A = floor(rand(D,M)*10);
B = floor(rand(1,D)*10);
for i=1:D
for j=1:M
result(i,j) = A(i,j) * B(1,j);
end
end
manual = result * A';
auto = A*diag(B)*A';
isequal(manual,auto)

One option that should solve your problem is using sparse matrices. Here's an example:
D = 20000;
M = 25;
A = floor(rand(D,M).*10); %# A D-by-M matrix
diagB = rand(1,D).*10; %# Main diagonal of B
B = sparse(1:D,1:D,diagB); %# A sparse D-by-D diagonal matrix
result = (A.'*B)*A; %'# An M-by-M result
Another option would be to replicate the D elements along the main diagonal of B to create an M-by-D matrix using the function REPMAT, then use element-wise multiplication with A.':
B = repmat(diagB,M,1); %# Replicate diagB to create an M-by-D matrix
result = (A.'.*B)*A; %'# An M-by-M result
And yet another option would be to use the function BSXFUN:
result = bsxfun(#times,A.',diagB)*A; %'# An M-by-M result

Maybe I'm having a bit of a brainfart here, but can't you turn your DxD matrix into a DxM matrix (with M copies of the vector you're given) and then .* the last two matrices rather than multiply them (and then, of course, normally multiply the first with the found product quantity)?

You are getting "out of memory" because MATLAB can not find a chunk of memory large enough to accommodate the entire matrix. There are different techniques to avoid this error described in MATLAB documentation.
In MATLAB you obviously do not need programming explicit loops in most cases because you can use operator *. There exists a technique how to speed up matrix multiplication if it is done with explicit loops, here is an example in C#. It has a good idea how (potentially large) matrix can be split into smaller matrices. To contain these smaller matrices in MATLAB you can use cell matrix. It is much more probably that system finds enough RAM to accommodate two smaller sub-matrices then the resulting large matrix.

Related

How to perform operation for all matrix elements in Scilab?

I'm trying to simulate the heat distribution on an infinite plate over time. For this purpose, I've wrote a Scilab script. Now, the crucial point of it, is calculation of temperature for all plate points, and it has to be done for every time instance I want to observe:
for j=2:S-1
for i=2:S-1
heat(i, j) = tcoeff*10000*(plate(i-1,j) + plate(i+1,j) - 4*plate(i,j) + plate(i, j-1) + plate(i, j+1)) + plate(i,j);
end;
end
The problem is, that, if I'd like to do it for a 100x100 points plate, it means, that here (it's only for inner part, without boundary conditions), I would have to loop 98x98 = 9604 times, at every turn calculating the heat at a given i,j point. If I'd like to observe that for, say 100 secons, with a 1 s step, I have to repeat it 100 times, giving 960,400 iterations in total. Which takes quite a long time, and I'd like to avoid it. Up to 50x50 plate, it all happens in a reasonable, 4-5 seconds time frame.
Now my question is - is it necessary to do all this using for loops? Is there any built-in aggregate function in Scilab, that will let me do this for all elements of a matrix? The reason I haven't found a way yet, is that the result for every point depends on the values of other matrix points, and that made me do it with nested loops. Any ideas on how to make it faster appreciated.
It seems to me that you want to compute a 2D intercorrelation of your heat field and a certain diffusion pattern. This pattern can be thought as a "filter" kernel, which is a common way to modify images with a linear filter matrix. Your "filter" is:
F=[0,1,0;1,-4,1;0,1,0];
If you install the Image Processing Toolbox (IPD) you will have a MaskFilter function to do this 2D intercorrelation.
S=500;
plate=rand(S,S);
tcoeff=1;
//your solution with nested for loops
t0=getdate();
for j=2:S-1
for i=2:S-1
heat(i, j) = tcoeff*10000*(plate(i-1,j)+plate(i+1,j)-..
4*plate(i,j)+plate(i,j-1)+plate(i, j+1))+plate(i,j);
end
end
t1=getdate();
T0=etime(t1,t0);
mprintf("\nNested for loops: %f s (100 %%)",T0);
//optimised nested for loop
F=[0,1,0;1,-4,1;0,1,0]; //"filter" matrix
F=tcoeff*10000*F;
heat2=zeros(plate);
t0=getdate();
for j=2:S-1
for i=2:S-1
heat2(i,j)=sum(F.*plate(i-1:i+1,j-1:j+1));
end
end
heat2=heat2+plate;
t1=getdate();
T2=etime(t1,t0);
mprintf("\nNested for loops optimised: %f s (%.2f %%)",T2,T2/T0*100);
//MaskFilter from IPD toolbox
t0=getdate();
heat3=MaskFilter(plate,F);
heat3=heat3+plate;
t1=getdate();
T3=etime(t1,t0);
mprintf("\nWith MaskFilter: %f s (%.2f %%)",T3,T3/T0*100);
disp(heat3(1:10,1:10)-heat(1:10,1:10),"Difference of the results (heat3-heat):");
Please note, that MaskFilter pads the image (the original matrix) before applying the filter, and as far as I know it uses a "mirror" array across the border. You should check whether this behaviour is appropriate for you or not.
The speed increase is about *320 (the execution time is 0.32% of your original code). Is that fast enough?
In theory it could be done with two 2D Fourier Transform (with Scilab builtin mfft maybe) but it might not be faster than this. See here: http://mailinglists.scilab.org/Image-processing-filter-td2618144.html#a2618168
Please consider that there is a big difference between vectorizing an operation and parallel computation, as I have explained here. Although vectorizing might improve performance a little bit, that's not comparable to what you can achive through GPU computing for example (e.g. OpenCL). I will try to explain a vectorized form of your code without going too much into the details. Consider these as given:
S = ...;
tcoeff = ...;
function Plate = plate(i, j)
...;
endfunction
function Heat = heat(i, j)
...;
endfunction
Now you could define a meshgrid:
x = 2 : S - 1;
y = 2 : S - 1;
[M, N] = meshgrid(x,y);
Result = feval(M, N, heat);
The feval is the key here which will broadcast the feval function over the M and N matrices.
Your scheme is a finite differences scheme of the Laplacian operator applied to a rectangular grid. If you choose a row-wise or column-wise numbering of your degrees of freedom (here the plate(i,j)) in order to treat them as vectors, then applying your "discrete" Laplacian can be done by multiplying a sparse matrix on the left (it is very fast) This is particularly well explained in the following document:
https://www.math.uci.edu/~chenlong/226/FDMcode.pdf.
The implementation is described in Matlab but is easily translated in Scilab.

Faster alternative to INTERSECT with 'rows' - MATLAB

I have a code written in Matlab that uses 'intersect' to find the vectors (and their indices) that intersect in two large matrices. I found that 'intersect' is the slowest line (by a large difference) in my code. Unfortunately I couldn't find a faster alternative so far.
As an example running the code below takes approx 5 seconds on my pc:
profile on
for i = 1 : 500
a = rand(10000,5);
b = rand(10000,5);
[intersectVectors, ind_a, ind_b] = intersect(a,b,'rows');
end
profile viewer
I was wondering if there is a faster way. Note that the matrices (a) and (b) have 5 columns. The number of rows don't necessary have to be the same for the two matrices.
Any help would be great.
Thanks
Discussion and solution codes
You can use an approach that leverages fast matrix multiplication in MATLAB to convert those 5 columns of input arrays into one column by considering each column as a significant "digit" of a single number. Thus, you would end up with an array with only column and then, you can use intersect or ismember without 'rows' and that must speedup the codes in a big way!
Here are the promised implementations as function codes for easy usage -
intersectrows_fast_v1.m:
function [intersectVectors, ind_a, ind_b] = intersectrows_fast_v1(a,b)
%// Calculate equivalent one-column versions of input arrays
mult = [10^ceil(log10( 1+max( [a(:);b(:)] ))).^(size(a,2)-1:-1:0)]'; %//'
acol1 = a*mult;
bcol1 = b*mult;
%// Use intersect without 'rows' option for a good speedup
[~, ind_a, ind_b] = intersect(acol1,bcol1);
intersectVectors = a(ind_a,:);
return;
intersectrows_fast_v2.m:
function [intersectVectors, ind_a, ind_b] = intersectrows_fast_v2(a,b)
%// Calculate equivalent one-column versions of input arrays
mult = [10^ceil(log10( 1+max( [a(:);b(:)] ))).^(size(a,2)-1:-1:0)]'; %//'
acol1 = a*mult;
bcol1 = b*mult;
%// Use ismember to get indices of the common elements
[match_a,idx_b] = ismember(acol1,bcol1);
%// Now, with ismember, duplicate items are not taken care of automatically as
%// are done with intersect. So, we need to find the duplicate items and
%// remove those from the outputs of ismember
[~,a_sorted_ind] = sort(acol1);
a_rm_ind =a_sorted_ind([false;diff(sort(acol1))==0]); %//indices to be removed
match_a(a_rm_ind)=0;
intersectVectors = a(match_a,:);
ind_a = find(match_a);
ind_b = idx_b(match_a);
return;
Quick tests and conclusions
With the datasizes listed in the question, the runtimes were -
-------------------------- With original approach
Elapsed time is 3.885792 seconds.
-------------------------- With Proposed approach - Version - I
Elapsed time is 0.581123 seconds.
-------------------------- With Proposed approach - Version - II
Elapsed time is 0.963409 seconds.
The results seem to suggest a big advantage in favour of the version - I of the two proposed approaches with a whooping speedup of around 6.7x over the original approach!!
Also, please note that if you don't need any one or two of the three outputs from the original intersect with 'rows' based approach, then both the proposed approaches could be further shortened for better runtime performances!

Looking for efficient way to perform a computation - Matlab

I have a scalar function f([x,y],[i,j])= exp(-norm([x,y]-[i,j])^2/sigma^2) which receives two 2-dimensional vectors as input (norm here implements the Euclidean norm). The values of x,i range in 1:w and the values y,j range in 1:h. I want to create a cell array X such that X{x,y} will contain a w x h matrix such that X{x,y}(i,j) = f([x,y],[i,j]). This can obviously be done using 4 nested loops like so:
for x=1:w;
for y=1:h;
X{x,y}=zeros(w,h);
for i=1:w
for j=1:h
X{x,y}(i,j)=f([x,y],[i,j])
end
end
end
end
This is however extremely inefficient. I would very much appreciate an efficient way to create X.
The one way to do this is to remove the 2 innermost loops and replace then with a vectorised version. By the look of your f function this shouldn't be too bad
First we need to construct two matrices containing the 1 to w on every row and 1 to h on every column like so
wMat=repmat(1:w,h,1);
hMat=repmat(1:h,w,1)';
This is going to represent the inner two loops, and the transpose will allow us to get all combinations. Now we can vectorise the calculation (f([x,y],[i,j])= exp(-norm([x,y]-[i,j])^2/sigma^2)):
for x=1:w;
for y=1:h;
temp1=sqrt((x-wMat).^2+(y-hMat).^2);
X{x,y}=exp(temp1/(sigma^2));
end
end
Where we have computed the Euclidean norm for all pairs of nodes in the inner loops at once.
Some discussion and code
The trick here is to perform the norm-calculations with numeric arrays and save the results into a cell array version as late as possible. For performing the norm-calculations you can take help of ndgrid, bsxfun and some permute + reshape to give it the "shape" as needed for the final cell array version. So, here's the vectorized approach to perform these tasks -
%// Create x-y/i-j values to be used for calculation of function values
[xi,yi] = ndgrid(1:w,1:h);
%// Get the norm values
normvals = sqrt(bsxfun(#minus,xi(:),xi(:).').^2 + ...
bsxfun(#minus,yi(:),yi(:).').^2);
%// Get the actual function values
vals = exp(-normvals.^2/sigma^2);
%// Get the values into blocks of a 4D array and then re-arrange to match
%// with the shape of numeric array version of X
blks = reshape(permute(reshape(vals, w*h, h, []), [2 1 3]), h, w, h, w);
arranged_blks = reshape(permute(blks,[2 3 1 4]),w,h,w,h);
%// Finally get the cell array version
X = squeeze(mat2cell(arranged_blks,w,h,ones(1,w),ones(1,h)));
Benchmarking and runtimes
After improving the original loopy code with pre-allocation for X and function-inling f, runtime-benchmarks were performed with it against the proposed vectorized approach with datasizes as w, h = 60 and the runtime results thus obtained were -
----------- With Improved loopy code
Elapsed time is 41.227797 seconds.
----------- With Vectorized code
Elapsed time is 2.116782 seconds.
This suggested a whooping close to 20x speedup with the proposed solution!
For extremely huge datasizes
If you are dealing with huge datasizes, essentially you are not giving enough memory for bsxfun to work with, and bsxfun is known to use up a lot of memory for giving you a performance-efficient vectorized solution. So, for such huge-datasize cases, you can use the following loopy approach to replace normvals calculations that was listed in the earlier bsxfun based solution -
%// Get the norm values
nx = numel(xi);
normvals = zeros(nx,nx);
for ii = 1:nx
normvals(:,ii) = sqrt( (xi(:) - xi(ii)).^2 + (yi(:) - yi(ii)).^2 );
end
It seems to me that when you run through the cycle for x=w, y=h, you are calculating all the values you need at once. So you don't need recalculate them. Once you have this:
for i=1:w
for j=1:h
temp(i,j)=f([x,y],[i,j])
end
end
Then, e.g. X{1,1} is just temp(1,1), X{2,2} is just temp(1:2,1:2), and so on. If you can vectorise the calculation of f (norm here is just the Euclidean norm of that vector?) then it will get even simpler.

Why is my Matlab for-loop code faster than my vectorized version

I had always heard that vectorized code runs faster than for loops in MATLAB. However, when I tried vectorizing my MATLAB code it seemed to run slower.
I used tic and toc to measure the times. I changed only the implementation of a single function in my program. My vectorized version ran in 47.228801 seconds and my for-loop version ran in 16.962089 seconds.
Also in my main program I used a large number for N, N = 1000000and DataSet's size is 1 301, and I ran each version several times for different data sets with the same size and N.
Why is the vectorized so much slower and how can I improve the speed further?
The "vectorized" version
function [RNGSet] = RNGAnal(N,DataSet)
%Creates a random number generated set of numbers to check accuracy overall
% This function will produce random numbers and normalize a new Data set
% that is derived from an old data set by multiply random numbers and
% then dividing by N/2
randData = randint(N,length(DataSet));
tempData = repmat(DataSet,N,1);
RNGSet = randData .* tempData;
RNGSet = sum(RNGSet,1) / (N/2); % sum and normalize by the N
end
The "for-loop" version
function [RNGData] = RNGAnsys(N,Data)
%RNGAnsys This function produces statistical RNG data using a for loop
% This function will produce RNGData that will be used to plot on another
% plot that possesses the actual data
multData = zeros(N,length(Data));
for i = 1:length(Data)
photAbs = randint(N,1); % Create N number of random 0's or 1's
multData(:,i) = Data(i) * photAbs; % multiply each element in the molar data by the random numbers
end
sumData = sum(multData,1); % sum each individual energy level's data point
RNGData = (sumData/(N/2))'; % divide by n, but account for 0.5 average by n/2
end
Vectorization
First glance at the for-loop code tells us that since photAbs is a binary array each column of which is scaled according to each element of Data, this binary feature could be used for vectorization. This is abused in the code here -
function RNGData = RNGAnsys_vect1(N,Data)
%// Get the 2D Matrix of random ones and zeros
photAbsAll = randint(N,numel(Data));
%// Take care of multData internally by summing along the columns of the
%// binary 2D matrix and then multiply each element of it with each scalar
%// taken from Data by performing elementwise multiplication
sumData = Data.*sum(photAbsAll,1);
%// Divide by n, but account for 0.5 average by n/2
RNGData = (sumData./(N/2))'; %//'
return;
After profiling, it appears that the bottleneck is the random binary array creating part. So, using a faster random binary array creator as suggested in this smart solution, the above function could be further optimized like so -
function RNGData = RNGAnsys_vect2(N,Data)
%// Create a random binary array and sum along the columns on the fly to
%// save on any variable space that would be required otherwise.
%// Also perform the elementwise multiplication as discussed before.
sumData = Data.*sum(rand(N,numel(Data))<0.5,1);
%// Divide by n, but account for 0.5 average by n/2
RNGData = (sumData./(N/2))'; %//'
return;
Using the smart binary random array creator, the original code could be optimized as well, that will be used for a fair benchmarking between optimized for-loop and vectorized codes later on. The optimized for-loop code is listed here -
function RNGData = RNGAnsys_opt1(N,Data)
multData = zeros(N,numel(Data));
for i = 1:numel(Data)
%// Create N number of random 0's or 1's using a smart approach
%// Then, multiply each element in the molar data by the random numbers
multData(:,i) = Data(i) * rand(N,1)<.5;
end
sumData = sum(multData,1); % sum each individual energy level's data point
RNGData = (sumData/(N/2))'; % divide by n, but account for 0.5 average by n/2
return;
Benchmarking
Benchmarking Code
N = 15000; %// Kept at this value as it going out of memory with higher N's.
%// Size of dataset is more important anyway as that decides how
%// well is vectorized code against a for-loop code
DS_arr = [50 100 200 500 800 1500 5000]; %// Dataset sizes
timeall = zeros(2,numel(DS_arr));
for k1 = 1:numel(DS_arr)
DS = DS_arr(k1);
Data = rand(1,DS);
f = #() RNGAnsys_opt1(N,Data);%// Optimized for-loop code
timeall(1,k1) = timeit(f);
clear f
f = #() RNGAnsys_vect2(N,Data);%// Vectorized Code
timeall(2,k1) = timeit(f);
clear f
end
%// Display benchmark results
figure,hold on, grid on
plot(DS_arr,timeall(1,:),'-ro')
plot(DS_arr,timeall(2,:),'-kx')
legend('Optimized for-loop code','Vectorized code')
xlabel('Dataset size ->'),ylabel('Time(sec) ->')
avg_speedup = mean(timeall(1,:)./timeall(2,:))
title(['Average Speedup with vectorized code = ' num2str(avg_speedup) 'x'])
Results
Concluding remarks
Based on the experience I had so far with MATLAB, neither for loops nor vectorized techniques are fit for all situations, but everything is situation-specific.
Try using the matlab profiler to determine which line or lines of code are using the most amount of time. That way you can find out if the repmat function is what is slowing you down as is being suggested. Let us know what you find, I'm interested!
randData = randint(N,length(DataSet));
allocates a 1.2GB array. (4*301*1000000). Implicitly you create up to 4 of these monsters in your program, causing continuous cache-misses.
You for-loop code could nearly run in the processor cache (or it does on the bigger xeons).

Matlab - if exists a faster way to assign values to big matrix?

I am a new student learning to use Matlab.
Could anyone please tell me is there a faster way possibly without loops:
to assign for each row only two values 1, -1 into different positions of a big sparse matrix.
My code to build a bimatrix or bibimatrix for the MILP problem of condition :
f^k_{ij} <= y_{ij} for every arc (i,j) and all k ~=r; in a multi-commodity flow model.
Naive approach:
bimatrix=[];
% create each row and then add to bimatrix
newrow4= zeros(1,n*(n+1)^2);
for k=1:n
for i=0:n
for j=1: n
if j~=i
%change value of some positions to -1 and 1
newrow4(i*n^2+(j-1)*n+k)=1;
newrow4((n+1)*n^2+i*n+j)=-1;
% add to bimatrix
bimatrix=[bimatrix; newrow4];
% change newrow4 back to zeros row.
newrow4(i*n^2+(j-1)*n+k)=0;
newrow4((n+1)*n^2+i*n+j)=0;
end
end
end
end
OR:
% Generate the big sparse matrix first.
bibimatrix=zeros(n^3 ,n*(n+1)^2);
t=1;
for k=1:n
for i=0:n
for j=1: n
if j~=i
%Change 2 positions in each row to -1 and 1 in each row.
bibimatrix(t,i*n^2+(j-1)*n+k)=1;
bibimatrix(t,(n+1)*n^2+i*n+j)=-1;
t=t+1
end
end
end
end
With these above code in Matlab, the time to generate this matrix, with n~12, is more than 3s. I need to generate a larger matrix in less time.
Thank you.
Suggestion: Use sparse matrices.
You should be able to create two vectors containing the column number where you want your +1 and -1 in each row. Let's call these two vectors vec_1 and vec_2. You should be able to do this without loops (if not, I still think the procedure below will be faster).
Let the size of your matrix be (max_row X max_col). Then you can create your matrix like this:
bibimatrix = sparse(1:max_row,vec_1,1,max_row,max_col);
bibimatrix = bibimatrix + sparse(1:max_row, vec_2,-1,max_row,max_col)
If you want to see the entire matrix (which you don't, since it's huge) you can write: full(bibimatrix).
EDIT:
You may also do it this way:
col_vec = [vec_1, vec_2];
row_vec = [1:max_row, 1:max_row];
s = [ones(1,max_row), -1*ones(1,max_row)];
bibimatrix = sparse(row_vec, col_vec, s, max_row, max_col)
Disclaimer: I don't have MATLAB available, so it might not be error-free.

Resources