Vectorizing Generalised Hebb Algorithm - performance

I have this code which runs very slow. Can someone help me vectorize it.
for ii=1:K,
y=w*x(ii,:)'; % y is N by 1
u=zeros(N,M);
disp(num2str(ii));
for jj=1:N,
u(jj,:)=y(jj)*(x(ii,:)-y(1:jj)'*w(1:jj,:));
end
wold=w;
w=wold+eta*u; % updated weight matrix
end
the inner loop takes the most time. The code is for generalised hebb algorithm.
Input sizes:
M=153600;
K=5000;
N=400;
eta=0.004;
size(w)=5000x153600
size(x)=400x153600

You can kill the inner loop to get u with bsxfun -
yN = y(1:N);
u = bsxfun(#times,yN,bsxfun(#minus,x(ii,:),cumsum(bsxfun(#times,w(1:N,:),yN))))
For the outer loop, owing to the data dependency between iterations with the updates on w, it might be hard to vectorize that one.

Related

Is there most efficient way to code program for Avg Clustering Coeff

Calculation of Average clustering coefficient of a graph
I am getting correct result but it takes huge time when the graph dimension increases need some alternative way so that it takes less time to execute. Is there any way to simplify the code??
%// A is adjacency matrix N X N,
%// d is degree ,
N=100;
d=10;
rand('state',0)
A = zeros(N,N);
kv=d*(d-1)/2;
%% Creating A matrix %%%
for i = 1:(d*N/2)
j = floor(N*rand)+1;
k = floor(N*rand)+1;
while (j==k)||(A(j,k)==1)
j = floor(N*rand)+1;
k = floor(N*rand)+1;
end
A(j,k)=1;
A(k,j)=1;
end
%% Calculation of clustering Coeff %%
for i=1:N
J=find(A(i,:));
et=0;
for ii=1:(size(J,2))-1
for jj=ii+1:size(J,2)
et=et+A(J(ii),J(jj));
end
end
Cv(i)=et/kv;
end
Avg_clustering_coeff=sum(Cv)/n;
Output I got.
Avg_clustering_coeff = 0.1107
That Calculation of clustering Coeff part could be vectorized using nchoosek to remove the innermost two nested loops, like so -
CvOut = zeros(1,N);
for k=1:N
J=find(A(k,:));
if numel(J)>1
idx = nchoosek(J,2);
CvOut(k) = sum(A(sub2ind([N N],idx(:,1),idx(:,2))));
end
end
CvOut=CvOut/kv;
Hopefully, this would boost up the performance quite a bit!
To speed up your code you can read my comment, but you are not going to reduce drastically the computation time, because the time complexity doesn't change.
But if you don't need to get an absolut result you can use the probability.
probnum = cumsum(1:d);
probnum = mean(probnum(end-1:end)); %theorical number of elements created by your second loop (for each row).
probfind = d*N/(N^2); %probability of finding a non zero value.
coeff = probnum*probfind/kv;
This probabilistic coeff is going to be equal to Avg_clustering_coeff for big N.
So you can use the normal method for small N and this method for big N.

Looking for efficient way to perform a computation - Matlab

I have a scalar function f([x,y],[i,j])= exp(-norm([x,y]-[i,j])^2/sigma^2) which receives two 2-dimensional vectors as input (norm here implements the Euclidean norm). The values of x,i range in 1:w and the values y,j range in 1:h. I want to create a cell array X such that X{x,y} will contain a w x h matrix such that X{x,y}(i,j) = f([x,y],[i,j]). This can obviously be done using 4 nested loops like so:
for x=1:w;
for y=1:h;
X{x,y}=zeros(w,h);
for i=1:w
for j=1:h
X{x,y}(i,j)=f([x,y],[i,j])
end
end
end
end
This is however extremely inefficient. I would very much appreciate an efficient way to create X.
The one way to do this is to remove the 2 innermost loops and replace then with a vectorised version. By the look of your f function this shouldn't be too bad
First we need to construct two matrices containing the 1 to w on every row and 1 to h on every column like so
wMat=repmat(1:w,h,1);
hMat=repmat(1:h,w,1)';
This is going to represent the inner two loops, and the transpose will allow us to get all combinations. Now we can vectorise the calculation (f([x,y],[i,j])= exp(-norm([x,y]-[i,j])^2/sigma^2)):
for x=1:w;
for y=1:h;
temp1=sqrt((x-wMat).^2+(y-hMat).^2);
X{x,y}=exp(temp1/(sigma^2));
end
end
Where we have computed the Euclidean norm for all pairs of nodes in the inner loops at once.
Some discussion and code
The trick here is to perform the norm-calculations with numeric arrays and save the results into a cell array version as late as possible. For performing the norm-calculations you can take help of ndgrid, bsxfun and some permute + reshape to give it the "shape" as needed for the final cell array version. So, here's the vectorized approach to perform these tasks -
%// Create x-y/i-j values to be used for calculation of function values
[xi,yi] = ndgrid(1:w,1:h);
%// Get the norm values
normvals = sqrt(bsxfun(#minus,xi(:),xi(:).').^2 + ...
bsxfun(#minus,yi(:),yi(:).').^2);
%// Get the actual function values
vals = exp(-normvals.^2/sigma^2);
%// Get the values into blocks of a 4D array and then re-arrange to match
%// with the shape of numeric array version of X
blks = reshape(permute(reshape(vals, w*h, h, []), [2 1 3]), h, w, h, w);
arranged_blks = reshape(permute(blks,[2 3 1 4]),w,h,w,h);
%// Finally get the cell array version
X = squeeze(mat2cell(arranged_blks,w,h,ones(1,w),ones(1,h)));
Benchmarking and runtimes
After improving the original loopy code with pre-allocation for X and function-inling f, runtime-benchmarks were performed with it against the proposed vectorized approach with datasizes as w, h = 60 and the runtime results thus obtained were -
----------- With Improved loopy code
Elapsed time is 41.227797 seconds.
----------- With Vectorized code
Elapsed time is 2.116782 seconds.
This suggested a whooping close to 20x speedup with the proposed solution!
For extremely huge datasizes
If you are dealing with huge datasizes, essentially you are not giving enough memory for bsxfun to work with, and bsxfun is known to use up a lot of memory for giving you a performance-efficient vectorized solution. So, for such huge-datasize cases, you can use the following loopy approach to replace normvals calculations that was listed in the earlier bsxfun based solution -
%// Get the norm values
nx = numel(xi);
normvals = zeros(nx,nx);
for ii = 1:nx
normvals(:,ii) = sqrt( (xi(:) - xi(ii)).^2 + (yi(:) - yi(ii)).^2 );
end
It seems to me that when you run through the cycle for x=w, y=h, you are calculating all the values you need at once. So you don't need recalculate them. Once you have this:
for i=1:w
for j=1:h
temp(i,j)=f([x,y],[i,j])
end
end
Then, e.g. X{1,1} is just temp(1,1), X{2,2} is just temp(1:2,1:2), and so on. If you can vectorise the calculation of f (norm here is just the Euclidean norm of that vector?) then it will get even simpler.

How to multiply each column of matrix A by each row of matrix B and sum resulting matrices in Matlab?

I have a problem which I hope can be easily solved.
A is a NG matrix, B is NG matrix. The goal is to get matrix C
which is equal to multiplying each column of transposed A by each row of B and summing resulting matrices; total number of such matrices before summing is NN, their size is GG
This can be easily done in MatLab with two for-loops:
N=5;
G=10;
A=rand(N,G);
B=rand(N,G);
C=zeros(G);
for n=1:1:N
for m=1:1:N
C=C+A(m,:)'*B(n,:);
end
end
However, for large matrices it is quite slow.
So, my question is:
is there a more efficient way for calculating C matrix in Matlab?
Thank you
If you write it all out for two 3×3 matrices, you'll find that the operation basically equals this:
C = bsxfun(#times, sum(B), sum(A).');
Running each of the answers here for N=50, G=100 and repeating each method 100 times:
Elapsed time is 13.839893 seconds. %// OP's original method
Elapsed time is 19.773445 seconds. %// Luis' method
Elapsed time is 0.306447 seconds. %// Robert's method
Elapsed time is 0.005036 seconds. %// Rody's method
(a factor of ≈ 4000 between the fastest and slowest method...)
I think this should improve the performance significantly
C = zeros(G);
for n = 1:N
C = C + sum(A,1)'*B(n,:);
end
You avoid one loop, and should also avoid the problems of running out of memory. According to my benchmarking, it's about 20 times faster than the approach with two loops. (Note, I had to benchmark in Octace since I don't have MATLAB on this PC).
Use bsxfun instead of the loops, and then sum twice:
C = sum(sum(bsxfun(#times, permute(A, [2 3 1]), permute(B,[3 2 4 1])), 3), 4);

How to make this loop faster in matlab

I have to multiply arrays A and B element by element and calculate the sum of the first dimension, then returns the result in C. A is a N-by-M-by-L matrix. B is a N-by-1-by-L matrix. N and M is lower than 30, but L is very large. My code is:
C=zeros(size(B));
parfor i=1:size(A,2)
C(i,1,:) = sum(bsxfun(#times, A(:,i,:), B(:,1,:)), 1);
end
The problem is the code is slow, anyone can help to make the code faster? Thank you very much.
How about something along the lines of this:
C = permute(sum(A.*repmat(B,1,M)),[2,1,3]);
This speeds computation on my PC up by a factor of ~4. Interestingly enough, you can actually speed up the computation by a factor of 2 (at least on my PC) simply by changing the parfor loop to a for loop.
If I understand correctly, just do this:
C = squeeze(sum(bsxfun(#times, A, B)));
This gives C with size M x L.
Taking the comments from Luis Mendo, I propose to use this command:
C=reshape(sum(bsxfun(#times, A, B), 1), size(B))
I think this is the fastest.

Need to know if this a Unique way to divide?

Few months ago I had asked a question on an "Algorithm to find factors for primes in linear time" in StackOverflow.
In the replies i was clear that my assumptions where wrong and the Algorithm cannot find factors in linear time.
However I would like to know if the algorithm is an unique way to do division and find factors; that is do any similar/same way to do division is known? I am posting the algorithm here again:
Input: A Number (whose factors is to be found)
Output: The two factor of the Number. If the one of the factor found is 1 then it can be concluded that the
Number is prime.
Integer N, mL, mR, r;
Integer temp1; // used for temporary data storage
mR = mL = square root of (N);
/*Check if perfect square*/
temp1 = mL * mR;
if temp1 equals N then
{
r = 0; //answer is found
End;
}
mR = N/mL; (have the value of mL less than mR)
r = N%mL;
while r not equals 0 do
{
mL = mL-1;
r = r+ mR;
temp1 = r/mL;
mR = mR + temp1;
r = r%mL;
}
End; //mR and mL has answer
Let me know your inputs/ The question is purely out of personal interest to know if a similar algorithm exists to do division and find factors, which I am not able to find.
I understand and appreciate thay you may require to understand my funny algorithm to give answers! :)
Further explanation:
Yes, it does work on numbers above 10 (which i tested) and all positive integers.
The algorithm depends on remainder r to proceed further.I basically formed the idea that for a number, its factors gives us the sides of the
rectangles whose area is the number itself. For all other numbers which are not factors there would be a
remainder left, or consequently the rectangle cannot be formed in complete.
Thus idea is for each decrease of mL, we can increase r = mR+r (basically shifting one mR from mRmL to r) and then this large r is divided by mL to see how much we can increase mR (how many times we can increase mR for one decrease of mL). Thus remaining r is r mod mL.
I have calculated the number of while loop it takes to find the factors and it comes below or equal 5*N for all numbers. Trial division will take more.*
Thanks for your time, Harish
The main loop is equivalent to the following C code:
mR = mL = sqrt(N);
...
mR = N/mL; // have the value of mL less than mR
r = N%mL;
while (r) {
mL = mL-1;
r += mR;
mR = mR + r/mL;
r = r%mL;
}
Note that after each r += mR statement, the value of r is r%(mL-1)+mR. Since r%(mL-1) < mL, the value of r/mL in the next statement is either mR/mL or 1 + mR/mL. I agree (as a result of numerical testing) that it works out that mR*mL = N when you come out of the loop, but I don't understand why. If you know why, you should explain why, if you want your method to be taken seriously.
In terms of efficiency, your method uses the same number of loops as Fermat factorization although the inner loop of Fermat factorization can be written without using any divides, where your method uses two division operations (r/mL and r%mL) inside its inner loop. In the worst case for both methods, the inner loop runs about sqrt(N) times.
There are others, for example Pollard's rho algorithm, and GNFS which you were already told about in the previous question.

Resources