Optimised code for computing truncated weighted sum of matrix products - algorithm

I need to perform a double sum over a weighted matrix product like this: mathematical expression of problem
where C is a list containing some different weights (complex) for each term and
X are an 3d-array where the index refers to the i'ths 2d-array along the depth axis. I need to evaluate a sum of such expression thousands of times (different 'm' arrays) so the execution time is very important. in the real problem the arrays are of greater dimensions. An illustrative minimum example of my code is
import numpy as np
from time import clock
X = np.random.random((12,12,31))
m = np.random.random((12,12))
C = list(map(lambda x: x*2+1j*x, range(31)))
def summing(m):
start = clock()
t2 = np.zeros((12,12),dtype = complex)
for i in range(len(C)):
for j in range(len(C)):
t2 += C[i]*(X[:,:,i].dot(m).dot(np.conj(X[:,:,j].T)))
return t2, clock()-start
summing(m)[1]
The execution time for me is around 0.025s, which is way too slow for the number of computations I need to perform. Running in parallel is not an option as this function is part of a larger parallel computation, so parallelising this would create nested parallel computations.
Do any of you see a much cleverer way of performing this computation in a faster and hopefully a more pythonic way?

Related

Finding a perfect matching in graphs

I have this question :
Airline company has N different planes and T pilots. Every pilot has a list of planes he can fly. Every flight needs 2 pilots. The company want to have as much flights simultaneously as possible. Find an algorithm that finds if you can have all the flights simultaneously.
This is the solution I thought about is finding max flow on this graph:
I am just not sure what the capacity should be. Can you help me with that?
Great idea to find the max flow.
For each edge from source --> pilot, assign a capacity of 1. Each pilot can only fly one plane at a time since they are running simultaneously.
For each edge from pilot --> plane, assign a capacity of 1. If this edge is filled with flow of 1, it represents that the given pilot is flying that plane.
For each edge from plane --> sink, assign a capacity of 2. This represents that each plane must be supplied by exactly 2 pilots.
Now, find a maximum flow. If the resulting maximum flow is two times the number of planes, then it's possible to satisfy the constraints. In this case, the edges between planes and pilots that are at capacity represent the matching.
The other answer is fine but you don't really need to involve flow as this can be reduced just as well to ordinary maximum bipartite matching:
For each plane, add another auxiliary plane to the plane partition with edges to the same pilots as the first plane.
Find a maximum bipartite matching M.
The answer is now true if and only if M = 2 N.
If you like, you can think of this as saying that each plane needs a pilot and a co-pilot, and the two vertices associated to each plane now represents those two roles.
The reduction to maximum bipartite matching is linear time, so using e.g. the Hopcroft–Karp algorithm to find the matching, you can solve the problem in O(|E| √|V|) where E is the number of edges between the partitions, and V = T + N.
In practice, the improvement over using a maximum flow based approach should depend on the quality of your implementations as well as the particular choice of representation of the graph, but chances are that you're better off this way.
Implementation example
To illustrate the last point, let's give an idea of how the two reductions could look in practice. One representation of a graph that's often useful due to its built-in memory locality is that of a CSR matrix, so let us assume that the input is such a matrix, whose rows correspond to the planes, and whose columns correspond to the pilots.
We will use the Python library SciPy which comes with algorithms for both maximum bipartite matching and maximum flow, and which works with CSR matrix representations for graphs under the hood.
In the algorithm given above, we will then need to construct the biadjacency matrix of the graph with the additional vertices added. This is nothing but the result of stacking the input matrix on top of itself, which is straightforward to phrase in terms of the CSR data structures: Following Wikipedia's notation, COL_INDEX should just be repeated, and ROW_INDEX should be replaced with ROW_INDEX concatenated with a copy of ROW_INDEX in which all elements are increased by the final element of ROW_INDEX.
In SciPy, a complete implementation which answers yes or no to the problem in OP would look as follows:
import numpy as np
from scipy.sparse.csgraph import maximum_bipartite_matching
def reduce_to_max_matching(a):
i, j = a.shape
data = np.ones(a.nnz * 2, dtype=bool)
indices = np.concatenate([a.indices, a.indices])
indptr = np.concatenate([a.indptr, a.indptr[1:] + a.indptr[-1]])
graph = csr_matrix((data, indices, indptr), shape=(2*i, j))
return (maximum_bipartite_matching(graph) != -1).sum() == 2 * i
In the maximum flow approach given by #HeatherGuarnera's answer, we will need to set up the full adjacency matrix of the new graph. This is also relatively straightforward; the input matrix will appear as a certain submatrix of the adjacency matrix, and we need to add a row for the source vertex and a column for the target. The example section of the documentation for SciPy's max flow solver actually contains an illustration of what this looks like in practice. Adopting this, a complete solution looks as follows:
import numpy as np
from scipy.sparse.csgraph import maximum_flow
def reduce_to_max_flow(a):
i, j = a.shape
n = a.nnz
data = np.concatenate([2*np.ones(i, dtype=int), np.ones(n + j, dtype=int)])
indices = np.concatenate([np.arange(1, i + 1),
a.indices + i + 1,
np.repeat(i + j + 1, j)])
indptr = np.concatenate([[0],
a.indptr + i,
np.arange(n + i + 1, n + i + j + 1),
[n + i + j]])
graph = csr_matrix((data, indices, indptr), shape=(2+i+j, 2+i+j))
flow = maximum_flow(graph, 0, graph.shape[0]-1)
return flow.flow_value == 2*i
Let us compare the timings of the two approaches on a single example consisting of 40 planes and 100 pilots, on a graph whose edge density is 0.1:
from scipy.sparse import random
inp = random(40, 100, density=.1, format='csr', dtype=bool)
%timeit reduce_to_max_matching(inp) # 191 µs ± 3.57 µs per loop
%timeit reduce_to_max_flow(inp) # 1.29 ms ± 20.1 µs per loop
The matching-based approach is faster, but not by a crazy amount. On larger problems, we'll start to see the advantages of using matching instead; with 400 planes and 1000 pilots:
inp = random(400, 1000, density=.1, format='csr', dtype=bool)
%timeit reduce_to_max_matching(inp) # 473 µs ± 5.52 µs per loop
%timeit reduce_to_max_flow(inp) # 68.9 ms ± 555 µs per loop
Again, this exact comparison relies on the use of specific predefined solvers from SciPy and how those are implemented, but if nothing else, this hints that simpler is better.

Weighted Sum Scheduling in Halide

I am implementing a Radial Basis Function in Halide, and while I have it running successfully it is quite slow. For each pixel I compute the distance, then take a weighted sum of this distance to produce the output. To loop over the weights I use an RDom (as seen below). In this implementation, every pixel computation requires reloading all of the many (3000+) weights, hence the slow speed.
My question is how to take advantage of Halide's scheduling functionality in this instance. My desire is to load some of the weights, compute partial weighted sums for a subset of the pixels, load the next set of weights, and continue to completion. This keeps locality for each smaller group of weights, and that kind of thing is exactly what Halide is built for. Unfortunately I haven't found anything for this specific problem. The RDom seems to be at a lower level of abstraction than the scheduling primitives, so its unclear how to schedule this.
Any alternative suggestions for weighted sum implementation in Halide are welcome. No need to do this with an RDom, I'm just not aware of any other way.
Func rbf_ctrl_pts("rbf_ctrl_pts");
// Initialization with all zero
rbf_ctrl_pts(x,y,c) = cast<float>(0);
// Index to iterate with
RDom idx(0,num_ctrl_pts);
// Loop code
// Subtract the vectors
Expr red_sub = (*in_func)(x,y,0) - (*ctrl_pts_h)(0,idx);
Expr green_sub = (*in_func)(x,y,1) - (*ctrl_pts_h)(1,idx);
Expr blue_sub = (*in_func)(x,y,2) - (*ctrl_pts_h)(2,idx);
// Take the L2 norm to get the distance
Expr dist = sqrt( red_sub*red_sub +
green_sub*green_sub +
blue_sub*blue_sub );
// Update persistant loop variables
rbf_ctrl_pts(x,y,c) = select( c == 0, rbf_ctrl_pts(x,y,c) +
( (*weights_h)(0,idx) * dist),
c == 1, rbf_ctrl_pts(x,y,c) +
( (*weights_h)(1,idx) * dist),
rbf_ctrl_pts(x,y,c) +
( (*weights_h)(2,idx) * dist));
You can use split or tile and rfactor in the idx dimension of rbf_ctrl_pts to factor and schedule the reduction operation. Getting locality on the weights should be doable via these mechanisms. I'm not 100% sure the associative prover will handle the select so it may be required to unroll by channels or move to using a Tuple across the channels, although in the code above, I'm not sure the select is doing anything compared to passing c through.

Is there most efficient way to code program for Avg Clustering Coeff

Calculation of Average clustering coefficient of a graph
I am getting correct result but it takes huge time when the graph dimension increases need some alternative way so that it takes less time to execute. Is there any way to simplify the code??
%// A is adjacency matrix N X N,
%// d is degree ,
N=100;
d=10;
rand('state',0)
A = zeros(N,N);
kv=d*(d-1)/2;
%% Creating A matrix %%%
for i = 1:(d*N/2)
j = floor(N*rand)+1;
k = floor(N*rand)+1;
while (j==k)||(A(j,k)==1)
j = floor(N*rand)+1;
k = floor(N*rand)+1;
end
A(j,k)=1;
A(k,j)=1;
end
%% Calculation of clustering Coeff %%
for i=1:N
J=find(A(i,:));
et=0;
for ii=1:(size(J,2))-1
for jj=ii+1:size(J,2)
et=et+A(J(ii),J(jj));
end
end
Cv(i)=et/kv;
end
Avg_clustering_coeff=sum(Cv)/n;
Output I got.
Avg_clustering_coeff = 0.1107
That Calculation of clustering Coeff part could be vectorized using nchoosek to remove the innermost two nested loops, like so -
CvOut = zeros(1,N);
for k=1:N
J=find(A(k,:));
if numel(J)>1
idx = nchoosek(J,2);
CvOut(k) = sum(A(sub2ind([N N],idx(:,1),idx(:,2))));
end
end
CvOut=CvOut/kv;
Hopefully, this would boost up the performance quite a bit!
To speed up your code you can read my comment, but you are not going to reduce drastically the computation time, because the time complexity doesn't change.
But if you don't need to get an absolut result you can use the probability.
probnum = cumsum(1:d);
probnum = mean(probnum(end-1:end)); %theorical number of elements created by your second loop (for each row).
probfind = d*N/(N^2); %probability of finding a non zero value.
coeff = probnum*probfind/kv;
This probabilistic coeff is going to be equal to Avg_clustering_coeff for big N.
So you can use the normal method for small N and this method for big N.

Finding parameters of exponentially decaying sinusoids (Matrix Pencil Method)

The matrix pencil method is an algorithm which can be used to find the individual exponential decaying sinusoids' parameters (frequency, amplitude, decay factor and initial phase) in a signal consisting of multiple such signals added. I am trying to implement the algorithm. The algorithm can be found in the paper from this link:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=370583 OR
http://krein.unica.it/~cornelis/private/IEEE/IEEEAntennasPropagMag_37_48.pdf
In order to test the algorithm, I created a synthetic signal composed of four exponentially decaying sinusoids generated as follows:
fs=2205;
t=0:1/fs:249/fs;
f(1)=80;
f(2)=120;
f(3)=250;
f(4)=560;
a(1)=.4;
a(2)=1;
a(3)=0.89;
a(4)=.65;
d(1)=70;
d(2)=50;
d(3)=90;
d(4)=80;
for i=1:4
x(i,:)=a(i)*exp(-d(i)*t).*cos(2*pi*f(i)*t);
end
y=x(1,:)+x(2,:)+x(3,:)+x(4,:);
I then feed this signal to the algorithm described in the paper as follows:
function [f d] = mpencil(y)
%construct hankel matrix
N = size(y,2);
L1 = ceil(1/3 * N);
L2 = floor(2/3 * N);
L = ceil((L1 + L2) / 2);
fs=2205;
for i=1:1:(N-L)
Y(i,:)=y(i:(i+L));
end
Y1=Y(:,1:L);
Y2=Y(:,2:(L+1));
[U,S,V] = svd(Y);
D=diag(S);
tol=1e-3;
m=0;
l=length(D);
for i=1:l
if( abs(D(i)/D(1)) >= tol)
m=m+1;
end
end
Ss=S(:,1:m);
Vnew=V(:,1:m);
a=size(Vnew,1);
Vs1=Vnew(1:(a-1),:);
Vs2=Vnew(2:end,:);
Y1=U*Ss*(Vs1');
Y2=U*Ss*(Vs2');
D_fil=(pinv(Y1))*Y2;
z = eig(D_fil);
l=length(z);
for i=1:2:l
f((i+1)/2)= (angle(z(i))*fs)/(2*pi);
d((i+1)/2)=-real(z(i))*fs;
end
In the output from the above code, I am correctly getting the four constituent frequency components but am not getting their decaying factors. If anybody has prior experience with this algorithm or has some understanding about why this discrepancy might be there, I would be very grateful for your help. I have tried rewriting the code from a scratch multiple times but it has been of no help, giving the same results.
Any help would be highly appreciated.
I found the problem.
There are two small glitches in the code:
SVD output is a complex conjugate of the right singular matrix—i.e, Vh—and according to IEEE, it needs to be converted to V first.
Now, this V is filtered for reducing the dimension.
After reducing the dimensions of V, V1 and V2 are calculated from V. (In your case, you are using Vh directly for calculating V1 and V2!)
When calculating Y1 and Y2, the complex conjugates of V1 and V2 are used.
You did not consider the absolute magnitude of complex eigen values, but only the real part.
damping coefficient "zeta"= log(magnitude(z))/Ts

Looking for efficient way to perform a computation - Matlab

I have a scalar function f([x,y],[i,j])= exp(-norm([x,y]-[i,j])^2/sigma^2) which receives two 2-dimensional vectors as input (norm here implements the Euclidean norm). The values of x,i range in 1:w and the values y,j range in 1:h. I want to create a cell array X such that X{x,y} will contain a w x h matrix such that X{x,y}(i,j) = f([x,y],[i,j]). This can obviously be done using 4 nested loops like so:
for x=1:w;
for y=1:h;
X{x,y}=zeros(w,h);
for i=1:w
for j=1:h
X{x,y}(i,j)=f([x,y],[i,j])
end
end
end
end
This is however extremely inefficient. I would very much appreciate an efficient way to create X.
The one way to do this is to remove the 2 innermost loops and replace then with a vectorised version. By the look of your f function this shouldn't be too bad
First we need to construct two matrices containing the 1 to w on every row and 1 to h on every column like so
wMat=repmat(1:w,h,1);
hMat=repmat(1:h,w,1)';
This is going to represent the inner two loops, and the transpose will allow us to get all combinations. Now we can vectorise the calculation (f([x,y],[i,j])= exp(-norm([x,y]-[i,j])^2/sigma^2)):
for x=1:w;
for y=1:h;
temp1=sqrt((x-wMat).^2+(y-hMat).^2);
X{x,y}=exp(temp1/(sigma^2));
end
end
Where we have computed the Euclidean norm for all pairs of nodes in the inner loops at once.
Some discussion and code
The trick here is to perform the norm-calculations with numeric arrays and save the results into a cell array version as late as possible. For performing the norm-calculations you can take help of ndgrid, bsxfun and some permute + reshape to give it the "shape" as needed for the final cell array version. So, here's the vectorized approach to perform these tasks -
%// Create x-y/i-j values to be used for calculation of function values
[xi,yi] = ndgrid(1:w,1:h);
%// Get the norm values
normvals = sqrt(bsxfun(#minus,xi(:),xi(:).').^2 + ...
bsxfun(#minus,yi(:),yi(:).').^2);
%// Get the actual function values
vals = exp(-normvals.^2/sigma^2);
%// Get the values into blocks of a 4D array and then re-arrange to match
%// with the shape of numeric array version of X
blks = reshape(permute(reshape(vals, w*h, h, []), [2 1 3]), h, w, h, w);
arranged_blks = reshape(permute(blks,[2 3 1 4]),w,h,w,h);
%// Finally get the cell array version
X = squeeze(mat2cell(arranged_blks,w,h,ones(1,w),ones(1,h)));
Benchmarking and runtimes
After improving the original loopy code with pre-allocation for X and function-inling f, runtime-benchmarks were performed with it against the proposed vectorized approach with datasizes as w, h = 60 and the runtime results thus obtained were -
----------- With Improved loopy code
Elapsed time is 41.227797 seconds.
----------- With Vectorized code
Elapsed time is 2.116782 seconds.
This suggested a whooping close to 20x speedup with the proposed solution!
For extremely huge datasizes
If you are dealing with huge datasizes, essentially you are not giving enough memory for bsxfun to work with, and bsxfun is known to use up a lot of memory for giving you a performance-efficient vectorized solution. So, for such huge-datasize cases, you can use the following loopy approach to replace normvals calculations that was listed in the earlier bsxfun based solution -
%// Get the norm values
nx = numel(xi);
normvals = zeros(nx,nx);
for ii = 1:nx
normvals(:,ii) = sqrt( (xi(:) - xi(ii)).^2 + (yi(:) - yi(ii)).^2 );
end
It seems to me that when you run through the cycle for x=w, y=h, you are calculating all the values you need at once. So you don't need recalculate them. Once you have this:
for i=1:w
for j=1:h
temp(i,j)=f([x,y],[i,j])
end
end
Then, e.g. X{1,1} is just temp(1,1), X{2,2} is just temp(1:2,1:2), and so on. If you can vectorise the calculation of f (norm here is just the Euclidean norm of that vector?) then it will get even simpler.

Resources