Speed up Python nested loops with conditional statements - performance

I am converting code from MATLAB to python in order to speed up simple operations. I have written a function which contains nested loops and a conditional statement; the purpose of the loop is to return a list of indices for the nearest elements in array x when compared to array y. I am comparing in the order of 1e5 items which takes about 30 sec to run. Any help to speed this process up will be greatly appreciated! I have had partial sucess with using the numba-pro automatic just in time compiler:
#autojit()
def find_nearest(x,y,idx):
idx_old = 0
rng1 = range(y.shape[0])
rng2 = range(x.shape[0])
for i in rng1:
prev = abs(x[idx_old]-y[i])
for j in rng2:
if abs(x[j]-y[i]) < prev:
prev = abs(x[j]-y[i])
idx_old = j
idx[i] = idx_old
return idx
Sorry for being such a noob, I am brand new to python!

Nothing wrong with your Numba code, except that the algorithm is not as efficient as can be. Much better is to sort the x array and do a binary search, very similar to this answer and also this answer:
def find_nearest(x, y):
indices = np.argsort(x)
loc = np.searchsorted(x[indices], y)
right = indices.take(loc, mode='clip')
left = indices.take(loc-1, mode='clip')
return np.where(abs(y-x[left]) < abs(y-x[right]), left, right)
On my PC this is about 80x faster than even the KDTree approach for x and y having 106 and 105 elements respectively. About two-thirds of the time is spent argsort-ing the array, so I don't think you can gain much with Numba here.

I have found an interim solution to my problem. By implementing the scipy.spatial's kdtree I was able to cut down the run time from 32s to just under 10s. This is still four times slower than the MATLAB knnsearch algorithm; and understanding how to speed up loops with conditional statements is still important. But for the moment this revised implementation is faster:
from scipy import spatial
from numpy import matrix
tree = spatial.KDTree(matrix(x).T)
(_, idxx) = tree.query(matrix(y).T)
The arrays x and y were in flat 1d formats; the tree required queries to be in column vector form.
Any suggestions to improve the run time of the original implementation would be greatly appreciated!

Related

solving sparse Ax=b in scipy

I need to solve Ax=b where A is the matrix that represents finite difference method for PDEs. Typical size of A for a 2D problem is around (256^2)x(256^2), and it consists of a few diagonals. The following sample code is how I construct A:
N = Nx*Ny # Nx is no. of cols (size in x-direction), Ny is no. rows (size in y-direction)
# finite difference in x-direction
up1 = (0.5)*c
up1[Nx-1::Nx] = 0
down1 = (-0.5)*c
down1[::Nx] = 0
matX = diags([down1[1:], up1[:-1]], [-1,1], format='csc')
# finite difference in y-direction
up1 = (0.5)*c
down1 = (-0.5)*c
matY = diags([down1[Nx:], up1[:N-Nx]], [-Nx,Nx], format='csc')
Adding matX and matY together results in four diagonals. The above is for second-order discretization. For fourth-order discretization, then I have eight diagonals. If I have second derivative, then the main diagonal is nonzero as well.
I use the following code to do the actual solving:
# Initialize A_fixed, B_fixed
if const is True: # the potential term V(x) is time-independent
A = A_fixed + sp.sparse.diags(V_func(x))
B = B_fixed + sp.sparse.diags(V_func(x))
A_factored = sp.sparse.linalg.factorized(A)
for idx, t in enumerate(t_steps[1:],1):
# Solve Ax=b=Bu
if const in False: #
A = A_fixed + sp.sparse.diags(V_func(x,t))
B = B_fixed + sp.sparse.diags(V_func(x,t))
psi_n = B.dot(psi_old)
if const is True:
psi_new = A_factored(psi_n)
else:
psi_new = sp.sparse.linalg.spsolve(A,psi_n,use_umfpack=False)
psi_old = psi_new.copy()
I have a couple questions:
What's the best way to solve Ax=b in scipy? I use the spsolve in the sp.sparse.linalg library, which uses the LU-decomposition. I tried using the standard sp.linalg.solve for dense matrix, but it's considerably slower. I also tried using all the other iterative solvers in the sp.sparse.linalg library, but they are also slower (for 1000x1000, they all take a couple seconds, compared to less than half a second for spsolve, and my A is likely to be a lot bigger). Are there any alternative ways to do the computation?
Edit: The problem I'm trying to solve is actually the time-dependent Schrodinger Equation. If the potential term is time-independent, then as suggested I can pre-factorize the matrix A first to speed up the code, but this doesn't work if the potential is time-varying, as I need to change the diagonal term of both matrices A and B at each time step. For this specific problem, are there any ways to speed up the code using method similar to pre-factorization or other ways?
I have installed umfpack. I tried setting use_umfpack True and False to test it, but surprisingly use_umfpack=True takes nearly twice longer than use_umfpack=False. I thought having this package should give a speed up. Any idea what that's the case? (PS: I am using Anaconda Spyder to run the code if that makes any difference)
I have use cProfile to profile my codes, and nearly all the time is spent on that line with spsolve. So I think the remaining part of the code (matrix /problem initialization) is pretty much optimized, and it's the matrix solving part that needs to be improved.
Thanks.

Optimised code for computing truncated weighted sum of matrix products

I need to perform a double sum over a weighted matrix product like this: mathematical expression of problem
where C is a list containing some different weights (complex) for each term and
X are an 3d-array where the index refers to the i'ths 2d-array along the depth axis. I need to evaluate a sum of such expression thousands of times (different 'm' arrays) so the execution time is very important. in the real problem the arrays are of greater dimensions. An illustrative minimum example of my code is
import numpy as np
from time import clock
X = np.random.random((12,12,31))
m = np.random.random((12,12))
C = list(map(lambda x: x*2+1j*x, range(31)))
def summing(m):
start = clock()
t2 = np.zeros((12,12),dtype = complex)
for i in range(len(C)):
for j in range(len(C)):
t2 += C[i]*(X[:,:,i].dot(m).dot(np.conj(X[:,:,j].T)))
return t2, clock()-start
summing(m)[1]
The execution time for me is around 0.025s, which is way too slow for the number of computations I need to perform. Running in parallel is not an option as this function is part of a larger parallel computation, so parallelising this would create nested parallel computations.
Do any of you see a much cleverer way of performing this computation in a faster and hopefully a more pythonic way?

Looking for efficient way to perform a computation - Matlab

I have a scalar function f([x,y],[i,j])= exp(-norm([x,y]-[i,j])^2/sigma^2) which receives two 2-dimensional vectors as input (norm here implements the Euclidean norm). The values of x,i range in 1:w and the values y,j range in 1:h. I want to create a cell array X such that X{x,y} will contain a w x h matrix such that X{x,y}(i,j) = f([x,y],[i,j]). This can obviously be done using 4 nested loops like so:
for x=1:w;
for y=1:h;
X{x,y}=zeros(w,h);
for i=1:w
for j=1:h
X{x,y}(i,j)=f([x,y],[i,j])
end
end
end
end
This is however extremely inefficient. I would very much appreciate an efficient way to create X.
The one way to do this is to remove the 2 innermost loops and replace then with a vectorised version. By the look of your f function this shouldn't be too bad
First we need to construct two matrices containing the 1 to w on every row and 1 to h on every column like so
wMat=repmat(1:w,h,1);
hMat=repmat(1:h,w,1)';
This is going to represent the inner two loops, and the transpose will allow us to get all combinations. Now we can vectorise the calculation (f([x,y],[i,j])= exp(-norm([x,y]-[i,j])^2/sigma^2)):
for x=1:w;
for y=1:h;
temp1=sqrt((x-wMat).^2+(y-hMat).^2);
X{x,y}=exp(temp1/(sigma^2));
end
end
Where we have computed the Euclidean norm for all pairs of nodes in the inner loops at once.
Some discussion and code
The trick here is to perform the norm-calculations with numeric arrays and save the results into a cell array version as late as possible. For performing the norm-calculations you can take help of ndgrid, bsxfun and some permute + reshape to give it the "shape" as needed for the final cell array version. So, here's the vectorized approach to perform these tasks -
%// Create x-y/i-j values to be used for calculation of function values
[xi,yi] = ndgrid(1:w,1:h);
%// Get the norm values
normvals = sqrt(bsxfun(#minus,xi(:),xi(:).').^2 + ...
bsxfun(#minus,yi(:),yi(:).').^2);
%// Get the actual function values
vals = exp(-normvals.^2/sigma^2);
%// Get the values into blocks of a 4D array and then re-arrange to match
%// with the shape of numeric array version of X
blks = reshape(permute(reshape(vals, w*h, h, []), [2 1 3]), h, w, h, w);
arranged_blks = reshape(permute(blks,[2 3 1 4]),w,h,w,h);
%// Finally get the cell array version
X = squeeze(mat2cell(arranged_blks,w,h,ones(1,w),ones(1,h)));
Benchmarking and runtimes
After improving the original loopy code with pre-allocation for X and function-inling f, runtime-benchmarks were performed with it against the proposed vectorized approach with datasizes as w, h = 60 and the runtime results thus obtained were -
----------- With Improved loopy code
Elapsed time is 41.227797 seconds.
----------- With Vectorized code
Elapsed time is 2.116782 seconds.
This suggested a whooping close to 20x speedup with the proposed solution!
For extremely huge datasizes
If you are dealing with huge datasizes, essentially you are not giving enough memory for bsxfun to work with, and bsxfun is known to use up a lot of memory for giving you a performance-efficient vectorized solution. So, for such huge-datasize cases, you can use the following loopy approach to replace normvals calculations that was listed in the earlier bsxfun based solution -
%// Get the norm values
nx = numel(xi);
normvals = zeros(nx,nx);
for ii = 1:nx
normvals(:,ii) = sqrt( (xi(:) - xi(ii)).^2 + (yi(:) - yi(ii)).^2 );
end
It seems to me that when you run through the cycle for x=w, y=h, you are calculating all the values you need at once. So you don't need recalculate them. Once you have this:
for i=1:w
for j=1:h
temp(i,j)=f([x,y],[i,j])
end
end
Then, e.g. X{1,1} is just temp(1,1), X{2,2} is just temp(1:2,1:2), and so on. If you can vectorise the calculation of f (norm here is just the Euclidean norm of that vector?) then it will get even simpler.

How to write vectorized functions in MATLAB

I am just learning MATLAB and I find it hard to understand the performance factors of loops vs vectorized functions.
In my previous question: Nested for loops extremely slow in MATLAB (preallocated) I realized that using a vectorized function vs. 4 nested loops made a 7x times difference in running time.
In that example instead of looping through all dimensions of a 4 dimensional array and calculating median for each vector, it was much cleaner and faster to just call median(stack, n) where n meant the working dimension of the median function.
But median is just a very easy example and I was just lucky that it had this dimension parameter implemented.
My question is that how do you write a function yourself which works as efficiently as one which has this dimension range implemented?
For example you have a function my_median_1D which only works on a 1-D vector and returns a number.
How do you write a function my_median_nD which acts like MATLAB's median, by taking an n-dimensional array and a "working dimension" parameter?
Update
I found the code for calculating median in higher dimensions
% In all other cases, use linear indexing to determine exact location
% of medians. Use linear indices to extract medians, then reshape at
% end to appropriate size.
cumSize = cumprod(s);
total = cumSize(end); % Equivalent to NUMEL(x)
numMedians = total / nCompare;
numConseq = cumSize(dim - 1); % Number of consecutive indices
increment = cumSize(dim); % Gap between runs of indices
ixMedians = 1;
y = repmat(x(1),numMedians,1); % Preallocate appropriate type
% Nested FOR loop tracks down medians by their indices.
for seqIndex = 1:increment:total
for consIndex = half*numConseq:(half+1)*numConseq-1
absIndex = seqIndex + consIndex;
y(ixMedians) = x(absIndex);
ixMedians = ixMedians + 1;
end
end
% Average in second value if n is even
if 2*half == nCompare
ixMedians = 1;
for seqIndex = 1:increment:total
for consIndex = (half-1)*numConseq:half*numConseq-1
absIndex = seqIndex + consIndex;
y(ixMedians) = meanof(x(absIndex),y(ixMedians));
ixMedians = ixMedians + 1;
end
end
end
% Check last indices for NaN
ixMedians = 1;
for seqIndex = 1:increment:total
for consIndex = (nCompare-1)*numConseq:nCompare*numConseq-1
absIndex = seqIndex + consIndex;
if isnan(x(absIndex))
y(ixMedians) = NaN;
end
ixMedians = ixMedians + 1;
end
end
Could you explain to me that why is this code so effective compared to the simple nested loops? It has nested loops just like the other function.
I don't understand how could it be 7x times faster and also, that why is it so complicated.
Update 2
I realized that using median was not a good example as it is a complicated function itself requiring sorting of the array or other neat tricks. I re-did the tests with mean instead and the results are even more crazy:
19 seconds vs 0.12 seconds.
It means that the built in way for sum is 160 times faster than the nested loops.
It is really hard for me to understand how can an industry leading language have such an extreme performance difference based on the programming style, but I see the points mentioned in the answers below.
Update 2 (to address your updated question)
MATLAB is optimized to work well with arrays. Once you get used to it, it is actually really nice to just have to type one line and have MATLAB do the full 4D looping stuff itself without having to worry about it. MATLAB is often used for prototyping / one-off calculations, so it makes sense to save time for the person coding, and giving up some of C[++|#]'s flexibility.
This is why MATLAB internally does some loops really well - often by coding them as a compiled function.
The code snippet you give doesn't really contain the relevant line of code which does the main work, namely
% Sort along given dimension
x = sort(x,dim);
In other words, the code you show only needs to access the median values by their correct index in the now-sorted multi-dimensional array x (which doesn't take much time). The actual work accessing all array elements was done by sort, which is a built-in (i.e. compiled and highly optimized) function.
Original answer (about how to built your own fast functions working on arrays)
There are actually quite a few built-ins that take a dimension parameter: min(stack, [], n), max(stack, [], n), mean(stack, n), std(stack, [], n), median(stack,n), sum(stack, n)... together with the fact that other built-in functions like exp(), sin() automatically work on each element of your whole array (i.e. sin(stack) automatically does four nested loops for you if stack is 4D), you can built up a lot of functions that you might need just be relying on the existing built-ins.
If this is not enough for a particular case you should have a look at repmat, bsxfun, arrayfun and accumarray which are very powerful functions for doing things "the MATLAB way". Just search on SO for questions (or rather answers) using one of these, I learned a lot about MATLABs strong points that way.
As an example, say you wanted to implement the p-norm of stack along dimension n, you could write
function result=pnorm(stack, p, n)
result=sum(stack.^p,n)^(1/p);
... where you effectively reuse the "which-dimension-capability" of sum.
Update
As Max points out in the comments, also have a look at the colon operator (:) which is a very powerful tool for selecting elements from an array (or even changing it shape, which is more generally done with reshape).
In general, have a look at the section Array Operations in the help - it contains repmat et al. mentioned above, but also cumsum and some more obscure helper functions which you should use as building blocks.
Vectorization
In addition to whats already been said, you should also understand that vectorization involves parallelization, i.e. performing concurrent operations on data as opposed to sequential execution (think SIMD instructions), and even taking advantage of threads and multiprocessors in some cases...
MEX-files
Now although the "interpreted vs. compiled" point has already been argued, no one mentioned that you can extend MATLAB by writing MEX-files, which are compiled executables written in C, that can be called directly as normal function from inside MATLAB. This allows you to implement performance-critical parts using a lower-level language like C.
Column-major order
Finally, when trying to optimize some code, always remember that MATLAB stores matrices in column-major order. Accessing elements in that order can yield significant improvements compared to other arbitrary orders.
For example, in your previous linked question, you were computing the median of set of stacked images along some dimension. Now the order in which those dimensions are ordered greatly affect the performance. Illustration:
%# sequence of 10 images
fPath = fullfile(matlabroot,'toolbox','images','imdemos');
files = dir( fullfile(fPath,'AT3_1m4_*.tif') );
files = strcat(fPath,{filesep},{files.name}'); %'
I = imread( files{1} );
%# stacked images along the 1st dimension: [numImages H W RGB]
stack1 = zeros([numel(files) size(I) 3], class(I));
for i=1:numel(files)
I = imread( files{i} );
stack1(i,:,:,:) = repmat(I, [1 1 3]); %# grayscale to RGB
end
%# stacked images along the 4th dimension: [H W RGB numImages]
stack4 = permute(stack1, [2 3 4 1]);
%# compute median image from each of these two stacks
tic, m1 = squeeze( median(stack1,1) ); toc
tic, m4 = median(stack4,4); toc
isequal(m1,m4)
The timing difference was huge:
Elapsed time is 0.257551 seconds. %# stack1
Elapsed time is 17.405075 seconds. %# stack4
Could you explain to me that why is this code so effective compared to the simple nested loops? It has nested loops just like the other function.
The problem with nested loops is not the nested loops themselves. It's the operations you perform inside.
Each function call (especially to a non-built-in function) generates a little bit of overhead; more so if the function performs e.g. error checking that takes the same amount of time regardless of input size. Thus, if a function has only a 1 ms overhead, if you call it 1000 times, you will have wasted a second. If you can call it once to perform a vectorized calculation, you pay overhead only once.
Furthermore, the JIT compiler (pdf) can help vectorize simple for-loops, where you, for example, only perform basic arithmetic operations. Thus, the loops with simple calculations in your post are sped up by a lot, while the loops calling median are not.
In this case
M = median(A,dim) returns the median values for elements along the dimension of A specified by scalar dim
But with a general function you can try splitting your array with mat2cell (which can work with n-D arrays and not just matrices) and applying your my_median_1D function through cellfun. Below I will use median as an example to show that you get equivalent results, but instead you can pass it any function defined in an m-file, or an anonymous function defined with the #(args) notation.
>> testarr = [[1 2 3]' [4 5 6]']
testarr =
1 4
2 5
3 6
>> median(testarr,2)
ans =
2.5000
3.5000
4.5000
>> shape = size(testarr)
shape =
3 2
>> cellfun(#median,mat2cell(testarr,repmat(1,1,shape(1)),[shape(2)]))
ans =
2.5000
3.5000
4.5000

Performance of swapping two elements in MATLAB

Purely as an experiment, I'm writing sort functions in MATLAB then running these through the MATLAB profiler. The aspect I find most perplexing is to do with swapping elements.
I've found that the "official" way of swapping two elements in a matrix
self.Data([i1, i2]) = self.Data([i2, i1])
runs much slower than doing it in four lines of code:
e1 = self.Data(i1);
e2 = self.Data(i2);
self.Data(i1) = e2;
self.Data(i2) = e1;
The total length of time taken up by the second example is 12 times less than the single line of code in the first example.
Would somebody have an explanation as to why?
Based on suggestions posted, I've run some more tests.
It appears the performance hit comes when the same matrix is referenced in both the LHS and RHS of the assignment.
My theory is that MATLAB uses an internal reference-counting / copy-on-write mechanism, and this is causing the entire matrix to be copied internally when it's referenced on both sides. (This is a guess because I don't know the MATLAB internals).
Here are the results from calling the function 885548 times. (The difference here is times four, not times twelve as I originally posted. Each of the functions have the additional function-wrapping overhead, while in my initial post I just summed up the individual lines).
swap1: 12.547 s
swap2: 14.301 s
swap3: 51.739 s
Here's the code:
methods (Access = public)
function swap(self, i1, i2)
swap1(self, i1, i2);
swap2(self, i1, i2);
swap3(self, i1, i2);
self.SwapCount = self.SwapCount + 1;
end
end
methods (Access = private)
%
% swap1: stores values in temporary doubles
% This has the best performance
%
function swap1(self, i1, i2)
e1 = self.Data(i1);
e2 = self.Data(i2);
self.Data(i1) = e2;
self.Data(i2) = e1;
end
%
% swap2: stores values in a temporary matrix
% Marginally slower than swap1
%
function swap2(self, i1, i2)
m = self.Data([i1, i2]);
self.Data([i2, i1]) = m;
end
%
% swap3: does not use variables for storage.
% This has the worst performance
%
function swap3(self, i1, i2)
self.Data([i1, i2]) = self.Data([i2, i1]);
end
end
In the first (slow) approach, the RHS value is a matrix, so I think MATLAB incurs a performance penalty in creating a new matrix to store the two elements. The second (fast) approach avoids this by working directly with the elements.
Check out the "Techniques for Improving Performance" article on MathWorks for ways to improve your MATLAB code.
you could also do:
tmp = self.Data(i1);
self.Data(i1) = self.Data(i2);
self.Data(i2) = tmp;
Zach is potentially right in that a temporary copy of the matrix may be made to perform the first operation, although I would hazard a guess that there is some internal optimization within MATLAB that attempts to avoid this. It may be a function of the version of MATLAB you are using. I tried both of your cases in version 7.1.0.246 (a couple years old) and only saw a speed difference of about 2-2.5.
It's possible that this may be an example of speed improvement by what's called "loop unrolling". When doing vector operations, at some level within the internal code there is likely a FOR loop which loops over the indices you are swapping. By performing the scalar operations in the second example, you are avoiding any overhead from loops. Note these two (somewhat silly) examples:
vec = [1 2 3 4];
%Example 1:
for i = 1:4,
vec(i) = vec(i)+1;
end;
%Example 2:
vec(1) = vec(1)+1;
vec(2) = vec(2)+1;
vec(3) = vec(3)+1;
vec(4) = vec(4)+1;
Admittedly, it would be much easier to simply use vector operations like:
vec = vec+1;
but the examples above are for the purpose of illustration. When I repeat each example multiple times over and time them, Example 2 is actually somewhat faster than Example 1. For a small loop with a known number (in the example, just 4), it can actually be more efficient to forgo the loop. Of course, in this particular example, the vector operation given above is actually the fastest.
I usually follow this rule: Try a few different things, and pick the fastest for your specific problem.
This post deserves an update, since the JIT compiler is now a thing (since R2015b) and so is timeit (since R2013b) for more reliable function timing.
Below is a short benchmarking function for element swapping within a large array.
I have used the terms "directly swapping" and "using a temporary variable" to describe the two methods in the question respectively.
The results are pretty staggering, the performance of directly swapping 2 elements using is increasingly poor by comparison to using a temporary variable.
function benchie()
% Variables for plotting, loop to increase size of the arrays
M = 15; D = zeros(1,M); W = zeros(1,M);
for n = 1:M;
N = 2^n;
% Create some random array of length N, and random indices to swap
v = rand(N,1);
x = randi([1, N], N, 1);
y = randi([1, N], N, 1);
% Time the functions
D(n) = timeit(#()direct);
W(n) = timeit(#()withtemp);
end
% Plotting
plot(2.^(1:M), D, 2.^(1:M), W);
legend('direct', 'with temp')
xlabel('number of elements'); ylabel('time (s)')
function direct()
% Direct swapping of two elements
for k = 1:N
v([x(k) y(k)]) = v([y(k) x(k)]);
end
end
function withtemp()
% Using an intermediate temporary variable
for k = 1:N
tmp = v(y(k));
v(y(k)) = v(x(k));
v(x(k)) = tmp;
end
end
end

Resources