Perform numpy.sum (or scipy.integrate.simps()) on large splitted array efficiently - performance

Let's consider a very large numpy array a (M, N).
where M can typically be 1 or 100 and N 10-100,000,000
We have the array of indices that can split it into many (K = 1,000,000) along axis=1.
We want to efficiently perform an operation like integration along axis=1 (np.sum to take the simplest form) on each sub-array and return a (M, K) array.
An elegant and efficient solution was proposed by #Divakar in question [41920367]how to split numpy array and perform certain actions on split arrays [Python] but my understanding is that it only applies to cases where all sub-arrays have the same shape, which allows for reshaping.
But in our case the sub-arrays don't have the same shape, which, so far has forced me to loop on the index... please take me out of my misery...
Example
a = np.random.random((10, 100000000))
ind = np.sort(np.random.randint(10, 9000000, 1000000))
The size of the sub-arrays are not homogenous:
sizes = np.diff(ind)
print(sizes.min(), size.max())
2, 8732
So far, the best I found is:
output = np.concatenate([np.sum(vv, axis=1)[:, None] for vv in np.split(a, ind, axis=1)], axis=1)
Possible feature request for numpy and scipy:
If looping is really unavoidable, at least having it done in C inside the numpy and scipy.integrate.simps (or romb) functions would probably speed-up the output.
Something like
output = np.sum(a, axis=1, split_ind=ind)
output = scipy.integrate.simps(a, x=x, axis=1, split_ind=ind)
output = scipy.integrate.romb(a, x=x, axis=1, split_ind=ind)
would be very welcome !
(where x itself could be splitable, or not)
Side note:
While trying this example, I noticed that with these numbers there was almost always an element of sizes equal to 0 (the sizes.min() is almost always zero).
This looks peculiar to me, as we are picking 10,000 integers between 10 and 9,000,000, the odds that the same number comes up twice (such that diff = 0) should be close to 0. It seems to be very close to 1.
Would that be due to the algorithm behind np.random.randint ?

What you want is np.add.reduceat
output = np.add.reduceat(a, ind, axis = 1)
output.shape
Out[]: (10, 1000000)
Universal Functions (ufunc) are a very powerful tool in numpy
As for the repeated indices, that's simply the Birthday Problem cropping up.

Great !
Thanks ! on my VM Cent OS 6.9 I have the following results:
In [71]: a = np.random.random((10, 10000000))
In [72]: ind = np.unique(np.random.randint(10, 9000000, 100000))
In [73]: ind2 = np.append([0], ind)
In [74]: out = np.concatenate([np.sum(vv, axis=1)[:, None] for vv in np.split(a, ind, axis=1)], axis=1)
In [75]: out2 = np.add.reduceat(a, ind2, axis=1)
In [83]: np.allclose(out, out2)
Out[83]: True
In [84]: %timeit out = np.concatenate([np.sum(vv, axis=1)[:, None] for vv in np.split(a, ind, axis=1)], axis=1)
2.7 s ± 40.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [85]: %timeit out2 = np.add.reduceat(a, ind2, axis=1)
179 ms ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That's a good 93 % speed gain (or factor 15 faster) over the list concatenation :-)
Great !

Related

Julia: FAST way of calculating the smallest distances between two sets of points

I have 5000 3D points in a Matrix A and another 5000 3D point in a matrix B.
For each point in A i want to find the smallest distance to a point in B. These distances should be stored in an array with 5000 entries.
So far I have this solution, running in about 0.145342 seconds (23 allocations: 191.079 MiB). How can I improve this further?
using Distances
A = rand(5000, 3)
B = rand(5000, 3)
mis = #time minimum(Distances.pairwise(SqEuclidean(), A, B, dims=1), dims=2)
This is a standard way to do it as it will have a better time complexity (especially for larger data):
using NearestNeighbors
nn(KDTree(B'; leafsize = 10), A')[2] .^ 2
Two comments:
by default Euclidean distance is computed (so I square it)
by default NearestNeigbors.jl assumes observations are stored in columns (so I need B' and A' in the solution; if your original data were transposed it would not be needed; the reason why it is designed this way is that Julia uses column major matrix storage)
Generating a big distance matrix using Distances.pairwise(SqEuclidean(), A, B, dims=1) is not efficient because the main memory is pretty slow nowadays compared to CPU caches and the computing power of modern CPUs and this is not gonna be better any time soon (see "memory wall"). It is faster to compute the minimum on-the-fly using two basic nested for loops. Additionally, one can use multiple cores to compute this faster using multiple threads.
function computeMinDist(A, B)
n, m = size(A, 1), size(B, 1)
result = zeros(n)
Threads.#threads for i = 1:n
minSqDist = Inf
#inbounds for j = 1:m
dx = A[i,1] - B[j,1]
dy = A[i,2] - B[j,2]
dz = A[i,3] - B[j,3]
sqDist = dx*dx + dy*dy + dz*dz
if sqDist < minSqDist
minSqDist = sqDist
end
end
result[i] = minSqDist
end
return result
end
mis = #time computeMinDist(A, B)
Note the Julia interpreter uses 1 thread by default but this can be tuned using the environment variable JULIA_NUM_THREADS=auto or just by running it using the flag --threads=auto. See the multi-threading documentation for more information.
Performance results
Here are performance results on my i5-9600KF machine with 6 cores (with two 5000x3 matrices):
Initial implementation: 93.4 ms
This implementation: 4.4 ms
This implementation is thus 21 times faster.
Results are the same to few ULP.
Note the code can certainly be optimized further using loop tiling, and possibly by transposing A and B so the JIT can generate a more efficient implementation using SIMD instructions.

How to speed up this matlab code which is already vectorized

I'm trying to speed up steps 1-4 in the following code (the rest is setup that will be predetermined for my actual problem.)
% Given sizes:
m = 200;
n = 1e8;
% Given vectors:
value_vector = rand(m, 1);
index_vector = randi([0 200], n, 1);
% Objective: Determine the values for the values_grid based on indices provided by index_grid, which
% correspond to the indices of the value in value_vector
% 0. Preallocate
values = zeros(n, 1);
% 1. Remove "0" indices since these won't have values assigned
nonzero_inds = (index_vector ~= 0);
% 2. Examine only nonzero indices
value_inds = index_vector(nonzero_inds);
% 3. Get the values for these indices
nonzero_values = value_vector(value_inds);
% 4. Assign values to output (0 for those with 0 index)
values(nonzero_inds) = nonzero_values;
Here's my analysis of these portions of the code:
Necessary since the index_vector will contain zeros which need to be ferreted out. O(n) since it's just a matter of going through the vector one element at a time and checking (value ∨ 0)
Should be O(n) to go through index_vector and retain those that are nonzero from the previous step
Should be O(n) since we have to check each nonzero index_vector element, and for each element we access the value_vector which is O(1).
Should be O(n) to go through each element of nonzero_inds, access corresponding values index, access the corresponding nonzero_values element, and assign it to the values vector.
The code above takes about 5 seconds to run through steps 1-4 on 4 cores, 3.8GHz. Do you all have any ideas on how this could be sped up? Thanks.
Wow, I found something really interesting. I saw this link in the "related" section about indexing vectors being inefficient in Matlab sometimes, so I decided to try a for loop. This code ended up being an order of magnitude faster!
for i = 1:n
if index_vector(i) > 0
values(i) = value_vector(index_vector(i));
end
end
EDIT: Another interesting thing, unfortunately detrimental to my problem though. The speed of this solution depends on the amount of zeros in the index_vector. With index_vector = randi([0 200]);, a small proportion of the values are zeros, but if I try index_vector = randi([0 1]), approximately half of the values will be zero and then the above for loop is actually an order of magnitude slower. However, using ~= instead of > speeds the loop back up so that it's on a similar order of magnitude. Very interesting and odd behavior.
if you stick to matlab and the flow of the algorithm you want , and not doing this in fortran or c, here's a small start:
change the randi to rand, and round by casting to uint8 and use the > logical operation that for some reason is faster at my end
to sum up:
value_vector = rand(m, 1 );
index_vector = uint8(-0.5+201*rand(n,1) );
values = zeros(n, 1);
values=value_vector(index_vector(index_vector>0));
this improved at my end by a factor 1.6

Efficiently calculate histogram of a 3D numpy array along an axis with different bin edges

Problem description
I have a 3D numpy array, denoted as data, of shape N x R x C, i.e. N samples, R rows and C columns. I would like to obtain histograms along column for each combination of sample and row. However bin edges (see argument bins in numpy.histogram), of fixed length S, will be different at different rows but are shared across samples. Consider this example for illustration, for the 1st sample (data[0]), bin edge sequence for its 1st row is different from that for its 2nd row, but is the same as that for the 1st row from the 2nd sample (data[1]). Thus all the bin edge sequences are stored in a 2D numpy array of shape R x S, denoted as bin_edges.
My question is how to efficiently calculate the histograms?
A working but slow solution
Using numpy.histogram, I was able to come up with a working but fairly slow solution as shown in the below code snippet
```
Get dummy data
N: number of samples
R: number of rows (or kernels)
C: number of columns (or pixels)
S: number of bins
```
import numpy as np
N, R, C, S = 100, 50, 1000, 10
data = np.random.randn(N, R, C)
# for each row/kernel, pool pixels of all samples
poolsamples = np.swapaxes(data, 0, 1).reshape(R, -1)
# use quantiles as bin edges
percentiles = np.linspace(0, 100, num=(S + 1))
bin_edges = np.transpose(np.percentile(poolsamples, percentiles, axis=1))
```
A working but slow solution of getting histograms along column
```
hist = np.empty((N, R, S))
for idx in np.arange(R):
bin_edges_i = bin_edges[idx, :]
counts = np.apply_along_axis(
lambda a: np.histogram(a, bins=bin_edges_i)[0],
1, data[:, idx, :])
hist[:, idx, :] = counts
Possible directions
Fancy numpy reshape to avoid using for loop at all
This problem arises from extracting low-end characteristics for each image forwarded through a trained neural network. Therefore, if the histogram extraction can be embedded in TensorFlow graph and ultimately be carried out on GPU, that would be ideal!
I noticed a python package fast-histogram which claims to be 7-15x faster than numpy.histogram. However 1d histogram function can only takes number of bins instead of actual bin positions
numexpr?
I would love to hear any inputs! Thanks in advance!
Making use of 2D version of np.searchsorted : searchsorted2d -
def vectorized_app(data, bin_edges):
N, R, C = data.shape
a = np.sort(data.reshape(-1,C),1)
b = np.repeat(bin_edges[None],N,axis=0).reshape(-1,bin_edges.shape[-1])
idx = searchsorted2d(a,b)
idx[:,0] = 0
idx[:,-1] = a.shape[1]
out = (idx[:,1:] - idx[:,:-1]).reshape(N,R,-1)
return out
Runtime test -
In [591]: N, R, C, S = 100, 50, 1000, 10
...: data = np.random.randn(N, R, C)
...:
...: # for each row/kernel, pool pixels of all samples
...: poolsamples = np.swapaxes(data, 0, 1).reshape(R, -1)
...: # use quantiles as bin edges
...: percentiles = np.linspace(0, 100, num=(S + 1))
...: bin_edges = np.transpose(np.percentile(poolsamples, percentiles, axis=1))
...:
In [592]: %timeit org_app(data, bin_edges)
1 loop, best of 3: 481 ms per loop
In [593]: %timeit vectorized_app(data, bin_edges)
1 loop, best of 3: 224 ms per loop
In [595]: np.allclose(org_app(data, bin_edges), vectorized_app(data, bin_edges))
Out[595]: True
More than 2x speedup there.
Closer look reveals that the bottleneck with the proposed vectorized method is the sorting itself -
In [594]: %timeit np.sort(data.reshape(-1,C),1)
1 loop, best of 3: 194 ms per loop
We need this sorting to use searchsorted.

Looking for efficient way to perform a computation - Matlab

I have a scalar function f([x,y],[i,j])= exp(-norm([x,y]-[i,j])^2/sigma^2) which receives two 2-dimensional vectors as input (norm here implements the Euclidean norm). The values of x,i range in 1:w and the values y,j range in 1:h. I want to create a cell array X such that X{x,y} will contain a w x h matrix such that X{x,y}(i,j) = f([x,y],[i,j]). This can obviously be done using 4 nested loops like so:
for x=1:w;
for y=1:h;
X{x,y}=zeros(w,h);
for i=1:w
for j=1:h
X{x,y}(i,j)=f([x,y],[i,j])
end
end
end
end
This is however extremely inefficient. I would very much appreciate an efficient way to create X.
The one way to do this is to remove the 2 innermost loops and replace then with a vectorised version. By the look of your f function this shouldn't be too bad
First we need to construct two matrices containing the 1 to w on every row and 1 to h on every column like so
wMat=repmat(1:w,h,1);
hMat=repmat(1:h,w,1)';
This is going to represent the inner two loops, and the transpose will allow us to get all combinations. Now we can vectorise the calculation (f([x,y],[i,j])= exp(-norm([x,y]-[i,j])^2/sigma^2)):
for x=1:w;
for y=1:h;
temp1=sqrt((x-wMat).^2+(y-hMat).^2);
X{x,y}=exp(temp1/(sigma^2));
end
end
Where we have computed the Euclidean norm for all pairs of nodes in the inner loops at once.
Some discussion and code
The trick here is to perform the norm-calculations with numeric arrays and save the results into a cell array version as late as possible. For performing the norm-calculations you can take help of ndgrid, bsxfun and some permute + reshape to give it the "shape" as needed for the final cell array version. So, here's the vectorized approach to perform these tasks -
%// Create x-y/i-j values to be used for calculation of function values
[xi,yi] = ndgrid(1:w,1:h);
%// Get the norm values
normvals = sqrt(bsxfun(#minus,xi(:),xi(:).').^2 + ...
bsxfun(#minus,yi(:),yi(:).').^2);
%// Get the actual function values
vals = exp(-normvals.^2/sigma^2);
%// Get the values into blocks of a 4D array and then re-arrange to match
%// with the shape of numeric array version of X
blks = reshape(permute(reshape(vals, w*h, h, []), [2 1 3]), h, w, h, w);
arranged_blks = reshape(permute(blks,[2 3 1 4]),w,h,w,h);
%// Finally get the cell array version
X = squeeze(mat2cell(arranged_blks,w,h,ones(1,w),ones(1,h)));
Benchmarking and runtimes
After improving the original loopy code with pre-allocation for X and function-inling f, runtime-benchmarks were performed with it against the proposed vectorized approach with datasizes as w, h = 60 and the runtime results thus obtained were -
----------- With Improved loopy code
Elapsed time is 41.227797 seconds.
----------- With Vectorized code
Elapsed time is 2.116782 seconds.
This suggested a whooping close to 20x speedup with the proposed solution!
For extremely huge datasizes
If you are dealing with huge datasizes, essentially you are not giving enough memory for bsxfun to work with, and bsxfun is known to use up a lot of memory for giving you a performance-efficient vectorized solution. So, for such huge-datasize cases, you can use the following loopy approach to replace normvals calculations that was listed in the earlier bsxfun based solution -
%// Get the norm values
nx = numel(xi);
normvals = zeros(nx,nx);
for ii = 1:nx
normvals(:,ii) = sqrt( (xi(:) - xi(ii)).^2 + (yi(:) - yi(ii)).^2 );
end
It seems to me that when you run through the cycle for x=w, y=h, you are calculating all the values you need at once. So you don't need recalculate them. Once you have this:
for i=1:w
for j=1:h
temp(i,j)=f([x,y],[i,j])
end
end
Then, e.g. X{1,1} is just temp(1,1), X{2,2} is just temp(1:2,1:2), and so on. If you can vectorise the calculation of f (norm here is just the Euclidean norm of that vector?) then it will get even simpler.

Most efficient way to weight and sum a number of matrices in Fortran

I am trying to write a function in Fortran that multiplies a number of matrices with different weights and then adds them together to form a single matrix. I have identified that this process is the bottleneck in my program (this weighting will be made many times for a single run of the program, with different weights). Right now I'm trying to make it run faster by switching from Matlab to Fortran. I am a newbie at Fortran so I appreciate all help.
In Matlab the fastest way I have found to make such a computation looks like this:
function B = weight_matrices()
n = 46;
m = 1800;
A = rand(n,m,m);
w = rand(n,1);
tic;
B = squeeze(sum(bsxfun(#times,w,A),1));
toc;
The line where B is assigned runs in about 0.9 seconds on my machine (Matlab R2012b, MacBook Pro 13" retina, 2.5 GHz Intel Core i5, 8 GB 1600 MHz DDR3). It should be noted that for my problem, the tensor A will be the same (constant) for the whole run of the program (after initialization), but w can take any values. Also, typical values of n and m are used here, meaning that the tensor A will have a size of about 1 GB in memory.
The clearest way I can think of writing this in Fortran is something like this:
pure function weight_matrices(w,A) result(B)
implicit none
integer, parameter :: n = 46
integer, parameter :: m = 1800
double precision, dimension(num_sizes), intent(in) :: w
double precision, dimension(num_sizes,msize,msize), intent(in) :: A
double precision, dimension(msize,msize) :: B
integer :: i
B = 0
do i = 1,n
B = B + w(i)*A(i,:,:)
end do
end function weight_matrices
This function runs in about 1.4 seconds when compiled with gfortran 4.7.2, using -O3 (function call timed with "call cpu_time(t)"). If I manually unwrap the loop into
B = w(1)*A(1,:,:)+w(2)*A(2,:,:)+ ... + w(46)*A(46,:,:)
the function takes about 0.11 seconds to run instead. This is great and means that I get a speedup of about 8 times compared to the Matlab version. However, I still have some questions on readability and performance.
First, I wonder if there is an even faster way to perform this weighting and summing of matrices. I have looked through BLAS and LAPACK, but can't find any function that seems to fit. I have also tried to put the dimension in A that enumerates the matrices as the last dimension (i.e. switching from (i,j,k) to (k,i,j) for the elements), but this resulted in slower code.
Second, this fast version is not very flexible, and actually looks quite ugly, since it is so much text for such a simple computation. For the tests I am running I would like to try to use different numbers of weights, so that the length of w will vary, to see how it affects the rest of my algorithm. However, that means I quite tedious rewrite of the assignment of B every time. Is there any way to make this more flexible, while keeping the performance the same (or better)?
Third, the tensor A will, as mentioned before, be constant during the run of the program. I have set constant scalar values in my program using the "parameter" attribute in their own module, importing them with the "use" expression into the functions/subroutines that need them. What is the best way to do the equivalent thing for the tensor A? I want to tell the compiler that this tensor will be constant, after init., so that any corresponding optimizations can be done. Note that A is typically ~1 GB in size, so it is not practical to enter it directly in the source file.
Thank you in advance for any input! :)
Perhaps you could try something like
do k=1,m
do j=1,m
B(j,k)=sum( [ ( (w(i)*A(i,j,k)), i=1,n) ])
enddo
enddo
The square brace is a newer form of (/ /), the 1d matrix (vector). The term in sum is a matrix of dimension (n) and sum sums all of those elements. This is precisely what your unwrapped code does (and is not exactly equal to the do loop you have).
I tried to refine Kyle Vanos' solution.
Therefor I decided to use sum and Fortran's vector-capabilities.
I don't know, if the results are correct, because I only looked for the timings!
Version 1: (for comparison)
B = 0
do i = 1,n
B = B + w(i)*A(i,:,:)
end do
Version 2: (from Kyle Vanos)
do k=1,m
do j=1,m
B(j,k)=sum( [ ( (w(i)*A(i,j,k)), i=1,n) ])
enddo
enddo
Version 3: (mixed-up indices, work on one row/column at a time)
do j = 1, m
B(:,j)=sum( [ ( (w(i)*A(:,i,j)), i=1,n) ], dim=1)
enddo
Version 4: (complete matrices)
B=sum( [ ( (w(i)*A(:,:,i)), i=1,n) ], dim=1)
Timing
As you can see, I had to mixup the indices to get faster execution times. The third solution is really strange because the number of the matrix is the middle index, but this is necessary for memory-order-reasons.
V1: 1.30s
V2: 0.16s
V3: 0.02s
V4: 0.03s
Concluding, I would say, that you can get a massive speedup, if you have the possibility to change order of the matrix indices in arbitrary order.
I would not hide any looping as this is usually slower. You can write it explicitely, then you'll see that the inner loop access is over the last index, making it inefficient. So, you should make sure your n dimension is the last one by storing A is A(m,m,n):
B = 0
do i = 1,n
w_tmp = w(i)
do j = 1,m
do k = 1,m
B(k,j) = B(k,j) + w_tmp*A(k,j,i)
end do
end do
end do
this should be much more efficient as you are now accessing consecutive elements in memory in the inner loop.
Another solution is to use the level 1 BLAS subroutines _AXPY (y = a*x + y):
B = 0
do i = 1,n
CALL DAXPY(m*m, w(i), A(1,1,i), 1, B(1,1), 1)
end do
With Intel MKL this should be more efficient, but again you should make sure the last index is the one which changes in the outer loop (in this case the loop you're writing). You can find the necessary arguments for this call here: MKL
EDIT: you might also want to use some parallellization? (I don't know if Matlab takes advantage of that)
EDIT2: In the answer of Kyle, the inner loop is over different values of w, which is more efficient than n times reloading B as w can be kept in cache (using A(n,m,m)):
B = 0
do i = 1,m
do j = 1,m
B(j,i)=0.0d0
do k = 1,n
B(j,i) = B(j,i) + w(k)*A(k,j,i)
end do
end do
end do
This explicit looping performs about 10% better as the code of Kyle which uses whole-array operations. Bandwidth with ifort -O3 -xHost is ~6600 MB/s, with gfortran -O3 it's ~6000 MB/s, and the whole-array version with either compiler is also around 6000 MB/s.
I know this is an old post, however I will be glad to bring my contribution as I played with most of the posted solutions.
By adding a local unroll for the weights loop (from Steabert's answer ) gives me a little speed-up compared to the complete unroll version (from 10% to 80% with different size of the matrices). The partial unrolling may help the compiler to vectorize the 4 operations in one SSE call.
pure function weight_matrices_partial_unroll_4(w,A) result(B)
implicit none
integer, parameter :: n = 46
integer, parameter :: m = 1800
real(8), intent(in) :: w(n)
real(8), intent(in) :: A(n,m,m)
real(8) :: B(m,m)
real(8) :: Btemp(4)
integer :: i, j, k, l, ndiv, nmod, roll
!==================================================
roll = 4
ndiv = n / roll
nmod = mod( n, roll )
do i = 1,m
do j = 1,m
B(j,i)=0.0d0
k = 1
do l = 1,ndiv
Btemp(1) = w(k )*A(k ,j,i)
Btemp(2) = w(k+1)*A(k+1,j,i)
Btemp(3) = w(k+2)*A(k+2,j,i)
Btemp(4) = w(k+3)*A(k+3,j,i)
k = k + roll
B(j,i) = B(j,i) + sum( Btemp )
end do
do l = 1,nmod !---- process the rest of the loop
B(j,i) = B(j,i) + w(k)*A(k,j,i)
k = k + 1
enddo
end do
end do
end function

Resources