Minimizing overhead due to the large number of Numpy dot calls - performance

My problem is the following, I have an iterative algorithm such that at each iteration it needs to perform several matrix-matrix multiplications dot(A_i, B_i), for i = 1 ... k. Since these multiplications are being performed with Numpy's dot, I know they are calling BLAS-3 implementation, which is quite fast. The problem is that the number of calls is huge and it turned out to be a bottleneck in my program. I would like to minimize the overhead due all these calls by making less products but with bigger matrices.
For simplicity, consider that all matrices are n x n (usually n is not big, it ranges between 1 and 1000). One way around to my problem would be to consider the block diagonal matrix diag(A_i) and perform the product below.
This is just one call to the function dot but now the program wastes a lot of times performing multiplication with zeros. This idea doesn't seem to work but it gives the result [A_1 B_1, ..., A_k B_k], that is, all products stacked in a single big matrix.
My question is this, is there a way to compute [A_1 B_1, ..., A_k B_k] with a single function call? Or even more to the point, how can I compute these products faster than making a loop of Numpy dots?

It depends on the size of the matrices
Edit
For larger nxn matrices (aprox. size 20) a BLAS call from compiled code is faster, for smaller matrices custom Numba or Cython Kernels are usually faster.
The following method generates custom dot- functions for given input shapes. With this method it is also possible to benefit from compiler related optimizations like loop unrolling, which are especially important for small matrices.
It has to be noted, that generating and compiling one kernel takes approx. 1s, therefore make sure to call the generator only if you really have to.
Generator function
def gen_dot_nm(x,y,z):
#small kernels
#nb.njit(fastmath=True,parallel=True)
def dot_numba(A,B):
"""
calculate dot product for (x,y)x(y,z)
"""
assert A.shape[0]==B.shape[0]
assert A.shape[2]==B.shape[1]
assert A.shape[1]==x
assert B.shape[1]==y
assert B.shape[2]==z
res=np.empty((A.shape[0],A.shape[1],B.shape[2]),dtype=A.dtype)
for ii in nb.prange(A.shape[0]):
for i in range(x):
for j in range(z):
acc=0.
for k in range(y):
acc+=A[ii,i,k]*B[ii,k,j]
res[ii,i,j]=acc
return res
#large kernels
#nb.njit(fastmath=True,parallel=True)
def dot_BLAS(A,B):
assert A.shape[0]==B.shape[0]
assert A.shape[2]==B.shape[1]
res=np.empty((A.shape[0],A.shape[1],B.shape[2]),dtype=A.dtype)
for ii in nb.prange(A.shape[0]):
res[ii]=np.dot(A[ii],B[ii])
return res
#At square matices above size 20
#calling BLAS is faster
if x>=20 or y>=20 or z>=20:
return dot_BLAS
else:
return dot_numba
Usage example
A=np.random.rand(1000,2,2)
B=np.random.rand(1000,2,2)
dot22=gen_dot_nm(2,2,2)
X=dot22(A,B)
%timeit X3=dot22(A,B)
#5.94 µs ± 21.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Old answer
Another alternative, but with more work to do, would be to use some special BLAS implementations, which creates custom kernels for very small matrices just in time and than calling this kernels from C.
Example
import numpy as np
import numba as nb
#Don't use this for larger submatrices
#nb.njit(fastmath=True,parallel=True)
def dot(A,B):
assert A.shape[0]==B.shape[0]
assert A.shape[2]==B.shape[1]
res=np.empty((A.shape[0],A.shape[1],B.shape[2]),dtype=A.dtype)
for ii in nb.prange(A.shape[0]):
for i in range(A.shape[1]):
for j in range(B.shape[2]):
acc=0.
for k in range(B.shape[1]):
acc+=A[ii,i,k]*B[ii,k,j]
res[ii,i,j]=acc
return res
#nb.njit(fastmath=True,parallel=True)
def dot_22(A,B):
assert A.shape[0]==B.shape[0]
assert A.shape[1]==2
assert A.shape[2]==2
assert B.shape[1]==2
assert B.shape[2]==2
res=np.empty((A.shape[0],A.shape[1],B.shape[2]),dtype=A.dtype)
for ii in nb.prange(A.shape[0]):
res[ii,0,0]=A[ii,0,0]*B[ii,0,0]+A[ii,0,1]*B[ii,1,0]
res[ii,0,1]=A[ii,0,0]*B[ii,0,1]+A[ii,0,1]*B[ii,1,1]
res[ii,1,0]=A[ii,1,0]*B[ii,0,0]+A[ii,1,1]*B[ii,1,0]
res[ii,1,1]=A[ii,1,0]*B[ii,0,1]+A[ii,1,1]*B[ii,1,1]
return res
Timings
A=np.random.rand(1000,2,2)
B=np.random.rand(1000,2,2)
X=A#B
X2=np.einsum("xik,xkj->xij",A,B)
X3=dot_22(A,B) #avoid measurig compilation overhead
X4=dot(A,B) #avoid measurig compilation overhead
%timeit X=A#B
#262 µs ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.einsum("xik,xkj->xij",A,B,optimize=True)
#264 µs ± 3.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit X3=dot_22(A,B)
#5.68 µs ± 27.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit X4=dot(A,B)
#9.79 µs ± 61.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

You can stack the arrays to have shape (k, n, n), and call numpy.matmul or use the # operator.
For example,
In [18]: A0 = np.array([[1, 2], [3, 4]])
In [19]: A1 = np.array([[1, 2], [-3, 5]])
In [20]: A2 = np.array([[4, 0], [1, 1]])
In [21]: B0 = np.array([[1, 4], [-3, 4]])
In [22]: B1 = np.array([[2, 1], [1, 1]])
In [23]: B2 = np.array([[-2, 9], [0, 1]])
In [24]: np.matmul([A0, A1, A2], [B0, B1, B2])
Out[24]:
array([[[-5, 12],
[-9, 28]],
[[ 4, 3],
[-1, 2]],
[[-8, 36],
[-2, 10]]])
Or, using #:
In [32]: A = np.array([A0, A1, A2])
In [33]: A
Out[33]:
array([[[ 1, 2],
[ 3, 4]],
[[ 1, 2],
[-3, 5]],
[[ 4, 0],
[ 1, 1]]])
In [34]: B = np.array([B0, B1, B2])
In [35]: A # B
Out[35]:
array([[[-5, 12],
[-9, 28]],
[[ 4, 3],
[-1, 2]],
[[-8, 36],
[-2, 10]]])

If you don't want to waste time multiplying zeros, then what you really want are sparse matrices. Using A and B matrices from #WarrenWeckesser:
from scipy import sparse
sparse.block_diag((A0, A1, A2), format = "csr") # np.concatenate((B0, B1, B2), axis = 0)
Out[]:
array([[-5, 12],
[-9, 28],
[ 4, 3],
[-1, 2],
[-8, 36],
[-2, 10]], dtype=int32)
This is likely a speedup for large matrices. For smaller ones #max9111 probably has the right idea using numba.

Related

pytorch: efficient way to perform operations on 2 tensors of different sizes, where one has a one-to-many relation

I have 2 tensors. The first tensor is 1D (e.g. a tensor of 3 values). The second tensor is 2D, with the first dim as the IDs to first tensor in a one-many relationship (e.g. a tensor with a shape of 6, 2)
# e.g. simple example of dot product
import torch
a = torch.tensor([2, 4, 3])
b = torch.tensor([[0, 2], [0, 3], [0, 1], [1, 4], [2, 3], [2, 1]]) # 1st column is the index to tensor a, 2nd column is the value
output = [(2*2)+(2*3)+(2*1),(4*4),(3*3)+(3*1)]
output = [12, 16, 12]
Current what I have is to find the size of each id in b (e.g. [3,1,2]) then using torch.split to group them into a list of tensors and running a for loop through the groups. It is fine for a small tensor, but when the size of the tensors are in millions, with tens of thousands of arbitrary-sized groups, it became very slow.
Any better solutions?
You can use numpy.bincount or torch.bincount to sum the elements of b by key:
import numpy as np
a = np.array([2,4,3])
b = np.array([[0,2], [0,3], [0,1], [1,4], [2,3], [2,1]])
print( np.bincount(b[:,0], b[:,1]) )
# [6. 4. 4.]
print( a * np.bincount(b[:,0], b[:,1]) )
# [12. 16. 12.]
import torch
a = torch.tensor([2,4,3])
b = torch.tensor([[0,2], [0,3], [0,1], [1,4], [2,3], [2,1]])
torch.bincount(b[:,0], b[:,1])
# tensor([6., 4., 4.], dtype=torch.float64)
a * torch.bincount(b[:,0], b[:,1])
# tensor([12., 16., 12.], dtype=torch.float64)
References:
numpy.bincount official documentation;
torch.bincount official documentation;
How can I reduce a numpy array based on a key rather than an axis?
Another alternative in pytorch if gradient is needed.
import torch
a = torch.tensor([2,4,3])
b = torch.tensor([[0,2], [0,3], [0,1], [1,4], [2,3], [2,1]])
output = torch.zeros(a.shape[0], dtype=torch.long).index_add_(0, b[:, 0], b[:, 1]) * a
alternatively, torch.tensor.scatter_add also works.

Julia sparse matrix with random 1's

So I have a size N in julia and I need an NxN sparse matrix with N ones in it, in random places. What would be the best way to go about this?
At first I thought about randomly generating indexes and then setting those numbers to 1 in a sparse matrix but I recently found the sprand functions however I don't understand how to use them correctly or apply them to my problem. I tried using it with my limited understanding and it keeps generating error messages. Help is of course always greatly appreciated :)
Inspired by #DanGetz comment above, the following solution is a one-line function using randperm. I deleted the original answer as it was not very helpful.
sparseN(N) = sparse(randperm(N), randperm(N), ones(N), N, N)
This is also incredibly fast:
#time sparseN(10_000);
0.000558 seconds (30 allocations: 782.563 KiB)
A sparse matrix of dimension (N rows)x(M columns) has at most NxM components that can be indexed using the K=[0,N*M) integer set. For any k in K you can retrieve element indices (i,j) thanks to a Euclidean division k = i + j*N (here column major layout).
To randomly sample n elements of K (without repetition), you can use Knuth algorithm "Algorithm S (Selection sampling technique)" 3.4.2, in its book Vol2., seminumerical-Algorithms
In Julia:
function random_select(n::Int64,K::Int64)
#assert 0<=n<=K
sample=Vector{Int64}(n)
t=Int64(0)
m=Int64(0)
while m<n
if (K-t)*rand()>=n-m
t+=1
else
m+=1
sample[m]=t
t+=1
end
end
sample
end
The next part simply retrieves the I,J indices to create the sparse matrix from its coordinate form:
function create_sparseMatrix(n::Int64,N::Int64,M::Int64)
#assert (0<=N)&&(0<=M)
#assert 0<=n<=N*M
nonZero = random_select(n,N*M)
# column major: k=i+j*N
I = map(k->mod(k,N),nonZero)
J = map(k->div(k,N),nonZero)
sparse(I+1,J+1,ones(n),N,M)
end
Usage example: a 4x5 sparse matrix with 3 nonzero (=1.0) at random positions:
julia> create_sparseMatrix(3,4,5)
4×5 SparseMatrixCSC{Float64,Int64} with 3 stored entries:
[4, 1] = 1.0
[3, 2] = 1.0
[3, 3] = 1.0
Border case tests:
julia> create_sparseMatrix(0,4,5)
4×5 SparseMatrixCSC{Float64,Int64} with 0 stored entries
julia> create_sparseMatrix(4*5,4,5)
4×5 SparseMatrixCSC{Float64,Int64} with 20 stored entries:
[1, 1] = 1.0
[2, 1] = 1.0
[3, 1] = 1.0
[4, 1] = 1.0
⋮
[4, 4] = 1.0
[1, 5] = 1.0
[2, 5] = 1.0
[3, 5] = 1.0
[4, 5] = 1.0
Insisting on a one-line-ish solution:
using StatsBase
sparseones(N,M,K) = sparse(
(x->(first.(x).+1,last.(x).+1))(divrem.(sample(0:N*M-1,K,replace=false),M))...,
ones(K),N,M
)
Giving:
julia> sparseones(3,4,5)
3×4 SparseMatrixCSC{Float64,Int64} with 5 stored entries:
[1, 1] = 1.0
[2, 1] = 1.0
[3, 3] = 1.0
[2, 4] = 1.0
[3, 4] = 1.0
This method is essentially the same as the earlier answer with the advantage of re-using existing sample and being much shorter. It is even faster on larger matrices.

How to vectorize getting sub arrays from numpy array using indexing arrays

I want to get a numpy array of sub arrays from a base array using some type of indexing arrays (style/format of indexing arrays open for suggestions). I can easily do this with a for loop, but wondering if there is a clever way to use numpy broadcasting?
Constraints: Sub-arrays are guaranteed to be the same size.
up_idx = np.array([[0, 0],
[0, 2],
[1, 1]])
lw_idx = np.array([[2, 2],
[2, 4],
[3, 3]])
base = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])
samples = []
for index in range(up_idx.shape[0]):
up_row = up_idx[index, 0]
up_col = up_idx[index, 1]
lw_row = lw_idx[index, 0]
lw_col = lw_idx[index, 1]
samples.append(base[up_row:lw_row, up_col:lw_col])
samples = np.array(samples)
print(samples)
> [[[ 1 2]
[ 5 6]]
[[ 3 4]
[ 7 8]]
[[ 6 7]
[10 11]]]
I've tried:
vector_s = base[up_idx[:, 0]:lw_idx[:, 1], up_idx[:, 1]:lw_idx[:, 1]]
But that was just nonsensical it seems.
I don't think there is a fast way to do this in general via numpy broadcasting operations – for one thing, the way you set up the problem there is no guarantee that the resulting sub-arrays will be the same shape, and thus able to fit into a single output array.
The most succinct and efficient way to solve this is probably via a list comprehension; e.g.
result = np.array([base[i1:i2, j1:j2] for (i1, j1), (i2, j2) in zip(up_idx, lw_idx)])
Unless your base array is very large, this shouldn't be much of a bottleneck.
If you have different problem constraints (i.e. same size slice in every case) it may be possible to come up with a faster vectorized solution based on fancy indexing. For example, if every slice is of size two (as in your example above) then you can use fancy indexing like this to obtain the same result:
i, j = up_idx.T[:, :, None] + np.arange(2)
result = base[i[:, :, None], j[:, None]]
The key to understanding this fancy indexing is to realize that the result follows the broadcasted shape of the index arrays.

List of all possible permutations of factors of a number

I am trying to find all the possible factorizations of a number provided in Python.
For example: 1)given n=12,
the output will be, f(n)=[[2,2,3],[4,3],[6,2],[12]]
2) given n=24,
the output will be,f(n)=[2,2,2,3],[2,2,6],[2,12],[4,6],[8,3],[24]]
Here is my code:
def p(a):
k=1
m=1
n=[]
for i in range (len(a)):
for j in range(0,i+1):
k*=a[j]
for l in range(i+1,len(a)):
m*=a[l]
n+=[[k,m],]
k=1
m=1
return n
def f(n):
primfac = []
d = 2
while d*d <= n:
while (n % d) == 0:
primfac.append(d)
n //= d
d += 1
if n > 1:
primfac.append(n)
return p(primfac)
But my code returns following values:
1) For n=12,The output is ,
[[2, 10], [4, 5], [20, 1]]
2)1) For n=24,The output is ,
[[2, 12], [4, 6], [8, 3], [24, 1]]
What can I do for getting relevant results?
I don't know python, so can't help you with the code, but here in an explanation I provided for a related question (a bit of Java code as well, if you can read Java).
get your number factored with multiplicity - this is with high probability the most expensive step O(sqrt(N)) - you can stop here if this is al that you want
build you sets of {1, pi1, pi1, ..., pimi} - pi being a prime factor with multiplicity of mi
perform a Cartesian product between these sets and you'll get all the divisors of your number - you'll spend longer time here only for numbers with many distinct factors (and multiplicities) - e.g 210 x 3 8 x 54 x 73 will have 1980 divisors.
Now, each divisor d resulted from the above will come with it's pair (N/d) so if you want distinct factorisation irrespective of the order, you''l need to sort them and eliminate the duplicates.

concatenated matrix multiplication

any suggestions for a fast multiply of A * diag(e) * A^T * f for dense matrix A and vectors e, f?
This is what I have now.
v[:] = 0
for i in range(N):
for j in range(N):
v[i] = v[i]+A[i,j]*e[j]*np.dot(A[:,j],f)
Thanks,
Comments by #rubenvb's, where it was suggested to use A.dot(np.diag(e)).dot(A.transpose()).dot(f) should make it really fast. But, we don't really need to make it a 2D array of diag(e) there and thus skip one matrix-multiplication. Additionally, we can swap places for A.T and f and thus avoid the transpose too. Thus, a simplified and much more efficient solution would evolve, like so -
A.dot(e*f.dot(A))
Here's a quick runtime test on decent sized arrays on all the proposed approaches -
In [226]: # Setup inputs
...: N = 200
...: A = np.random.rand(N,N)
...: e = np.random.rand(N,)
...: f = np.random.rand(N,)
...:
In [227]: %timeit np.einsum('ij,j,kj,k', A, e, A, f) # #Warren Weckesser's soln
10 loops, best of 3: 77.6 ms per loop
In [228]: %timeit A.dot(np.diag(e)).dot(A.transpose()).dot(f) # #rubenvb's soln
10 loops, best of 3: 18.6 ms per loop
In [229]: %timeit A.dot(e*f.dot(A)) # Proposed here
10000 loops, best of 3: 100 µs per loop
The suggestion made by #rubenvb is probably the simplest way to do it. Another way is to use einsum.
Here's an example. I'll use the following a, e and f:
In [95]: a
Out[95]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
In [96]: e
Out[96]: array([-1, 2, 3])
In [97]: f
Out[97]: array([5, 4, 1])
This is the direct translation of your formula into numpy code. It is basically the same as #rubenvb's suggestion:
In [98]: a.dot(np.diag(e)).dot(a.T).dot(f)
Out[98]: array([ 556, 1132, 1708])
Here's the einsum version:
In [99]: np.einsum('ij,j,jk,k', a, e, a.T, f)
Out[99]: array([ 556, 1132, 1708])
You can eliminate the need to transpose a by swapping the index labels associated with that argument:
In [100]: np.einsum('ij,j,kj,k', a, e, a, f)
Out[100]: array([ 556, 1132, 1708])

Resources