concatenated matrix multiplication

concatenated matrix multiplication - performance

any suggestions for a fast multiply of A * diag(e) * A^T * f for dense matrix A and vectors e, f?
This is what I have now.
v[:] = 0
for i in range(N):
for j in range(N):
v[i] = v[i]+A[i,j]*e[j]*np.dot(A[:,j],f)
Thanks,

Comments by #rubenvb's, where it was suggested to use A.dot(np.diag(e)).dot(A.transpose()).dot(f) should make it really fast. But, we don't really need to make it a 2D array of diag(e) there and thus skip one matrix-multiplication. Additionally, we can swap places for A.T and f and thus avoid the transpose too. Thus, a simplified and much more efficient solution would evolve, like so -
A.dot(e*f.dot(A))
Here's a quick runtime test on decent sized arrays on all the proposed approaches -
In [226]: # Setup inputs
...: N = 200
...: A = np.random.rand(N,N)
...: e = np.random.rand(N,)
...: f = np.random.rand(N,)
...:
In [227]: %timeit np.einsum('ij,j,kj,k', A, e, A, f) # #Warren Weckesser's soln
10 loops, best of 3: 77.6 ms per loop
In [228]: %timeit A.dot(np.diag(e)).dot(A.transpose()).dot(f) # #rubenvb's soln
10 loops, best of 3: 18.6 ms per loop
In [229]: %timeit A.dot(e*f.dot(A)) # Proposed here
10000 loops, best of 3: 100 µs per loop

The suggestion made by #rubenvb is probably the simplest way to do it. Another way is to use einsum.
Here's an example. I'll use the following a, e and f:
In [95]: a
Out[95]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
In [96]: e
Out[96]: array([-1, 2, 3])
In [97]: f
Out[97]: array([5, 4, 1])
This is the direct translation of your formula into numpy code. It is basically the same as #rubenvb's suggestion:
In [98]: a.dot(np.diag(e)).dot(a.T).dot(f)
Out[98]: array([ 556, 1132, 1708])
Here's the einsum version:
In [99]: np.einsum('ij,j,jk,k', a, e, a.T, f)
Out[99]: array([ 556, 1132, 1708])
You can eliminate the need to transpose a by swapping the index labels associated with that argument:
In [100]: np.einsum('ij,j,kj,k', a, e, a, f)
Out[100]: array([ 556, 1132, 1708])

Related

SymPy: Extract the lower triangular part of a matrix

I am trying to extract the lower triangular part of a SymPy matrix. Since I could not find a tril method in SymPy, I defined:
def tril (M):
m = M.copy()
for row_index in range (m.rows):
for col_index in range (row_index + 1, m.cols):
m[row_index, col_index] = 0
return (m)
It seems to work:
Is there a more elegant way to extract the lower triangular part of a SymPy matrix?
Is .copy() the recommended way to ensure the integrity of the original matrix?

In SymPy, M.lower_triangular(k) will give the lower triangular elements below the kth diagonal. The default is k=0.

In [99]: M
Out[99]:
⎡a b c⎤
⎢ ⎥
⎢d e f⎥
⎢ ⎥
⎣g h i⎦
The other answer suggest using the np.tril function:
In [100]: np.tril(M)
Out[100]:
array([[a, 0, 0],
[d, e, 0],
[g, h, i]], dtype=object)
That converts M into a numpy array - object dtype because of the symbols. And the result is also a numpy array.
Your function returns a sympy.Matrix.
In [101]: def tril (M):
...: m = M.copy()
...: for row_index in range (m.rows):
...: for col_index in range (row_index + 1, m.cols):
...: m[row_index, col_index] = 0
...: return (m)
...:
In [102]: tril(M)
Out[102]:
⎡a 0 0⎤
⎢ ⎥
⎢d e 0⎥
⎢ ⎥
⎣g h i⎦
As a general rule mixing sympy and numpy leads to confusion, if not errors. numpy is best for numeric work. It can handle non-numeric objects like symbols, but the math is hit-or-miss.
The np.tri... functions are built on the np.tri function:
In [114]: np.tri(3).astype(int)
Out[114]:
array([[1, 0, 0],
[1, 1, 0],
[1, 1, 1]])
We can make a symbolic Matrix from this:
In [115]: m1 = Matrix(np.tri(3).astype(int))
In [116]: m1
Out[116]:
⎡1 0 0⎤
⎢ ⎥
⎢1 1 0⎥
⎢ ⎥
⎣1 1 1⎦
and do element-wise multiplication:
In [117]: M.multiply_elementwise(m1)
Out[117]:
⎡a 0 0⎤
⎢ ⎥
⎢d e 0⎥
⎢ ⎥
⎣g h i⎦
np.tri works by comparing a column array with a row:
In [123]: np.arange(3)[:,None]>=np.arange(3)
Out[123]:
array([[ True, False, False],
[ True, True, False],
[ True, True, True]])
In [124]: _.astype(int)
Out[124]:
array([[1, 0, 0],
[1, 1, 0],
[1, 1, 1]])
Another answer suggests lower_triangular. It's interesting to look at its code:
def entry(i, j):
return self[i, j] if i + k >= j else self.zero
return self._new(self.rows, self.cols, entry)
It applies an i>=j test to each element. _new must be iterating on the rows and columns.

You can simply use numpy function:
import numpy as np
np.tril(M)
*of course, as noted below, you should convert back to sympy.Matrix(np.tril(M)). But it depends on what you're going to do next.

Minimizing overhead due to the large number of Numpy dot calls

My problem is the following, I have an iterative algorithm such that at each iteration it needs to perform several matrix-matrix multiplications dot(A_i, B_i), for i = 1 ... k. Since these multiplications are being performed with Numpy's dot, I know they are calling BLAS-3 implementation, which is quite fast. The problem is that the number of calls is huge and it turned out to be a bottleneck in my program. I would like to minimize the overhead due all these calls by making less products but with bigger matrices.
For simplicity, consider that all matrices are n x n (usually n is not big, it ranges between 1 and 1000). One way around to my problem would be to consider the block diagonal matrix diag(A_i) and perform the product below.
This is just one call to the function dot but now the program wastes a lot of times performing multiplication with zeros. This idea doesn't seem to work but it gives the result [A_1 B_1, ..., A_k B_k], that is, all products stacked in a single big matrix.
My question is this, is there a way to compute [A_1 B_1, ..., A_k B_k] with a single function call? Or even more to the point, how can I compute these products faster than making a loop of Numpy dots?

It depends on the size of the matrices
Edit
For larger nxn matrices (aprox. size 20) a BLAS call from compiled code is faster, for smaller matrices custom Numba or Cython Kernels are usually faster.
The following method generates custom dot- functions for given input shapes. With this method it is also possible to benefit from compiler related optimizations like loop unrolling, which are especially important for small matrices.
It has to be noted, that generating and compiling one kernel takes approx. 1s, therefore make sure to call the generator only if you really have to.
Generator function
def gen_dot_nm(x,y,z):
#small kernels
#nb.njit(fastmath=True,parallel=True)
def dot_numba(A,B):
"""
calculate dot product for (x,y)x(y,z)
"""
assert A.shape[0]==B.shape[0]
assert A.shape[2]==B.shape[1]
assert A.shape[1]==x
assert B.shape[1]==y
assert B.shape[2]==z
res=np.empty((A.shape[0],A.shape[1],B.shape[2]),dtype=A.dtype)
for ii in nb.prange(A.shape[0]):
for i in range(x):
for j in range(z):
acc=0.
for k in range(y):
acc+=A[ii,i,k]*B[ii,k,j]
res[ii,i,j]=acc
return res
#large kernels
#nb.njit(fastmath=True,parallel=True)
def dot_BLAS(A,B):
assert A.shape[0]==B.shape[0]
assert A.shape[2]==B.shape[1]
res=np.empty((A.shape[0],A.shape[1],B.shape[2]),dtype=A.dtype)
for ii in nb.prange(A.shape[0]):
res[ii]=np.dot(A[ii],B[ii])
return res
#At square matices above size 20
#calling BLAS is faster
if x>=20 or y>=20 or z>=20:
return dot_BLAS
else:
return dot_numba
Usage example
A=np.random.rand(1000,2,2)
B=np.random.rand(1000,2,2)
dot22=gen_dot_nm(2,2,2)
X=dot22(A,B)
%timeit X3=dot22(A,B)
#5.94 µs ± 21.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Old answer
Another alternative, but with more work to do, would be to use some special BLAS implementations, which creates custom kernels for very small matrices just in time and than calling this kernels from C.
Example
import numpy as np
import numba as nb
#Don't use this for larger submatrices
#nb.njit(fastmath=True,parallel=True)
def dot(A,B):
assert A.shape[0]==B.shape[0]
assert A.shape[2]==B.shape[1]
res=np.empty((A.shape[0],A.shape[1],B.shape[2]),dtype=A.dtype)
for ii in nb.prange(A.shape[0]):
for i in range(A.shape[1]):
for j in range(B.shape[2]):
acc=0.
for k in range(B.shape[1]):
acc+=A[ii,i,k]*B[ii,k,j]
res[ii,i,j]=acc
return res
#nb.njit(fastmath=True,parallel=True)
def dot_22(A,B):
assert A.shape[0]==B.shape[0]
assert A.shape[1]==2
assert A.shape[2]==2
assert B.shape[1]==2
assert B.shape[2]==2
res=np.empty((A.shape[0],A.shape[1],B.shape[2]),dtype=A.dtype)
for ii in nb.prange(A.shape[0]):
res[ii,0,0]=A[ii,0,0]*B[ii,0,0]+A[ii,0,1]*B[ii,1,0]
res[ii,0,1]=A[ii,0,0]*B[ii,0,1]+A[ii,0,1]*B[ii,1,1]
res[ii,1,0]=A[ii,1,0]*B[ii,0,0]+A[ii,1,1]*B[ii,1,0]
res[ii,1,1]=A[ii,1,0]*B[ii,0,1]+A[ii,1,1]*B[ii,1,1]
return res
Timings
A=np.random.rand(1000,2,2)
B=np.random.rand(1000,2,2)
X=A#B
X2=np.einsum("xik,xkj->xij",A,B)
X3=dot_22(A,B) #avoid measurig compilation overhead
X4=dot(A,B) #avoid measurig compilation overhead
%timeit X=A#B
#262 µs ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.einsum("xik,xkj->xij",A,B,optimize=True)
#264 µs ± 3.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit X3=dot_22(A,B)
#5.68 µs ± 27.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit X4=dot(A,B)
#9.79 µs ± 61.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

You can stack the arrays to have shape (k, n, n), and call numpy.matmul or use the # operator.
For example,
In [18]: A0 = np.array([[1, 2], [3, 4]])
In [19]: A1 = np.array([[1, 2], [-3, 5]])
In [20]: A2 = np.array([[4, 0], [1, 1]])
In [21]: B0 = np.array([[1, 4], [-3, 4]])
In [22]: B1 = np.array([[2, 1], [1, 1]])
In [23]: B2 = np.array([[-2, 9], [0, 1]])
In [24]: np.matmul([A0, A1, A2], [B0, B1, B2])
Out[24]:
array([[[-5, 12],
[-9, 28]],
[[ 4, 3],
[-1, 2]],
[[-8, 36],
[-2, 10]]])
Or, using #:
In [32]: A = np.array([A0, A1, A2])
In [33]: A
Out[33]:
array([[[ 1, 2],
[ 3, 4]],
[[ 1, 2],
[-3, 5]],
[[ 4, 0],
[ 1, 1]]])
In [34]: B = np.array([B0, B1, B2])
In [35]: A # B
Out[35]:
array([[[-5, 12],
[-9, 28]],
[[ 4, 3],
[-1, 2]],
[[-8, 36],
[-2, 10]]])

If you don't want to waste time multiplying zeros, then what you really want are sparse matrices. Using A and B matrices from #WarrenWeckesser:
from scipy import sparse
sparse.block_diag((A0, A1, A2), format = "csr") # np.concatenate((B0, B1, B2), axis = 0)
Out[]:
array([[-5, 12],
[-9, 28],
[ 4, 3],
[-1, 2],
[-8, 36],
[-2, 10]], dtype=int32)
This is likely a speedup for large matrices. For smaller ones #max9111 probably has the right idea using numba.

Julia sparse matrix with random 1's

So I have a size N in julia and I need an NxN sparse matrix with N ones in it, in random places. What would be the best way to go about this?
At first I thought about randomly generating indexes and then setting those numbers to 1 in a sparse matrix but I recently found the sprand functions however I don't understand how to use them correctly or apply them to my problem. I tried using it with my limited understanding and it keeps generating error messages. Help is of course always greatly appreciated :)

Inspired by #DanGetz comment above, the following solution is a one-line function using randperm. I deleted the original answer as it was not very helpful.
sparseN(N) = sparse(randperm(N), randperm(N), ones(N), N, N)
This is also incredibly fast:
#time sparseN(10_000);
0.000558 seconds (30 allocations: 782.563 KiB)

A sparse matrix of dimension (N rows)x(M columns) has at most NxM components that can be indexed using the K=[0,N*M) integer set. For any k in K you can retrieve element indices (i,j) thanks to a Euclidean division k = i + j*N (here column major layout).
To randomly sample n elements of K (without repetition), you can use Knuth algorithm "Algorithm S (Selection sampling technique)" 3.4.2, in its book Vol2., seminumerical-Algorithms
In Julia:
function random_select(n::Int64,K::Int64)
#assert 0<=n<=K
sample=Vector{Int64}(n)
t=Int64(0)
m=Int64(0)
while m<n
if (K-t)*rand()>=n-m
t+=1
else
m+=1
sample[m]=t
t+=1
end
end
sample
end
The next part simply retrieves the I,J indices to create the sparse matrix from its coordinate form:
function create_sparseMatrix(n::Int64,N::Int64,M::Int64)
#assert (0<=N)&&(0<=M)
#assert 0<=n<=N*M
nonZero = random_select(n,N*M)
# column major: k=i+j*N
I = map(k->mod(k,N),nonZero)
J = map(k->div(k,N),nonZero)
sparse(I+1,J+1,ones(n),N,M)
end
Usage example: a 4x5 sparse matrix with 3 nonzero (=1.0) at random positions:
julia> create_sparseMatrix(3,4,5)
4×5 SparseMatrixCSC{Float64,Int64} with 3 stored entries:
[4, 1] = 1.0
[3, 2] = 1.0
[3, 3] = 1.0
Border case tests:
julia> create_sparseMatrix(0,4,5)
4×5 SparseMatrixCSC{Float64,Int64} with 0 stored entries
julia> create_sparseMatrix(4*5,4,5)
4×5 SparseMatrixCSC{Float64,Int64} with 20 stored entries:
[1, 1] = 1.0
[2, 1] = 1.0
[3, 1] = 1.0
[4, 1] = 1.0
⋮
[4, 4] = 1.0
[1, 5] = 1.0
[2, 5] = 1.0
[3, 5] = 1.0
[4, 5] = 1.0

Insisting on a one-line-ish solution:
using StatsBase
sparseones(N,M,K) = sparse(
(x->(first.(x).+1,last.(x).+1))(divrem.(sample(0:N*M-1,K,replace=false),M))...,
ones(K),N,M
)
Giving:
julia> sparseones(3,4,5)
3×4 SparseMatrixCSC{Float64,Int64} with 5 stored entries:
[1, 1] = 1.0
[2, 1] = 1.0
[3, 3] = 1.0
[2, 4] = 1.0
[3, 4] = 1.0
This method is essentially the same as the earlier answer with the advantage of re-using existing sample and being much shorter. It is even faster on larger matrices.

List of all possible permutations of factors of a number

I am trying to find all the possible factorizations of a number provided in Python.
For example: 1)given n=12,
the output will be, f(n)=[[2,2,3],[4,3],[6,2],[12]]
2) given n=24,
the output will be,f(n)=[2,2,2,3],[2,2,6],[2,12],[4,6],[8,3],[24]]
Here is my code:
def p(a):
k=1
m=1
n=[]
for i in range (len(a)):
for j in range(0,i+1):
k*=a[j]
for l in range(i+1,len(a)):
m*=a[l]
n+=[[k,m],]
k=1
m=1
return n
def f(n):
primfac = []
d = 2
while d*d <= n:
while (n % d) == 0:
primfac.append(d)
n //= d
d += 1
if n > 1:
primfac.append(n)
return p(primfac)
But my code returns following values:
1) For n=12,The output is ,
[[2, 10], [4, 5], [20, 1]]
2)1) For n=24,The output is ,
[[2, 12], [4, 6], [8, 3], [24, 1]]
What can I do for getting relevant results?

I don't know python, so can't help you with the code, but here in an explanation I provided for a related question (a bit of Java code as well, if you can read Java).
get your number factored with multiplicity - this is with high probability the most expensive step O(sqrt(N)) - you can stop here if this is al that you want
build you sets of {1, pi1, pi1, ..., pimi} - pi being a prime factor with multiplicity of mi
perform a Cartesian product between these sets and you'll get all the divisors of your number - you'll spend longer time here only for numbers with many distinct factors (and multiplicities) - e.g 210 x 3 8 x 54 x 73 will have 1980 divisors.
Now, each divisor d resulted from the above will come with it's pair (N/d) so if you want distinct factorisation irrespective of the order, you''l need to sort them and eliminate the duplicates.

N-fold partition of an array with equal sum in each partition

Given an array of integers a, two numbers N and M, return N group of integers from a such that each group sums to M.
For example, say:
a = [1,2,3,4,5]
N = 2
M = 5
Then the algorithm could return [2, 3], [1, 4] or [5], [2, 3] or possibly others.
What algorithms could I use here?
Edit:
I wasn't aware that this problem is NP complete. So maybe it would help if I provided more details on my specific scenario:
So I'm trying to create a "match-up" application. Given the number of teams N and the number of players per team M, the application listens for client requests. Each client request will give a number of players that the client represents. So if I need 2 teams of 5 players, then if 5 clients send requests, each representing 1, 2, 3, 4, 5 players respectively, then my application should generate a match-up between clients [1, 4] and clients [2, 3]. It could also generate a match-up between [1, 4] and [5]; I don't really care.
One implication is that any client representing more than M or less than 0 players is invalid. Hope this could simplify the problem.

this appears to be a variation of the subset sum problem. as this problem is np-complete, there will be no efficient algorithm without further constraints.
note that it is already hard to find a single subset of the original set whose elements would sum up to M.

People give up too easily on NP-complete problems. Just because a problem is NP complete doesn't mean that there aren't more and less efficient algorithms in the general case. That is you can't guarantee that for all inputs there is an answer that can be computed faster than a brute force search, but for many problems you can certainly have methods that are faster than the full search for most inputs.
For this problem there are certainly 'perverse' sets of numbers that will result in worst case search times, because there may be say a large vector of integers, but only one solution and you have to end up trying a very large number of combinations.
But for non-perverse sets, there are probably many solutions, and an efficient way of 'tripping over' a good partitioning will run much faster than NP time.
How you solve this will depend a lot on what you expect to be the more common parameters. It also makes a difference if the integers are all positive, or if negatives are allowed.
In this case I'll assume that:
N is small relative to the length of the vector
All integers are positive.
Integers cannot be re-used.
Algorithm:
Sort the vector, v.
Eliminate elements bigger than M. They can't be part of any solution.
Add up all remaining numbers in v, divide by N. If the result is smaller than M, there is no solution.
Create a new array w, same size as v. For each w[i], sum all the numbers in v[i+1 - end]
So if v was 5 4 3 2 1, w would be 10, 6, 3, 1, 0.
While you have not found enough sets:
Chose the largest number, x, if it is equal to M, emit a solution set with just x, and remove it from the vector, remove the first element from w.
Still not enough sets? (likely), then again while you have not found enough sets:
A solution theory is ([a,b,c], R ) where [a,b,c] is a partial set of elements of v and a remainder R. R = M-sum[a,b,c]. Extending a theory is adding a number to the partial set, and subtracting that number from R. As you extend the theories, if R == 0, that is a possible solution.
Recursively create theories like so: loop over the elements v, as v[i] creating theories, ( [v[i]], R ), And now recursively extend extend each theory from just part of v. Binary search into v to find the first element equal to or smaller than R, v[j]. Start with v[j] and extend each theory with the elements of v from j until R > w[k].
The numbers from v[j] to v[k] are the only numbers that be used to extend a theory and still get R to 0. Numbers larger than v[j] will make R negative. Smaller larger than v[k], and there aren't any more numbers left in the array, even if you used them all to get R to 0

Here is my own Python solution that uses dynamic programming. The algorithm is given here.
def get_subset(lst, s):
'''Given a list of integer `lst` and an integer s, returns
a subset of lst that sums to s, as well as lst minus that subset
'''
q = {}
for i in range(len(lst)):
for j in range(1, s+1):
if lst[i] == j:
q[(i, j)] = (True, [j])
elif i >= 1 and q[(i-1, j)][0]:
q[(i, j)] = (True, q[(i-1, j)][1])
elif i >= 1 and j >= lst[i] and q[(i-1, j-lst[i])][0]:
q[(i, j)] = (True, q[(i-1, j-lst[i])][1] + [lst[i]])
else:
q[(i, j)] = (False, [])
if q[(i, s)][0]:
for k in q[(i, s)][1]:
lst.remove(k)
return q[(i, s)][1], lst
return None, lst
def get_n_subset(n, lst, s):
''' Returns n subsets of lst, each of which sums to s'''
solutions = []
for i in range(n):
sol, lst = get_subset(lst, s)
solutions.append(sol)
return solutions, lst
# print(get_n_subset(7, [1, 2, 3, 4, 5, 7, 8, 4, 1, 2, 3, 1, 1, 1, 2], 5))
# [stdout]: ([[2, 3], [1, 4], [5], [4, 1], [2, 3], [1, 1, 1, 2], None], [7, 8])

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

concatenated matrix multiplication - performance

any suggestions for a fast multiply of A * diag(e) * A^T * f for dense matrix A and vectors e, f? This is what I have now. v[:] = 0 for i in range(N): for j in range(N): v[i] = v[i]+A[i,j]e[j]np.dot(A[:,j],f) Thanks,

Related

SymPy: Extract the lower triangular part of a matrix

Minimizing overhead due to the large number of Numpy dot calls

Julia sparse matrix with random 1's

List of all possible permutations of factors of a number

N-fold partition of an array with equal sum in each partition

Categories

Resources

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

concatenated matrix multiplication - performance

any suggestions for a fast multiply of A * diag(e) * A^T * f for dense matrix A and vectors e, f? This is what I have now. v[:] = 0 for i in range(N): for j in range(N): v[i] = v[i]+A[i,j]*e[j]*np.dot(A[:,j],f) Thanks,

Related

SymPy: Extract the lower triangular part of a matrix

Minimizing overhead due to the large number of Numpy dot calls

Julia sparse matrix with random 1's

List of all possible permutations of factors of a number

N-fold partition of an array with equal sum in each partition

Categories

Resources

any suggestions for a fast multiply of A * diag(e) * A^T * f for dense matrix A and vectors e, f? This is what I have now. v[:] = 0 for i in range(N): for j in range(N): v[i] = v[i]+A[i,j]e[j]np.dot(A[:,j],f) Thanks,