numpy lstsq -- memory and run-time performance - performance

I need to solve (in the least-squares sense) a large set (50,000) of linear systems. Each such "system" is Ax=B, with A being an N-by-K matrix, x being an k-by-1 vector, and B (obviously) being an N-by-1 vector. (N in my case is 50,000, and K is ~10).
numpy.linalg.lstsq seems like the obvious choice, but since the documentation contains no implementation details, I am wondering about the memory and run-time performance:
What are the run-time performance and memory requirements of lstsq?
Will it compute the A, A^T, multiply them, and take the inverse, or will it compute A's pseudo-inverse directly?
Is there a way to directly compute each X[i] of the result, thus saving on memory? Will it use it?

The documentation describes the result as including the singular values and the rank; a strong hint that it is using SVD.
A quick test on my laptop shows the memory not going up at all (as reported by System Monitor) after the allocation of the arrays A and B.
In [7]: A = np.random.randn(100000, 10)
In [8]: B = np.random.randn(100000)
In [9]: np.linalg.lstsq(A, B)
Out[9]:
(array([ 0.00240061, 0.0017896 , 0.00619928, 0.00010278, -0.00411501,
0.00028532, 0.0003893 , -0.00042893, 0.00178326, -0.00444068]),
array([ 99695.18278372]),
10,
array([ 318.37776275, 318.16578799, 317.82872616, 317.21981114,
316.80987468, 315.63798002, 315.46574698, 314.73120345,
313.99948001, 313.61503118]))

Related

Multiply Eigen::Matrices without transposing first

Say you have Matrix<int, 4, 2> and Matrix<int, 3, 2> which you want to multiply in the natural way that consumes the -1 dimension without first transposing.
Is this possible? Or do we have to transpose first. Which would be silly (unperformative) from a cache perspective, because now the elements we are multiplying and summing aren't contiguous.
Here's a playground. https://godbolt.org/z/Gdj3sfzcb
Pytorch provides torch.inner and torch.tensordot which do this.
Just like in Numpy, transpose() just creates a "view". It doesn't do any expensive memory operations (unless you assign it to a new matrix). Just call a * b.transpose() and let Eigen handle the details of the memory access. A properly optimized BLAS library like Eigen handles the transposition on smaller tiles in temporary memory for optimal performance.
Memory order still matters for fine tuning though. If you can, write your matrix multiplications in the form a.transpose() * b for column-major matrices (like Eigen, Matlab), or a * b.transpose() for row-major matrices like those in Numpy. That saves the BLAS library the trouble of doing that transposition.
Side note: You used auto for your result. Please read the Common Pitfalls chapter in the documentation. Your code didn't compute a matrix multiplication, it stored an expression of one.

Parallellize least squares for large (> 30k x 30k) non-square dense matrices

Let RG = A for dense unstructured matrices with shapes (e.g. roughly) R: (30k x 40k, entries float32) and G: (40k x 50k, entries either 0.0 or 1.0, roughly equally often) and of course A: (30k x 50k, entries float32).
Given A and G, I want to find the least squares solution for R.
I can use hundreds of CPU cores, hundreds of GB of RAM and also an A40 GPU. What is the best way to use such resources to solve the problem? I'm using Julia 1.7 in the examples below but I'm open to other options!
First question: Can I somehow exploit that the entries of G are only zeros and ones?
Trying to use Julia LinearAlgebra with many CPUs
I've tried two methods: "Penrose inverse" and "right division"
using LinearAlgebra
#show BLAS.get_num_threads()
# defaults to 8. Can change using BLAS.set_num_threads(N)
# build toy problem (order of magnitude smaller sizes)
R_true = rand(Float32, 3_000, 4_000)
G = rand([0., 1.], 4_000, 5_000)
# note: using true/false here gives same results but is much slower!
A = R_true * G
# solve toy problem using matrix (right) division
R_fitted_rdiv = A / G
# solve toy problem using Penrose inverse
R_fitted_pinv = (pinv(G') * A')'
First, setting BLAS.set_num_threads(64) (or any bigger number) actually only gives me BLAS.get_num_threads() returning 32. Apparantly that's an upper limit. Second,
using 32 BLAS threads is actually slower than using 8.
(e.g. performing right division with sizes (4000, 9800) / (8500, 9800) takes less than 50 seconds on 8 threads but more than 55 seconds on 32 threads. I ran things multiple times to exclude compilation time issues.) I don't know why this is or if it's normal. How can I make use of my computing power for this problem?
I think that the matrix division is faster than the Penrose inverse method. Should this be expected? I don't know what either of the functions do exactly for these inputs. The docs say that left division (\) uses pivoted QR factorization. I couldn't find what algorithm(s) are used for pinv or right division (/) (although it's probably the same as \ since they are related by transposing the matrices). I'd rather not delve too deeply because my knowledge in numerical linear algebra is quite limited.
The issue is that for my large matrices either method takes forever. Is there a way to make use of my ~100 cores somehow?
Trying to use the GPU:
Using CUDA.jl, Matrices of size around 10k work fine and take a minute to pinv:
using CUDA
#time matrix = CUDA.rand(Float32, 10_000, 10_500) # 0.003037 seconds (5 allocations: 160 bytes)
#time pinv(matrix) # 57.417559 seconds (678 allocations: 172.094 KiB)
However, when I try to do matrices around size 20k, I get right away the error InexactError: trunc(Int32, 4811456640). I assume this is due to CUBLAS using int32 for indexing, even though I don't understand why it leads to an error in this case. (edit: it's about the size of the array in bytes fitting into 31 bits.)
Trying to use right division with CuArrays gives the error "DimensionMismatch("LU factored matrix A must be square!")". I guess I have to choose a different algorithm manually? I don't know what it's called. (Although, it probably would still crash for large matrices...?)
To summarize, it doesn't look like I can use the GPU from Julia easily to solve my problem. Should I keep trying to use the GPU for this task or stick to the many CPUs?
Yes this is really my problem, please refrain from commenting "nobody should ever need such large least squares"
Naive answer
Using pytorch, this will require at least 30GB GPU memory
import torch
A = torch.randint(0, 2, (50000, 40000), device='cuda', dtype=torch.float32).T
G = torch.randint(0, 2, (50000, 30000), device='cuda', dtype=torch.float32).T
R = torch.lstsq(G.T, A.T)
If the system can sustain the same operation throughput as my laptop you should have an answer in about 15 minutes.
I would suggest you to try a generalized version scaling up the dimensions to get a better feeling of how your system will handle it
def try_it(a,b,c):
A = torch.randint(0, 2, (a, b), device='cuda', dtype=torch.float32).T
G = torch.randint(0, 2, (a, c), device='cuda', dtype=torch.float32).T
R = torch.lstsq(G.T, A.T)
I transposed the dimensions in the generation in order to make sure G.T and A.T would be contiguous.
You can't take much advantage of the entries being integer. This type of problem is easier to solve on the reals than on the integers, because finding integer solutions would require you to search the solutions, while the real solution you can find by doing algebraic manipulations.

Non-intuitive perf diff between `matrix * vector`, `matrix’ * vector` and `copy(matrix’) * vector` Usage Performance blas

Question from Julia Discourse
I’m using Julia 1.2. This is my test:
a = rand(1000, 1000)
b = adjoint(a)
c = copy(b)
#btime a * x setup=(x=rand(1000)) # 114.757 μs
#btime b * x setup=(x=rand(1000)) # 94.179 μs
#btime c * x setup=(x=rand(1000)) # 110.325 μs
I was expecting a and c to be at very least not slower.
After inspecting stdlib/LinearAlgebra/src/matmul.jl , it turns out that Julia passes b.parent (i.e. a ) to BLAS.gemv , not b , and instead switches LAPACK’s dgemv_ into a different and apparently faster mode.
Am I correct in assuming that the speedup comes from the fact that the memory is aligned in a more favorable way for whatever dgemv_ does, when it’s in a trans = T mode? If so, then I’m guessing this isn’t actionable, besides possibly mentioning the gotcha in the docs somehow. If my assumption is wrong though, is there something to be done about this?
Answer from #stevengj in the same Discourse thread:
Am I correct in assuming that the speedup comes from the fact that the memory is aligned in a more favorable way for whatever dgemv_ does, when it’s in a trans = T mode?
Close. It does have to do with memory, but it’s about locality, not alignment. The basic thing to understand is that it is more efficient to access consecutive (or at least nearby) data from memory than data that is separated, due to the existence of cache lines. (Consecutive access also has some advantages in utilizing SIMD instructions.)
Julia stores matrices in column-major order, so that the columns are contiguous in memory. When you multiply a transposed matrix (that has not been copied) by a vector, therefore, it can compute it as the dot product of the contiguous column (= transposed row) with the contiguous vector, which has good spatial locality and therefore utilizes cache lines efficiently.
For multiplying a non -transposed matrix by a vector, in contrast, you are taking the dot products of non-contiguous rows of the matrix with the vector, and it is harder to efficiently utilize cache lines. To improve spatial locality in this case, an optimized BLAS like OpenBLAS actually computes the dot products of several rows at a time (a “block”) with the vector, I believe — that’s why it’s only 10% slower and not much worse. (In fact, even the transposed case may do some blocking to keep the vector in cache.)

Solving large system of coupled differential equations

I have a system of coupled ordinary differential equations
dx/dt = (A + C_d(t) * B) * x,
where A and B are constant matrices and C_d is a diagonal coefficient matrix which smoothly varies depending on the current value of the integration variable.
The square matrices A and B are built up from smaller 60*60 upper triangular or zero matrices. The dimension of the full system is around 2500*2500. A and B are sparse with ~10% non-zero elements. The diagonal elements are negative or zero. The main (physical) constraint is that elements of x(t) are not allowed to become negative during integration.
Currently, I employ a ‘naïve’ step solver
x_(i+1) = A * x_i * dt_i + B * (C_d(t_i) * x_i) * dt_i + x_i
or in the CPU/GPU versions
def solve_CPU(nsteps, dt, c_d, x):
for step in xrange(nsteps):
x += (A.dot(x) + B.dot(x * c_d[step])) * dt[step]
def solve_GPU(m, n, nsteps, dt, c_d, cu_curr_x, cu_delta_x, cu_A, cu_B):
for step in xrange(nsteps):
cubl.gemv(trans='T', m=m, n=n, alpha=1.0, A=cu_A,
x=cu_curr_x, beta=0.0, y=cu_delta_x)
cubl.gemv(trans='T', m=m, n=n, alpha=c_d[step], A=cu_B,
x=cu_curr_x, beta=1.0, y=cu_delta_x)
cubl.axpy(alpha=dt[step], x=cu_delta_x, y=cu_curr_x)
and make use of a feature, that the step sizes dt_ithis can be computed a priory in a way that the elements of x are always >=0 during integration. Depending on the amount of approximations and the settings the number of integration steps varies between 25k and 10M.
I have tried several methods to optimize performance on general purpose hardware:
(unknown) When using ODEPACK’s VODE solver, I do not know how to express the x>=0 constraint
(slowest) Dense BLAS 2 dot-product using Intel MKL
(medium) Dense BLAS using single precision cuBLAS on NVIDIA GPU
(fastest) SCIPY sparse module using CSR/CSC formats
The code is written in Python and has access to the above listed libraries via Anaconda, Numba, Accelerate, Numpy etc. SCIPY's sparse BLAS routines are not properly linked to MKL in Anaconda and Python wrappers around cuSPARSE are to my knowledge not available, yet. I would know how to squeeze out a little bit more performance by directly interfacing to cuSPARSE/C-MKL sparse dot product, but that’s it. This exercise has to be solved dozens of times, again and again if models change, so performance is always an issue. I’m not an expert in this matter, so I don’t know much about preconditioners, factorization theorems etc. what brings me to my question:
Is there a more elegant or better way to solve such a linear-algebra task?

Hiding communication in Matrix Vector Product with MPI

I have to solve a huge linear equation for multiple right sides (Let's say 20 to 200). The Matrix is stored in a sparse format and distributed over multiple MPI nodes (Let's say 16 to 64). I run a CG solver on the rank 0 node. It's not possible to solve the linear equation directly, because the system matrix would be dense (Sys = A^T * S * A).
The basic Matrix-Vector multiplication is implemented as:
broadcast x
y = A_part * x
reduce y
While the collective operations are reasonably fast (OpenMPI seems to use a binary tree like communication pattern + Infiniband), it still accounts for a quite large part of the runtime. For performance reasons we already calculate 8 right sides per iteration (Basicly SpM * DenseMatrix, just to be complete).
I'm trying to come up with a good scheme to hide the communication latency, but I did not have a good idea yet. I also try to refrain from doing 1:n communication, although I did not yet measure if scaling would be a problem.
Any suggestions are welcome!
If your matrix is already distributed, would it be possible to use a distributed sparse linear solver instead of running it only on rank 0 and then broadcasting the result (if I'm reading your description correctly..). There's plenty of libraries for that, e.g. SuperLU_DIST, MUMPS, PARDISO, Aztec(OO), etc.
The "multiple rhs" optimization is supported by at least SuperLU and MUMPS (haven't checked the others, but I'd be VERY surprised if they didn't support it!), since they solve AX=B where X and B are matrices with potentially > 1 column. That is, each "rhs" is stored as a column vector in B.
If you don't need to have the results of an old right-hand-side before starting the next run you could try to use non-blocking communication (ISend, IRecv) and communicate the result while calculating the next right-hand-side already.
But make sure you call MPI_Wait before reading the content of the communicated array, in order to be sure you're not reading "old" data.
If the matrices are big enough (i.e. it takes long enough to calculate the matrix-product) you don't have any communication delay at all with this approach.

Resources