Solving large system of coupled differential equations - matrix

I have a system of coupled ordinary differential equations
dx/dt = (A + C_d(t) * B) * x,
where A and B are constant matrices and C_d is a diagonal coefficient matrix which smoothly varies depending on the current value of the integration variable.
The square matrices A and B are built up from smaller 60*60 upper triangular or zero matrices. The dimension of the full system is around 2500*2500. A and B are sparse with ~10% non-zero elements. The diagonal elements are negative or zero. The main (physical) constraint is that elements of x(t) are not allowed to become negative during integration.
Currently, I employ a ‘naïve’ step solver
x_(i+1) = A * x_i * dt_i + B * (C_d(t_i) * x_i) * dt_i + x_i
or in the CPU/GPU versions
def solve_CPU(nsteps, dt, c_d, x):
for step in xrange(nsteps):
x += (A.dot(x) + B.dot(x * c_d[step])) * dt[step]
def solve_GPU(m, n, nsteps, dt, c_d, cu_curr_x, cu_delta_x, cu_A, cu_B):
for step in xrange(nsteps):
cubl.gemv(trans='T', m=m, n=n, alpha=1.0, A=cu_A,
x=cu_curr_x, beta=0.0, y=cu_delta_x)
cubl.gemv(trans='T', m=m, n=n, alpha=c_d[step], A=cu_B,
x=cu_curr_x, beta=1.0, y=cu_delta_x)
cubl.axpy(alpha=dt[step], x=cu_delta_x, y=cu_curr_x)
and make use of a feature, that the step sizes dt_ithis can be computed a priory in a way that the elements of x are always >=0 during integration. Depending on the amount of approximations and the settings the number of integration steps varies between 25k and 10M.
I have tried several methods to optimize performance on general purpose hardware:
(unknown) When using ODEPACK’s VODE solver, I do not know how to express the x>=0 constraint
(slowest) Dense BLAS 2 dot-product using Intel MKL
(medium) Dense BLAS using single precision cuBLAS on NVIDIA GPU
(fastest) SCIPY sparse module using CSR/CSC formats
The code is written in Python and has access to the above listed libraries via Anaconda, Numba, Accelerate, Numpy etc. SCIPY's sparse BLAS routines are not properly linked to MKL in Anaconda and Python wrappers around cuSPARSE are to my knowledge not available, yet. I would know how to squeeze out a little bit more performance by directly interfacing to cuSPARSE/C-MKL sparse dot product, but that’s it. This exercise has to be solved dozens of times, again and again if models change, so performance is always an issue. I’m not an expert in this matter, so I don’t know much about preconditioners, factorization theorems etc. what brings me to my question:
Is there a more elegant or better way to solve such a linear-algebra task?

Related

Parallellize least squares for large (> 30k x 30k) non-square dense matrices

Let RG = A for dense unstructured matrices with shapes (e.g. roughly) R: (30k x 40k, entries float32) and G: (40k x 50k, entries either 0.0 or 1.0, roughly equally often) and of course A: (30k x 50k, entries float32).
Given A and G, I want to find the least squares solution for R.
I can use hundreds of CPU cores, hundreds of GB of RAM and also an A40 GPU. What is the best way to use such resources to solve the problem? I'm using Julia 1.7 in the examples below but I'm open to other options!
First question: Can I somehow exploit that the entries of G are only zeros and ones?
Trying to use Julia LinearAlgebra with many CPUs
I've tried two methods: "Penrose inverse" and "right division"
using LinearAlgebra
#show BLAS.get_num_threads()
# defaults to 8. Can change using BLAS.set_num_threads(N)
# build toy problem (order of magnitude smaller sizes)
R_true = rand(Float32, 3_000, 4_000)
G = rand([0., 1.], 4_000, 5_000)
# note: using true/false here gives same results but is much slower!
A = R_true * G
# solve toy problem using matrix (right) division
R_fitted_rdiv = A / G
# solve toy problem using Penrose inverse
R_fitted_pinv = (pinv(G') * A')'
First, setting BLAS.set_num_threads(64) (or any bigger number) actually only gives me BLAS.get_num_threads() returning 32. Apparantly that's an upper limit. Second,
using 32 BLAS threads is actually slower than using 8.
(e.g. performing right division with sizes (4000, 9800) / (8500, 9800) takes less than 50 seconds on 8 threads but more than 55 seconds on 32 threads. I ran things multiple times to exclude compilation time issues.) I don't know why this is or if it's normal. How can I make use of my computing power for this problem?
I think that the matrix division is faster than the Penrose inverse method. Should this be expected? I don't know what either of the functions do exactly for these inputs. The docs say that left division (\) uses pivoted QR factorization. I couldn't find what algorithm(s) are used for pinv or right division (/) (although it's probably the same as \ since they are related by transposing the matrices). I'd rather not delve too deeply because my knowledge in numerical linear algebra is quite limited.
The issue is that for my large matrices either method takes forever. Is there a way to make use of my ~100 cores somehow?
Trying to use the GPU:
Using CUDA.jl, Matrices of size around 10k work fine and take a minute to pinv:
using CUDA
#time matrix = CUDA.rand(Float32, 10_000, 10_500) # 0.003037 seconds (5 allocations: 160 bytes)
#time pinv(matrix) # 57.417559 seconds (678 allocations: 172.094 KiB)
However, when I try to do matrices around size 20k, I get right away the error InexactError: trunc(Int32, 4811456640). I assume this is due to CUBLAS using int32 for indexing, even though I don't understand why it leads to an error in this case. (edit: it's about the size of the array in bytes fitting into 31 bits.)
Trying to use right division with CuArrays gives the error "DimensionMismatch("LU factored matrix A must be square!")". I guess I have to choose a different algorithm manually? I don't know what it's called. (Although, it probably would still crash for large matrices...?)
To summarize, it doesn't look like I can use the GPU from Julia easily to solve my problem. Should I keep trying to use the GPU for this task or stick to the many CPUs?
Yes this is really my problem, please refrain from commenting "nobody should ever need such large least squares"
Naive answer
Using pytorch, this will require at least 30GB GPU memory
import torch
A = torch.randint(0, 2, (50000, 40000), device='cuda', dtype=torch.float32).T
G = torch.randint(0, 2, (50000, 30000), device='cuda', dtype=torch.float32).T
R = torch.lstsq(G.T, A.T)
If the system can sustain the same operation throughput as my laptop you should have an answer in about 15 minutes.
I would suggest you to try a generalized version scaling up the dimensions to get a better feeling of how your system will handle it
def try_it(a,b,c):
A = torch.randint(0, 2, (a, b), device='cuda', dtype=torch.float32).T
G = torch.randint(0, 2, (a, c), device='cuda', dtype=torch.float32).T
R = torch.lstsq(G.T, A.T)
I transposed the dimensions in the generation in order to make sure G.T and A.T would be contiguous.
You can't take much advantage of the entries being integer. This type of problem is easier to solve on the reals than on the integers, because finding integer solutions would require you to search the solutions, while the real solution you can find by doing algebraic manipulations.

Parallel method to get all the eigenvalues of a large sparse matrix

Is it possible to compute all the eigenvalues of a large sparse matrix using multiple CPUs ?
If yes, then is it possible to do it without storing the full dense matrix in memory ? (using only the stored sparse matrix)
If yes, then what's a good (rapid and low memory usage) method to do it ?
Can numpy or scipy do it ?
My matrix is complex, non-hermitian, as sparse as the identity matrix and of dimension N x N where N = BinomialCoefficient(L,Floor(L/2)) where we need to take L as large as possible.
For example, with L = 20, N = 184 756 it is 99.9995% sparse, having just N non-zero elements. So, the memory usage of the sparse matrix is ~0.1GB but would be ~10TB for the dense matrix.
With L = 30, N = 155 117 520 and we use ~60GB (sparse) and ~10EB (dense). So it's impraticable to store the full dense matrix in memory.
I have access to Intel® Gold 6148 Skylake # 2.4 [GHz] CPUs with up to 752 [GB] of RAM each. I could use Python, C (ScaLAPACK, OpenBLAS, MAGMA, ELPA, MUMPS, SuperLU, SuiteSparse, PETSc, Lis,...), C++ (Armadillo, Eigen, BLitz++, Trilinos,...), Matlab, R, Perl, Fortran, mpi4py, CUDA, Intel® Math Kernel Library, and a few other softwares.
I build my matrix using Python (scipy.sparse, numpy and multiprocessing). I've tried using numpy.linalg.eigvals() and scipy.linalg.eigvals(), but it seems that they only use the cores of one CPU. I could look further into those, but I wont if there's a better way to solve my matrix.
For the curious ones, my matrix comes from a representation of a non-hermitian operator on a subset of states of a length L quantum spin 1/2 chain with strong interactions. I need the full spectrum because it allows me to study the level spacing distribution of the energy spectrum for a fixed set of quantum numbers.
I'm far from being a professional in computer science, so if I missed some basic concept please be clement.

Speeding up evaluation of many scipy splines over the same set of knots

I have a few quick questions with regards to speeding-up spline function evaluation in scipy (version 0.12.0) and I wish to apologize in advance for my novice understanding of splines. I am trying to create an object for scipy.integrate.odeint integration of a chemical kinetics problems using spline lookups for reaction rates (1.e2-1.e3 functions of ode system variables) and generated c-code for all of the algebra in the ODE system of equations. In comparison to a previous implementation that was purely in python, evaluating the c-code is so much faster than the spline interpolations that the evaluation of splines is the bottleneck in the ODE function. In trying to remove the bottleneck, I have reformed all of the reaction rates into splines that exist on the same knot values with the same order while having different smoothing coefficients (In reality I will have multiple sets of functions, where each function set was found on the same knots, has the same argument variable, and at the same derivative level, but for simplicity I will assume one function set for this question).
In principle this is just a collection of curves on the same x-values and could be treated with interp1d (equivalently rewrapping splmake and spleval from scipy.interpolate) or a list of splev calls on tck data from splrep.
In [1]: %paste
import numpy
import scipy
from scipy.interpolate import *
#Length of Data
num_pts = 3000
#Number of functions
num_func = 100
#Shared x list by all functions
x = numpy.linspace(0.0,100.0,num_pts)
#Separate y(x) list for each function
ylist = numpy.zeros((num_pts,num_func))
for ind in range(0,num_func):
#Dummy test for different data
ylist[:,ind] = (x**ind + x - 3.0)
testval = 55.0
print 'Method 1'
fs1 = [scipy.interpolate.splrep(x,ylist[:,ind],k=3) for ind in range(0,num_func)]
out1 = [scipy.interpolate.splev(testval,fs1[ind]) for ind in range(0,num_func)]
%timeit [scipy.interpolate.splev(testval,fs1[ind]) for ind in range(0,num_func)]
print 'Method 2 '
fs2 = scipy.interpolate.splmake(x,ylist,order=3)
out2 = scipy.interpolate.spleval(fs2,testval)
%timeit scipy.interpolate.spleval(fs2,testval)
## -- End pasted text --
Method 1
1000 loops, best of 3: 1.51 ms per loop
Method 2
1000 loops, best of 3: 1.32 ms per loop
As far as I understand spline evaluations, once the tck arrays have been created (either with splrep or splmake) the evaluation functions (splev and spleval) perform two operations when given some new value xnew:
1) Determine relevant indicies of knots and smoothing coefficients
2) Evaluate polynomial expression with smoothing coefficients and new xnew
Questions
Since all of the splines (in a function set) are created on the same knot values, is it possible to avoid step (1, relevant indices) in the spline evaluation once it has been performed on the first function of a function set? From my looking at the Fortran fitpack files (directly from DIERCKX, I could not find the .c files used by scipy on my machine) I do not think this is supported, but I would love to be shown wrong.
The compilation of the system c-code as well as the creation of all of the spline tck arrays is a preprocessing step as far as I am concerned; if I am worried about the speed of evaluating these lists of many functions, should be looking at a compiled variant since my tck lists will be unchanging?
One of my function sets will likely have an x-array of geometrically spaced values as opposed to linearly spaced; will this drastically reduce the evaluation time of the splines?
Thank you in advance for your time and answers.
Cheers,
Guy

numpy lstsq -- memory and run-time performance

I need to solve (in the least-squares sense) a large set (50,000) of linear systems. Each such "system" is Ax=B, with A being an N-by-K matrix, x being an k-by-1 vector, and B (obviously) being an N-by-1 vector. (N in my case is 50,000, and K is ~10).
numpy.linalg.lstsq seems like the obvious choice, but since the documentation contains no implementation details, I am wondering about the memory and run-time performance:
What are the run-time performance and memory requirements of lstsq?
Will it compute the A, A^T, multiply them, and take the inverse, or will it compute A's pseudo-inverse directly?
Is there a way to directly compute each X[i] of the result, thus saving on memory? Will it use it?
The documentation describes the result as including the singular values and the rank; a strong hint that it is using SVD.
A quick test on my laptop shows the memory not going up at all (as reported by System Monitor) after the allocation of the arrays A and B.
In [7]: A = np.random.randn(100000, 10)
In [8]: B = np.random.randn(100000)
In [9]: np.linalg.lstsq(A, B)
Out[9]:
(array([ 0.00240061, 0.0017896 , 0.00619928, 0.00010278, -0.00411501,
0.00028532, 0.0003893 , -0.00042893, 0.00178326, -0.00444068]),
array([ 99695.18278372]),
10,
array([ 318.37776275, 318.16578799, 317.82872616, 317.21981114,
316.80987468, 315.63798002, 315.46574698, 314.73120345,
313.99948001, 313.61503118]))

Hiding communication in Matrix Vector Product with MPI

I have to solve a huge linear equation for multiple right sides (Let's say 20 to 200). The Matrix is stored in a sparse format and distributed over multiple MPI nodes (Let's say 16 to 64). I run a CG solver on the rank 0 node. It's not possible to solve the linear equation directly, because the system matrix would be dense (Sys = A^T * S * A).
The basic Matrix-Vector multiplication is implemented as:
broadcast x
y = A_part * x
reduce y
While the collective operations are reasonably fast (OpenMPI seems to use a binary tree like communication pattern + Infiniband), it still accounts for a quite large part of the runtime. For performance reasons we already calculate 8 right sides per iteration (Basicly SpM * DenseMatrix, just to be complete).
I'm trying to come up with a good scheme to hide the communication latency, but I did not have a good idea yet. I also try to refrain from doing 1:n communication, although I did not yet measure if scaling would be a problem.
Any suggestions are welcome!
If your matrix is already distributed, would it be possible to use a distributed sparse linear solver instead of running it only on rank 0 and then broadcasting the result (if I'm reading your description correctly..). There's plenty of libraries for that, e.g. SuperLU_DIST, MUMPS, PARDISO, Aztec(OO), etc.
The "multiple rhs" optimization is supported by at least SuperLU and MUMPS (haven't checked the others, but I'd be VERY surprised if they didn't support it!), since they solve AX=B where X and B are matrices with potentially > 1 column. That is, each "rhs" is stored as a column vector in B.
If you don't need to have the results of an old right-hand-side before starting the next run you could try to use non-blocking communication (ISend, IRecv) and communicate the result while calculating the next right-hand-side already.
But make sure you call MPI_Wait before reading the content of the communicated array, in order to be sure you're not reading "old" data.
If the matrices are big enough (i.e. it takes long enough to calculate the matrix-product) you don't have any communication delay at all with this approach.

Resources