Speed up numpy matrix inverse - performance

I am using Numpy/Scipy to invert a 20k matrix, it's slow.
I tried:
(1) M_inv = M.I
(2) Ident = np.Identity(len(M))
M_inv = scipy.linalg.solve(M, Ident)
(3) M_inv = scipy.linglg.inv(M)
but didn't see any speedup.
Is there any other way to speed this up?

This is a big matrix, and inverting it is going to be slow. Some options:
Use a numpy linked against Intel MKL (e.g. the Enthought distribution, or you can compile it yourself), which should be faster than one linked against standard BLAS/ATLAS.
If your matrix is sufficiently sparse, use scipy.linalg.sparse. (This will probably be slower if there are only a few zeros, though.)
Figure out if you really need an explicit representation of the inverted matrix to do whatever it is you're trying to do with it – often you can get away without explicitly inverting it, but it's hard to tell without knowing what it is you're doing with this matrix.


Multiply Eigen::Matrices without transposing first

Say you have Matrix<int, 4, 2> and Matrix<int, 3, 2> which you want to multiply in the natural way that consumes the -1 dimension without first transposing.
Is this possible? Or do we have to transpose first. Which would be silly (unperformative) from a cache perspective, because now the elements we are multiplying and summing aren't contiguous.
Here's a playground. https://godbolt.org/z/Gdj3sfzcb
Pytorch provides torch.inner and torch.tensordot which do this.
Just like in Numpy, transpose() just creates a "view". It doesn't do any expensive memory operations (unless you assign it to a new matrix). Just call a * b.transpose() and let Eigen handle the details of the memory access. A properly optimized BLAS library like Eigen handles the transposition on smaller tiles in temporary memory for optimal performance.
Memory order still matters for fine tuning though. If you can, write your matrix multiplications in the form a.transpose() * b for column-major matrices (like Eigen, Matlab), or a * b.transpose() for row-major matrices like those in Numpy. That saves the BLAS library the trouble of doing that transposition.
Side note: You used auto for your result. Please read the Common Pitfalls chapter in the documentation. Your code didn't compute a matrix multiplication, it stored an expression of one.

Parallellize least squares for large (> 30k x 30k) non-square dense matrices

Let RG = A for dense unstructured matrices with shapes (e.g. roughly) R: (30k x 40k, entries float32) and G: (40k x 50k, entries either 0.0 or 1.0, roughly equally often) and of course A: (30k x 50k, entries float32).
Given A and G, I want to find the least squares solution for R.
I can use hundreds of CPU cores, hundreds of GB of RAM and also an A40 GPU. What is the best way to use such resources to solve the problem? I'm using Julia 1.7 in the examples below but I'm open to other options!
First question: Can I somehow exploit that the entries of G are only zeros and ones?
Trying to use Julia LinearAlgebra with many CPUs
I've tried two methods: "Penrose inverse" and "right division"
using LinearAlgebra
#show BLAS.get_num_threads()
# defaults to 8. Can change using BLAS.set_num_threads(N)
# build toy problem (order of magnitude smaller sizes)
R_true = rand(Float32, 3_000, 4_000)
G = rand([0., 1.], 4_000, 5_000)
# note: using true/false here gives same results but is much slower!
A = R_true * G
# solve toy problem using matrix (right) division
R_fitted_rdiv = A / G
# solve toy problem using Penrose inverse
R_fitted_pinv = (pinv(G') * A')'
First, setting BLAS.set_num_threads(64) (or any bigger number) actually only gives me BLAS.get_num_threads() returning 32. Apparantly that's an upper limit. Second,
using 32 BLAS threads is actually slower than using 8.
(e.g. performing right division with sizes (4000, 9800) / (8500, 9800) takes less than 50 seconds on 8 threads but more than 55 seconds on 32 threads. I ran things multiple times to exclude compilation time issues.) I don't know why this is or if it's normal. How can I make use of my computing power for this problem?
I think that the matrix division is faster than the Penrose inverse method. Should this be expected? I don't know what either of the functions do exactly for these inputs. The docs say that left division (\) uses pivoted QR factorization. I couldn't find what algorithm(s) are used for pinv or right division (/) (although it's probably the same as \ since they are related by transposing the matrices). I'd rather not delve too deeply because my knowledge in numerical linear algebra is quite limited.
The issue is that for my large matrices either method takes forever. Is there a way to make use of my ~100 cores somehow?
Trying to use the GPU:
Using CUDA.jl, Matrices of size around 10k work fine and take a minute to pinv:
using CUDA
#time matrix = CUDA.rand(Float32, 10_000, 10_500) # 0.003037 seconds (5 allocations: 160 bytes)
#time pinv(matrix) # 57.417559 seconds (678 allocations: 172.094 KiB)
However, when I try to do matrices around size 20k, I get right away the error InexactError: trunc(Int32, 4811456640). I assume this is due to CUBLAS using int32 for indexing, even though I don't understand why it leads to an error in this case. (edit: it's about the size of the array in bytes fitting into 31 bits.)
Trying to use right division with CuArrays gives the error "DimensionMismatch("LU factored matrix A must be square!")". I guess I have to choose a different algorithm manually? I don't know what it's called. (Although, it probably would still crash for large matrices...?)
To summarize, it doesn't look like I can use the GPU from Julia easily to solve my problem. Should I keep trying to use the GPU for this task or stick to the many CPUs?
Yes this is really my problem, please refrain from commenting "nobody should ever need such large least squares"
Naive answer
Using pytorch, this will require at least 30GB GPU memory
import torch
A = torch.randint(0, 2, (50000, 40000), device='cuda', dtype=torch.float32).T
G = torch.randint(0, 2, (50000, 30000), device='cuda', dtype=torch.float32).T
R = torch.lstsq(G.T, A.T)
If the system can sustain the same operation throughput as my laptop you should have an answer in about 15 minutes.
I would suggest you to try a generalized version scaling up the dimensions to get a better feeling of how your system will handle it
def try_it(a,b,c):
A = torch.randint(0, 2, (a, b), device='cuda', dtype=torch.float32).T
G = torch.randint(0, 2, (a, c), device='cuda', dtype=torch.float32).T
R = torch.lstsq(G.T, A.T)
I transposed the dimensions in the generation in order to make sure G.T and A.T would be contiguous.
You can't take much advantage of the entries being integer. This type of problem is easier to solve on the reals than on the integers, because finding integer solutions would require you to search the solutions, while the real solution you can find by doing algebraic manipulations.

Sparse matrix to speed up octave

I have a loop where "i" depends on "i-1" value, so I cannot vectorize it.
I've read that I can use a sparse matrix in order to vectorize it and so to speed up my code, but I don't understand how this work.
Any help?
You are referring to this technique, as referenced from this (rather old) how to speed up octave article.
I'll rephrase the gist here in case the link dies in the future.
Suppose you have the following loop:
p1(1) = 0;
for i = 2 : N
t = t + dt;
p1(i) = p1(i - 1) + dt * 2 * t;
You note here that, purely from a mathematical point of view, the last step in the loop could be rephrased as:
-1 * p1(i - 1) + 1 * p1(i) = dt * 2 * t
This makes it possible to recast the problem as a sparse matrix solve, by thinking of p1 as the vector of unknowns, and each iteration of the loop as a row in a (sparse) system of equations. E.g.:
Given that t is a known vector, this makes the above a straightforward problem that can be solved via a simple matrix division operation, which is guaranteed to be fast.
Having said that, presumably this 'trick' is only useful if you are able to recast the problem in this manner in the first place. Presumably this will only be the case for linear problems of your unknown. I don't think this can necessarily be used for more complicated loops.
Also, as Cris has mentioned in the comments, if this method does not work for you, there's a chance you can optimize your loop in other ways (or even that the loop solution may not necessarily be slow in the first place).
By the way, in theory, Octave provides jit-speedup like matlab does, though unlike matlab you need to enable it explicitly (in the sense that you need to compile your octave with jit options, which tends not to be the default), and my personal experience is that this is mostly experimental and may not do much except in the simplest of loops (see this post).

CUDA implementation for arbitrary precision arithmetics

I have to multiply two very large (~ 2000 X 2000) dense matrices whose entries are floats with arbitrary precision (I am using GMP and the precision is currently set to 600). I was wondering if there is any CUDA library that supports arbitrary precision arithmetics? The only library that I have found is called CAMPARY however it seems to be missing some references to some of the used functions.
The other solution that I was thinking about was implementing a version of the Karatsuba algorithm for multiplying matrices with arbitrary precision entries. The end step of the algorithm would just be multiplying matrices of doubles, which could be done very efficiently using cuBLAS. Is there any similar implementation already out there?
Since nobody has suggested such a library so far, let's assume that one doesn't exist.
You could always implement the naive implementation:
One grid thread for each pair of coordinates in the output matrix.
Each thread performs an inner product of a row and a column in the input matrices.
Individual element operations will use the code taken from the GMP (hopefully not much more than copy-and-paste).
But you can also do better than this - just like you can do better for regular-float matrix multiplication. Here's my idea (likely not the best of course):
Consider the worked example of matrix multiplication using shared memory in the CUDA C Programming Guide. It suggests putting small submatrices in shared memory. You can still do this - but you need to be careful with shared memory sizes (they're small...):
A typical GPU today has 64 KB shared memory usable per grid block (or more)
They take 16 x 16 submatrix.
Times 2 (for the two multiplicands)
Times ceil(801/8) (assuming the GMP representation uses 600 bits from the mantissa, one bit for the sign and 200 bits from the exponent)
So 512 * 101 < 64 KB !
That means you can probably just use the code in their worked example as-is, again replacing the float multiplication and addition with code from GMP.
You may then want to consider something like parallelizing the GMP code itself, i.e. using multiple threads to work together on single pairs of 600-bit-precision numbers. That would likely help your shared memory reading pattern. Alternatively, you could interleave the placement of 4-byte sequences from the representation of your elements, in shared memory, for the same effect.
I realize this is a bit hand-wavy, but I'm pretty certain I've waved my hands correctly and it would be a "simple matter of coding".

Ising 2D Optimization

I have implemented a MC-Simulation of the 2D Ising model in C99.
Compiling with gcc 4.8.2 on Scientific Linux 6.5.
When I scale up the grid the simulation time increases, as expected.
The implementation simply uses the Metropolis–Hastings algorithm.
I tried to find out a way to speed up the algorithm, but I haven't any good idea ?
Are there some tricks to do so ?
As jimifiki wrote, try to do a profiling session.
In order to improve on the algorithmic side only, you could try the following:
Lookup Table:
When calculating the energy difference for the Metropolis criteria you need to evaluate the exponential exp[-K / T * dE ] where K is your scaling constant (in units of Boltzmann's constant) and dE the energy-difference between the original state and the one after a spin-flip.
Calculating exponentials is expensive
So you simply build a table beforehand where to look up the possible values for the dE. There will be (four choose one plus four choose two plus four choose three plus four choose four) possible combinations for a nearest-neightbour interaction, exploit the problem's symmetry and you get five values fordE: 8, 4, 0, -4, -8. Instead of using the exp-function, use the precalculated table.
As mentioned before, it is possible to parallelize the algorithm. To preserve the physical correctness, you have to use a so-called checkerboard concept. Consider the two-dimensional grid as a checkerboard and compute only the white cells parallel at once, then the black ones. That should be clear, considering the nearest-neightbour interaction which introduces dependencies of the values.
You can also implement the simulation on a GPGPU, e.g. using CUDA, if you're already working on C99.
Some tips:
- Don't forget to align C99-structs properly.
- Use linear Arrays, not that nested ones. Aligned memory is normally faster to access, if done properly.
- Try to let the compiler do loop-unrolling, etc. (gcc special options, not default on O2)
Some more information:
If you look for an efficient method to calculate the critical point of the system, the method of choice would be finite-size scaling where you simulate at different system-sizes and different temperature, then calculate a value which is system-size independet at the critical point, therefore an intersection point of the corresponding curves (please see the theory to get a detailed explaination)
I hope I was helpful.
It's normal that your simulation times scale at least with the square of the size. Isn't it?
Here some subjestions:
If you are concerned with thermalization issues, try to use parallel tempering. It can be of help.
The Metropolis-Hastings algorithm can be made parallel. You could try to do it.
Check you are not pessimizing the code.
Are your spin arrays of ints? You could put many spins on the same int. It's a lot of work.
Moreover, remember what Donald taught us:
premature optimisation is the root of all evil
Before optimising you should first understand where your program is slow. This is called profiling.
