Julia: How to efficiently compose `StaticArrays.SMatrix` from multiple sub-matrices - performance

I really like StaticArrays.jl for the fact that in many cases (e.g. small to medium sized matrices) it can be so much faster when doing matrix algebra.
However, I am frequently struggling to efficiently compose e.g. multiple small SMatrix objects into a larger matrix without allocations. While my example below yields a block-diagonal matrix, ideally I am looking for a generic way of composing matrices, where the index partitioning might not necessarily have this very structure (even though well-performing solution for block-diagonal matrix composition would probably already fix 90% of my use-cases).
Here is a minimum example of what I am currently doing when composing matrices:
using BenchmarkTools, StaticArrays
# this is just a placeholder for an operation with `m`
#inline something_something(m) = m*m
function compose_matrix(mats, index_partitions)
m_total = sum(mats) do m; first(size(m)) end
n_total = sum(mats) do m; last(size(m)) end
M = #MMatrix(zeros(m_total, n_total))
for (m, (xids, yids)) in zip(mats, index_partitions)
M[xids, yids] = something_something(m)
end
return SMatrix(M)
end
# a tuple (xindex, yindex) of indices, indicating the position of each matrix in the
# composed matrix
index_partitions = ((SVector(1,2), SVector(1,2)),
(SVector(3,4), SVector(3,4)))
mats = (#SMatrix(rand(2,2)), #SMatrix(rand(2,2)))
#benchmark(compose_matrix($mats, $index_partitions)) |> display
This is already quite good:
BenchmarkTools.Trial:
memory estimate: 144 bytes
allocs estimate: 1
--------------
minimum time: 19.386 ns (0.00% GC)
median time: 22.257 ns (0.00% GC)
mean time: 35.945 ns (21.02% GC)
maximum time: 34.551 μs (99.82% GC)
--------------
samples: 10000
evals/sample: 997
However, the MMatrix allocation is not optimized out (on both Julia 1.2 and 1.3-rc4). I think I have seen cases in which these memory allocations can be eliminated by the compiler, however I fail to see the pattern of things I am allowed to do with the MMatrix between instantiation and conversion to the SMatrix such that it is lowered to code that does not allocate. Notice that, while this is a "friendly" case (only one allocation), allocations are even worse if A and B (and therefore the index partitions) don't have the same size.
Is there a reliable way of doing these compositions of SMatrix objects without allocations? Do I have to mess around with #generated functions to compose the SMatrix in one shot (if so, what could this look like)? Or is there another way of doing this in an efficient manner?

Related

Parallellize least squares for large (> 30k x 30k) non-square dense matrices

Let RG = A for dense unstructured matrices with shapes (e.g. roughly) R: (30k x 40k, entries float32) and G: (40k x 50k, entries either 0.0 or 1.0, roughly equally often) and of course A: (30k x 50k, entries float32).
Given A and G, I want to find the least squares solution for R.
I can use hundreds of CPU cores, hundreds of GB of RAM and also an A40 GPU. What is the best way to use such resources to solve the problem? I'm using Julia 1.7 in the examples below but I'm open to other options!
First question: Can I somehow exploit that the entries of G are only zeros and ones?
Trying to use Julia LinearAlgebra with many CPUs
I've tried two methods: "Penrose inverse" and "right division"
using LinearAlgebra
#show BLAS.get_num_threads()
# defaults to 8. Can change using BLAS.set_num_threads(N)
# build toy problem (order of magnitude smaller sizes)
R_true = rand(Float32, 3_000, 4_000)
G = rand([0., 1.], 4_000, 5_000)
# note: using true/false here gives same results but is much slower!
A = R_true * G
# solve toy problem using matrix (right) division
R_fitted_rdiv = A / G
# solve toy problem using Penrose inverse
R_fitted_pinv = (pinv(G') * A')'
First, setting BLAS.set_num_threads(64) (or any bigger number) actually only gives me BLAS.get_num_threads() returning 32. Apparantly that's an upper limit. Second,
using 32 BLAS threads is actually slower than using 8.
(e.g. performing right division with sizes (4000, 9800) / (8500, 9800) takes less than 50 seconds on 8 threads but more than 55 seconds on 32 threads. I ran things multiple times to exclude compilation time issues.) I don't know why this is or if it's normal. How can I make use of my computing power for this problem?
I think that the matrix division is faster than the Penrose inverse method. Should this be expected? I don't know what either of the functions do exactly for these inputs. The docs say that left division (\) uses pivoted QR factorization. I couldn't find what algorithm(s) are used for pinv or right division (/) (although it's probably the same as \ since they are related by transposing the matrices). I'd rather not delve too deeply because my knowledge in numerical linear algebra is quite limited.
The issue is that for my large matrices either method takes forever. Is there a way to make use of my ~100 cores somehow?
Trying to use the GPU:
Using CUDA.jl, Matrices of size around 10k work fine and take a minute to pinv:
using CUDA
#time matrix = CUDA.rand(Float32, 10_000, 10_500) # 0.003037 seconds (5 allocations: 160 bytes)
#time pinv(matrix) # 57.417559 seconds (678 allocations: 172.094 KiB)
However, when I try to do matrices around size 20k, I get right away the error InexactError: trunc(Int32, 4811456640). I assume this is due to CUBLAS using int32 for indexing, even though I don't understand why it leads to an error in this case. (edit: it's about the size of the array in bytes fitting into 31 bits.)
Trying to use right division with CuArrays gives the error "DimensionMismatch("LU factored matrix A must be square!")". I guess I have to choose a different algorithm manually? I don't know what it's called. (Although, it probably would still crash for large matrices...?)
To summarize, it doesn't look like I can use the GPU from Julia easily to solve my problem. Should I keep trying to use the GPU for this task or stick to the many CPUs?
Yes this is really my problem, please refrain from commenting "nobody should ever need such large least squares"
Naive answer
Using pytorch, this will require at least 30GB GPU memory
import torch
A = torch.randint(0, 2, (50000, 40000), device='cuda', dtype=torch.float32).T
G = torch.randint(0, 2, (50000, 30000), device='cuda', dtype=torch.float32).T
R = torch.lstsq(G.T, A.T)
If the system can sustain the same operation throughput as my laptop you should have an answer in about 15 minutes.
I would suggest you to try a generalized version scaling up the dimensions to get a better feeling of how your system will handle it
def try_it(a,b,c):
A = torch.randint(0, 2, (a, b), device='cuda', dtype=torch.float32).T
G = torch.randint(0, 2, (a, c), device='cuda', dtype=torch.float32).T
R = torch.lstsq(G.T, A.T)
I transposed the dimensions in the generation in order to make sure G.T and A.T would be contiguous.
You can't take much advantage of the entries being integer. This type of problem is easier to solve on the reals than on the integers, because finding integer solutions would require you to search the solutions, while the real solution you can find by doing algebraic manipulations.

Non-intuitive perf diff between `matrix * vector`, `matrix’ * vector` and `copy(matrix’) * vector` Usage Performance blas

Question from Julia Discourse
I’m using Julia 1.2. This is my test:
a = rand(1000, 1000)
b = adjoint(a)
c = copy(b)
#btime a * x setup=(x=rand(1000)) # 114.757 μs
#btime b * x setup=(x=rand(1000)) # 94.179 μs
#btime c * x setup=(x=rand(1000)) # 110.325 μs
I was expecting a and c to be at very least not slower.
After inspecting stdlib/LinearAlgebra/src/matmul.jl , it turns out that Julia passes b.parent (i.e. a ) to BLAS.gemv , not b , and instead switches LAPACK’s dgemv_ into a different and apparently faster mode.
Am I correct in assuming that the speedup comes from the fact that the memory is aligned in a more favorable way for whatever dgemv_ does, when it’s in a trans = T mode? If so, then I’m guessing this isn’t actionable, besides possibly mentioning the gotcha in the docs somehow. If my assumption is wrong though, is there something to be done about this?
Answer from #stevengj in the same Discourse thread:
Am I correct in assuming that the speedup comes from the fact that the memory is aligned in a more favorable way for whatever dgemv_ does, when it’s in a trans = T mode?
Close. It does have to do with memory, but it’s about locality, not alignment. The basic thing to understand is that it is more efficient to access consecutive (or at least nearby) data from memory than data that is separated, due to the existence of cache lines. (Consecutive access also has some advantages in utilizing SIMD instructions.)
Julia stores matrices in column-major order, so that the columns are contiguous in memory. When you multiply a transposed matrix (that has not been copied) by a vector, therefore, it can compute it as the dot product of the contiguous column (= transposed row) with the contiguous vector, which has good spatial locality and therefore utilizes cache lines efficiently.
For multiplying a non -transposed matrix by a vector, in contrast, you are taking the dot products of non-contiguous rows of the matrix with the vector, and it is harder to efficiently utilize cache lines. To improve spatial locality in this case, an optimized BLAS like OpenBLAS actually computes the dot products of several rows at a time (a “block”) with the vector, I believe — that’s why it’s only 10% slower and not much worse. (In fact, even the transposed case may do some blocking to keep the vector in cache.)

Why is reshape so fast? (Spoiler: Copy-on-Write)

I have a big matrix A which is 1GB of double values, when I reshape it to different dimensions, it's incredible fast.
A=rand(128,1024,1024);
tic;B=reshape(A,1024,128,1024);toc
Elapsed time is 0.000011 seconds.
How can it be that fast? Another observation, MATLAB uses less memory than it should after running that code and storing two matrices of 1GB each: Memory used by MATLAB: 1878 MB (1.969e+09 bytes)
Explanation of the good performance
Matlab uses copy-on-write whenever possible. If you write expressions like B=A, MATLAB does not copy A, instead both variables A and B are references to the same data structure. Only if one of the two variables will be modified, MATLAB will create a copy.
Now to the special case of reshape. Here it looks like A and B are not the same, but in memory they are. The underlying array which holds the data is unaffected by the reshape operation, nothing has to be moved: all(A(:)==B(:)). Everything MATLAB has to do when calling reshape is to create a new reference and annotate it with the new dimensions of the matrix. Reshaping a matrix is nothing more than creating a new reference to the input data, which is annotated with the new dimensions. The runtime of reshape is less than 1µs or roughly the time two simple assignments like B=A require. For all practical applications a zero time operation.
>> tic;for i=1:1000;B=reshape(A,1024,128,1024);end;toc
Elapsed time is 0.000724 seconds.
>> tic;for i=1:1000;B=A;end;toc
Elapsed time is 0.000307 seconds.
It is unknown how large such a reference really is, but we can assume it to be within a few bytes.
Other zero cost operations
Functions known to have practically zero cost (both runtime and memory):
B=reshape(A,sz)
B=A(:)
B=A.' - only for Vectors
B=A' - only for Vectors of real numbers, without the attribute complex. Use .' instead.
B=permute(A,p) - only for the cases where all(A(:)==B(:)).1
B=ipermute(A,p) - only for the cases where all(A(:)==B(:)).1
B=squeeze(A) 1
shiftdim - only for the cases where all(A(:)==B(:)), which are:1
used to remove leading singleton dimensions.
used with negative second input
used without second input argument.
Functions which are "expensive", regardless of the fact that they don't touch the representation in memory (all(A(:)==B(:)) is true)
Left sided indexing: B(1:numel(A))=A; 2
Right sided indexing other than (:), including B=A(1:end); and B=A(:,:,:); 2
1 Significantly slower runtime than reshape between 1µs and 1ms. Probably because of some constant computation overhead. Memory consumption is practically zero and the runtime is independent from the input size. Operations without this annotation have a runtime below 1µs and roughly equivalent to reshape.
2 Zero cost in OCTAVE
Originally used MATLAB 2013b when writing this post. Confirmed the numbers with MATLAB 2019b.

cuFFT runs slowly - any way to accelerate?

I am using cufft to calculate 1D fft along each row for a matrix, and an array. The matrix size is 512 (x) X 720 (y), and the size of the array is 512 X 1. Which means the fft is applied on each row that has 512 elements for 720 times to the matrix, and is applied once for the array.
However, this operation turns out really slow, about one second basically. Is it normal, or any chance I can accelerate the code?
Here is my code (from NVIDIA sample code):
void FFTSinoKernel(cufftComplex* boneSinoF,
cufftComplex* kernelF,
int nChanDetX, // 512
int nView) // 720
{
cufftHandle plan;
// fft sino
cufftPlan1d(&plan, nChanDetX, CUFFT_C2C, nView);
cufftExecC2C(plan, boneSinoF, boneSinoF, CUFFT_FORWARD);
// fft kernel
cufftPlan1d(&plan, nChanDetX, CUFFT_C2C, 1);
cufftExecC2C(plan, kernelF, kernelF, CUFFT_FORWARD);
cufftDestroy(plan);
}
I was trying to usecufftExecR2C(), but I think that function has bug, because my DC component shifts 1or 2 units with each row. So I have filed a but report. But for now the cufftExecC2C() gives me the right results, so I decide to stick to it.
UPDATE:
Interestingly, I found if I call this function again, it will accelerate significantly, less than 10 ms. So whenever the cufft gets called the first, time, it is slow. Afterwards, it becomes much faster. I don't understand why the first time is slow, and how to avoid it. Anyone has any similar experience? Thanks.
Move the FFT initialization (plan creation) outside of the performance critical loop. The setup code has to allocate memory and calculate O(N) transcendental functions, which can be much slower than the O(NlogN) simple arithmetic ops inside the FFT computation itself.

How to Optimally Add, Multiply and Average Very Large Data Sets in MATLAB Using parfor

I would like to introduce an interesting MATLAB programming problem I’ve encountered in my research. The solution may be of use to people doing computations on very large data sets. It involves striking a balance between RAM and CPU usage using parfor. Because my data is so large, files must be read in over and over again to be processed. The other issue it introduces is finding an optimal algorithm for multiplication, summation and averaging of very large vectors and matrices.
I have found a solution, but it’s time intensive and I would like too see if the community sees any room for improvement. Here’s the general form of the problem.
Suppose we have about 30,000 functions that we’ve taken the Fourier transforms of. Each transform has the form e(k)=a(k)+b(k)*i where k is the magnitude of a wavevector, a is the real component and b is the imaginary component. Each of these transforms is saved to file as a 2-column table with the structure below. The number of elements in each vector is about N=1e6. This means that each of these files is 1/64 GB in size. Note that the values of k_i are not necessarily in order.
k | Re(e) Im(e)
k_1 | a(1) b(1)
k_2 | a(2) b(2)
... |
k_N | a(N) b(N)
The goal is to cross-multiply each pair of modes and average the results over a set of about 50 fixed k-bands. So for example, if we let the elements of vectors 5 and 7 be represented respectively as e5=a5+b5*i and e7=a7+b7*i we need
a5(1)*a7(1) + b5(1)*b7(1)
a5(2)*a7(2) + b5(2)*b7(2)
...
a5(N)*a7(N) + b5(N)*b7(N)
Each element of the above N-dimensional vector belongs within a single k-bin. All the elements in each bin must be averaged and returned. So at the end of one mode-mode comparison we end up with just 50 numbers.
I have at my disposal a computer with 512GB of RAM and 48 processors. My version of MATLAB 2013a limits me to only opening 12 processors with parfor simultaneously on each instance of MATLAB that I run. So what I’ve been doing is opening 4 versions of MATLAB, allocating 10 processors each, and sending the maximum amount of data to each processor without spilling over my self-imposed limit of 450 GB.
Naturally this involves my breaking the problem into blocks. If I have 30000 vectors e, then I will have 30000^2 sets of these 50 cross-coefficients. (It’s actually about half this since the problem is symmetric). I decide to break my problem into blocks of size m=300. This means I’d have 100 rows and columns of these blocks. The code I’m running looks like this (I’m simplifying a bit to just include the relevant bits):
for ii=1:100 % for each of the block-rows
[a,b] = f_ReadModes(ii); % this function reads modes 1 through 300 if ii=1,
% modes 301 through 600 if ii=2 and so on
% “a” and “b” are matrices of size Nx300
% “a” contains the real vectors stored in columns,
% “b” contains the imaginary vectors stored in columns
for jj=ii:100 % for each of the block-columns in the upper triangle
[c,d] = f_ReadModes(jj); % same as above except this reads in a different
% set of 300 modes
block = zeros(300,300,50); % allocates space for the results. The first
% dimension corresponds to the "ii modes".
% The 2nd dimension corresponds to the “jj”
% modes. The 3rd dimension is for each k-bin.
parfor rr=1:300
A = zeros(1,300,50); % temporary storage to keep parfor happy
ModeMult = bsxfun(#times,a(:,rr),c)+bsxfun(#times,b(:,rr),d);
% My strategy is to cross multiply one mode by all the others before
% moving on. So when rr=6, I’m multiplying the 6th mode in a (and b)
% by all the modes in c and d. This would help fill in the 6th row
% of block, i.e. block(6,:,:).
for kk=1:50 % Now I average the results in each of the k-bins
ind_dk = f_kbins(kk); % this returns the rows of a,b,c,d and
% ModeMult that lie within the the kk^th bin
A(1,:,kk) = mean(ModeMult(ind_dk,:)); % average results in each bin
end
block(rr,:,:) = A; % place the results in more permanent storage.
end
f_WriteBlock(block); % writes the resulting cross-coefficient block to disk
end
end
There are three bottlenecks in this problem:
1) Read-in time
2) Computing the products ac and bd then summing them (this is ModeMult)
3) Averaging the results of step 2 in each k-bin
Bigger blocks are preferable since they necessitate fewer read-in’s. However, the computations in steps 2 and 3 don’t automatically parallelize, so they have to be sent to individual processors using parfor. Because the computational costs are high, utitlizing all the processors seems necessary.
The way my code is written, each processor needs enough memory to hold 4*N*m elements. When m=300, this means each processor is using about 10 GB of RAM. It would be nice if the memory requirement for each processor could be lowered somehow. It would also be great if the computations in steps 2 and 3 could be rewritten to run more efficiently. Any thoughts?

Resources