GPU: the fastest way to transpose, reshape and multiply two matrices

GPU: the fastest way to transpose, reshape and multiply two matrices - performance

What is the fastest way to transpose, reshape and multiply two matrices in Matlab? I want to do the following:
B = B';
B = reshape(B, 20, 5 000 000);
A = A * B
where A is 20 x 20 real matrix and B is 25 million x 4 real matrix. Using the implementation above, the transpose operation is ~4 times slower than the matrix multiplication (on GPU).
I heard about dgemm which seems relevant, but not exactly what I'm looking for (it allows one to multiply matrices, transpose them and add stride (see LDA argument) in a single and fast operation).
I'm mostly interested in the case when both A and B are already gpuArrays and we are using modern nVidia GPU hardware.

Related

Is row-major ordering more efficient for matrix-vector multiplication?

If M is an n x m matrix and v and u are vectors, then in terms of indices, matrix-vector multiplication looks like u[i] = sum(M[i,j] v_j, 1 <= j <= m). Since v is a vector, its elements are presumably stored in consecutive memory locations for numerical-computation-oriented languages. If M is stored in row-major order (as in C, Mathematica, and Pascal), then the subsequent M[i,j] in the sum are also stored in consecutive memory locations as j is incremented, making the iteration very efficient. If it's stored in column-major order (as in Fortran, Matlab, R, and Julia), then incrementing j requires moving over by a number of memory locations equal to the outer matrix stride, which in this case equals n. This naively seems less efficient for matrices with many rows. (For matrix-matrix multiplication the problem doesn't come up, because under either ordering convention, incrementing the summed index requires moving by the major stride in one matrix's memory or the other.)
Is the difference between moving over in memory by one unit and by many units appreciable or negligible in most computer architectures, compared to the multiplication and addition operations? (I'm guessing "negligible", since in practice Fortran is typically at least as fast as C, but can anyone elaborate why?)

The difference is expected to be high in most computer architectures, at least in principle.
Matrix-vector multiplication is a memory-bound computation because the reusage of memory is low. All (N) components of v are reused to compute each element of u but each element of the matrix (N^2) is just used once. If we consider the latency of a typical memory (see e.g., https://gist.github.com/hellerbarde/2843375) as (less than) 100ns compared to the time required to perform a floating point operation (less than 1ns) we see that the majority of time is spent loading and storing values from/to arrays.
We still can implement it cache-friendly, i.e. having data locality as much as possible. Since the memory is loaded to the cache as lines, we have to use a loaded line of cache as much as possible. That is why accessing contiguous memory regions reduce the time spent loading data from memory.
To support this, let us try a very simple code:
program mv
integer, parameter :: n=10000
real, allocatable :: M(:,:), v(:), u(:)
real :: start, finish
integer :: i, j
allocate(M(n,n),v(n),u(n))
call random_number(M)
call random_number(v)
u(:)=0.
call cpu_time(start)
do i=1,n
do j=1,n
! non-contiguous order
u(i)=u(i)+M(i,j)*v(j)
! contiguous order
! u(i)=u(i)+M(j,i)*v(j)
enddo
enddo
call cpu_time(finish)
print*,'elapsed time: ',finish-start
end program mv
Some results:
non-contiguous order contiguous order
gfortran -O0 1. 0.5
gfortran -O3 0.3 0.1
ifort -O0 1.5 0.85
ifort -O3 0.037 0.035
As you can see, the difference is significant compiling without optimizations. Enabling optimization gfortran still shows significant differences, whereas with ifort there is only a small difference. Looking at the compiler report, it seems that the compiler interchanged the loops, thus leading to a contiguous access on the inner loop.
However, can we say that a language having row-major ordering is more efficient for matrix-vector computation? No, I cannot say that. Not only because the compiler can compensate the difference. The code itself does not know everything about rows and columns of M: it basically knows that M has two indexes, one of which -- depending on the language -- contiguous in memory. For matrix-vector the best for data locality is having the "fast" index mapped to the matrix row index. You can achieve this with both "row-major" and "column-major" languages. You just have to store the values of M according to this. As an example if you have the "algebraic" matrix
[ M11 M12 ]
M = [ ]
[ M21 M22 ]
you store it as "computational matrix"
C ==> M[1,1] = M11 ; M[1,2] = M12 ; M[2,1] = M21 ; M[2,2] = M22
Fortran ==> M[1,1] = M11 ; M[2,1] = M12 ; M[1,2] = M21 ; M[2,2] = M22
so that you are always contiguous in a "algebraic matrix" row. The computer does not know anything on the initial matrix but we know that the computational matrix is the tranposed version of the algebraic matrix. In both cases, I will have the inner loop iterating over a contiguous index and the final result will be the same vector.
In a complex code, if I have already allocated and filled the matrix with values and I cannot decide to store the transposed matrix, it is potentially possible that the "row-major" language gives best performances. But, interchanging the loops (see https://en.wikipedia.org/wiki/Loop_interchange) as automatically done by intel compilers and as done by BLAS implementations (see http://www.netlib.org/lapack/explore-html/db/d58/sgemv_8f_source.html), reduce the differences to very small differences values. Therefore, using Fortran you can prefer:
do j=1,n
do i=1,n
u(i)=u(i)+M(i,j)*v(j)
enddo
enddo

Lua/torch multiplication of 1D and 2D tensors

I am trying to multiply two matrices in lua whose dimensions are a=40,000x1 and b=1x40,000. In Lua, the 40,000x1 matrix is showing up as a 1D tensor and the 1x40,000 matrix is showing up as a 2D tensor. Whenever, I try to multiply them simply using a*b, an error is showing up: multiplication between 1D and 2D tensors not yet supported. I cannot iteratively go through each index because this function is used regularly in my program and would considerably increase time of execution. How can I multiply a and b?

Use view:
c = a:view(40000, 1) * b

Parallel Matrix Multiplication using multi GPU

I have installed two GPUs (2x Nvidia Quadro 410) in my system in different pci slots. To solve Martix multiplication on both of these GPU, how can I split the input matrices such that each GPU processes/computes a part of output matrix and then returns it back.
For eg. for two matrix A, B each of order 10x10 , then the to compute the output matrix C= A x B ,such that ,out of 100 elements(10 x 10) 50 elements should be calculated on 1st GPU and other half i.e 50 to b computed in 2nd GPU.
I am trying to implement it on OpenCL. But, any algorithm is welcomed which will help me come up with the solution.

In general, if you have matrices X (of size axb, rows first) and Y (of size bxc),
X * Y = vcat(X[0:a/2,0:b] * Y, X[a/2:a,0:b] * Y)
In this pseudocode, vcat is vertical concatenation (putting one matrix on top of each other, e.g. a 4x3 matrix concatenated with 2x3 matrix will produce a 6x3 matrix), : denotes ranges and [] is indexing.
Both arguments to vcat can be computed on different GPUs, and the concatenation can be achieved just by pointing the output to different sub-regions of the output buffer (assuming we have C-ordered arrays). The initial splitting of X can be similarly achieved just by using different sub-regions (since it is split along a row).

Averaging fft2s of resulting matrix from blockproc fft2 on 20x20 blocks of 40x100 image?

I am relatively new to matlab and image processing, so please bear with me.
What I am trying to do is characterize noise within an image, specifically by averaging the fft of an area where this noise occurs with high probability.
img = img(1:40,1:100)
imshow(img);
ffts = blockproc(img, [20 20], #(block_struct) fftshift(fft2(block_struct.data)));
// fft = imresize(ffts, [40 100], 'nearest');
Essentially, this code takes the upper left hand 40 x 100 portion of the image and then performs a block-process on each 20 x 20 subsection of that area calculating fft2. Hopefully, my logic sounds alright so far.
What I am wondering though, is if there is any way to perform the average fft2 of the 20 x 20 subsections of the 40 x 100 fft matrix with the built-in matlab functionality. I know that this could be completed relatively easy with loops, but I'd like to keep the solution in my code as compact as possible.
I've read through the manual a little and it is apparent that there are a couple of matlab functions that may perform this; However, I am not entirely confident in my application so far. Any directions are welcomed!

This can easily be done in 3 lines of code.
Line #1
First use im2col to reshape each distinct block neighbourhood of 20 x 20 into a single column. As such, the output of this will be a 400 x N matrix, where each column denotes a unique block neighbourhood that has been reshaped into a column. Each column will have 400 rows, as each neighbourhood has 400 elements (20 x 20). N would be the total number of unique blocks we have in your 40 x 100 image. This would amount to 10, as we can fit 2 blocks horizontally and 5 blocks vertically given the 20 x 20 requirement.
Line #2
What is great about the output of im2col is that the ith row of im2col tells you the ith element for every block in your image. As such, all you have to do next is take each row and average over all of the columns. The output will be a 400 x 1 vector that denotes your average FFT for all of the blocks. This can be achieved using mean and specifying that we want to average over the second dimension (second parameter is 2), which is the columns.
Line #3
We then need to reshape this back into a 20 x 20 matrix, so use reshape to do this. We specify that the output matrix is 20 x 20, given the 400 x 1 element vector.
One question that you may ask is whether or not this re-ordering is guaranteed to reorder our FFT block correctly. This is guaranteed because when im2col constructed each block into a column, it progresses in a column-major order. This means that for one column of blocks, we construct them on a row-by-row basis. Once we get our 20 x 20 set of distinct blocks, these blocks are arranged so that the are sampled in column major order. This means that a single 20 x 20 block gets constructed into a 400 x 1 column vector, where columns of the 20 x 20 block are stacked on top of each other from left to right. Therefore, by doing mean and reshape, the spatial locations for each block do correspond to each other and will thus produce the right answer.
Without further ado, here's the code:
colBlocks = im2col(ffts, [20 20], 'distinct'); %// Line 1
meanCol = mean(colBlocks, 2); %// Line 2
fftBlockAverage = reshape(meanCol, [20 20]); %// Line 3
Minor Side Effect
Because the FFT is complex valued in nature, by doing the average, you would average the real and imaginary components separately. This is how MATLAB handles the average of complex valued data. I'm not sure what analysis you'll be performing after you calculate the average 2D FFT block, but bear this in mind before you proceed any further with your analysis.
Sidenote
Divakar in an earlier answer created a more efficient implementation of im2col. This is especially useful if you don't have the Image Processing Toolbox installed. You can check out that implementation here. It has been shown that the timing between this function and MATLAB's im2col are magnitudes faster.
Benchmarking
As a bonus, here is a benchmark using his code. Timing results were taken using a 40 x 100 matrix where the im2col built-in function was timed, and Divakar's custom function after. Results show that his method is faster. This may be very useful when considering larger size images. However, if you are looking for succinctness, use what I have written. If you want something fast, use his method.
Benchmarking Code
%// Input Parameters
nrows = 20;
ncols = 20;
A = rand(40,100);
disp('------------------------- With IM2COL');
tic
B1 = im2col(A,[nrows ncols],'distinct');
toc,clear B1
disp('----------------- With CUSTOM-BUILT IM2COL');
tic
B2 = im2col_distinct(A,[nrows ncols]);
toc,clear B2
Results
------------------------- With IM2COL
Elapsed time is 0.026914 seconds.
----------------- With CUSTOM-BUILT IM2COL
Elapsed time is 0.004186 seconds.

Is there a fast way to invert a matrix in Matlab?

I have lots of large (around 5000 x 5000) matrices that I need to invert in Matlab. I actually need the inverse, so I can't use mldivide instead, which is a lot faster for solving Ax=b for just one b.
My matrices are coming from a problem that means they have some nice properties. First off, their determinant is 1 so they're definitely invertible. They aren't diagonalizable, though, or I would try to diagonlize them, invert them, and then put them back. Their entries are all real numbers (actually rational).
I'm using Matlab for getting these matrices and for this stuff I need to do with their inverses, so I would prefer a way to speed Matlab up. But if there is another language I can use that'll be faster, then please let me know. I don't know a lot of other languages (a little but of C and a little but of Java), so if it's really complicated in some other language, then I might not be able to use it. Please go ahead and suggest it, though, in case.

I actually need the inverse, so I can't use mldivide instead,...
That's not true, because you can still use mldivide to get the inverse. Note that A-1 = A-1 * I. In MATLAB, this is equivalent to
invA = A\speye(size(A));
On my machine, this takes about 10.5 seconds for a 5000x5000 matrix. Note that MATLAB does have an inv function to compute the inverse of a matrix. Although this will take about the same amount of time, it is less efficient in terms of numerical accuracy (more info in the link).
First off, their determinant is 1 so they're definitely invertible
Rather than det(A)=1, it is the condition number of your matrix that dictates how accurate or stable the inverse will be. Note that det(A)=∏i=1:n λi. So just setting λ1=M, λn=1/M and λi≠1,n=1 will give you det(A)=1. However, as M → ∞, cond(A) = M2 → ∞ and λn → 0, meaning your matrix is approaching singularity and there will be large numerical errors in computing the inverse.
My matrices are coming from a problem that means they have some nice properties.
Of course, there are other more efficient algorithms that can be employed if your matrix is sparse or has other favorable properties. But without any additional info on your specific problem, there is nothing more that can be said.
I would prefer a way to speed Matlab up
MATLAB uses Gauss elimination to compute the inverse of a general matrix (full rank, non-sparse, without any special properties) using mldivide and this is Θ(n3), where n is the size of the matrix. So, in your case, n=5000 and there are 1.25 x 1011 floating point operations. So on a reasonable machine with about 10 Gflops of computational power, you're going to require at least 12.5 seconds to compute the inverse and there is no way out of this, unless you exploit the "special properties" (if they're exploitable)

Inverting an arbitrary 5000 x 5000 matrix is not computationally easy no matter what language you are using. I would recommend looking into approximations. If your matrices are low rank, you might want to try a low-rank approximation M = USV'
Here are some more ideas from math-overflow:
https://mathoverflow.net/search?q=matrix+inversion+approximation

First suppose the eigen values are all 1. Let A be the Jordan canonical form of your matrix. Then you can compute A^{-1} using only matrix multiplication and addition by
A^{-1} = I + (I-A) + (I-A)^2 + ... + (I-A)^k
where k < dim(A). Why does this work? Because generating functions are awesome. Recall the expansion
(1-x)^{-1} = 1/(1-x) = 1 + x + x^2 + ...
This means that we can invert (1-x) using an infinite sum. You want to invert a matrix A, so you want to take
A = I - X
Solving for X gives X = I-A. Therefore by substitution, we have
A^{-1} = (I - (I-A))^{-1} = 1 + (I-A) + (I-A)^2 + ...
Here I've just used the identity matrix I in place of the number 1. Now we have the problem of convergence to deal with, but this isn't actually a problem. By the assumption that A is in Jordan form and has all eigen values equal to 1, we know that A is upper triangular with all 1s on the diagonal. Therefore I-A is upper triangular with all 0s on the diagonal. Therefore all eigen values of I-A are 0, so its characteristic polynomial is x^dim(A) and its minimal polynomial is x^{k+1} for some k < dim(A). Since a matrix satisfies its minimal (and characteristic) polynomial, this means that (I-A)^{k+1} = 0. Therefore the above series is finite, with the largest nonzero term being (I-A)^k. So it converges.
Now, for the general case, put your matrix into Jordan form, so that you have a block triangular matrix, e.g.:
A 0 0
0 B 0
0 0 C
Where each block has a single value along the diagonal. If that value is a for A, then use the above trick to invert 1/a * A, and then multiply the a back through. Since the full matrix is block triangular the inverse will be
A^{-1} 0 0
0 B^{-1} 0
0 0 C^{-1}
There is nothing special about having three blocks, so this works no matter how many you have.
Note that this trick works whenever you have a matrix in Jordan form. The computation of the inverse in this case will be very fast in Matlab because it only involves matrix multiplication, and you can even use tricks to speed that up since you only need powers of a single matrix. This may not help you, though, if it's really costly to get the matrix into Jordan form.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio