Speeding up matrix multiplication like operations

Speeding up matrix multiplication like operations - matrix

Suppose I have the following two matrices:
x = torch.randint(0, 256, (100000, 48), dtype=torch.uint8)
x_new = torch.randint(0, 256, (1000, 48), dtype=torch.uint8)
I wish to do a matrix multiplication like operation where I compare the 48 dimensions and sum up all the elements that are equal. The following operation takes 7.81 seconds. Batching does not seem to help:
matrix = (x_new.unsqueeze(1) == x).sum(dim=-1)
However, doing a simple matrix multiplication (matrix = x_new # x.T) takes 3.54 seconds. I understand this is most likely calling a deeper library that isn't slowed down by python. However, the question is, is there a way to speed up the multiplication like operation? by using scripting, or any other way at all?
What is even stranger though is that if I do matrix = x_new.float() # x.float().T this operation takes 214ms. This is more than 10x faster than the uint8 multiplication.
For context, I am trying to quantize vectors so that I can find the closest vector by comparing integers than directly doing dot products.

Related

Efficient Parallel Sparse Matrix dot product in Scipy Python

I have a really big (1.5M x 16M) sparse csr scipy matrix A. What i need to compute is the similarity of each pair of rows. I have defined the similarity as this:
Assume a and b are two rows of matrix A
a = (0, 1, 0, 4)
b = (1, 0, 2, 3)
Similarity (a, b) = 0*1 + 1*0 + 0*2 + 4*3 = 12
To compute all pairwise row similarities I use this (or Cosine similarity):
AT = np.transpose(A)
pairs = A.dot(AT)
Now pairs[i, j] is the similarity of row i and row j for all such i and j.
This is quite similar to pairwise Cosine similarity of rows. So If there is an efficient parallel algorithm that computes pairwise Cosine similarity it would work for me as well.
The problem: This dot product is very slow because it uses just one cpu (I have access to 64 of those cpus on my server).
I can also export A and AT to a file and run any other external program that does the multiplication in parallel and get the results back to the Python program.
Is there any more efficient way of doing this dot product? or computing the pairwise similarity in Parallel?

I finally used the 'Cosine' distance metric of scikit-learn and its pairwise_distances functions which support sparse matrices and is highly parallelised.
sklearn.metrics.pairwise.pairwise_distances(X, Y=None, metric='euclidean', n_jobs=1, **kwds)
I could also divide A into n horizontal parts and use the parallel python package to run multiple multiplications and horizontally stack the results later.

I wrote own implementation using sklearn. It is not parallel but it quite fast for large matrices.
from scipy.sparse import spdiags
from sklearn.preprocessing import normalize
def get_similarity_by_x_dot_x_greedy_for_memory(sp_matrix):
sp_matrix = sp_matrix.tocsr()
matrix = sp_matrix.dot(sp_matrix.T)
# zero diagonal
diag = spdiags(-matrix.diagonal(), [0], *matrix.shape, format='csr')
matrix = matrix + diag
return matrix
def get_similarity_by_cosine(sp_matrix):
sp_matrix = normalize(sp_matrix.tocsr())
return get_similarity_by_x_dot_x_greedy_for_memory(sp_matrix)

How can I perform this array-slicing and multiplication operation more efficiently?

I used the last two dimensions of this 3D matrix as a 2D matrix. So I just wanna multiply the 2D matrix from the result of Matrix1(i,:,:) (which i-by-i) with the vector Matrix2(i,:).') (which is 1-by-i).
The only way I could do that was using an auxiliary matrix that picked up all the numbers from the 2 dimensions from the 3D matrix:
matrixAux(:,:) = Matrix1(1,:,:)
and then I did the multiplication:
matrixAux * (Matrix2(i,:).')
and it worked. However, this is slow because I need to copy all the 3D matrix to a lot of auxiliary matrices, and I need to speed up my code because I'm doing the same operation many times.
How can I do that more efficiently, without having to copy the matrix?

Approach I: bsxfun Multiplication
One approach would be to use the output of bsxfun with #times whose values you can use instead of calculating the matrix multiplication results in a loop -
sum(bsxfun(#times,Matrix1,permute(Matrix2,[1 3 2])),3).'
Example
As an example, let's suppose Matrix1 and Matrix2 are defined like this -
nrows = 3;
p = 6;
ncols = 2;
Matrix1 = rand(nrows,ncols,p)
Matrix2 = rand(nrows,p)
Then, you have your loop like this -
for i = 1:size(Matrix1,1)
matrixAux(:,:) = Matrix1(i,:,:);
matrix_mult1 = matrixAux * (Matrix2(i,:).') %//'
end
So, instead of the loops, you can directly calculate the matrix multiplication results -
matrix_mult2 = sum(bsxfun(#times,Matrix1,permute(Matrix2,[1 3 2])),3).'
Thus, each column of matrix_mult2 would represent matrix_mult1 at each iteration of the loop, as the output of the codes would make it clearer -
matrix_mult1 =
0.7693
0.8690
matrix_mult1 =
1.0649
1.2574
matrix_mult1 =
1.2949
0.6222
matrix_mult2 =
0.7693 1.0649 1.2949
0.8690 1.2574 0.6222
Approach II: "Full" Matrix Multiplication
Now, this must be exciting! Well you can also leverage MATLAB's fast matrix multiplication to get your intermediate matrix multiplication results again without loops. If Matrix1 is nrows x p x ncols, you can reshape it to nrows*p x ncols and then perform matrix multiplication of it with Matrix2. Then, to get an equivalent of matrix_mult2, you need to select indices from the multiplication result. This is precisely achieved here -
%// Get size of Matrix1 to be used regularly inside the codes later on
[m1,n1,p1] = size(Matrix1);
%// Convert 3D Matrix1 to 2D and thus perform "full" matrix multiplication
fmult = reshape(Matrix1,m1*n1,p1)*Matrix2'; %//'
%// Get valid indices
ind = bsxfun(#plus,[1:m1:size(fmult,1)]',[0:nrows-1]*(size(fmult,1)+1)); %//'
%// Get values from the full matrix multiplication result
matrix_mult3 = fmult(ind);
Here, matrix_mult3 must be the same as matrix_mult2.
Observations: Since we are not using all the values calculated from the full matrix multiplication, rather indexing into it and selecting few of its elements, as such this approach performs better than the other approaches under certain circumstances. This approach seems to be the best one when nrows is a small value as we would be using more number of elements from the full matrix multiplication output in that case.
Benchmark Results
Two cases were tested against the three approaches and the test results seem to support our hypotheses discussed earlier.
Case 1
Matrix 1 as 400 x 400 x 400, the runtimes are -
--------------- With Loops
Elapsed time is 2.253536 seconds.
--------------- With BSXFUN
Elapsed time is 0.910104 seconds.
--------------- With Full Matrix Multiplication
Elapsed time is 4.361342 seconds.
Case 2
Matrix 1 as 40 x 2000 x 2000, the runtimes are -
--------------- With Loops
Elapsed time is 5.402487 seconds.
--------------- With BSXFUN
Elapsed time is 2.585860 seconds.
--------------- With Full Matrix Multiplication
Elapsed time is 1.516682 seconds.

Maximizing the efficiency of a simple algorithm in MATLAB

So here is what I'm trying to do in MATLAB:
I have an array of n, 2D images. I need to go through pixel by pixel, and find which picture has the brightest pixel at each point, then store the index of that image in another array at that point.
As in, if I have three pictures (n=1,2,3) and picture 2 has the brightest pixel at [1,1], then the value of max_pixels[1,1] would be 2, the index of the picture with that brightest pixel.
I know how to do this with for loops,
%not my actual code:
max_pixels = zeroes(x_max, y_max)
for i:x_max
for j:y_max
[~ , max_pixels(i, j)] = max(pic_arr(i, j))
end
end
But my question is, can it be done faster with some of the special functionality in MATLAB? I have heard that MATLAB isn't too friendly when it comes to nested loops, and the functionality of : should be used wherever possible. Is there any way to get this more efficient?
-PK

You can use max(...) with a dimension specified to get the maximum along the 3rd dimension.
[max_picture, indexOfMax] = max(pic_arr,[],3)

You can get the matrix of maximum values in this way, using memory instead of high performance of processor:
a = [1 2 3];
b = [3 4 2];
c = [0 4 1];
[max_matrix, index_max] = arrayfun(#(x,y,z) max([x y z]), a,b,c);
a,b,c can be matrices also.
It returns the matrix with max values and the matrix of indexes (in which matrix is found each max value).

Eigen - Re-orthogonalization of Rotation Matrix

After multiplying a lot of rotation matrices, the end result might not be a valid rotation matrix any more, due to rounding issues (de-orthogonalized)
One way to re-orthogonalize is to follow these steps:
Convert the rotation matrix to an axis-angle representation (link)
Convert back the axis-angle to a rotation matrix (link)
Is there something in Eigen library that does the same thing by hiding all the details? Or is there any better recipe?
This procedure has to be handled with care due to special singularity cases, so if Eigen provides a better tool for this it would be great.

I don't use Eigen and didn't bother to look up the API but here is a simple, computationally cheap and stable procedure to re-orthogonalize the rotation matrix. This orthogonalization procedure is taken from Direction Cosine Matrix IMU: Theory by
William Premerlani and Paul Bizard; equations 19-21.
Let x, y and z be the row vectors of the (slightly messed-up) rotation matrix. Let error=dot(x,y) where dot() is the dot product. If the matrix was orthogonal, the dot product of x and y, that is, the error would be zero.
The error is spread across x and y equally: x_ort=x-(error/2)*y and y_ort=y-(error/2)*x. The third row z_ort=cross(x_ort, y_ort), which is, by definition orthogonal to x_ort and y_ort.
Now, you still need to normalize x_ort, y_ort and z_ort as these vectors are supposed to be unit vectors.
x_new = 0.5*(3-dot(x_ort,x_ort))*x_ort
y_new = 0.5*(3-dot(y_ort,y_ort))*y_ort
z_new = 0.5*(3-dot(z_ort,z_ort))*z_ort
That's all, were are done.
It should be pretty easy to implement this with the API provided by Eigen. You can easily come up with other orthoginalization procedures but I don't think it will make a noticable difference in practice. I used the above procedure in my motion tracking application and it worked beatifully; it's both stable and fast.

You can use a QR decomposition to systematically re-orthogonalize, where you replace the original matrix with the Q factor. In the library routines you have to check and correct, if necessary, by negating the corresponding column in Q, that the diagonal entries of R are positive (close to 1 if the original matrix was close to orthogonal).
The closest rotation matrix Q to a given matrix is obtained from the polar or QP decomposition, where P is a positive semi-definite symmetric matrix. The QP decomposition can be computed iteratively or using a SVD. If the latter has the factorization USV', then Q=UV'.

Singular Value Decomposition should be very robust. To quote from the reference:
Let M=UΣV be the singular value decomposition of M, then R=UV.
For your matrix, the singular-values in Σ should be very close to one. The matrix R is guaranteed to be orthogonal, which is the defining property of a rotation matrix. If there weren't any rounding errors in calculating your original rotation matrix, then R will be exactly the same as your M to within numerical precision.

In the meantime:
#include <Eigen/Geometry>
Eigen::Matrix3d mmm;
Eigen::Matrix3d rrr;
rrr << 0.882966, -0.321461, 0.342102,
0.431433, 0.842929, -0.321461,
-0.185031, 0.431433, 0.882966;
// replace this with any rotation matrix
mmm = rrr;
Eigen::AngleAxisd aa(rrr); // RotationMatrix to AxisAngle
rrr = aa.toRotationMatrix(); // AxisAngle to RotationMatrix
std::cout << mmm << std::endl << std::endl;
std::cout << rrr << std::endl << std::endl;
std::cout << rrr-mmm << std::endl << std::endl;
Which is nice news, because I can get rid of my custom method and have one headache less (how can one be sure that he takes care of all singularities?),
but I really want your opinion on better/alternative ways :)

An alternative is to use Eigen::Quaternion to represent your rotation. This is much easier to normalize, and rotation*rotation products are generally faster. If you have a lot of rotation*vector products (with the same matrix), you should locally convert the quaternion to a 3x3 matrix.

M is the matrix we want to orthonormalize, and R is the rotation matrix closest to M.
Analytic Solution
Matrix R = M*inverse(sqrt(transpose(M)*M));
Iterative Solution
// To re-orthogonalize matrix M, repeat:
M = 0.5f*(inverse(transpose(M)) + M);
// until M converges
M converges to R, the nearest rotation matrix. The number of digits of accuracy will approximately double with each iteration.
Check whether the sum of the squares of the elements of (M - M^-T)/2 is less than the square of your error goal to know when (M + M^-T)/2 meets your accuracy threshold. M^-T is the inverse transpose of M.
Why It Works
We want to find the rotation matrix R which is closest to M. We will define the error as the sum of squared differences of the elements. That is, minimize trace((M - R)^T (M - R)).
The analytic solution is R = M (M^T M)^-(1/2), outlined here.
The problem is that this requires finding the square root of M^T M. However, if we notice, there are many matrices whose closest rotation matrix is R. One of these is M (M^T M)^-1, which simplifies to M^-T, the inverse transpose. The nice thing is that M and M^-T are on opposite sides of R, (intuitively like a and 1/a are on opposite side of 1).
We recognize that the average, (M + M^-T)/2 will be closer to R than M, and because it is a linear combination, will also maintain R as the closest rotation matrix. Doing this recursively, we converge to R.
Worst Case Convergence (Speculative)
Because it is related to the Babylonian square root algorithm, it converges quadratically.
The exact worst case error after one iteration of a matrix M and error e is nextE:
nextE = e^2/(2 e + 2)
e = sqrt(trace((M - R)^T (M - R)))
R = M (M^T M)^-(1/2)

How to compute frequency of data using FFT?

I want to know the frequency of data. I had a little bit idea that it can be done using FFT, but I am not sure how to do it. Once I passed the entire data to FFT, then it is giving me 2 peaks, but how can I get the frequency?
Thanks a lot in advance.

Here's what you're probably looking for:
When you talk about computing the frequency of a signal, you probably aren't so interested in the component sine waves. This is what the FFT gives you. For example, if you sum sin(2*pi*10x)+sin(2*pi*15x)+sin(2*pi*20x)+sin(2*pi*25x), you probably want to detect the "frequency" as 5 (take a look at the graph of this function). However, the FFT of this signal will detect the magnitude of 0 for the frequency 5.
What you are probably more interested in is the periodicity of the signal. That is, the interval at which the signal becomes most like itself. So most likely what you want is the autocorrelation. Look it up. This will essentially give you a measure of how self-similar the signal is to itself after being shifted over by a certain amount. So if you find a peak in the autocorrelation, that would indicate that the signal matches up well with itself when shifted over that amount. There's a lot of cool math behind it, look it up if you are interested, but if you just want it to work, just do this:
Window the signal, using a smooth window (a cosine will do. The window should be at least twice as large as the largest period you want to detect. 3 times as large will give better results). (see http://zone.ni.com/devzone/cda/tut/p/id/4844 if you are confused).
Take the FFT (however, make sure the FFT size is twice as big as the window, with the second half being padded with zeroes. If the FFT size is only the size of the window, you will effectively be taking the circular autocorrelation, which is not what you want. see https://en.wikipedia.org/wiki/Discrete_Fourier_transform#Circular_convolution_theorem_and_cross-correlation_theorem )
Replace all coefficients of the FFT with their square value (real^2+imag^2). This is effectively taking the autocorrelation.
Take the iFFT
Find the largest peak in the iFFT. This is the strongest periodicity of the waveform. You can actually be a little more clever in which peak you pick, but for most purposes this should be enough. To find the frequency, you just take f=1/T.

Suppose x[n] = cos(2*pi*f0*n/fs) where f0 is the frequency of your sinusoid in Hertz, n=0:N-1, and fs is the sampling rate of x in samples per second.
Let X = fft(x). Both x and X have length N. Suppose X has two peaks at n0 and N-n0.
Then the sinusoid frequency is f0 = fs*n0/N Hertz.
Example: fs = 8000 samples per second, N = 16000 samples. Therefore, x lasts two seconds long.
Suppose X = fft(x) has peaks at 2000 and 14000 (=16000-2000). Therefore, f0 = 8000*2000/16000 = 1000 Hz.

If you have a signal with one frequency (for instance:
y = sin(2 pi f t)
With:
y time signal
f the central frequency
t time
Then you'll get two peaks, one at a frequency corresponding to f, and one at a frequency corresponding to -f.
So, to get to a frequency, can discard the negative frequency part. It is located after the positive frequency part. Furthermore, the first element in the array is a dc-offset, so the frequency is 0. (Beware that this offset is usually much more than 0, so the other frequency components might get dwarved by it.)
In code: (I've written it in python, but it should be equally simple in c#):
import numpy as np
from pylab import *
x = np.random.rand(100) # create 100 random numbers of which we want the fourier transform
x = x - mean(x) # make sure the average is zero, so we don't get a huge DC offset.
dt = 0.1 #[s] 1/the sampling rate
fftx = np.fft.fft(x) # the frequency transformed part
# now discard anything that we do not need..
fftx = fftx[range(int(len(fftx)/2))]
# now create the frequency axis: it runs from 0 to the sampling rate /2
freq_fftx = np.linspace(0,2/dt,len(fftx))
# and plot a power spectrum
plot(freq_fftx,abs(fftx)**2)
show()
Now the frequency is located at the largest peak.

If you are looking at the magnitude results from an FFT of the type most common used, then a strong sinusoidal frequency component of real data will show up in two places, once in the bottom half, plus its complex conjugate mirror image in the top half. Those two peaks both represent the same spectral peak and same frequency (for strictly real data). If the FFT result bin numbers start at 0 (zero), then the frequency of the sinusoidal component represented by the bin in the bottom half of the FFT result is most likely.
Frequency_of_Peak = Data_Sample_Rate * Bin_number_of_Peak / Length_of_FFT ;
Make sure to work out your proper units within the above equation (to get units of cycles per second, per fortnight, per kiloparsec, etc.)
Note that unless the wavelength of the data is an exact integer submultiple of the FFT length, the actual peak will be between bins, thus distributing energy among multiple nearby FFT result bins. So you may have to interpolate to better estimate the frequency peak. Common interpolation methods to find a more precise frequency estimate are 3-point parabolic and Sinc convolution (which is nearly the same as using a zero-padded longer FFT).

Assuming you use a discrete Fourier transform to look at frequencies, then you have to be careful about how to interpret the normalized frequencies back into physical ones (i.e. Hz).
According to the FFTW tutorial on how to calculate the power spectrum of a signal:
#include <rfftw.h>
...
{
fftw_real in[N], out[N], power_spectrum[N/2+1];
rfftw_plan p;
int k;
...
p = rfftw_create_plan(N, FFTW_REAL_TO_COMPLEX, FFTW_ESTIMATE);
...
rfftw_one(p, in, out);
power_spectrum[0] = out[0]*out[0]; /* DC component */
for (k = 1; k < (N+1)/2; ++k) /* (k < N/2 rounded up) */
power_spectrum[k] = out[k]*out[k] + out[N-k]*out[N-k];
if (N % 2 == 0) /* N is even */
power_spectrum[N/2] = out[N/2]*out[N/2]; /* Nyquist freq. */
...
rfftw_destroy_plan(p);
}
Note it handles data lengths that are not even. Note particularly if the data length is given, FFTW will give you a "bin" corresponding to the Nyquist frequency (sample rate divided by 2). Otherwise, you don't get it (i.e. the last bin is just below Nyquist).
A MATLAB example is similar, but they are choosing the length of 1000 (an even number) for the example:
N = length(x);
xdft = fft(x);
xdft = xdft(1:N/2+1);
psdx = (1/(Fs*N)).*abs(xdft).^2;
psdx(2:end-1) = 2*psdx(2:end-1);
freq = 0:Fs/length(x):Fs/2;
In general, it can be implementation (of the DFT) dependent. You should create a test pure sine wave at a known frequency and then make sure the calculation gives the same number.

Frequency = speed/wavelength.
Wavelength is the distance between the two peaks.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio