paralleling sequence of matrix multiplication for speed up - performance

In my function, there is a lot of element wise matrix multiplication which are independent. Is there a way to calculate them in parallel ?
All of them are very simple operations, but 70% of my run time is for these parts of code because this function is invoked millions of times.
function [r1,r2,r3]=backward(A,B,C,D,E,F,r1,r2,r3)
r1=A.*B;
r2=C.*D;
r3=E*F;
end
for i=1:300
[r1,r2,r3]=backward(A,B,C,D,E,F,r1,r2,r3)
end

EDIT: After writing the answer, I observed that you are not multiplying all the input matrices by means of matrix multiplication. Some of them are elementwise multiplications. If this is what you intended, the following answer won't apply.
You are looking for an optimal algorithm for computing product of multiple matrices. People have studied this problem long ago and they have come up with a dynamic programming algorithm to decide the optimal order.
For example, if A is of size 10000 x 1, B is of size 1 x 10000 and C is of size 10000 x 1, it would be a lot more efficient if we computed A*B*C as A*(B*C), instead of (A*B)*C. The proof of correctness of this technique lies in the fact that matrix multiplication is associative. You can read more about this on Wikipedia.
If you want a good quality MATLAB implementation of this, you can find it here. It takes the matrices as input and gives out the product. It seems like this implementation does a decent job of finding the optimal way of computing "upto" 10 matrices.

First thing to note: the last 3 variables that you provide as input are not beeing used. I don't think this will matter much, but it would be better to clean it up.
Now the actual answer:
MATLAB is all about matrix operations, and this has been highly optimized. Even using C++ you will not expect a significant speedup (and be wary of a slowdown). As such, with the information that is provided in the question, the conclusion would be that you cannot do anything to speed up independent matrix calculations.
That being said: If you could reduce the number of sequential function calls, there may be something to gain.
It is hard to say how to do this in general, but two ideas:
If you call the fuction in a for loop, use a parfor loop instead (assuming you have the parallel processing toolbox, otherwise manually break up the loop and open 4 matlab instances to paralellize the loop (can be automated if needed).
See whether you really need this many function calls to small matrix operations. If you could improve your algorithm, that could offer a huge improvement, but otherwise you may still be able to combine multiple matrices (multiple versions of A with multiple versions of B for instance) and do 1 big multiplication, rather than a 100 tiny ones).

Related

Using GCD with a large set of numbers

I am using PARI/GP which is a mathematics program with some helpful functionality for number theory, especially because it supports very large integers out of the box. For a previous C++ project I had to use a library called BigInt.
At the moment, using PARI/GP I am utilising the gcd() function to calculate the greatest common divisor (GCD) for numbers ranging from 0 to 255 digits in length, so as you can imagine the numbers do get very large! I set a=0 then my loop iterates upwards, each time calculating gcd(a,b) where the b is a long fixed number that never changes.
I was wondering, if perhaps I should use Euler's approach to calculating GCD, which I believe is the following simple formula: gcd(b, a % b) where the % symbol means modulo. Hopefully I got the variables in the correct order!
Is there a rough and quick way to approximate which approach shown above for calculating GCD is quickest? I would, of course, be open minded to other approaches which are quicker.
I do not expect my algorithm to ever finish, this is just an experiment to see how far it can reach based on which approach I use to calculating GCD.
Binary GCD should generally be better than naive Euclid, but a being very small compared to b is a special circumstance that may trigger poor performance from Binary GCD. I’d try one round of Euclid, i.e., gcd(b, a%b) where gcd is Binary GCD.
(But without knowing the underlying problem here, I’m not sure that this is the best advice.)
The best approach is to let pari do the work for you.
first, you can compute the gcd of a large number of inputs stored in a vector v as gcd(v).
? B=10^255; v = vector(10^6,i,random(B));
? gcd(v);
time = 22 ms.
? a = 0; for(i = 1, #v, a = gcd(a,v[i]))
time = 232 ms. \\ much worse
There are 2 reasons for this to be much faster on such small inputs: loop overhead and variable assignments on the one hand and early abort on the other hand (as soon as the intermediate answer is 1, we can stop). You can multiply v by 2, say, to prevent the second optimization; the simple gcd(v) will remain faster [because loop and assignments overhead still occurs, but in C rather than in interpreted GP; for small inputs this overhead is very noticeable, it will become negligible as the sizes increase]
similarly, it should be always faster on average to let the gcd function work out by itself how best to compute gcd(a,b) that to try an "improve" things by using tricks such as gcd(b, a % b) [Note: the order doesn't matter, and this will error out if b = 0, which gcd is clever enough to check]. gcd(a, b-a) will not error out but slow down things on average. For instance, gcd(a,b) will try an initial Euclidean step in case a and b have vastly differing sizes, it shouldn't help to try and add it yourself.
finally, the exact algorithms used depend on the underlying multiprecision library; either native PARI or GNU's GMP, the latter being faster due to a highly optimized implementation. In both cases, as operands sizes increase, this includes Euclid's algorithm, binary plus/minus [ dividing out powers of 2, we can assume a, b odd, then use gcd(b,(a-b)/4) if a = b mod 4 and gcd(b, (a+b)/4) otherwise; divisions are just binary shifts ], and asymptotically fast half-gcd (almost linear in the bit size). The latter is almost surely not being used in your computations since the threshold should be over 10.000 decimal digits. On the other hand, Euclid's algorithm will only be used for tiny (word-size) operands, but since all algorithms are recursive it will eventualy be used, when the size has become tiny enough.
If you want to investigate the speed of the gcd function, try it with integers around 100.000 decimal digits (then double that size, say), you should observe the almost linear complexity.

How are sparse Ax = b systems solved in practice?

Let A be an n x n sparse matrix, represented by a sequence of m tuples of the form (i,j,a) --- with indices i,j (between 0 and n-1) and a being a value a in the underlying field F.
What algorithms are used, in practice, to solve linear systems of equations of the form Ax = b? Please describe them, don't just link somewhere.
Notes:
I'm interested both in exact solutions for finite fields, and in exact and bounded-error solutions for reals or complex numbers using floating-point representation. I suppose exact or bounded-solutions for rational numbers are also interesting.
I'm particularly interested in parallelizable solutions.
A is not fixed, i.e. you don't just get different b's for the same A.
The main two algorithms that I have used and parallelised are the Wiedemann algorithm and the Lanczos algorithm (and their block variants for GF(2) computations), both of which are better than structured gaussian elimination.
The LaMacchia-Odlyzo paper (the one for the Lanczos algorithm) will tell you what you need to know. The algorithms involve repeatedly multiplying your sparse matrix by a sequence of vectors. To do this efficiently, you need to use the right data structure (linked list) to make the matrix-vector multiply time proportional to the number of non-zero values in the matrix (i.e. the sparsity).
Paralellisation of these algorithms is trivial, but optimisation will depend upon the architecture of your system. The parallelisation of the matrix-vector multiply is done by splitting the matrix into blocks of rows (each processor gets one block), each block of rows multiplies by the vector separately. Then you combine the results to get the new vector.
I've done these types of computations extensively. The original authors that broke the RSA-129 factorisation took 6 weeks using structured gaussian elimination on a 16,384 processor MasPar. On the same machine, I worked with Arjen Lenstra (one of the authors) to solve the matrix in 4 days with block Wiedemann and 1 day with block Lanczos. Unfortunately, I never published the result!

fast algorithm for computing 1/d?(SRT, goldsmidt, newton raphson,...)

I want to find a fast algorithm for computing 1/d , where d is double ( albeit it can be converted to integer) what is the best algorithm of many algorithms(SRT , goldschmidt,newton raphson, ...)?I'm writing my program in c language.
thanks in advance.
The fastest program is: double result = 1 / d;
CPU:s already use a root finding iterative algorithm like the ones you describe, to find the reciprocal 1/d. So you should find it difficult to beat it using a software implementation of the same algorithm.
If you have few/known denominators then try a lookup table. This is the usual approach for even slower functions such as trig functions.
Otherwise: just compute 1/d. It will be the fastest you can do. And there is an endless list of things you can do to speed up arithmetic if you have to
use 32 bit (single) instead of 64bit (double) precision. FP Division on takes a number of cycles proportional to the number of bits.
vectorize the operations. For example I believe you can compute four 32bit float divisions in parallel with SSE2, or even more in parallel by doing it on the GPU.
I've asked it from someone and I get my answer:
So, you can't add a hardware divider to the FPGA then? Or fast reciprocal support?
Anyway it depends. Does it have fast multiplication? If not, well, that's a problem, you could only implement the slow methods then.
If you have fast multiplication and IEEE floats, you can use the weird trick I linked to in my previous post with a couple of refinement steps. That's really just Newton–Raphson division with a simpler calculation for the initial approximation (but afaik it still only takes 3 refinements for single-precision floats, just like the regular initial approximation). Fast reciprocal support works that way too - give a fast initial approximation (handling the exponent right and getting significant bits from a lookup table, if you get 12 significant bits that way you only need one refinement step for single-precision or, 13 are enough to get 2 steps for double-precision) and optionally have instructions that help implement the refinement step (like AMD's PFRCPIT1 and PFRCPIT2), for example to calculate Y = (1 - D*X) and to calculate X + X * Y.
Even without those tricks Newton–Raphson division is still not bad, with the linear approximation it takes only 4 refinements for double-precision floats, but it also takes some annoying exponent adjustments to get in the right range first (in hardware that wouldn't be half as annoying).
Goldschmidt division is, afaik, roughly equivalent in performance and might have a slightly less complex implementation. It's really the same sort of deal - trickery with the exponent to get in the right range, the "2 - something" estimation trick (which is rearranged in Newton-Raphson division, but it's really the same thing), and doing the refinement step until all the bits are right. It just looks a little different.

parallelize bisection

Consider the bisection algorithm to find square root. Every step depends on the previous, so in my opinion it's not possibile to parallelize it. Am I wrong?
Consider also similar algorithm like binary search.
edit
My problem is not the bisection, but it is very similar. I have a monotonic function f(mu) and I need to find the mu where f(mu)<alpha. One core need 2 minutes to compute f(mu) and I need a very big precision. We have a farm of ~100 cores. My first attemp was to use only 1 core and then scan all value of f with a dynamic step, depending on how close I am to alpha. Now I want to use the whole farm, but my only idea is to compute 100 value of f at equal spaced points.
It depends on what you mean by parallelize, and at what granularity. For example you could use instruction level parallelism (e.g. SIMD) to find square roots for a set of input values.
Binary search is trickier, because the control flow is data-dependent, as is the number of iterations, but you could still conceivably perform a number of binary searches in parallel so long as you allow for the maximum number of iterations (log2 N).
Even if these algorithms could be parallelized (and I'm not sure they can), there is very little point in doing so.
Generally speaking, there is very little point in attempting to parallelize algorithms that already have sub-linear time bounds (that is, T < O(n)). These algorithms are already so fast that extra hardware will have very little impact.
Furthermore, it is not true (in general) that all algorithms with data dependencies cannot be parallelized. In some cases, for example, it is possible to set up a pipeline where different functional units operate in parallel and feed data sequentially between them. Image processing algorithms, in particular, are frequently amenable to such arrangements.
Problems with no such data dependencies (and thus no need to communicate between processors) are referred to as "embarrassingly parallel". Those problems represent a small subset of the space of all problems that can be parallelized.
Many algorithms have several steps that each step depend on previous step,Some those algorithm can changed steps to doing parallel and some impossible to parallel, I think BinarySearch is of second type, You not wrong, But you can paralleled binary search with multiple Search.

Accurate least-squares fit algorithm needed

I've experimented with the two ways of implementing a least-squares fit (LSF) algorithm shown here.
The first code is simply the textbook approach, as described by Wolfram's page on LSF. The second code re-arranges the equation to minimize machine errors. Both codes produce similar results for my data. I compared these results with Matlab's p=polyfit(x,y,1) function, using correlation coefficients to measure the "goodness" of fit and compare each of the 3 routines. I observed that while all 3 methods produced good results, at least for my data, Matlab's routine had the best fit (the other 2 routines had similar results to each other).
Matlab's p=polyfit(x,y,1) function uses a Vandermonde matrix, V (n x 2 matrix) and QR factorization to solve the least-squares problem. In Matlab code, it looks like:
V = [x1,1; x2,1; x3,1; ... xn,1] % this line is pseudo-code
[Q,R] = qr(V,0);
p = R\(Q'*y); % performs same as p = V\y
I'm not a mathematician, so I don't understand why it would be more accurate. Although the difference is slight, in my case I need to obtain the slope from the LSF and multiply it by a large number, so any improvement in accuracy shows up in my results.
For reasons I can't get into, I cannot use Matlab's routine in my work. So, I'm wondering if anyone has a more accurate equation-based approach recommendation I could use that is an improvement over the above two approaches, in terms of rounding errors/machine accuracy/etc.
Any comments appreciated! thanks in advance.
For a polynomial fitting, you can create a Vandermonde matrix and solve the linear system, as you already done.
Another solution is using methods like Gauss-Newton to fit the data (since the system is linear, one iteration should do fine). There are differences between the methods. One possibly reason is the Runge's phenomenon.

Resources