How to perform parallel reduction to sum an array - parallel-processing

I'm new to parallel programming and can't quite grasp the concept of how to implement parallel reduction.
Say you were to sum a vector of 128 floats in shared memory with 8 threads how would you achieve this with only 2 for loops?
The only way I can think of is to sequentially add the vector until you're left with 16 elements that you can compute in parallel but that wouldn't be much faster than just sequentially adding all the elements together?

You can assume the sum of two floats is associative.
Based on this, you can then virtually divide the vector in 8 equal parts and compute each part in each thread (in parallel). Each thread can make a local sum and then put this in a cell of a predefined array local_sums.
Finally, one thread can make the sum local_sums resulting in the global sum.
Note that the sum of floating-point numbers is actually not associative (the result of a+b and b+a are marginally different).
However, you cannot parallelize the code if you do not make this assumption. Furthermore, the difference of precision of the two methods is generally similar (ie. not great for both).

Related

Using GCD with a large set of numbers

I am using PARI/GP which is a mathematics program with some helpful functionality for number theory, especially because it supports very large integers out of the box. For a previous C++ project I had to use a library called BigInt.
At the moment, using PARI/GP I am utilising the gcd() function to calculate the greatest common divisor (GCD) for numbers ranging from 0 to 255 digits in length, so as you can imagine the numbers do get very large! I set a=0 then my loop iterates upwards, each time calculating gcd(a,b) where the b is a long fixed number that never changes.
I was wondering, if perhaps I should use Euler's approach to calculating GCD, which I believe is the following simple formula: gcd(b, a % b) where the % symbol means modulo. Hopefully I got the variables in the correct order!
Is there a rough and quick way to approximate which approach shown above for calculating GCD is quickest? I would, of course, be open minded to other approaches which are quicker.
I do not expect my algorithm to ever finish, this is just an experiment to see how far it can reach based on which approach I use to calculating GCD.
Binary GCD should generally be better than naive Euclid, but a being very small compared to b is a special circumstance that may trigger poor performance from Binary GCD. I’d try one round of Euclid, i.e., gcd(b, a%b) where gcd is Binary GCD.
(But without knowing the underlying problem here, I’m not sure that this is the best advice.)
The best approach is to let pari do the work for you.
first, you can compute the gcd of a large number of inputs stored in a vector v as gcd(v).
? B=10^255; v = vector(10^6,i,random(B));
? gcd(v);
time = 22 ms.
? a = 0; for(i = 1, #v, a = gcd(a,v[i]))
time = 232 ms. \\ much worse
There are 2 reasons for this to be much faster on such small inputs: loop overhead and variable assignments on the one hand and early abort on the other hand (as soon as the intermediate answer is 1, we can stop). You can multiply v by 2, say, to prevent the second optimization; the simple gcd(v) will remain faster [because loop and assignments overhead still occurs, but in C rather than in interpreted GP; for small inputs this overhead is very noticeable, it will become negligible as the sizes increase]
similarly, it should be always faster on average to let the gcd function work out by itself how best to compute gcd(a,b) that to try an "improve" things by using tricks such as gcd(b, a % b) [Note: the order doesn't matter, and this will error out if b = 0, which gcd is clever enough to check]. gcd(a, b-a) will not error out but slow down things on average. For instance, gcd(a,b) will try an initial Euclidean step in case a and b have vastly differing sizes, it shouldn't help to try and add it yourself.
finally, the exact algorithms used depend on the underlying multiprecision library; either native PARI or GNU's GMP, the latter being faster due to a highly optimized implementation. In both cases, as operands sizes increase, this includes Euclid's algorithm, binary plus/minus [ dividing out powers of 2, we can assume a, b odd, then use gcd(b,(a-b)/4) if a = b mod 4 and gcd(b, (a+b)/4) otherwise; divisions are just binary shifts ], and asymptotically fast half-gcd (almost linear in the bit size). The latter is almost surely not being used in your computations since the threshold should be over 10.000 decimal digits. On the other hand, Euclid's algorithm will only be used for tiny (word-size) operands, but since all algorithms are recursive it will eventualy be used, when the size has become tiny enough.
If you want to investigate the speed of the gcd function, try it with integers around 100.000 decimal digits (then double that size, say), you should observe the almost linear complexity.

How are sparse Ax = b systems solved in practice?

Let A be an n x n sparse matrix, represented by a sequence of m tuples of the form (i,j,a) --- with indices i,j (between 0 and n-1) and a being a value a in the underlying field F.
What algorithms are used, in practice, to solve linear systems of equations of the form Ax = b? Please describe them, don't just link somewhere.
Notes:
I'm interested both in exact solutions for finite fields, and in exact and bounded-error solutions for reals or complex numbers using floating-point representation. I suppose exact or bounded-solutions for rational numbers are also interesting.
I'm particularly interested in parallelizable solutions.
A is not fixed, i.e. you don't just get different b's for the same A.
The main two algorithms that I have used and parallelised are the Wiedemann algorithm and the Lanczos algorithm (and their block variants for GF(2) computations), both of which are better than structured gaussian elimination.
The LaMacchia-Odlyzo paper (the one for the Lanczos algorithm) will tell you what you need to know. The algorithms involve repeatedly multiplying your sparse matrix by a sequence of vectors. To do this efficiently, you need to use the right data structure (linked list) to make the matrix-vector multiply time proportional to the number of non-zero values in the matrix (i.e. the sparsity).
Paralellisation of these algorithms is trivial, but optimisation will depend upon the architecture of your system. The parallelisation of the matrix-vector multiply is done by splitting the matrix into blocks of rows (each processor gets one block), each block of rows multiplies by the vector separately. Then you combine the results to get the new vector.
I've done these types of computations extensively. The original authors that broke the RSA-129 factorisation took 6 weeks using structured gaussian elimination on a 16,384 processor MasPar. On the same machine, I worked with Arjen Lenstra (one of the authors) to solve the matrix in 4 days with block Wiedemann and 1 day with block Lanczos. Unfortunately, I never published the result!

paralleling sequence of matrix multiplication for speed up

In my function, there is a lot of element wise matrix multiplication which are independent. Is there a way to calculate them in parallel ?
All of them are very simple operations, but 70% of my run time is for these parts of code because this function is invoked millions of times.
function [r1,r2,r3]=backward(A,B,C,D,E,F,r1,r2,r3)
r1=A.*B;
r2=C.*D;
r3=E*F;
end
for i=1:300
[r1,r2,r3]=backward(A,B,C,D,E,F,r1,r2,r3)
end
EDIT: After writing the answer, I observed that you are not multiplying all the input matrices by means of matrix multiplication. Some of them are elementwise multiplications. If this is what you intended, the following answer won't apply.
You are looking for an optimal algorithm for computing product of multiple matrices. People have studied this problem long ago and they have come up with a dynamic programming algorithm to decide the optimal order.
For example, if A is of size 10000 x 1, B is of size 1 x 10000 and C is of size 10000 x 1, it would be a lot more efficient if we computed A*B*C as A*(B*C), instead of (A*B)*C. The proof of correctness of this technique lies in the fact that matrix multiplication is associative. You can read more about this on Wikipedia.
If you want a good quality MATLAB implementation of this, you can find it here. It takes the matrices as input and gives out the product. It seems like this implementation does a decent job of finding the optimal way of computing "upto" 10 matrices.
First thing to note: the last 3 variables that you provide as input are not beeing used. I don't think this will matter much, but it would be better to clean it up.
Now the actual answer:
MATLAB is all about matrix operations, and this has been highly optimized. Even using C++ you will not expect a significant speedup (and be wary of a slowdown). As such, with the information that is provided in the question, the conclusion would be that you cannot do anything to speed up independent matrix calculations.
That being said: If you could reduce the number of sequential function calls, there may be something to gain.
It is hard to say how to do this in general, but two ideas:
If you call the fuction in a for loop, use a parfor loop instead (assuming you have the parallel processing toolbox, otherwise manually break up the loop and open 4 matlab instances to paralellize the loop (can be automated if needed).
See whether you really need this many function calls to small matrix operations. If you could improve your algorithm, that could offer a huge improvement, but otherwise you may still be able to combine multiple matrices (multiple versions of A with multiple versions of B for instance) and do 1 big multiplication, rather than a 100 tiny ones).

Pseudorandom hash of two integers

I need a NxN matrix with 16bit or 32bit pseudorandom uniformaly distributed numbers over the whole range of values. N is unfortunately very large (at least 1e6), so it can not be pregenerated (That would take about a TB of memory). The only viable option I can think of is using a hash of my indices i and j as matrix elements.
There are plenty of integer hash functions available, but I am not sure which ones are suitable for two reasons.
-Only 32bit unsigned integer operations available. Since N is at least 2^20 I can not naively concatenate the two indices into one 32bit key without creating unnecessary collisions.
-Pseudorandomness is important here, I am not building a hashtable. Most integer hashes I found are designed for hashtables and don't have very strong requirements.
A possible solution would be taking a cryptographic hash like SHA-2, but performance is important and that is just too expensive.
A suggestions on how to combine two 32bit uints into one wile avoiding collisions patterns would already help a great deal, since I could then pick from the whole range of 32bit to 32bit hashes.
Some insight on which 32bit to 32bit hashes have good randomness would also be much appreciated.
Pregenerating 1 or 2 Arrays of N random numbers is no problem if it helps.
In case it matters, the target are GPUs I am writing in recent versions of GLSL.
What about using LCG? It is well-known fact that in the form of
xn = (a*x+c) mod 232 where a mod 8 is 3 or 5 and c is odd, the resulting congruential sequence will have period 232.
Numerical recipes: a=1664525, c=1013904223, but there are tons of them
Form unique x from i, j, and compute xn.
I found a suitable algorithm. Block ciphers in counter mode are obviously suitable. I initially rejected the idea because of the performance implications of most block ciphers. However, I found a paper that introduces a related algorithm (basically a block cipher with less rounds) called Philox (Parallel Random Numbers: As Easy as 1, 2, 3 by Salmon et al.).
Link: http://www.thesalmons.org/john/random123/papers/random123sc11.pdf
The only problem left is how to combine the two indices into one 32bit number. But I guess XOR should be good enough if combined with a rotation to avoid commutativity.

Why permutation matrices are used to swap rows of an array?

What are the advantages of using a permutation matrix to swap rows? Why one would create a permutation matrix and then apply a matrix multiplication, is it easier and more efficient than just swapping rows with a for loop?
Permutation matrices are a useful mathematical abstraction, because they allow analysis using the normal rules of matrix algebra, without having to introduce another type of operation.
In software, good implementations do not store a permutation matrix as a full matrix, they store a permutation array and they apply it directly (without a full matrix multiplication).
Depending on the sizes of the matrices and the operations and access patterns involved, it may be cheaper not to apply the permutation to the data in memory at all, but just to use it as an extra indirection. So, when you request (P * M)(i,j), where P is a permutation matrix and M is some other matrix that you are permuting, the data need not be re-arranged at all, but rather the element access operation will look up the permuted row when you access the element.
The first thing that comes into my mind is the issue called "spatial locality". Caching technologies assume that if a memory location is accessed, it is probable to access the nearby locations of the memory. In some programming languages, elements in rows are neighbors whereas elements in columns are neighbors in others. It depends on the implementation. I guess permutation matrices are designed to solve this problem, since optimization of matrix multiplication is one of the problems that algorithms academia mostly works on improving. Simple loop structure will not be able to make use of cache technologies to improve performance.

Resources