How to parallelise arrays inside a structure in OpenMP - openmp

I have a structure(pointer) G, which has multiple arrays, say A,B,C, etc.
I have parallelizable loops for these arrays. How do I implement OpenMP for these loops?
e.g. Code:
for(int i=;i<size,i++)
G->A[i]=G->B[i]+G->C[i];

Related

How to perform parallel reduction to sum an array

I'm new to parallel programming and can't quite grasp the concept of how to implement parallel reduction.
Say you were to sum a vector of 128 floats in shared memory with 8 threads how would you achieve this with only 2 for loops?
The only way I can think of is to sequentially add the vector until you're left with 16 elements that you can compute in parallel but that wouldn't be much faster than just sequentially adding all the elements together?
You can assume the sum of two floats is associative.
Based on this, you can then virtually divide the vector in 8 equal parts and compute each part in each thread (in parallel). Each thread can make a local sum and then put this in a cell of a predefined array local_sums.
Finally, one thread can make the sum local_sums resulting in the global sum.
Note that the sum of floating-point numbers is actually not associative (the result of a+b and b+a are marginally different).
However, you cannot parallelize the code if you do not make this assumption. Furthermore, the difference of precision of the two methods is generally similar (ie. not great for both).

Complex matrix multiplies real matrix by BLAS

Related question Multiplying real matrix with a complex vector using BLAS
Suppose I aim at C = A*B, where A, B, C are real, complex, and complex matrices, respectively. A[i,j] * B[j,k] := (A[i,j] Re(B[j,k])), (A[i,j] Im(B[j,k])). Is there any available subroutine in BLAS?
I can think about split B into two real matrices for the real and imaginary part, do dgemm then combine, (combine should be faster than matrix multiplication, even directly using nested loops(?)) as suggested by Multiplying real matrix with a complex vector using BLAS
I don't know if there is a direct option in BLAS.
No, there is no routine in standard BLAS that multiplies real and complex matrices together to produce a complex result.

Do iterative solvers in Eigen allocate memory every iteration?

I am interested in using Eigen to solve sparse matrix equations. Iterative solvers require a number of "scratch" vectors that are updated with intermediate values during each iteration. As I understand it, when using an iterative solver such as the conjugate gradient method these vectors are usually allocated once before beginning iteration and then reused at every iteration to avoid frequent reallocations. As far as I can tell from looking at the ConjugateGradient class, Eigen re-allocates memory at every iteration. Could someone who is familiar with Eigen tell me whether my understanding is correct? It seemed possible that there was some sort of clever memory-saving scheme going on in the allocation procedure, with the result that the memory is not actually reallocated each time through, but I dug down and could not find such a thing. Alternatively, if Eigen is indeed re-allocating memory at each pass through the loop, is it an insubstantial burden compared to time required to do the actual computations?
Where do you see reallocation? As you can see in the source code, the four helper vectors residual, p, z, and tmp, are declared and allocated outside the while loop, that is, before the iterations take place. Moreover, recall that Eigen is an expression template library, so a line code as:
x += alpha * p;
does note create any temporary. In conclusion, no, Eigen's CG implementation does not perform any (re-)allocation within the iterations.

Why permutation matrices are used to swap rows of an array?

What are the advantages of using a permutation matrix to swap rows? Why one would create a permutation matrix and then apply a matrix multiplication, is it easier and more efficient than just swapping rows with a for loop?
Permutation matrices are a useful mathematical abstraction, because they allow analysis using the normal rules of matrix algebra, without having to introduce another type of operation.
In software, good implementations do not store a permutation matrix as a full matrix, they store a permutation array and they apply it directly (without a full matrix multiplication).
Depending on the sizes of the matrices and the operations and access patterns involved, it may be cheaper not to apply the permutation to the data in memory at all, but just to use it as an extra indirection. So, when you request (P * M)(i,j), where P is a permutation matrix and M is some other matrix that you are permuting, the data need not be re-arranged at all, but rather the element access operation will look up the permuted row when you access the element.
The first thing that comes into my mind is the issue called "spatial locality". Caching technologies assume that if a memory location is accessed, it is probable to access the nearby locations of the memory. In some programming languages, elements in rows are neighbors whereas elements in columns are neighbors in others. It depends on the implementation. I guess permutation matrices are designed to solve this problem, since optimization of matrix multiplication is one of the problems that algorithms academia mostly works on improving. Simple loop structure will not be able to make use of cache technologies to improve performance.

Sparse matrix creation in parallel

Are there any algorithms that allow efficient creation (element filling) of sparse (e.g. CSR or coordinate) matrix in parallel?
If you store your matrix as a coordinate map, any language which has a concurrent dictionary implementation available should do the job for you.
Java's got the ConcurrentHashMap, and .NET 4 has ConcurrentDictionary, both of which allow multi-threaded non-blocking (afaik) element insertion in parallel.
There are no efficient algorithms for creating sparse matrices in data-parallel way. Plausible is coordinate matrix type which requires sorting after content filling, but that type is slow for matrix products etc.
Solution is you don't build sparse matrix - you don't keep it in memory; you do implicit operations in place when you're calculating elements of sparse matrix.

Resources