I want run the following loop for counting elements in parallel.
Since count[j] is only updated by iterations where (X[i] / exp) % 10 evaluates to j I want to define a different critical section for each of these. I first thought of using reduction clause on each of the array elements but that gave a compilation error. I know this code is wrong but how should I implement this sort of thing?
#pragma omp parallel for
for (i = 0; i < n; i++)
#pragma omp critical((X[i] / exp) % 10)
count[(X[i] / exp) % 10]++;
For something like this where there is a single update, the normal solution is to use an atomic.
#pragma omp atomic
count[(X[i] / exp) % 10]++;
Alternatively you can use OpenMP reduction on the whole array, see Example reduction.7.c on page 238 of OpenMP 4.5 Examples
Related
Consider the following chunk of code:
int array[30000]
#pragma omp parallel
{
for( int a = 0; a < 1000; a++ )
{
#pragma omp for nowait
for( int i = 0; i < 30000; i++ )
{
/*calculations with array[i] and also other array entries happen here*/
}
}
}
Race conditions are not a concern in my application but I would like to enforce that each thread in the parallel regions takes care of exactly the same chunk of array at each run through the inner for loop.
It is my understanding that schedule(static) distributes the for-loop items based on the number of threads and the array length. However, it is not clear whether the distribution changes for different loops or different repetitions of the same loop (even when number of threads and length are the same).
What does the standard say about this? Is schedule(static) sufficient to enforce this?
I believe this quote from OpenMP Specification provides such a guarantee:
A compliant implementation of the static schedule must ensure that the same assignment of logical iteration numbers to threads will be used in two worksharing-loop regions if the following conditions are satisfied: 1) both worksharing-loop regions have the same number of loop iterations, 2) both worksharing-loop regions have the same value of chunk_size specified, or both worksharing-loop regions have no chunk_size specified, 3) both worksharing-loop regions bind to the same parallel region, and 4) neither loop is associated with a SIMD construct.
I am trying to parallelize the below code using OpenMp. This code belongs to Encryption algorithm where output of each iteration is input to next iteration. Therefore i think that i can not parallelize this for loop due to dependency of rounds. But i want to parallelize each part of the instruction
y += ((z<<4)+k[0]) ^ (z+sum) ^ ((z>>5)+k[1]) using separate threads and get XOR finally . I tried to insert openMp directive inside loop but it takes longer time. Please correct me where i am doing wrong.
for(n=0;n<32;n++)
{
sum += delta;
#pragma omp parallel
y += ((z<<4)+k[0]) ^ (z+sum) ^ ((z>>5)+k[1]);
z += ((y<<4)+k[2]) ^ (y+sum) ^ ((y>>5)+k[3]);
}
I am trying to use openmp tasks to schedule a tiled execution of basic jacobi2d computation. In jacobi2d there is a dependence on A(i,j) from
A(i, j)
A(i-1, j)
A(i+1, j)
A(i, j-1)
A(i, j+1).
To my understanding of the depend clause I am declaring the dependences correctly, but they are not being respected while executing the code. I have copied the simplified code piece below. Initially my guess was that the out of bounds ranges for some tiles might be causing this issue, so I corrected that but the issue persists.(I have not copied the longer code with corrected tile ranges as that part is just a bunch of ifs + max)
int n=8,tsteps=2,b=4; //n - size of matrix, tsteps - time iterations, b - tile size or block size
#pragma omp parallel
{
#pragma omp master
for (t=0; t<tsteps; ++t)
{
for (i=0; i<n; i+=b)
for (j=0; j<n; j+=b)
{
#pragma omp task firstprivate(t,i,j) depend(in:A[i-1:b+2][j-1:b+2]) depend(out:B[i:b][j:b])
{
#pragma omp critical
printf("t-%d i-%d j-%d --A",t,i,j); //Prints out time loop, i,j
}
}
for (i=0; i<n; i+=b)
for (j=0; j<n; j+=b)
{
#pragma omp task firstprivate(t,i,j) depend(in:B[i-1:b+2][j-1:b+2]) depend(out:A[i:b][j:b])
{
#pragma omp critical
printf("t-%d i-%d j-%d --B",t,i,j); //Prints out time loop, i,j
}
}
}
}
}
So the idea with declaring dependence starting from i-1 and j-1 and the range being (b+2) is that the neighbouring tiles also affect your current tiles calculations. And similarly for the second set of loop where values in A should only be overwritten once the neighbouring tiles have used the values.
Code is being compiled using gcc 5.3 which supports openmp 4.0.
ps: the way array range is declared above denotes the starting position and the number of indices to be considered while creating the dependence graph.
edit (based on Zulan's comment) - changed the inner code to a simple print statement as this will suffice to check order of task execution. Ideally for the above values(since there are only 4 tiles) all tiles should complete the first printf and then only execute the second. But if you execute the code it will mix the order.
So I finally figured out the issue, even though OpenMP specs say that depend clause is supposed to be implemented with a starting point and range, it has not been implemented yet in gcc. So currently it only compares the starting point from the depend clause (depend(in:A[i-1:b+2][j-1:b+2])) A[i-1][j-1] in this case.
Initially I was comparing elements in different relative tile positions. Eg comparing (0,0) element with the last element of the tile, which was giving a no conflicts with dependence and hence the random order of execution of various tasks.
Current gcc implementation does not care about the range provided in the clause at all.
Which one can gain a better performance?
Example 1
#pragma omp parallel for private (i,j)
for(i = 0; i < 100; i++) {
for (j=0; j< 100; j++){
....do sth...
}
}
Example 2
for(i = 0; i < 100; i++) {
#pragma omp parallel for private (i,j)
for (j=0; j< 100; j++){
....do sth...
}
}
Follow up question Is it valid to use Example 3?
#pragma omp parallel for private (i)
for(i = 0; i < 100; i++) {
#pragma omp parallel for private (j)
for (j=0; j< 100; j++){
....do sth...
}
}
In general, Example 1 is the best as it parallelizes the outer most loop, which minimizes thread fork/join overhead. Although many OpenMP implementations pre-allocate the thread pool, there are still overhead to dispatch logical tasks to worker threads (a.k.a. a team of thread) and join them. Also note that when you use a dynamic scheduling (e.g., schedule(dynamic, 1)), then this task dispatch overhead would be problematic.
So, Example 2 may incur significant parallel overhead, especially when the trip count of for-i is large (100 is okay, though), and the amount of workload of for-j is small. Small may be an ambiguous term and depends on many variables. But, less than 1 millisecond would be definitely wasteful to use OpenMP.
However, in case where the for-i is not parallelizable and only for-j is parallelizable, then Example2 is the only option. In this case, you must consider carefully whether the amount of parallel workload can offset the parallel overhead.
Example3 is perfectly valid once for-i and for-j are safely parallelizable (i.e., no loop-carried flow dependences in each two loops, respectively). Example3 is called nested parallelism. You may take a look this article. Nested parallelism should be used with care. In many OpenMP implementations, you need to manually turn on nested parallelism by calling omp_set_nested. However, as nested parallelism may spawn huge number of threads, its benefit may be significantly reduced.
It depends on the amount your doing in the inner loop. If it's small, lauching too many threads will represent a overhead. If the work is big, I would probabaly go with option 2, depending on the number of cores your machines has.
BTW, the only place where you need to flag a variable as private is "j" in example 1. In all the other cases it's implicit.
I'm writing a program for matrix multiplication with OpenMP, that, for cache convenience, implements the multiplication A x B(transpose) rows X rows instead of the classic A x B rows x columns, for better cache efficiency. Doing this I faced an interesting fact that for me is illogic: if in this code i parallelize the extern loop the program is slower than if I put the OpenMP directives in the most inner loop, in my computer the times are 10.9 vs 8.1 seconds.
//A and B are double* allocated with malloc, Nu is the lenght of the matrixes
//which are square
//#pragma omp parallel for
for (i=0; i<Nu; i++){
for (j=0; j<Nu; j++){
*(C+(i*Nu+j)) = 0.;
#pragma omp parallel for
for(k=0;k<Nu ;k++){
*(C+(i*Nu+j))+=*(A+(i*Nu+k)) * *(B+(j*Nu+k));//C(i,j)=sum(over k) A(i,k)*B(k,j)
}
}
}
Try hitting the result less often. This induces cacheline sharing and prevents the operation from running in parallel. Using a local variable instead will allow most of the writes to take place in each core's L1 cache.
Also, use of restrict may help. Otherwise the compiler can't guarantee that writes to C aren't changing A and B.
Try:
for (i=0; i<Nu; i++){
const double* const Arow = A + i*Nu;
double* const Crow = C + i*Nu;
#pragma omp parallel for
for (j=0; j<Nu; j++){
const double* const Bcol = B + j*Nu;
double sum = 0.0;
for(k=0;k<Nu ;k++){
sum += Arow[k] * Bcol[k]; //C(i,j)=sum(over k) A(i,k)*B(k,j)
}
Crow[j] = sum;
}
}
Also, I think Elalfer is right about needing reduction if you parallelize the innermost loop.
You could probably have some dependencies in the data when you parallelize the outer loop and compiler is not able to figure it out and adds additional locks.
Most probably it decides that different outer loop iterations could write into the same (C+(i*Nu+j)) and it adds access locks to protect it.
Compiler could probably figure out that there are no dependencies if you'll parallelize the 2nd loop. But figuring out that there are no dependencies parallelizing the outer loop is not so trivial for a compiler.
UPDATE
Some performance measurements.
Hi again. It looks like 1000 double * and + is not enough to cover the cost of threads synchronization.
I've done few small tests and simple vector scalar multiplication is not effective with openmp unless the number of elements is less than ~10'000. Basically, larger your array is, more performance will you get from using openmp.
So parallelizing the most inner loop you'll have to separate task between different threads and gather data back 1'000'000 times.
PS. Try Intel ICC, it is kinda free to use for students and open source projects. I remember being using openmp for smaller that 10'000 elements arrays.
UPDATE 2: Reduction example
double sum = 0.0;
int k=0;
double *al = A+i*Nu;
double *bl = A+j*Nu;
#pragma omp parallel for shared(al, bl) reduction(+:sum)
for(k=0;k<Nu ;k++){
sum +=al[k] * bl[k]; //C(i,j)=sum(over k) A(i,k)*B(k,j)
}
C[i*Nu+j] = sum;