Parallelizing a data dependence loop with OpenMP - parallel-processing

I have to parallelize the following code, the data dependence is i -> i-3
for(i=3; i<N2; i++)
for(j=0; j<N3; j++)
{
D[i][j] = D[i-3][j] / 3.0 + x + E[i];
if (D[i][j] < 6.5) bat = bat + D[i][j]/100.0;
}
I tried with #pragma omp parallel for reduction(+:bat) private(i,j) shared(D,x,E) and similar things but it wasn't correct

Let's consider two threads and why parallelizing the outer loop is failing.
Thread 1: i=3, j=0. This reads D[0][0] and writes D[3][0]
Thread 2: i=6, j=0. This reads D[3][0] and writes D[6][0]
So thread 2 reads D[3][0], the same value that thread 1 is writing. That's the race condition. I think if you parallelize the inner loop you won't have a problem.
for(i=3; i<N2; i++) {
#pragma omp parallel for reduction(+:bat) private(j)
for(j=0; j<N3; j++) {
D[i][j] = D[i-3][j] / 3.0 + x + E[i];
if (D[i][j] < 6.5) bat = bat + D[i][j]/100.0;
}
}
Edit: I forgot to add the reduction and make j private. I fixed that now.

Related

while loop getting stuck - Openmp

I was trying to implement some piece of parallel code and tried to synchronize the threads using an array of flags as shown below
// flags array set to zero initially
#pragma omp parallel for num_threads (n_threads) schedule(static, 1)
for(int i = 0; i < n; i ++){
for(int j = 0; j < i; j++) {
while(!flag[j]);
y[i] -= L[i][j]*y[j];
}
y[i] /= L[i][i];
flag[i] = 1;
}
However, the code always gets stuck after a few iterations when I try to compile it using gcc -O3 -fopenmp <file_name>. I have tried different number of threads like 2, 4, 8 all of them leads to the loop getting stuck. On putting print statements inside critical sections, I figured out that even though the value of flag[i] gets updated to 1, the while loop is still stuck or maybe there is some other problem with the code, I am not aware of.
I also figured out that if I try to do something inside the while block like printf("Hello\n") the problem goes away. I think there is some problem with the memory consistency across threads but I do not know how to resolve this. Any help would be appreciated.
Edit: The single threaded code I am trying to parallelise is
for(int i=0; i<n; i++){
for(int j=0; j < i; j++){
y[i]-=L[i][j]*y[j];
}
y[i]/=L[i][i];
}
You have data race in your code, which is easy to fix, but the bigger problem is that you also have loop carried dependency. The result of your code does depend on the order of execution. Try reversing the i loop without OpenMP, you will get different result, so your code cannot be parallelized efficiently.
One possibility is to parallelize the j loop, but the workload is very small inside this loop, so the OpenMP overheads will be significantly bigger than the speed gain by parallelization.
EDIT: In the case of your updated code I suggest to forget parallelization (because of loop carried dependency) and make sure that inner loop is properly vectorized, so I suggest the following:
for(int i = 0; i < n; i ++){
double sum_yi=y[i];
#pragma GCC ivdep
for(int j = 0; j < i; j++) {
sum_yi -= L[i][j]*y[j];
}
y[i] = sum_yi/L[i][i];
}
#pragma GCC ivdep tells the compiler that there is no loop carried dependency in the loop, so it can vectorize it safely. Do not forget to inform compiler the about the vectorization capabilities of your processor (e.g. use -mavx2 flag if your processor is AVX2 capable).

OpenMP double for loop

I'd like to use openMP to apply multi-thread.
Here is simple code that I wrote.
vector<Vector3f> a;
int i, j;
for (i = 0; i<10; i++)
{
Vector3f b;
#pragma omp parallel for private(j)
for (j = 0; j < 3; j++)
{
b[j] = j;
}
a.push_back(b);
}
for (i = 0; i < 10; i++)
{
cout << a[i] << endl;
}
I want to change it to works lik :
parallel for1
{
for2
}
or
for1
{
parallel for2
}
Code works when #pragma line is deleted. but it does not work when I use it. What's the problem?
///////// Added
Actually I use OpenMP to more complicated example,
double for loop question.
here, also When I do not apply MP, it works well.
But When I apply it,
the error occurs at vector push_back line.
vector<Class> B;
for 1
{
#pragma omp parallel for private(j)
parallel for j
{
Class A;
B.push_back(A); // error!!!!!!!
}
}
If I erase B.push_back(A) line, it works as well when I applying MP.
I could not find exact error message, but it looks like exception error about vector I guess. Debug stops at
void _Reallocate(size_type _Count)
{ // move to array of exactly _Count elements
pointer _Ptr = this->_Getal().allocate(_Count);
_TRY_BEGIN
_Umove(this->_Myfirst, this->_Mylast, _Ptr);
std::vector::push_back is not thread safe, you cannot call that without any protection against race conditions from multiple threads.
Instead, prepare the vector such that it's size is already correct and then insert the elements via operator[].
Alternatively you can protect the insertion with a critical region:
#pragma omp critical
B.push_back(A);
This way only one thread at a time will do the insertion which will fix the error but slow down the code.
In general I think you don't approach parallelization the right way, but there is no way to give better advise without a clearer and more representative problem description.

OpenACC bitonic sort is much slower on GPU than on CPU

I have the following bit of code to sort double values on my GPU:
void bitonic_sort(double *data, int length) {
#pragma acc data copy(data[0:length], length)
{
int i,j,k;
for (k = 2; k <= length; k *= 2) {
for (j=k >> 1; j > 0; j = j >> 1) {
#pragma acc parallel loop gang worker vector independent
for (i = 0; i < length; i++) {
int ixj = i ^ j;
if ((ixj) > i) {
if ((i & k) == 0 && data[i] > data[ixj]) {
_ValueType buffer = data[i];
data[i] = data[ixj];
data[ixj] = buffer;
}
if ((i & k) != 0 && data[i] < data[ixj]) {
_ValueType buffer = data[i];
data[i] = data[ixj];
data[ixj] = buffer;
}
}
}
}
}
}
}
This is a bit slower on my GPU than on my CPU. I'm using GCC 6.1. I can't figure out, how to run the whole code on my GPU. So far, only the parallel loop is executed on the cpu and it switches between CPU and GPU for each one of the outer loops.
I'd like to run the whole content of the function on the GPU, but I can't figure out how. One major problem for me now is that the GCC implementation currently doesn't allow nested parallelism, so I can't use a parallel construct inside another parallel construct. Is there any way to get around that?
I've tried putting a kernels construct on top of the first loop but that slows it down by a factor of about 10. If I use a parallel construct above the first loop instead, the result isn't sorted any more, which makes sense. The two outer loops need to be executed sequentially for the algorithm to work.
If you have any other suggestions on how I could improve performance, I would be grateful as well.

openMP reduction and thread number control

I use OpenMP as:
#pragma omp parallel for reduction(+:average_stroke_width)
for(int i = 0; i < GB_name.size(); ++i) {...}
I know I can use :
#pragma omp parallel for num_threads(thread)
for(int index = 0; index < GB_name.size(); ++index){...}
How can I control the thread number when I use reduction?
How can I control the thread number when I use reduction?
Both clauses can be used togehter:
#pragma omp parallel for reduction(+:average_stroke_width) num_threads(thread)
for(int i = 0; i < GB_name.size(); ++i) {...}
Note that reduction involves all threads, so you cannot have a parallel loop with 8 threads and then perform a reduction with 4 threads only. Reduction combines the local values in all threads and therefore all of them need to participate.

openmp: difference in combining 2 for loop & not combining

What is the difference in combining 2 for loops and parallizing together and parallizing separately
Example
1. not paralleling together
#pragma omp parallel for
for(i = 0; i < 100; i++) {
//.... some code
}
#pragma omp parallel for
for(i = 0; i < 1000; i++) {
//.... some code
}
2. paralleling together
#pragma omp parallel
{
#pragma omp for
for(i = 0; i < 100; i++) {
//.... some code
}
#pragma omp for
for(i = 0; i < 1000; i++) {
//.... some code
}
}
which code is better and why????
One might expect a small win in the second, because one is fork/joining (or the functional equivalent) the OMP threads twice, rather than once. Whether it makes any actual difference for your code is an empirical question best answered by measurement.
The second can also have a more significant advantage if the work in the two loops are independant, and you can start the second at any time, and there's reason to expect some load imbalance in the first loop. In that case, you can add a nowait clause to the firs tomp for and, rather than all threads waiting until the for loop ends, whoever's done first can immediately go on to start working on the second loop. Or, one could put the two chunks of codes each in a section, or task. In general, you have a lot of control over what threads do and how they do it within a parallel section; whereas once you end the parallel section, you lose that flexibility - everything has to join together and you're done.

Resources