Sorting an array in openmp - critical section - openmp

Quite similar to that question
Sorting an array in openmp
which has several hundred views but no correct answer. Therefore I give it another try asking here again.
I am aware of the overhead and uselessness of this regarding speedup or performance. It simply is a small example to get into openMP. The fact that is is insertSort is given by my courseinstructor.
Here is my code:
std::vector<int> insertionSort(std::vector<int> a) {
int i, j, k;
#pragma omp parallel for private(i,j,k)
for(i = 0; i < a.size(); i++) {
#pragma omp critical
k = a[i];
for (j = i; j > 0 && a[j-1] > k; j--)
#pragma omp critical
{
a[j] = a[j-1];
a[j] = k;
}
}
return a;
}
I understand that the critical aspect is the race-condition between threads accessing (reading and writing) elements of a - that is, why I put a critical section arround all of them. That does not seem to be sufficient. What am I missing here. Without the pragmas, the sorting is correct.

Related

OpenMP Do I have race condition or false-sharing '?

I'm trying to write a code for matrix multiplication. As far as I understand OMP and pararel programming this code may suffer from race condition.
#pragma omp parallel
#pragma omp for
for (int k = 0; k < size; k++){
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
c[i][j] += a[i][k] * b[k][j];
}}}
Do I get rid of it if I put #pragma omp atomic before writing to c matrix or by adding private(i) to 2nd #pragma? Also is it possible to make this code false-sharing free? If yes, how ?
A race condition occurs when 2 or more threads access the same memory location and at least one of them is writing it. Line c[i][j] +=... can cause data race in your code. The solution is to reorder your nested loops (use the order of i,j,k) and you may introduce a temporary variable to calculate the dot product:
#pragma omp parallel for
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
double tmp=0; // change its type as needed
for (int k = 0; k < size; k++){
tmp += a[i][k] * b[k][j];
}
c[i][j] = tmp; //note that += was used in your original code
}
}
Note that your code will be faster if you calculate the transpose of matrix b. For more details read this.
UPDATE:
If you need to maintain the order of loops, there are 2 possibilities (but these solutions may be slower than the serial code):
Use atomic operation (i.e #pragma omp atomic). In this case false sharing also can be a problem.
If your stack is large enough to store the matrix for all threads, a better alternative is to use reduction: #pragma omp parallel for reduction(+:c[:size][:size]) (Another alternative is to do the reduction manually. In this case you can allocate the matrices used for reduction on the heap.)

Nested loop in OpenMP performance issue

I have such a uninformative nested loops (just as test of performance):
const int N = 300;
for (int num = 0; num < 10000; num++) {
for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
The elapsed time was
about 6 s
I tried to parallelize different loops with OpenMP. But I am very confused with the results I got.
In the first step I used "parallel for" pragma only for the first (outermost) loop:
#pragma omp parallel for schedule(static) reduction(+:sum1,sum2)
for (int num = 0; num < 10000; num++) {
for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
The elapsed time was (2 cores)
3.81
Then I tried to parallelize two inner loops with "collapse" clause (2 cores):
for (int num = 0; num < 10000; num++) {
#pragma omp parallel for collapse(2) schedule(static) reduction(+:sum1, sum2)
for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
The elapsed time was
3.76
This is faster then in previous case. And I do not understand the reason of this.
If I use fusing of these inner loops (which is meant to be better in the sense of performance) like this
#pragma omp parallel for schedule(static) reduction(+:sum1,sum2)
for (int n = 0; n < N * N; n++) {
int i = n / N; int j = n % N;
the elapsed time is
5.53
This confuses me so much. The performance is worse in this case, though usually people advise to fuse loops for better performance.
Okay, now let's try to parallelize only middle loop like this (2 cores):
for (int num = 0; num < 10000; num++) {
#pragma omp parallel for schedule(static) reduction(+:sum1,sum2)
for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
Again, the performance becomes better:
3.703
And the final step - parallelization of the innermost loop only (assuming that this will be the fastest case according to the previous results) (2 cores):
for (int num = 0; num < 10000; num++) {
for (int i=0; i<N; i++) {
#pragma omp parallel for schedule(static) reduction(+:sum1,sum2)
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
But (surprise!) the elapsed time is
about 11 s
This is much slower than in previous cases. I cannot catch the reason of all of this.
By the way, I was looking for similar questions, and I found advice of adding
#pragma omp parallel
before the first loop (for example, in this and that questions). But why is it right procedure? If we place
#pragma omp parallel#
before for-loop it means that each thread executes for-loop completely, which is incorrect (excess work). Indeed, I tried to insert
#pragma omp parallel
before the outermost loop with different locations of
#pragma omp parallel for
as I am describing here, and the performance was worse in call cases (moreover, in the latest case when parallelizing the innermost loop only, answer was also incorrect (namely, "sum2" was different - as there was a race condition).
I would like to know the reasons of such a performance (probably the reason is that time of data exchange is greater than time of actual computation on each thread, but this is in the latest case) and what solution is the most correct one.
EDIT: I've disabled compiler's optimization (by $-O0$ option) and results still the same (except that time elapsed in the latest example (when parallelizing the innermost loop) reduced from 11 s to 8 s).
Compiler options:
g++ -std=gnu++0x -fopenmp -O0 test.cpp
Definition of variables:
unsigned int seed;
const int N = 300;
int main()
{
double arr[N][N];
double brr[N][N];
for (int i=0; i < N; i++) {
for (int j = 0; j < N; j++) {
arr[i][j] = i * j;
brr[i][j] = i + j;
}
}
double start = omp_get_wtime();
double crr[N][N];
double sum1 = 0;
double sum2 = 0;
And the final step - parallelization of the innermost loop only (assuming that this will be the fastest case according to the previous results) (2 cores)
But (surprise!) the elapsed time is:
about 11 s
It is not a surprise at all. Parallel blocks perform implicit barriers and can even join and create threads (some libraries may use thread pools to reduce the cost of thread creation).
In the end, opening parallel regions is expensive. You should do it as few times as possible. The threads will run the outer loops in parallel, at the same time, but will divide the iteration space once they reach the omp for block, so the result should still be correct (you should make your program check this if you are unsure).
For testing performance, you should always run your experiments turning compiler optimizations, as they have a heavy impact on the behavior of the application (you should not make assumptions about performance on unoptimized programs because their problems may be already addressed during optimization).
When making a single parallel block that contains all the loops, the execution time is halved in my setup (started with 9.536s using 2 threads, and reduced to 4.757s).
The omp for block still applies implicit barriers, which is not needed in your example. Adding the nowait clause to the example reduces the execution time by another half: 2.120s.
From this point, you can now try to explore the other options.
Parallelizing middle loop reduces execution time to only 0.732s due to much better usage of the memory hierarchy and vectorization. L1 miss ratio reduced from ~29% to ~0.3%.
Using collapse with the two innermost loops made no big deal using two threads (strong scaling should be checked).
Using other directives such as omp simd does not improve performance in this case, as the compiler is sure enough that it can vectorize the innermost loop safely.
#pragma omp parallel reduction(+:sum1,sum2)
for (int num = 0; num < 10000; num++) {
#pragma omp for schedule(static) nowait
for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
Note: L1 miss ratio computed using perf:
$ perf stat -e cache-references,cache-misses -r 3 ./test
Since variables in parallel programming are shared among threads (cores), you should consider how the processor cache-memory take in action. at this point your code might executed with a false-sharing which could hurt your processor performance.
At your 1st parallel code, you call #pragma omp for right at the first for, it means each thread has its own i and j. Compare with 2nd and 3rd (only differentiated by collapse) parallel code that parallelized the 2nd of for, it means each of i has its own j. These two code better because each thread/core more often hits the cache-line of j. The 4th code is completely disaster for caches processor because nothing to be shared there.
I recommends you to measure your code with Intel's PCM or PAPI in order to get a proper analyst.
Regards.

Loop sequence in OpenMP Collapse performance advise

I found Intel's performance suggestion on Xeon Phi on Collapse clause in OpenMP.
#pragma omp parallel for collapse(2)
for (i = 0; i < imax; i++) {
for (j = 0; j < jmax; j++) a[ j + jmax*i] = 1.;
}
Modified example for better performance:
#pragma omp parallel for collapse(2)
for (i = 0; i < imax; i++) {
for (j = 0; j < jmax; j++) a[ k++] = 1.;
}
I test both case in Fortran with similar code on regular CPU using GFortran 4.8, they both get correct result. Test using similar Fortran Code with later code does not pass for GFortran5.2.0 and Intel 14.0
But as far as I understand, the loop body for OpenMP should avoid "loop sequence dependent" variable, for this case is k, so why in the later case it can get correct result and even better performance?
Here's the equivalent code for the two approaches when using collapse clause. You could see the second one is better.
for(int k=0; k<imax*jmax; k++) {
int i = k / jmax;
int j = k % jmax;
a[j + jmax*i]=1.;
}
for(int k=0; k<imax*jmax; k++) {
a[k]=1.;
}

Sorting an array in openmp

I have an array of 100 elements that needs to be sorted with insertion sort using OpenMP. When I parallelize my sort it does not give correct values. Can some one help me
void insertionSort(int a[])
{
int i, j, k;
#pragma omp parallel for private(i)
for(i = 0; i < 100; i++)
{
k = a[i];
for (j = i; j > 0 && a[j-1] > k; j--)
#pragma omp critical
a[j] = a[j-1];
a[j] = k;
}
}
Variables "j" and "k" need to be private on the parallel region. Otherwise you have a data race condition.
Unless it's a homework, sorting as few as 100 elements in parallel makes no sense: the overhead introduced by parallelism will far outweigh any performance benefit.
And, insertion sort algorithm is inherently serial. When a[i] is processed, it is supposed that all previous elemens in the array are already sorted. But if two elements are processed in parallel, there is obviously no such guarantee.
A more detailed explanation of why insertion sort cannot be parallelized in the suggested way is given by #dreamcrash in his answer to a similar question.

shell sort in openmp

Is anyone familiar with openmp, I don't get a sorted list. what am I doing wrong. I am using critical at the end so only one thread can access that section when it's been sorted. I guess my private values are not correct. Should they even be there or am I better off with just #pragma omp for.
void shellsort(int a[])
{
int i, j, k, m, temp;
omp_set_num_threads(10);
for(m = 2; m > 0; m = m/2)
{
#pragma omp parallel for private (j, m)
for(j = m; j < 100; j++)
{
#pragma omp critical
for(i = j-m; i >= 0; i = i-m)
{
if(a[i+m] >= a[i])
break;
else
{
temp = a[i];
a[i] = a[i+m];
a[i+m] = temp;
}
}
}
}
}
So there's a number of issues here.
So first, as has been pointed out, i and j (and temp) need to be private; m and a need to be shared. A useful thing to do with openmp is to use default(none), that way you are forced to think through what each variable you use in the parallel section does, and what it needs to be. So this
#pragma omp parallel for private (i,j,temp) shared(a,m) default(none)
is a good start. Making m private in particular is a bit of a disaster, because it means that m is undefined inside the parallel region. The loop, by the way, should start with m = n/2, not m=2.
In addition, you don't need the critical region -- or you shouldn't, for a shell sort. The issue, we'll see in a second, is not so much multiple threads working on the same elements. So if you get rid of those things, you end up with something that almost works, but not always. And that brings us to the more fundamental problem.
The way a shell sort works is, basically, you break the array up into many (here, m) subarrays, and insertion-sort them (very fast for small arrays), and then reassemble; then continue by breaking them up into fewer and fewer subarrays and insertion sort (very fast, because they're partly sorted). Sorting those many subarrays is somethign that can be done in parallel. (In practice, memory contention will be a problem with this simple approach, but still).
Now, the code you've got does that in serial, but it can't be counted on to work if you just wrap the j loop in an omp parallel for. The reason is that each iteration through the j loop does one step of one of the insertion sorts. The j+m'th loop iteration does the next step. But there's no guarantee that they're done by the same thread, or in order! If another thread has already done the j+m'th iteration before the first does the j'th, then the insertion sort is messed up and the sort fails.
So the way to make this work is to rewrite the shell sort to make the parallelism more explicit - to not break up the insertion sort into a bunch of serial steps.
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
void insertionsort(int a[], int n, int stride) {
for (int j=stride; j<n; j+=stride) {
int key = a[j];
int i = j - stride;
while (i >= 0 && a[i] > key) {
a[i+stride] = a[i];
i-=stride;
}
a[i+stride] = key;
}
}
void shellsort(int a[], int n)
{
int i, m;
for(m = n/2; m > 0; m /= 2)
{
#pragma omp parallel for shared(a,m,n) private (i) default(none)
for(i = 0; i < m; i++)
insertionsort(&(a[i]), n-i, m);
}
}
void printlist(char *s, int a[], int n) {
printf("%s\n",s);
for (int i=0; i<n; i++) {
printf("%d ", a[i]);
}
printf("\n");
}
int checklist(int a[], int n) {
int result = 0;
for (int i=0; i<n; i++) {
if (a[i] != i) {
result++;
}
}
return result;
}
void seedprng() {
struct timeval t;
/* seed prng */
gettimeofday(&t, NULL);
srand((unsigned int)(1000000*(t.tv_sec)+t.tv_usec));
}
int main(int argc, char **argv) {
const int n=100;
int *data;
int missorted;
data = (int *)malloc(n*sizeof(int));
for (int i=0; i<n; i++)
data[i] = i;
seedprng();
/* shuffle */
for (int i=0; i<n; i++) {
int i1 = rand() % n;
int i2 = rand() % n;
int tmp = data[i1];
data[i1] = data[i2];
data[i2] = tmp;
}
printlist("Unsorted List:",data,n);
shellsort2(data,n);
printlist("Sorted List:",data,n);
missorted = checklist(data,n);
if (missorted != 0) printf("%d missorted nubmers\n",missorted);
return 0;
}
Variables "j" and "i" need to be declared private on the parallel region. As it is now, I am surprised anything is happening, because "m" can not be private. The critical region is allowing it to work for the "i" loop, but the critical region should be able to be reduced - though I haven't done a shell sort in a while.

Resources