Strange benchmark results for open mp methods

Strange benchmark results for open mp methods - c++11

My benchmark results are very strange. On one hand I have the serial function for calculating the quadratic form. On the other hand I wrote two parallel versions. For one thread all functions should need more or less the same running time. But one parallel function just needs half of the time. Is there a "hidden" optimization?
Serial version:
double quadratic_form_serial(const std::vector<double> & A,const std::vector<double> & v, const std::vector<double> & w){
int N= v.size();
volatile double q=0.0;
for(int i=0; i<N; ++i)
for(int j=0; j<N; ++j)
q +=v[i]*A[i*N+j]*w[j];
return q;
}
Parallel version 1:
double quadratic_form_parallel(const std::vector<double> & A,const std::vector<double> & v, const std::vector<double> & w, const int threadnum){
int N= v.size();
omp_set_num_threads(threadnum);
volatile double q[threadnum];
volatile double val = 0.0;
#pragma omp parallel
{
int me = omp_get_thread_num();
q[me] = 0.0;
#pragma omp for collapse(2)
for(int i=0; i<N; ++i)
for(int j=0; j<N; ++j)
q[me]+=v[i]*A[i*N+j]*w[j];
#pragma omp atomic
val+=q[me];
}
return val;
}
Parallel version 2:
double quadratic_form_parallel2(const std::vector<double> & A,const std::vector<double> & v, const std::vector<double> & w, const int threadnum){
int N= v.size();
volatile double result =0.0;
omp_set_num_threads(threadnum);
#pragma omp parallel for reduction(+: result)
for (int i=0; i<N; ++i)
for (int j=0; j<N; ++j)
result += v[i] * A[i*N + j] * w[j];
return result;
}
I run the code for N=10000 and I flushed the cache before I call the function. The function quadratic_form_parallel2 needs with one thread less than the half of the time the two other function needed:
threads serial Parallel1 Parallel2
1 0.0882503 0.0875649 0.0313441

Most likely this is the result of result being a reduction variable in the second OpenMP version. This means, that each thread gets a private copy of result that is merged after the parallel region. This private copy probably does not respect the volatile limitations and thus can be optimized more. I assume the detailed interaction between volatile and private are unspecified.
This shows, that marking a variable as volatile - presumably to avoid optimizing away the entire code - is a bad idea. Instead just output the result.

Related

How to distribute teams on GPU using OpenMP?

i'm trying to utilize my Nvidia Geforce GT 740M for parallel-programming using OpenMP and the clang-3.8 compiler.
When processed in parallel on the CPU, I manage to get the desired result. However, when processed on the GPU, my results are some almost random numbers.
Therefore, I figured that I'm not correctly distributing my thread teams and that there might be some data races. I guess I have to do my for-loops differently but I have no idea where the mistake could be.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char* argv[])
{
const int n =100; float a = 3.0f; float b = 2.0f;
float *x = (float *) malloc(n * sizeof(float));
float *y = (float *) malloc(n * sizeof(float));
int i;
int j;
int k;
double start;
double end;
start = omp_get_wtime();
for (k=0; k<n; k++){
x[k] = 2.0f;
y[k] = 3.0f;
}
#pragma omp target data map(to:x[0:n]) map(tofrom:y[0:n]) map(to:i) map(to:j)
{
#pragma omp target teams
#pragma omp distribute
for(i = 0; i < n; i++) {
#pragma omp parallel for
for (j = 0; j < n; j++){
y[j] = a*x[j] + y[j];
}
}
}
end = omp_get_wtime();
printf("Work took %f seconds.\n", end - start);
free(x); free(y);
return 0;
}
I guess that it might have something to to with the Architecture of my GPU. So therefore I'm adding this:
Im fairly new to the topic, so thanks for your help :)

Yes, there is a race here. Different teams are reading and writing to the same element of the array 'y'. Perhaps you want something like this?
for(i = 0; i < n; i++) {
#pragma omp target teams distribute parallel for
for (j = 0; j < n; j++){
y[j] = a*x[j] + y[j];
}
}

Nested loop in OpenMP performance issue

I have such a uninformative nested loops (just as test of performance):
const int N = 300;
for (int num = 0; num < 10000; num++) {
for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
The elapsed time was
about 6 s
I tried to parallelize different loops with OpenMP. But I am very confused with the results I got.
In the first step I used "parallel for" pragma only for the first (outermost) loop:
#pragma omp parallel for schedule(static) reduction(+:sum1,sum2)
for (int num = 0; num < 10000; num++) {
for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
The elapsed time was (2 cores)
3.81
Then I tried to parallelize two inner loops with "collapse" clause (2 cores):
for (int num = 0; num < 10000; num++) {
#pragma omp parallel for collapse(2) schedule(static) reduction(+:sum1, sum2)
for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
The elapsed time was
3.76
This is faster then in previous case. And I do not understand the reason of this.
If I use fusing of these inner loops (which is meant to be better in the sense of performance) like this
#pragma omp parallel for schedule(static) reduction(+:sum1,sum2)
for (int n = 0; n < N * N; n++) {
int i = n / N; int j = n % N;
the elapsed time is
5.53
This confuses me so much. The performance is worse in this case, though usually people advise to fuse loops for better performance.
Okay, now let's try to parallelize only middle loop like this (2 cores):
for (int num = 0; num < 10000; num++) {
#pragma omp parallel for schedule(static) reduction(+:sum1,sum2)
for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
Again, the performance becomes better:
3.703
And the final step - parallelization of the innermost loop only (assuming that this will be the fastest case according to the previous results) (2 cores):
for (int num = 0; num < 10000; num++) {
for (int i=0; i<N; i++) {
#pragma omp parallel for schedule(static) reduction(+:sum1,sum2)
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
But (surprise!) the elapsed time is
about 11 s
This is much slower than in previous cases. I cannot catch the reason of all of this.
By the way, I was looking for similar questions, and I found advice of adding
#pragma omp parallel
before the first loop (for example, in this and that questions). But why is it right procedure? If we place
#pragma omp parallel#
before for-loop it means that each thread executes for-loop completely, which is incorrect (excess work). Indeed, I tried to insert
#pragma omp parallel
before the outermost loop with different locations of
#pragma omp parallel for
as I am describing here, and the performance was worse in call cases (moreover, in the latest case when parallelizing the innermost loop only, answer was also incorrect (namely, "sum2" was different - as there was a race condition).
I would like to know the reasons of such a performance (probably the reason is that time of data exchange is greater than time of actual computation on each thread, but this is in the latest case) and what solution is the most correct one.
EDIT: I've disabled compiler's optimization (by $-O0$ option) and results still the same (except that time elapsed in the latest example (when parallelizing the innermost loop) reduced from 11 s to 8 s).
Compiler options:
g++ -std=gnu++0x -fopenmp -O0 test.cpp
Definition of variables:
unsigned int seed;
const int N = 300;
int main()
{
double arr[N][N];
double brr[N][N];
for (int i=0; i < N; i++) {
for (int j = 0; j < N; j++) {
arr[i][j] = i * j;
brr[i][j] = i + j;
}
}
double start = omp_get_wtime();
double crr[N][N];
double sum1 = 0;
double sum2 = 0;

And the final step - parallelization of the innermost loop only (assuming that this will be the fastest case according to the previous results) (2 cores)
But (surprise!) the elapsed time is:
about 11 s
It is not a surprise at all. Parallel blocks perform implicit barriers and can even join and create threads (some libraries may use thread pools to reduce the cost of thread creation).
In the end, opening parallel regions is expensive. You should do it as few times as possible. The threads will run the outer loops in parallel, at the same time, but will divide the iteration space once they reach the omp for block, so the result should still be correct (you should make your program check this if you are unsure).
For testing performance, you should always run your experiments turning compiler optimizations, as they have a heavy impact on the behavior of the application (you should not make assumptions about performance on unoptimized programs because their problems may be already addressed during optimization).
When making a single parallel block that contains all the loops, the execution time is halved in my setup (started with 9.536s using 2 threads, and reduced to 4.757s).
The omp for block still applies implicit barriers, which is not needed in your example. Adding the nowait clause to the example reduces the execution time by another half: 2.120s.
From this point, you can now try to explore the other options.
Parallelizing middle loop reduces execution time to only 0.732s due to much better usage of the memory hierarchy and vectorization. L1 miss ratio reduced from ~29% to ~0.3%.
Using collapse with the two innermost loops made no big deal using two threads (strong scaling should be checked).
Using other directives such as omp simd does not improve performance in this case, as the compiler is sure enough that it can vectorize the innermost loop safely.
#pragma omp parallel reduction(+:sum1,sum2)
for (int num = 0; num < 10000; num++) {
#pragma omp for schedule(static) nowait
for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
Note: L1 miss ratio computed using perf:
$ perf stat -e cache-references,cache-misses -r 3 ./test

Since variables in parallel programming are shared among threads (cores), you should consider how the processor cache-memory take in action. at this point your code might executed with a false-sharing which could hurt your processor performance.
At your 1st parallel code, you call #pragma omp for right at the first for, it means each thread has its own i and j. Compare with 2nd and 3rd (only differentiated by collapse) parallel code that parallelized the 2nd of for, it means each of i has its own j. These two code better because each thread/core more often hits the cache-line of j. The 4th code is completely disaster for caches processor because nothing to be shared there.
I recommends you to measure your code with Intel's PCM or PAPI in order to get a proper analyst.
Regards.

PGI Compiler Parallelization +=

I am working on getting a vector and matrix class parallelized and have run into an issue. Any time I have a loop in the form of
for (int i = 0; i < n; i++)
b[i] += a[i] ;
the code has a data dependency and will not parallelize. When working with the intel compiler it is smart enough to handle this without any pragmas (I would like to avoid the pragma for no dependency check just due to the vast number of loops similar to this and because the cases are actually more complicated than this and I would like it to check just in case one does exist).
Does anyone know of a compiler flag for the PGI compiler that would allow this?
Thank you,
Justin
edit: Error in the for loop. Wasn't copy pasting an actual loop

I think the problem is you're not using the restrict keyword in these routines, so the C compiler has to worry about pointer aliasing.
Compiling this program:
#include <stdlib.h>
#include <stdio.h>
void dbpa(double *b, double *a, const int n) {
for (int i = 0; i < n; i++) b[i] += a[i] ;
return;
}
void dbpa_restrict(double *restrict b, double *restrict a, const int n) {
for (int i = 0; i < n; i++) b[i] += a[i] ;
return;
}
int main(int argc, char **argv) {
const int n=10000;
double *a = malloc(n*sizeof(double));
double *b = malloc(n*sizeof(double));
for (int i=0; i<n; i++) {
a[i] = 1;
b[i] = 2;
}
dbpa(b, a, n);
double error = 0.;
for (int i=0; i<n; i++)
error += (3 - b[i]);
if (error < 0.1)
printf("Success\n");
dbpa_restrict(b, a, n);
error = 0.;
for (int i=0; i<n; i++)
error += (4 - b[i]);
if (error < 0.1)
printf("Success\n");
free(b);
free(a);
return 0;
}
with the PGI compiler:
$ pgcc -o tryautop tryautop.c -Mconcur -Mvect -Minfo
dbpa:
5, Loop not vectorized: data dependency
dbpa_restrict:
11, Parallel code generated with block distribution for inner loop if trip count is greater than or equal to 100
main:
21, Loop not vectorized: data dependency
28, Loop not parallelized: may not be beneficial
36, Loop not parallelized: may not be beneficial
gives us the information that the dbpa() routine without the restrict keyword wasn't parallelized, but the dbpa_restict() routine was.
Really, for this sort of stuff, though, you're better off just using OpenMP (or TBB or ABB or...) rather than trying to convince the compiler to autoparallelize for you; probably better still is just to use existing linear algebra packages, either dense or sparse, depending on what you're doing.

Fill histograms (array reduction) in parallel with OpenMP without using a critical section

I would like to fill histograms in parallel using OpenMP. I have come up with two different methods of doing this with OpenMP in C/C++.
The first method proccess_data_v1 makes a private histogram variable hist_private for each thread, fills them in prallel, and then sums the private histograms into the shared histogram hist in a critical section.
The second method proccess_data_v2 makes a shared array of histograms with array size equal to the number of threads, fills this array in parallel, and then sums the shared histogram hist in parallel.
The second method seems superior to me since it avoids a critical section and sums the histograms in parallel. However, it requires knowing the number of threads and calling omp_get_thread_num(). I generally try to avoid this. Is there better way to do the second method without referencing the thread numbers and using a shared array with size equal to the number of threads?
void proccess_data_v1(float *data, int *hist, const int n, const int nbins, float max) {
#pragma omp parallel
{
int *hist_private = new int[nbins];
for(int i=0; i<nbins; i++) hist_private[i] = 0;
#pragma omp for nowait
for(int i=0; i<n; i++) {
float x = reconstruct_data(data[i]);
fill_hist(hist_private, nbins, max, x);
}
#pragma omp critical
{
for(int i=0; i<nbins; i++) {
hist[i] += hist_private[i];
}
}
delete[] hist_private;
}
}
void proccess_data_v2(float *data, int *hist, const int n, const int nbins, float max) {
const int nthreads = 8;
omp_set_num_threads(nthreads);
int *hista = new int[nbins*nthreads];
#pragma omp parallel
{
const int ithread = omp_get_thread_num();
for(int i=0; i<nbins; i++) hista[nbins*ithread+i] = 0;
#pragma omp for
for(int i=0; i<n; i++) {
float x = reconstruct_data(data[i]);
fill_hist(&hista[nbins*ithread], nbins, max, x);
}
#pragma omp for
for(int i=0; i<nbins; i++) {
for(int t=0; t<nthreads; t++) {
hist[i] += hista[nbins*t + i];
}
}
}
delete[] hista;
}
Based on a suggestion by #HristoIliev I have created an improved method called process_data_v3:
#define ROUND_DOWN(x, s) ((x) & ~((s)-1))
void proccess_data_v2(float *data, int *hist, const int n, const int nbins, float max) {
int* hista;
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
int lda = ROUND_DOWN(nbins+1023, 1024); //1024 ints = 4096 bytes -> round to a multiple of page size
#pragma omp single
hista = (int*)_mm_malloc(lda*sizeof(int)*nthreads, 4096); //align memory to page size
for(int i=0; i<nbins; i++) hista[lda*ithread+i] = 0;
#pragma omp for
for(int i=0; i<n; i++) {
float x = reconstruct_data(data[i]);
fill_hist(&hista[lda*ithread], nbins, max, x);
}
#pragma omp for
for(int i=0; i<nbins; i++) {
for(int t=0; t<nthreads; t++) {
hist[i] += hista[lda*t + i];
}
}
}
_mm_free(hista);
}

You could allocate the big array inside the parallel region, where you can query about the actual number of threads being used:
int *hista;
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
#pragma omp single
hista = new int[nbins*nthreads];
...
}
delete[] hista;
For better performance I would advise that you round the size of each thread's chunk in hista to a multiple of the system's memory page size, even if this could potentially leave holes between the different partial histograms. This way you will prevent both false sharing and remote memory access on NUMA systems (but not in the final reduction phase).

shell sort in openmp

Is anyone familiar with openmp, I don't get a sorted list. what am I doing wrong. I am using critical at the end so only one thread can access that section when it's been sorted. I guess my private values are not correct. Should they even be there or am I better off with just #pragma omp for.
void shellsort(int a[])
{
int i, j, k, m, temp;
omp_set_num_threads(10);
for(m = 2; m > 0; m = m/2)
{
#pragma omp parallel for private (j, m)
for(j = m; j < 100; j++)
{
#pragma omp critical
for(i = j-m; i >= 0; i = i-m)
{
if(a[i+m] >= a[i])
break;
else
{
temp = a[i];
a[i] = a[i+m];
a[i+m] = temp;
}
}
}
}
}

So there's a number of issues here.
So first, as has been pointed out, i and j (and temp) need to be private; m and a need to be shared. A useful thing to do with openmp is to use default(none), that way you are forced to think through what each variable you use in the parallel section does, and what it needs to be. So this
#pragma omp parallel for private (i,j,temp) shared(a,m) default(none)
is a good start. Making m private in particular is a bit of a disaster, because it means that m is undefined inside the parallel region. The loop, by the way, should start with m = n/2, not m=2.
In addition, you don't need the critical region -- or you shouldn't, for a shell sort. The issue, we'll see in a second, is not so much multiple threads working on the same elements. So if you get rid of those things, you end up with something that almost works, but not always. And that brings us to the more fundamental problem.
The way a shell sort works is, basically, you break the array up into many (here, m) subarrays, and insertion-sort them (very fast for small arrays), and then reassemble; then continue by breaking them up into fewer and fewer subarrays and insertion sort (very fast, because they're partly sorted). Sorting those many subarrays is somethign that can be done in parallel. (In practice, memory contention will be a problem with this simple approach, but still).
Now, the code you've got does that in serial, but it can't be counted on to work if you just wrap the j loop in an omp parallel for. The reason is that each iteration through the j loop does one step of one of the insertion sorts. The j+m'th loop iteration does the next step. But there's no guarantee that they're done by the same thread, or in order! If another thread has already done the j+m'th iteration before the first does the j'th, then the insertion sort is messed up and the sort fails.
So the way to make this work is to rewrite the shell sort to make the parallelism more explicit - to not break up the insertion sort into a bunch of serial steps.
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
void insertionsort(int a[], int n, int stride) {
for (int j=stride; j<n; j+=stride) {
int key = a[j];
int i = j - stride;
while (i >= 0 && a[i] > key) {
a[i+stride] = a[i];
i-=stride;
}
a[i+stride] = key;
}
}
void shellsort(int a[], int n)
{
int i, m;
for(m = n/2; m > 0; m /= 2)
{
#pragma omp parallel for shared(a,m,n) private (i) default(none)
for(i = 0; i < m; i++)
insertionsort(&(a[i]), n-i, m);
}
}
void printlist(char *s, int a[], int n) {
printf("%s\n",s);
for (int i=0; i<n; i++) {
printf("%d ", a[i]);
}
printf("\n");
}
int checklist(int a[], int n) {
int result = 0;
for (int i=0; i<n; i++) {
if (a[i] != i) {
result++;
}
}
return result;
}
void seedprng() {
struct timeval t;
/* seed prng */
gettimeofday(&t, NULL);
srand((unsigned int)(1000000*(t.tv_sec)+t.tv_usec));
}
int main(int argc, char **argv) {
const int n=100;
int *data;
int missorted;
data = (int *)malloc(n*sizeof(int));
for (int i=0; i<n; i++)
data[i] = i;
seedprng();
/* shuffle */
for (int i=0; i<n; i++) {
int i1 = rand() % n;
int i2 = rand() % n;
int tmp = data[i1];
data[i1] = data[i2];
data[i2] = tmp;
}
printlist("Unsorted List:",data,n);
shellsort2(data,n);
printlist("Sorted List:",data,n);
missorted = checklist(data,n);
if (missorted != 0) printf("%d missorted nubmers\n",missorted);
return 0;
}

Variables "j" and "i" need to be declared private on the parallel region. As it is now, I am surprised anything is happening, because "m" can not be private. The critical region is allowing it to work for the "i" loop, but the critical region should be able to be reduced - though I haven't done a shell sort in a while.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Strange benchmark results for open mp methods - c++11

Related

How to distribute teams on GPU using OpenMP?

Nested loop in OpenMP performance issue

PGI Compiler Parallelization +=

Fill histograms (array reduction) in parallel with OpenMP without using a critical section

shell sort in openmp

Categories

Resources