openmp parallelized code in c++ is slower than the serial one - openmp

I would like to parallelise the following code
for (i=0; i<M; i++){
temp = 0.
for (j=0; j<N; j++){
temp += A[i][j]*v[j];
u[i] = temp;
}
where different u[i] can be computed independently. So I tried to do
#pragma omp parallel for private (j,temp)
for (i=0; i<M; i++){
temp = 0.
for (j=0; j<N; j++){
temp += A[i][j]*v[j];
u[i] = temp;
}
and I find that the second case is more slow than the first one. Any idea why it is so? here M ~ 100 and N ~ 2.

There is no benefit to parallelizing this because the data are so small: the overhead of setting up and coordinating a parallel computation outweighs any benefit of parallel execution.
Further, the calculation is easily optimized.
Your compiler will most certainly unroll the inner loop (where N=2), so that it becomes:
u[i] = A[i][0]*v[0] + A[i][1] * v[1]
Even if we assume A is a 64bit type (like double), then the outer loop over i can be unrolled 4 times, such that each access to A uses a single 64 byte cache line (for the 8 entries A[i][0] through A[i+3][1]).
The result is a single loop with only 25 iterations (for M=100):
for (i=0; i<M; i+= 4)
{
u[i] = A[i][0]*v[0] + A[i][1] * v[1];
u[i+1] = A[i+1][0]*v[0] + A[i+1][1] * v[1];
u[i+2] = A[i+2][0]*v[0] + A[i+2][1] * v[1];
u[i+3] = A[i+3][0]*v[0] + A[i+3][1] * v[1];
}
With prefetching, all accesses are basically already in cache.
You would need much bigger data to see any benefit from parallelizing this :)

Related

OpenMP Do I have race condition or false-sharing '?

I'm trying to write a code for matrix multiplication. As far as I understand OMP and pararel programming this code may suffer from race condition.
#pragma omp parallel
#pragma omp for
for (int k = 0; k < size; k++){
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
c[i][j] += a[i][k] * b[k][j];
}}}
Do I get rid of it if I put #pragma omp atomic before writing to c matrix or by adding private(i) to 2nd #pragma? Also is it possible to make this code false-sharing free? If yes, how ?
A race condition occurs when 2 or more threads access the same memory location and at least one of them is writing it. Line c[i][j] +=... can cause data race in your code. The solution is to reorder your nested loops (use the order of i,j,k) and you may introduce a temporary variable to calculate the dot product:
#pragma omp parallel for
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
double tmp=0; // change its type as needed
for (int k = 0; k < size; k++){
tmp += a[i][k] * b[k][j];
}
c[i][j] = tmp; //note that += was used in your original code
}
}
Note that your code will be faster if you calculate the transpose of matrix b. For more details read this.
UPDATE:
If you need to maintain the order of loops, there are 2 possibilities (but these solutions may be slower than the serial code):
Use atomic operation (i.e #pragma omp atomic). In this case false sharing also can be a problem.
If your stack is large enough to store the matrix for all threads, a better alternative is to use reduction: #pragma omp parallel for reduction(+:c[:size][:size]) (Another alternative is to do the reduction manually. In this case you can allocate the matrices used for reduction on the heap.)

How to openmp parallelize for loop that increments two variables

As a first step into OpenMP I set myself a challenge to parallelize some matrix decomposition algorithm. I picked Crout with pivoting, source can be found here:
http://www.mymathlib.com/c_source/matrices/linearsystems/crout_pivot.c
At the bottom of that decomposition function there's an outer for loop that walks over i and p_row at the same time. Of course OpenMP is as confused as I am when looking at this and refuses to do anything with it.
After wrapping my mind around it I think I got it untangled into readable form:
p_row = p_k + n;
for (i = k+1; i < n; i++) {
for (j = k+1; j < n; j++) *(p_row + j) -= *(p_row + k) * *(p_k + j);
p_row += n;
}
At this point serial run still comes up with the same result as the original code.
Then I add some pragmas, like this:
p_row = p_k + n;
#pragma omp parallel for private (i,j) shared (n,k,p_row,p_k)
for (i = k+1; i < n; i++) {
for (j = k+1; j < n; j++) *(p_row + j) -= *(p_row + k) * *(p_k + j);
#pragma omp critical
p_row += n;
#pragma omp flush(p_row)
}
Yet the results are essentially random.
What am I missing?
I haven't tested your adaptation of original code, but your program has several problems.
#pragma omp parallel for private (i,j) shared (n,k,p_row,p_k)
Default behavior is to have vars declared outside of scope shared, so the shared declaration is useless.
But these var should not be shared and rendered private.
n is unchanged during iterations, so better have a local copy
ditto for k and p_k
p_row is modified, but you really want several copies of p_row. This what will insure a proper parallel processing, so that each thread processes different rows. The problem is to compute p_row value in the different threads.
In the outer loop, iteration 0 will use p_row, second iteration p_row+n, iteration l p_row+l*n. Your iterations will be spread over several threads. Assume each thread processes m iterations. Thread 0 will process i=k+1 to i=m+(k+1) and p_row to p_row+m*n, thread 1 i=m+1+(k+1) to i=2m+(k+1) and p_row+n*(m+1) to p_row+2*m*n, etc. Hence you can compute the value that should have p_row at the start of the loop with the value of i.
Here is a possible implementation
p_row = p_k + n;
#pragma omp parallel for private(i,j) firstprivate(n, k, p_row, p_k)
// first private insures initial values are kept
{
for (i = k+1, p_row=p_row+(i-(k+1))*n; i < n; i++, p_row += n) {
for (j = k+1; j < n; j++)
*(p_row + j) -= *(p_row + k) * *(p_k + j);
}
p_row incrementation is in the for loop. This should continue to work in a sequential environment.
Critical is useless (and was buggy in your previous code). Flush is implicit at the end of a parallel section (and the pragma is just "omp flush").

Nested loop in OpenMP performance issue

I have such a uninformative nested loops (just as test of performance):
const int N = 300;
for (int num = 0; num < 10000; num++) {
for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
The elapsed time was
about 6 s
I tried to parallelize different loops with OpenMP. But I am very confused with the results I got.
In the first step I used "parallel for" pragma only for the first (outermost) loop:
#pragma omp parallel for schedule(static) reduction(+:sum1,sum2)
for (int num = 0; num < 10000; num++) {
for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
The elapsed time was (2 cores)
3.81
Then I tried to parallelize two inner loops with "collapse" clause (2 cores):
for (int num = 0; num < 10000; num++) {
#pragma omp parallel for collapse(2) schedule(static) reduction(+:sum1, sum2)
for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
The elapsed time was
3.76
This is faster then in previous case. And I do not understand the reason of this.
If I use fusing of these inner loops (which is meant to be better in the sense of performance) like this
#pragma omp parallel for schedule(static) reduction(+:sum1,sum2)
for (int n = 0; n < N * N; n++) {
int i = n / N; int j = n % N;
the elapsed time is
5.53
This confuses me so much. The performance is worse in this case, though usually people advise to fuse loops for better performance.
Okay, now let's try to parallelize only middle loop like this (2 cores):
for (int num = 0; num < 10000; num++) {
#pragma omp parallel for schedule(static) reduction(+:sum1,sum2)
for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
Again, the performance becomes better:
3.703
And the final step - parallelization of the innermost loop only (assuming that this will be the fastest case according to the previous results) (2 cores):
for (int num = 0; num < 10000; num++) {
for (int i=0; i<N; i++) {
#pragma omp parallel for schedule(static) reduction(+:sum1,sum2)
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
But (surprise!) the elapsed time is
about 11 s
This is much slower than in previous cases. I cannot catch the reason of all of this.
By the way, I was looking for similar questions, and I found advice of adding
#pragma omp parallel
before the first loop (for example, in this and that questions). But why is it right procedure? If we place
#pragma omp parallel#
before for-loop it means that each thread executes for-loop completely, which is incorrect (excess work). Indeed, I tried to insert
#pragma omp parallel
before the outermost loop with different locations of
#pragma omp parallel for
as I am describing here, and the performance was worse in call cases (moreover, in the latest case when parallelizing the innermost loop only, answer was also incorrect (namely, "sum2" was different - as there was a race condition).
I would like to know the reasons of such a performance (probably the reason is that time of data exchange is greater than time of actual computation on each thread, but this is in the latest case) and what solution is the most correct one.
EDIT: I've disabled compiler's optimization (by $-O0$ option) and results still the same (except that time elapsed in the latest example (when parallelizing the innermost loop) reduced from 11 s to 8 s).
Compiler options:
g++ -std=gnu++0x -fopenmp -O0 test.cpp
Definition of variables:
unsigned int seed;
const int N = 300;
int main()
{
double arr[N][N];
double brr[N][N];
for (int i=0; i < N; i++) {
for (int j = 0; j < N; j++) {
arr[i][j] = i * j;
brr[i][j] = i + j;
}
}
double start = omp_get_wtime();
double crr[N][N];
double sum1 = 0;
double sum2 = 0;
And the final step - parallelization of the innermost loop only (assuming that this will be the fastest case according to the previous results) (2 cores)
But (surprise!) the elapsed time is:
about 11 s
It is not a surprise at all. Parallel blocks perform implicit barriers and can even join and create threads (some libraries may use thread pools to reduce the cost of thread creation).
In the end, opening parallel regions is expensive. You should do it as few times as possible. The threads will run the outer loops in parallel, at the same time, but will divide the iteration space once they reach the omp for block, so the result should still be correct (you should make your program check this if you are unsure).
For testing performance, you should always run your experiments turning compiler optimizations, as they have a heavy impact on the behavior of the application (you should not make assumptions about performance on unoptimized programs because their problems may be already addressed during optimization).
When making a single parallel block that contains all the loops, the execution time is halved in my setup (started with 9.536s using 2 threads, and reduced to 4.757s).
The omp for block still applies implicit barriers, which is not needed in your example. Adding the nowait clause to the example reduces the execution time by another half: 2.120s.
From this point, you can now try to explore the other options.
Parallelizing middle loop reduces execution time to only 0.732s due to much better usage of the memory hierarchy and vectorization. L1 miss ratio reduced from ~29% to ~0.3%.
Using collapse with the two innermost loops made no big deal using two threads (strong scaling should be checked).
Using other directives such as omp simd does not improve performance in this case, as the compiler is sure enough that it can vectorize the innermost loop safely.
#pragma omp parallel reduction(+:sum1,sum2)
for (int num = 0; num < 10000; num++) {
#pragma omp for schedule(static) nowait
for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
Note: L1 miss ratio computed using perf:
$ perf stat -e cache-references,cache-misses -r 3 ./test
Since variables in parallel programming are shared among threads (cores), you should consider how the processor cache-memory take in action. at this point your code might executed with a false-sharing which could hurt your processor performance.
At your 1st parallel code, you call #pragma omp for right at the first for, it means each thread has its own i and j. Compare with 2nd and 3rd (only differentiated by collapse) parallel code that parallelized the 2nd of for, it means each of i has its own j. These two code better because each thread/core more often hits the cache-line of j. The 4th code is completely disaster for caches processor because nothing to be shared there.
I recommends you to measure your code with Intel's PCM or PAPI in order to get a proper analyst.
Regards.

cpumemory.pdf - cache optimized matrix multiplication

I'm reading cpumemory.pdf from
Ulrich Drepper and I'm unable to understand following part about optimizing
cache access in matrix multiplication from chapter 6.2.1 (page 49-50):
First naive method for matrix multiplication is shown:
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
for (k = 0; k < N; ++k)
res[i][j] += mul1[i][k] * mul2[k][j];
mul2 is accessed by columns so for each column one cache line is wasted. Ulrich says:
With sizeof(double) being 8 this means that, to fully utilize the cache line,
we should unroll the middle loop 8 times.
For brevity I unrolled middle loop only 2 times.
for (i = 0; i < N; ++i)
for (j = 0; j < N; j += 2)
for (k = 0; k < N; ++k) {
res[i][j+0] += mul1[i][k] * mul2[k][j+0];
res[i][j+1] += mul1[i][k] * mul2[k][j+1];
}
Now it's obvious that if cache line is 2 double values wide it'll be fully
utilized. But then Ulrich continues:
Continuing this thought, to effectively use the res matrix as well, i.e., to
write 8 results at the same time, we should unroll the outer loop 8 times as
well.
For brevity I unrolled outer loop only 2 times again.
for (i = 0; i < N; i += 2)
for (j = 0; j < N; j+=2)
for (k = 0; k < N; ++k) {
res[i+0][j+0] += mul1[i+0][k] * mul2[k][j+0];
res[i+0][j+0] += mul1[i+0][k] * mul2[k][j+0];
res[i+1][j+0] += mul1[i+1][k] * mul2[k][j+0];
res[i+1][j+1] += mul1[i+1][k] * mul2[k][j+1];
}
To me it seems even worse than previous version because now mul1 is accessed
by columns. Please explain what Ulrich meant.
There are three matrices inside the cache: the left input, the right input and the result.
The left input is accessed just fine by original code because it is row-major, and the innermost loop increments k, so it marches down the cache line.. the second matrix is accessed well by the single unrolling, because now all the columns in a cache line are used before the cache line is evicted..
The question is the result matrix.. it is also row-major, but the cache line is indexed by j, not by k.. and you are right.. j has already been unrolled, so it uses all the elements on a cache line within the result matrix.. so there doesn't appear to be anything gained by the second unrolling.. all it does is add two extra cache lines.. an extra for the left matrix and an extra for the result matrix! It doesn't improve the coverage of elements of any cache lines!
However, it does happen to reuse the right matrix's cache line twice.. that reduces the total number of times the lines of the right matrix have to be brought in.. and it does not increase the number of times the left and right matrix cache lines will be brought in.. so perhaps that reuse of the entire line is where the advantage comes from.. I guess the question is whether this is properly blocked to the cache size, and what the set associativity of the cache is.. if all lines of all three matrices stay in the cache, then this has no advantage.. (but it doesn't make anything worse!)

OpenMP C parallelisation algorithm

in the book "Using OpenMP" is an example for bad memory access in C and I think this is the main problem in my attempt to parallelism the gaussian algorithm.
The example looks something like this:
k= 0 ;
for( int j=0; j<n ; j++)
for(int i = 0; i<n; i++)
a[i][j] = a[i][j] - a[i][k]*a[k][j] ;
So, I do understand why this causes a bad memory access. In C a 2d array is stored by rows and here in every i step a new row will be copied from memory to cache.
I am trying to find a solution for this, but im not getting a good speed up. The effects of my attempts are minor.
Can someone give me a hint what I can do?
The easiest way would be to swap the for loops, but I want to do it columnwise.
The second attempt:
for( int j=0; j<n-1 ; j+=2)
for(int i = 0; i<n; i++)
{
a[i][j] = a[i][j] - a[i][k]*a[k][j] ;
a[i][j+1] = a[i][j+1] - a[i][k]*a[k][j+1] ;
}
didn't make a difference at all.
The third attempt:
for( int j=0; j<n ; j++)
{
d= a[k][j] ;
for(int i = 0; i<n; i++)
{
e = a[i][k] ;
a[i][j] = a[i][j] - e*d ;
}
}
Thx alot
Greets Stepp
use flat array instead, eg:
#define A(i,j) A[i+j*ldA]
for( int j=0; j<n ; j++)
{
d= A(k,j) ;
...
}
Your loop order will cause a cache miss on every iteration, as you point out. So just swap the order of the loop statements:
for (int i = 0; i < n; i++) // now "i" is first
for (int j = 0; j < n; j++)
a[i][j] = a[i][j] - a[i][k]*a[k][j];
This will fix the row in a and vary just the columns, which means your memory accesses will be contiguous.
This memory access problem is just related to CACHE usage not to Openmp.
To make a good use of cache in general you should access contiguous memory locations. Remember also that if two or more threads are accessing the same memory area then you can have a "false shearing" problem forcing cache to be reloaded unnecessarily.
See for example:
http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads/

Resources