Low performance in a OpenMP program - performance

I am trying to understand an openmp code from here. You can see the code below.
In order to measure the speedup, difference between the serial and omp version, I use time.h, do you find right this approach?
The program runs on a 4 core machine. I specify export OMP_NUM_THREADS="4" but can not see substantially speedup, usually I get 1.2 - 1.7. Which problems am I facing in this parallelization?
Which debug/performace tool could I use to see the loss of performace?
code (for compilation I use xlc_r -qsmp=omp omp_workshare1.c -o omp_workshare1.exe)
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define CHUNKSIZE 1000000
#define N 100000000
int main (int argc, char *argv[])
{
int nthreads, tid, i, chunk;
float a[N], b[N], c[N];
unsigned long elapsed;
unsigned long elapsed_serial;
unsigned long elapsed_omp;
struct timeval start;
struct timeval stop;
chunk = CHUNKSIZE;
// ================= SERIAL start =======================
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
gettimeofday(&start,NULL);
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
//printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
gettimeofday(&stop,NULL);
elapsed = 1000000 * (stop.tv_sec - start.tv_sec);
elapsed += stop.tv_usec - start.tv_usec;
elapsed_serial = elapsed ;
printf (" \n Time SEQ= %lu microsecs\n", elapsed_serial);
// ================= SERIAL end =======================
// ================= OMP start =======================
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
gettimeofday(&start,NULL);
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
//printf("Thread %d starting...\n",tid);
#pragma omp for schedule(static,chunk)
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
//printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
} /* end of parallel section */
gettimeofday(&stop,NULL);
elapsed = 1000000 * (stop.tv_sec - start.tv_sec);
elapsed += stop.tv_usec - start.tv_usec;
elapsed_omp = elapsed ;
printf (" \n Time OMP= %lu microsecs\n", elapsed_omp);
// ================= OMP end =======================
printf (" \n speedup= %f \n\n", ((float) elapsed_serial) / ((float) elapsed_omp)) ;
}

There's nothing really wrong with the code as above, but your speedup is going to be limited by the fact that the main loop, c=a+b, has very little work -- the time required to do the computation (a single addition) is going to be dominated by memory access time (2 loads and one store), and there's more contention for memory bandwidth with more threads acting on the array.
We can test this by making the work inside the loop more compute-intensive:
c[i] = exp(sin(a[i])) + exp(cos(b[i]));
And then we get
$ ./apb
Time SEQ= 17678571 microsecs
Number of threads = 4
Time OMP= 4703485 microsecs
speedup= 3.758611
which is obviously a lot closer to the 4x speedup one would expect.
Update: Oh, and to the other questions -- gettimeofday() is probably fine for timing, and on a system where you're using xlc - is this AIX? In that case, peekperf is a good overall performance tool, and the hardware performance monitors will give you access to to memory access times. On x86 platforms, free tools for performance monitoring of threaded code include cachegrind/valgrind for cache performance debugging (not the problem here), scalasca for general OpenMP issues, and OpenSpeedShop is pretty useful, too.

Related

omp parallel for loop (reduction to find max) ran slower than serial codes

I am new in using OpenMP.
I think that use max reduction clause to find the max element of an array is not such a bad idea, but in fact the parallel for loop ran much slower than serial one.
int main() {
double sta, end, elapse_t;
int bsize = 46000;
int q = bsize;
int max_val = 0;
double *buffer;
buffer = (double*)malloc(bsize*sizeof(double));
srand(time(NULL));
for(int i=0;i<q;i++)
buffer[i] = rand()%10000;
sta = omp_get_wtime();
//int i;
#pragma omp parallel for reduction(max : max_val)
for(int i=0;i<q; i++)
{
max_val = max_val > buffer[i] ? max_val : buffer[i];
}
end = omp_get_wtime();
printf("parallel maximum time %f\n", end-sta);
sta = omp_get_wtime();
for(int i=0;i<q; i++)
{
max_val = max_val > buffer[i] ? max_val : buffer[i];
}
end = omp_get_wtime();
printf("serial maximum time %f\n", end-sta);
free(buffer);
return 0;}
Compile command
gcc-7 kp_omp.cpp -o kp_omp -fopenmp
Execution results
./kp_omp
parallel maximum time 0.000505
serial maximum time 0.000266
As for the CPU, it is an Intel Core i7-6700 with 8 cores.
Whenever you parallelise a loop openMP needs to perform some operations, for example creating the threads. Those operations result in some overhead and this in turns implies that, for each loop, there is a minimum number of iterations under which it is not convenient to parallelise.
If I execute your code I obtain the same results you have:
./kp_omp
parallel maximum time 0.000570
serial maximum time 0.000253
However if I modify bsize in line 8 such that
int bsize = 100000;
I obtain
./kp_omp
parallel maximum time 0.000323
serial maximum time 0.000552
So the parallelised version got faster than the sequential. Part of the challenges you encounter when you try to speedup the execution of a code is to understand when it is convenient to parallelise and when the overhead of the parallelisation would kill your expected gain in performance.

parallelizing in openMP

I have the following code that I want to paralleize using OpenMP
for(m=0; m<r_c; m++)
{
for(n=0; n<c_c; n++)
{
double value = 0.0;
for(j=0; j<r_b; j++)
for(k=0; k<c_b; k++)
{
double a;
if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
a = 0.0;
else
a = h_a[((m-j)*c_a) + (n-k)];
//printf("%lf\t", a);
value += h_b[(j*c_b) + k] * a;
}
h_c[m*c_c + n] = value;
//printf("%lf\t", h_c[m*c_c + n]);
}
//cout<<"row "<<m<<" completed"<<endl;
}
In this I want every thread to perform "for j" and "for k" simultaneouly.
I am trying to do using pragma omp parallel for before the "for m" loop but not getting the correct result.
How can I do this in an optimized manner. thanks in advance.
Depending exactly from which loop you want to parallelize, you have three options:
#pragma omp parallel
{
#pragma omp for // Option #1
for(m=0; m<r_c; m++)
{
for(n=0; n<c_c; n++)
{
double value = 0.0;
#pragma omp for // Option #2
for(j=0; j<r_b; j++)
for(k=0; k<c_b; k++)
{
double a;
if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
a = 0.0;
else
a = h_a[((m-j)*c_a) + (n-k)];
//printf("%lf\t", a);
value += h_b[(j*c_b) + k] * a;
}
h_c[m*c_c + n] = value;
//printf("%lf\t", h_c[m*c_c + n]);
}
//cout<<"row "<<m<<" completed"<<endl;
}
}
//////////////////////////////////////////////////////////////////////////
// Option #3
for(m=0; m<r_c; m++)
{
for(n=0; n<c_c; n++)
{
#pragma omp parallel
{
double value = 0.0;
#pragma omp for
for(j=0; j<r_b; j++)
for(k=0; k<c_b; k++)
{
double a;
if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
a = 0.0;
else
a = h_a[((m-j)*c_a) + (n-k)];
//printf("%lf\t", a);
value += h_b[(j*c_b) + k] * a;
}
h_c[m*c_c + n] = value;
//printf("%lf\t", h_c[m*c_c + n]);
}
}
//cout<<"row "<<m<<" completed"<<endl;
}
Test and profile each. You might find that option #1 is fastest if there isn't a lot of work for each thread, or you may find that with optimizations on, there is no difference (or even a slowdown) when enabling OMP.
Edit
I've adopted the MCVE supplied in the comments as follows:
#include <iostream>
#include <chrono>
#include <omp.h>
#include <algorithm>
#include <vector>
#define W_OMP
int main(int argc, char *argv[])
{
std::vector<double> h_a(9);
std::generate(h_a.begin(), h_a.end(), std::rand);
int r_b = 500;
int c_b = r_b;
std::vector<double> h_b(r_b * c_b);
std::generate(h_b.begin(), h_b.end(), std::rand);
int r_c = 500;
int c_c = r_c;
int r_a = 3, c_a = 3;
std::vector<double> h_c(r_c * c_c);
auto start = std::chrono::system_clock::now();
#ifdef W_OMP
#pragma omp parallel
{
#endif
int m,n,j,k;
#ifdef W_OMP
#pragma omp for
#endif
for(m=0; m<r_c; m++)
{
for(n=0; n<c_c; n++)
{
double value = 0.0,a;
for(j=0; j<r_b; j++)
{
for(k=0; k<c_b; k++)
{
if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
a = 0.0;
else a = h_a[((m-j)*c_a) + (n-k)];
value += h_b[(j*c_b) + k] * a;
}
}
h_c[m*c_c + n] = value;
}
}
#ifdef W_OMP
}
#endif
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << elapsed.count() << "ms"
#ifdef W_OMP
"\t with OMP"
#else
"\t without OMP"
#endif
"\n";
return 0;
}
As a reference, I'm using VS2012 (OMP 2.0, grrr). I'm not sure when collapse was introduced, but apparently after 2.0. Optimizations were /O2 and compiled in Release x64.
Benchmarks
Using the original sizes of the loops (7,7,5,5) and therefore arrays, the results were 0ms without OMP and 1ms with. Verdict: optimizations were better, and the added overhead wasn't worth it. Also, the measurements are not reliable (too short).
Using the slightly larger sizes of the loops (100, 100, 100, 100) and therefore arrays, the results were about equal at about 108ms. Verdict: still not worth the naive effort, tweaking OMP parameters might tip the scale. Definitely not the x4 speedup I would hope for.
Using an even larger sizes of the loops (500, 500, 500, 500) and therefore arrays, OMP started to pull ahead. Without OMP 74.3ms, with 15s. Verdict: Worth it. Weird. I got a x5 speedup with four threads and four cores on an i5. I'm not going to try and figure out how that happened.
Summary
As has been stated in countless answers here on SO, it's not always a good idea to parallelize every for loop you come across. Things that can screw up your desired xN speedup:
Not enough work per thread to justify the overhead of creating the additional threads
The work itself is memory bound. This means that the CPU can be running at 1petaHz and you still won't see a speedup.
Memory access patterns. I'm not going to go there. Feel free to edit in the relevant info if you want it.
OMP parameters. The best choice of parameters will often be a result of this entire list (not including this item, to avoid recursion issues).
SIMD operations. Depending on what and how you're doing, the compiler may vectorize your operations. I have no idea if OMP will usurp the SIMD operations, but it is possible. Check your assembly (foreign language to me) to confirm.

Why does this OpenMP code work on dynamic scheduling but not on static?

I'm learning OpenMP by building a simple program to calculate pi using the following algorithm:
pi = 4/1 - 4/3 + 4/5 - 4/7 + 4/9...
The problem is that it does not work correctly when I change the scheduling to static. It works perfectly when the thread count is one. It also runs correctly under dynamic scheduling despite the result differing slightly every time it's run. Any idea what could be the problem?
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define N 100
#define CSIZE 1
#define nthread 2
int pi()
{
int i, chunk;
float pi = 0, x = 1;
chunk = CSIZE;
omp_set_num_threads(nthread);
#pragma omp parallel shared(i, x, chunk)
{
if (omp_get_num_threads() == 0)
{
printf("Number of threads = %d\n", omp_get_num_threads());
}
printf("Thread %d starting...\n", omp_get_thread_num());
#pragma omp for schedule(dynamic, chunk)
for (i = 1; i <= N; i++)
{
if (i % 2 == 0)
pi = pi - 4/x;
else
pi = pi + 4/x;
x = x + 2;
printf("Pi is currently %f at iteration %d with x = %0.0f on thread %d\n",
pi, i, x, omp_get_thread_num());
}
}
return EXIT_SUCCESS;
}
Using printf in the loop when I test your code makes dynamic do all the work on the first thread and none on the second (making the program effectively serial). If you remove the printf statement then you will find that the value of pi is random. This is because you have race conditions in x and pi.
Instead of using x you can divide by 2*i+1 (for i starting at zero). Also instead of using a branch to get the sign you can use sign = -2*(i%2)+1. To get pi you need to do a reduction using #pragma omp for schedule(static) reduction(+:pi).
#include <stdio.h>
#define N 10000
int main() {
float pi;
int i;
pi = 0;
#pragma omp parallel for schedule(static) reduction(+:pi)
for (i = 0; i < N; i++) {
pi += (-2.0f*(i&1)+1)/(2*i+1);
}
pi*=4.0f;
printf("%f\n", pi);
}

How to read performance counters on i5, i7 CPUs

Modern CPUs have quite a lot of performance counters - http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html how to read them?
I'm interested in cache misses and branch mispredictions.
Looks like PAPI has very clean API and works just fine on Ubuntu 11.04.
Once it's installed, following app will do what I wanted:
#include <stdio.h>
#include <stdlib.h>
#include <papi.h>
#define NUM_EVENTS 4
void matmul(const double *A, const double *B,
double *C, int m, int n, int p)
{
int i, j, k;
for (i = 0; i < m; ++i)
for (j = 0; j < p; ++j) {
double sum = 0;
for (k = 0; k < n; ++k)
sum += A[i*n + k] * B[k*p + j];
C[i*p + j] = sum;
}
}
int main(int /* argc */, char ** /* argv[] */)
{
const int size = 300;
double a[size][size];
double b[size][size];
double c[size][size];
int event[NUM_EVENTS] = {PAPI_TOT_INS, PAPI_TOT_CYC, PAPI_BR_MSP, PAPI_L1_DCM };
long long values[NUM_EVENTS];
/* Start counting events */
if (PAPI_start_counters(event, NUM_EVENTS) != PAPI_OK) {
fprintf(stderr, "PAPI_start_counters - FAILED\n");
exit(1);
}
matmul((double *)a, (double *)b, (double *)c, size, size, size);
/* Read the counters */
if (PAPI_read_counters(values, NUM_EVENTS) != PAPI_OK) {
fprintf(stderr, "PAPI_read_counters - FAILED\n");
exit(1);
}
printf("Total instructions: %lld\n", values[0]);
printf("Total cycles: %lld\n", values[1]);
printf("Instr per cycle: %2.3f\n", (double)values[0] / (double) values[1]);
printf("Branches mispredicted: %lld\n", values[2]);
printf("L1 Cache misses: %lld\n", values[3]);
/* Stop counting events */
if (PAPI_stop_counters(values, NUM_EVENTS) != PAPI_OK) {
fprintf(stderr, "PAPI_stoped_counters - FAILED\n");
exit(1);
}
return 0;
}
Tested this on Intel Q6600, it supports up to 4 performance events. Your processor may support more or less.
What about perf? perf list hw cache shows 33 different events and the man page shows how to use raw performance counter descriptors.
Performance counters are read with the RDPMC insn.
EDIT: To add a bit more info, reading performance counters is not very easy and it would take pages upon pages if we are to describe it here, besides it involves writes to Model Specific Registers, which require privileged instructions. I would instead advise to use ready profilers - oprofile or Intel VTune, which are built upon performance counters.
I think there is a available library that can be used, called perfmon2, http://perfmon2.sourceforge.net/, and documentations are available at http://www.hpl.hp.com/research/linux/perfmon/perfmon.php4 and http://www.hpl.hp.com/techreports/2004/HPL-2004-200R1.html, I am recently digging this lib out, I would post example code as soon as I figure it out~

OpenMP in Ubuntu: parallel program works on double core processor in two times slower than single-threaded. Why?

I get the code from wikipedia:
#include <stdio.h>
#include <omp.h>
#define N 100
int main(int argc, char *argv[])
{
float a[N], b[N], c[N];
int i;
omp_set_dynamic(0);
omp_set_num_threads(10);
for (i = 0; i < N; i++)
{
a[i] = i * 1.0;
b[i] = i * 2.0;
}
#pragma omp parallel shared(a, b, c) private(i)
{
#pragma omp for
for (i = 0; i < N; i++)
c[i] = a[i] + b[i];
}
printf ("%f\n", c[10]);
return 0;
}
I tryed to compile and run it in my Ubuntu 11.04 with gcc4.5 (my configuration: Intel C2D T7500M 2.2GHz, 2048Mb RAM) and this program worked in two times slower than single-threaded. Why?
Very simple answer: Increase N. And set the number of threads equal to the number processors you have.
For your machine, 100 is a very low number. Try some orders of magnitudes higher.
Another question is: How are you measuring the computation time? Usually one takes the program time to get comparable results.
I suppose the compiler optimized the for loop in the non-smp case (using SSE instructions, e.g.) and it can't in the OMP variant.
Use gcc -S (or objdump -S) to view the assembly for the different variants.
You might want to watch out with the shared variables anyway, because they need to be synchronized, making things very slow. If you can 'smart' chunks (look at the schedule pragma) you might reduce the contention, but again:
verify the emitted code
profile
don't underestimate the efficiency of singlethreaded code (because of cache locality and lack of context switches)
set the number of threads to the number of CPUs (let openMP decide it for you!); unless your thread-team has a master thread with dedicated tasks, in which case there might be value in allocating ONE extra thread
In all the cases where I tried to apply OMP for parallelization, roughly 70% of the cases are slower. The cases where it is a definite speedup is with
coarse-grained parallellism (your sample is on the fine-grained end of the spectrum)
no shared data
The issue you are facing is false memory sharing. Each thread should have its own private c[i].
Try this: #pragma omp parallel shared(a, b) private(i, c)
Run the code below and see the difference.
1.) OpenMP has an overhead so the runtime has to be more than the overhead to see a benefit.
2.) Don't set the number of threads yourself. In general I use the default threads. However, if your processor has hyper-threading you might get a bit better performance by setting the number of threads equal to the number of cores. With hyper threading the default number of threads will be twice the number of cores. For example on my machine I have four cores and the default number of threads is eight. By setting it to four in some situations I get better results and in other cases I get worse results.
3.) There is some false sharing in c but as long as N is large enough (which it needs to be to overcome the overhead) the false sharing will not cause much of a problem. You can play with the chunk size but I don't think it will be helpful.
4.) Cache issues. You have at least four levels of memory (the values are for my system): L1 (32Kb), L2(256Kb), L3(12Mb), and main memory (>>12Mb). The benefits of parallelism are going to diminish as you move into higher level. However, in the example below I set N to 100 million floats which is 400 million bytes or about 381Mb and it is still significantly faster using multiple threads. Try adjusting N and see what happens. For example try setting N to your cache levels/4 (one float is 4 bytes) (arrays a and b also need to be in the cache so you might need to set N to the cache level/12). However, if N is too small you fight with the OpenMP overhead (which is what the code in your question does).
#include <stdio.h>
#include <omp.h>
#define N 100000000
int main(int argc, char *argv[]) {
float *a = new float[N];
float *b = new float[N];
float *c = new float[N];
int i;
for (i = 0; i < N; i++) {
a[i] = i * 1.0;
b[i] = i * 2.0;
}
double dtime;
dtime = omp_get_wtime();
for (i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
dtime = omp_get_wtime() - dtime;
printf ("time %f, %f\n", dtime, c[10]);
dtime = omp_get_wtime();
#pragma omp parallel for private(i)
for (i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
dtime = omp_get_wtime() - dtime;
printf ("time %f, %f\n", dtime, c[10]);
return 0;
}

Resources