omp parallel for loop (reduction to find max) ran slower than serial codes - openmp

I am new in using OpenMP.
I think that use max reduction clause to find the max element of an array is not such a bad idea, but in fact the parallel for loop ran much slower than serial one.
int main() {
double sta, end, elapse_t;
int bsize = 46000;
int q = bsize;
int max_val = 0;
double *buffer;
buffer = (double*)malloc(bsize*sizeof(double));
srand(time(NULL));
for(int i=0;i<q;i++)
buffer[i] = rand()%10000;
sta = omp_get_wtime();
//int i;
#pragma omp parallel for reduction(max : max_val)
for(int i=0;i<q; i++)
{
max_val = max_val > buffer[i] ? max_val : buffer[i];
}
end = omp_get_wtime();
printf("parallel maximum time %f\n", end-sta);
sta = omp_get_wtime();
for(int i=0;i<q; i++)
{
max_val = max_val > buffer[i] ? max_val : buffer[i];
}
end = omp_get_wtime();
printf("serial maximum time %f\n", end-sta);
free(buffer);
return 0;}
Compile command
gcc-7 kp_omp.cpp -o kp_omp -fopenmp
Execution results
./kp_omp
parallel maximum time 0.000505
serial maximum time 0.000266
As for the CPU, it is an Intel Core i7-6700 with 8 cores.

Whenever you parallelise a loop openMP needs to perform some operations, for example creating the threads. Those operations result in some overhead and this in turns implies that, for each loop, there is a minimum number of iterations under which it is not convenient to parallelise.
If I execute your code I obtain the same results you have:
./kp_omp
parallel maximum time 0.000570
serial maximum time 0.000253
However if I modify bsize in line 8 such that
int bsize = 100000;
I obtain
./kp_omp
parallel maximum time 0.000323
serial maximum time 0.000552
So the parallelised version got faster than the sequential. Part of the challenges you encounter when you try to speedup the execution of a code is to understand when it is convenient to parallelise and when the overhead of the parallelisation would kill your expected gain in performance.

Related

MPI Latency measuring

I am trying to understand some aspects of the MPI.
During the creation of the program, which is to measure latency between send/recv of two processes, I was faced with strange effects.
I tried to measure the result of many iterations, and received a response that matches the other benchmarks. Then I decided to display values ​​after each iteration and was surprised: they ranged between four values ​​that have not changed. I also drew attention to some very high values.
The code that calculates the value of latency and sample values is below:
int main()
{
MPI::Init();
Proc_Rank = MPI::COMM_WORLD.Get_rank();
for(int i = 0; i < 100; ++i)
latency_test(Proc_Rank, 1, 0);
MPI::Finalize();
return 0;
}
void latency_test(int Proc_Rank, int Iterations_Num, int Size)
{
double Total_Time, Latency;
double t1, t2;
char *Send_Buffer = new char[Size];
char *Recv_Buffer = new char[Size];
for(int i = 0; i < Size; i++){
Send_Buffer[i] = 'a';
}
for(int i = 0; i < Size; i++){
Recv_Buffer[i] = 'b';
}
MPI::COMM_WORLD.Barrier();
t1 = MPI::Wtime();
for(int i = 0; i < Iterations_Num; i++){
if (Proc_Rank == 0){
MPI::COMM_WORLD.Send(Send_Buffer, Size, MPI::CHAR, 1, 0);
MPI::COMM_WORLD.Recv(Recv_Buffer,Size,MPI::CHAR,1,
MPI::ANY_TAG);
}
else if (Proc_Rank==1) { MPI::COMM_WORLD.Recv(Recv_Buffer,Size,MPI::CHAR,0,MPI::ANY_TAG);
MPI::COMM_WORLD.Send(Send_Buffer, Size, MPI::CHAR, 0, 0);
}
}
t2 = MPI::Wtime();
delete []Send_Buffer;
delete []Recv_Buffer;
Total_Time = t2-t1;
if(Proc_Rank == 0){
Latency = (Total_Time / (Iterations_Num * 2.0)) * 1000000.0;
printf("%10.10f\n", Latency);
}
}
Part of the result:
5.4836273193
1.0728836060
0.9536743164
1.0728836060
0.4768371582
0.9536743164
0.5960464478
6.5565109253
0.9536743164
0.9536743164
1.0728836060
0.5960464478
0.4768371582
0.4768371582
Why are 4 fixed values randomly repeat? And why there are rare very large values?
As pointed out by Zulan, the resolution of the timer used by MPI_Wtime is not infinite. You can query the timer resolution by calling MPI_Wtick (MPI::Wtick in the C++ bindings). Measuring a single ping-pong round that lasts less than a microsecond is prone to very high statistical uncertainty, especially since the OS jitter, which is the random delay of the process execution due to other OS activities or processes being scheduled on the same CPU, could be several microseconds. No respectable MPI benchmark would do a single ping-pong round with empty messages.
As a side note, you are using a wildcard receive (MPI_ANY_TAG) in one of the processes. Those tend to be slower than fully-specified receives, especially when it comes to network equipment.

OpenMP Code Not Scaling due to overheads and cache issues

struct xnode
{
float *mat;
};
void testScaling( )
{
int N = 1000000; ///total num matrices
int dim = 10;
//memory for matrices
std::vector<xnode> nodeArray(N);
for( int k = 0; k < N; ++k )
nodeArray[k].mat = new float [dim*dim];
//memory for Y
std::vector<float*> Y(N,0);
for( int k = 0; k < N; ++k )
Y[k] = new float [dim];
//shared X
float* X = new float [dim];
for(int i = 0; i < dim; ++i ) X[i] = 1.0;
//init mats
for( int k = 0; k < N; ++k )
{
for( int i=0; i<dim*dim; ++i )
nodeArray[k].mat[i] = 0.25+((float)i)/3;
}
int NTIMES = 500;
//gemv args
char trans = 'N';
int lda = dim;
int incx = 1;
float alpha =1 , beta = 0;
//threads
int thr[4];
thr[0] =1 ; thr[1] = 2; thr[2] = 4; thr[3] = 8;
for( int t = 0; t<4; ++t )//test for nthreads
{
int nthreads = thr[t];
double t_1 = omp_get_wtime();
for( int ii = 0; ii < NTIMES; ++ii )//do matvec NTIMES
{
#pragma omp parallel for num_threads(nthreads)
for( int k=0; k<N; ++k )
{
//compute Y[k] = mat[k] * X;
GEMV(&trans, &dim, &dim, &alpha, nodeArray[k].mat, &lda, X, &incx, &beta, Y[k], &incx);
//GEMV(&trans, &dim, &dim, &alpha, nodeArray[0].mat, &lda, X, &incx, &beta, Y[k], &incx);
}
}
double t_2 = omp_get_wtime();
std::cout << "Threads " << nthreads << " time " << (t_2-t_1)/NTIMES << std::endl;
}
//clear memory
for( int k = 0; k < N; ++k )
{
delete [] nodeArray[k].mat;
delete [] Y[k];
}
delete [] X;
}
The above code parallelizes the matrix-vector product of N matrices of size dim, and stores results in N output vectors. The average of 500 products is taken as the time per matrix-vector product. The matrix-vector products in the above example are all of equal size and thus the threads should be perfectly balanced - we should achieve a performance scaling close to ideal 8x. The following are the observations (Machine – Intel Xeon 3.1Ghz.2 processors,8cores each, HyperThreading enabled, Windows, VS2012, Intel MKL, Intel OMP library).
OBSERVATION 1:
dim=10 N=1000000
Threads 1 - time 0.138068s
Threads 2 - time 0.0729147s
Threads 4 - time 0.0360527s
Threads 8 - time 0.0224268s (6.1x on 8threads)
OBSERVATION 2 :
dim=20 N=1000000
Threads 1 time 0.326617
Threads 2 time 0.185706
Threads 4 time 0.0886508
Threads 8 time 0.0733666 (4.5x on 8 threads).
Note – I ran VTune on this case. It showed CPUTime 267.8sec, Overhead time 43 sec, Spin time – 8 sec. The overhead time is all spent in a libiomp function (intel library). 8Threads/1Thread scaling is poor for such cases.
Next - in the gemv for loop, we change nodeArray[k].mat to nodeArray[0].mat (see commented statement), so that only the first matrix is used for all the matrix-vector products.
OBSERVATION 3
dim=20 N=1000000
Threads 1 time 0.152298 (The serial time is halved)
Threads 2 time 0.0769173
Threads 4 time 0.0384086
Threads 8 time 0.019336 (7.87x on 8 threads)
Thus I get almost ideal scaling - why is this behavior? VTune says that a significant portion of CPU time is spent in synchronization and thread overhead. Here it seems there is no relation between the load balancing and thread synchronization. As matrix size is increased the granularity should increase and thread overhead should be proportionately small. But as we increase from size 10 to 20 the scaling is weakening. When we use nodeArray[0].mat (only the first matrix) for doing all the matrix-vector products the cache is updated only once (since the compiler knows this during optimization) and we get near ideal scaling. Thus the synchronization overhead seems to be related to some cache related issue. I have tried a number of other things like setting KMP_AFFINITY and varying load distribution but that did not buy me anything.
My questions are:
1. I dont have a clear idea about how does the cache performance affect openMP thread synchronization. Can someone explain this?
2. Can anything be done about improving the scaling and reducing the overhead?
Thanks

OpenMP program is slower than sequential one

When I try the following code
double start = omp_get_wtime();
long i;
#pragma omp parallel for
for (i = 0; i <= 1000000000; i++) {
double x = rand();
}
double end = omp_get_wtime();
printf("%f\n", end - start);
Execution time is about 168 seconds, while the sequential version only spends 20 seconds.
I'm still a newbie in parallel programming. How could I get a parallel version that's faster that the sequential one?
The random number generator rand(3) uses global state variables (hidden in the (g)libc implementation). Access to them from multiple threads leads to cache issues and also is not thread safe. You should use the rand_r(3) call with seed parameter private to the thread:
long i;
unsigned seed;
#pragma omp parallel private(seed)
{
// Initialise the random number generator with different seed in each thread
// The following constants are chosen arbitrarily... use something more sensible
seed = 25234 + 17*omp_get_thread_num();
#pragma omp for
for (i = 0; i <= 1000000000; i++) {
double x = rand_r(&seed);
}
}
Note that this will produce different stream of random numbers when executed in parallel than when executed in serial. I would also recommend erand48(3) as a better (pseudo-)random number source.

OpenMP in Ubuntu: parallel program works on double core processor in two times slower than single-threaded. Why?

I get the code from wikipedia:
#include <stdio.h>
#include <omp.h>
#define N 100
int main(int argc, char *argv[])
{
float a[N], b[N], c[N];
int i;
omp_set_dynamic(0);
omp_set_num_threads(10);
for (i = 0; i < N; i++)
{
a[i] = i * 1.0;
b[i] = i * 2.0;
}
#pragma omp parallel shared(a, b, c) private(i)
{
#pragma omp for
for (i = 0; i < N; i++)
c[i] = a[i] + b[i];
}
printf ("%f\n", c[10]);
return 0;
}
I tryed to compile and run it in my Ubuntu 11.04 with gcc4.5 (my configuration: Intel C2D T7500M 2.2GHz, 2048Mb RAM) and this program worked in two times slower than single-threaded. Why?
Very simple answer: Increase N. And set the number of threads equal to the number processors you have.
For your machine, 100 is a very low number. Try some orders of magnitudes higher.
Another question is: How are you measuring the computation time? Usually one takes the program time to get comparable results.
I suppose the compiler optimized the for loop in the non-smp case (using SSE instructions, e.g.) and it can't in the OMP variant.
Use gcc -S (or objdump -S) to view the assembly for the different variants.
You might want to watch out with the shared variables anyway, because they need to be synchronized, making things very slow. If you can 'smart' chunks (look at the schedule pragma) you might reduce the contention, but again:
verify the emitted code
profile
don't underestimate the efficiency of singlethreaded code (because of cache locality and lack of context switches)
set the number of threads to the number of CPUs (let openMP decide it for you!); unless your thread-team has a master thread with dedicated tasks, in which case there might be value in allocating ONE extra thread
In all the cases where I tried to apply OMP for parallelization, roughly 70% of the cases are slower. The cases where it is a definite speedup is with
coarse-grained parallellism (your sample is on the fine-grained end of the spectrum)
no shared data
The issue you are facing is false memory sharing. Each thread should have its own private c[i].
Try this: #pragma omp parallel shared(a, b) private(i, c)
Run the code below and see the difference.
1.) OpenMP has an overhead so the runtime has to be more than the overhead to see a benefit.
2.) Don't set the number of threads yourself. In general I use the default threads. However, if your processor has hyper-threading you might get a bit better performance by setting the number of threads equal to the number of cores. With hyper threading the default number of threads will be twice the number of cores. For example on my machine I have four cores and the default number of threads is eight. By setting it to four in some situations I get better results and in other cases I get worse results.
3.) There is some false sharing in c but as long as N is large enough (which it needs to be to overcome the overhead) the false sharing will not cause much of a problem. You can play with the chunk size but I don't think it will be helpful.
4.) Cache issues. You have at least four levels of memory (the values are for my system): L1 (32Kb), L2(256Kb), L3(12Mb), and main memory (>>12Mb). The benefits of parallelism are going to diminish as you move into higher level. However, in the example below I set N to 100 million floats which is 400 million bytes or about 381Mb and it is still significantly faster using multiple threads. Try adjusting N and see what happens. For example try setting N to your cache levels/4 (one float is 4 bytes) (arrays a and b also need to be in the cache so you might need to set N to the cache level/12). However, if N is too small you fight with the OpenMP overhead (which is what the code in your question does).
#include <stdio.h>
#include <omp.h>
#define N 100000000
int main(int argc, char *argv[]) {
float *a = new float[N];
float *b = new float[N];
float *c = new float[N];
int i;
for (i = 0; i < N; i++) {
a[i] = i * 1.0;
b[i] = i * 2.0;
}
double dtime;
dtime = omp_get_wtime();
for (i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
dtime = omp_get_wtime() - dtime;
printf ("time %f, %f\n", dtime, c[10]);
dtime = omp_get_wtime();
#pragma omp parallel for private(i)
for (i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
dtime = omp_get_wtime() - dtime;
printf ("time %f, %f\n", dtime, c[10]);
return 0;
}

Low performance in a OpenMP program

I am trying to understand an openmp code from here. You can see the code below.
In order to measure the speedup, difference between the serial and omp version, I use time.h, do you find right this approach?
The program runs on a 4 core machine. I specify export OMP_NUM_THREADS="4" but can not see substantially speedup, usually I get 1.2 - 1.7. Which problems am I facing in this parallelization?
Which debug/performace tool could I use to see the loss of performace?
code (for compilation I use xlc_r -qsmp=omp omp_workshare1.c -o omp_workshare1.exe)
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define CHUNKSIZE 1000000
#define N 100000000
int main (int argc, char *argv[])
{
int nthreads, tid, i, chunk;
float a[N], b[N], c[N];
unsigned long elapsed;
unsigned long elapsed_serial;
unsigned long elapsed_omp;
struct timeval start;
struct timeval stop;
chunk = CHUNKSIZE;
// ================= SERIAL start =======================
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
gettimeofday(&start,NULL);
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
//printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
gettimeofday(&stop,NULL);
elapsed = 1000000 * (stop.tv_sec - start.tv_sec);
elapsed += stop.tv_usec - start.tv_usec;
elapsed_serial = elapsed ;
printf (" \n Time SEQ= %lu microsecs\n", elapsed_serial);
// ================= SERIAL end =======================
// ================= OMP start =======================
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
gettimeofday(&start,NULL);
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
//printf("Thread %d starting...\n",tid);
#pragma omp for schedule(static,chunk)
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
//printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
} /* end of parallel section */
gettimeofday(&stop,NULL);
elapsed = 1000000 * (stop.tv_sec - start.tv_sec);
elapsed += stop.tv_usec - start.tv_usec;
elapsed_omp = elapsed ;
printf (" \n Time OMP= %lu microsecs\n", elapsed_omp);
// ================= OMP end =======================
printf (" \n speedup= %f \n\n", ((float) elapsed_serial) / ((float) elapsed_omp)) ;
}
There's nothing really wrong with the code as above, but your speedup is going to be limited by the fact that the main loop, c=a+b, has very little work -- the time required to do the computation (a single addition) is going to be dominated by memory access time (2 loads and one store), and there's more contention for memory bandwidth with more threads acting on the array.
We can test this by making the work inside the loop more compute-intensive:
c[i] = exp(sin(a[i])) + exp(cos(b[i]));
And then we get
$ ./apb
Time SEQ= 17678571 microsecs
Number of threads = 4
Time OMP= 4703485 microsecs
speedup= 3.758611
which is obviously a lot closer to the 4x speedup one would expect.
Update: Oh, and to the other questions -- gettimeofday() is probably fine for timing, and on a system where you're using xlc - is this AIX? In that case, peekperf is a good overall performance tool, and the hardware performance monitors will give you access to to memory access times. On x86 platforms, free tools for performance monitoring of threaded code include cachegrind/valgrind for cache performance debugging (not the problem here), scalasca for general OpenMP issues, and OpenSpeedShop is pretty useful, too.

Resources