Atomic operators, SSE/AVX, and OpenMP - openmp

I'm wondering if SSE/AVX operations such as addition and multiplication can be an atomic operation? The reason I ask this is that in OpenMP the atomic construct only works on a limited set of operators. It does not work on for example SSE/AVX additions.
Let's assume I had a datatype float4 that corresponds to a SSE register and that the addition operator is defined for float4 to do an SSE addition. In OpenMP I could do a reduction over an array with the following code:
float4 sum4 = 0.0f; //sets all four values to zero
#pragma omp parallel
{
float4 sum_private = 0.0f;
#pragma omp for nowait
for(int i=0; i<N; i+=4) {
float4 val = float4().load(&array[i]) //load four floats into a SSE register
sum_private4 += val; //sum_private4 = _mm_addps(val,sum_private4)
}
#pragma omp critical
sum4 += sum_private;
}
float sum = horizontal_sum(sum4); //sum4[0] + sum4[1] + sum4[2] + sum4[3]
But atomic is faster than critical in general and my instinct tells me SSE/AVX operations should be atomic (even if OpenMP does not support it). Is this a limitation of OpenMP? Could I use for example e.g. Intel Threading Building Blocks or pthreads to do this as an atomic operation?
Edit: Based on Jim Cownie's comment I created a new function which is the best solution. I verified that it gives the correct result.
float sum = 0.0f;
#pragma omp parallel reduction(+:sum)
{
Vec4f sum4 = 0.0f;
#pragma omp for nowait
for(int i=0; i<N; i+=4) {
Vec4f val = Vec4f().load(&A[i]); //load four floats into a SSE register
sum4 += val; //sum4 = _mm_addps(val,sum4)
}
sum += horizontal_add(sum4);
}
Edit: based on comments Jim Cownie and comments by Mystical at this thread
OpenMP atomic _mm_add_pd I realize now that the reduction implementation in OpenMP does not necessarily use atomic operators and it's best to rely on OpenMP's reduction implementation rather than try to do it with atomic.

SSE & AVX in general are not atomic operations (but multiword CAS would sure be sweet).
You can use the combinable class template in tbb or ppl for more general purpose reductions and thread local initializations, think of it as a synchronized hash table indexed by thread id; it works just fine with OpenMP and doesn't spin up any extra threads on its own.
You can find examples on the tbb site and on msdn.
Regarding the comment, consider this code:
x = x + 5
You should really think of it as the following particularly when multiple threads are involved:
while( true ){
oldValue = x
desiredValue = oldValue + 5
//this conditional is the atomic compare and swap
if( x == oldValue )
x = desiredValue
break;
}
make sense?

Related

OpenMP reduction from within parallel region

How to do OpenMP reduction (sum) inside parallel region? (Result is needed on master thread only).
Algorithm prototype:
#pragma omp parallel
{
t = omp_get_thread_num();
while iterate
{
float f = get_local_result(t);
// fsum is required on master only
float fsum = // ? - SUM of f
if (t == 0):
MPI_Bcast(&fsum, ...);
}
If I have OpenMP region inside while iterate loop, parallel region overhead at each iteration kills the performance...
Here is the simplest way to do this:
float sharedFsum = 0.f;
float masterFsum;
#pragma omp parallel
{
const int t = omp_get_thread_num();
while(iteration_condition)
{
float f = get_local_result(t);
// Manual reduction
#pragma omp update
sharedFsum += f;
// Ensure the reduction is completed
#pragma omp barrier
#pragma omp master
MPI_Bcast(&sharedFsum, ...);
// Ensure no other threads update sharedFsum during the MPI_Bcast
#pragma omp barrier
}
}
The atomic operations can be costly if you have a lot of threads (eg. hundreds). A better approach is to let the runtime perform the reduction for you.
Here is a better version:
float sharedFsum = 0;
#pragma omp parallel
{
const int threadCount = omp_get_num_threads();
float masterFsum;
while(iteration_condition)
{
// Execute get_local_result on each thread and
// perform the reduction into sharedFsum
#pragma omp for reduction(+:sharedFsum) schedule(static,1)
for(int i=0 ; i<threadCount ; ++i)
sharedFsum += get_local_result(i);
#pragma omp master
{
MPI_Bcast(&sharedFsum, ...);
// sharedFsum must be reinitialized for the next iteration
sharedFsum = 0.f;
}
// Ensure no other threads update sharedFsum during the MPI_Bcast
#pragma omp barrier
}
}
Side notes:
t is not protected in your code, use private(t) in the #pragma omp parallel section to avoid an undefined behavior due to a race condition. Alternatively, you can use scoped variables.
#pragma omp master should be preferred to a conditional on the thread ID.
parallel region overhead at each iteration kills the performance...
Most of the time this is due to either (implicit) synchronizations/communications or a work imbalance.
The code above may have the same problem since it is quite synchronous.
If it makes sense in your application, you can make it a bit less synchronous (and thus possibly faster) by removing or moving barriers regarding the speed of the MPI_Bcast and get_local_result. However, this is far from being easy to do it correctly. One way to do that it to use OpenMP tasks and multi-buffering.

Is efficiency greater than unity valid?

I'm running a C code with OpenMP directive, at Intel i5-2410M (dual core with Hyper-threading).
Since by using #pragma omp parallel for simd,
the speedup achieved could be about x1000, as such:
#pragma omp parallel shared(a,b,c) private(i,j,k) {
#pragma omp for simd collapse (3)
for (i=0; i<5000; i++){
for (j=0; j<5000; j++){
for (k=0; k<5000; k=k+1){
a[i][j]=(a[i][j])+((b[i][k])*(c[k][j]));
} } }
}
And, the formula for calculating efficiency = Speedup / no. of processors
Thus, efficiency = 1000 / 1 = 1000. Is it valid?
I've never seen efficiency which is greater than 1. It really makes me confused as I don't know whether efficiency > 1 is a valid figure or not.
Hope you can clear up my doubt.
Thank you.

OpenMP: cannot change the value of reduction variable

After I had read that the initial value of reduction variable is set according to the operator used for reduction, I decided that instead of remembering these default values it is better to initialize it explicitly. So I modified the code in question by Totonga as follows
const int num_steps = 100000;
double x, sum, dx = 1./num_steps;
#pragma omp parallel private(x) reduction(+:sum)
{
sum = 0.;
#pragma omp for schedule(static)
for (int i=0; i<num_steps; ++i)
{
x = (i+0.5)*dx;
sum += 4./(1.+x*x);
}
}
But it turns out that no matter whether I write sum = 0. or sum = 123.456 the code produces the same result (used gcc-4.5.2 compiler). Can somebody, please, explain me why? (with a reference to openmp standard, if possible) Thanks in advance to everybody.
P.S. since some people object initializing reduction variable, I think it makes sense to expand a question a little. The code below works as expected: I initialize reduction variable and obtain result, which DOES depend on MY initial value
int sum;
#pragma omp parallel reduction(+:sum)
{
sum = 1;
}
printf("Reduction sum = %d\n",sum);
The printed result will be the number of cores, and not 0.
P.P.S I have to update my question again. User Gilles gave an insightful comment: And upon exit of the parallel region, these local values will be reduced using the + operator, and with the initial value of the variable, prior to entering the section.
Well, the following code gives me the result 3.142592653598146, which is badly calculated pi instead of expected 103.141592653598146 (the initial code was giving me excellent value of pi=3.141592653598146)
const int num_steps = 100000;
double x, sum, dx = 1./num_steps;
sum = 100.;
#pragma omp parallel private(x) reduction(+:sum)
{
#pragma omp for schedule(static)
for (int i=0; i<num_steps; ++i)
{
x = (i+0.5)*dx;
sum += 4./(1.+x*x);
}
}
Why would you want to do that? This is just begging with all your soul for troubles. The reduction clause and the way the local variables used are initialised are defined for a reason, and the idea is that you don't need to remember these initialisation value just because they are already right.
However, in your code, the behaviour is undefined. Let's see why...
Let's assume your initial code is this:
const int num_steps = 100000;
double x, sum, dx = 1./num_steps;
sum = 0.;
for (int i=0; i<num_steps; ++i) {
x = (i+0.5)*dx;
sum += 4./(1.+x*x);
}
Well, the "normal" way of parallelising it with OpenMP would be:
const int num_steps = 100000;
double x, sum, dx = 1./num_steps;
sum = 0.;
#pragma omp parallel for reduction(+:sum) private(x)
for (int i=0; i<num_steps; ++i) {
x = (i+0.5)*dx;
sum += 4./(1.+x*x);
}
Pretty straightforward, isn't it?
Now, when instead of that, you do:
const int num_steps = 100000;
double x, sum, dx = 1./num_steps;
#pragma omp parallel private(x) reduction(+:sum)
{
sum = 0.;
#pragma omp for schedule(static)
for (int i=0; i<num_steps; ++i)
{
x = (i+0.5)*dx;
sum += 4./(1.+x*x);
}
}
You have a problem... The reason is that upon entry into the parallel region, sum hadn't been initialised. So when you declare omp parallel reduction(+:sum), you create a per-thread private version of sum, initialised to the "logical" initial value corresponding to the operator of you reduction clause, namely 0 here because you asked for a + reduction. And upon exit of the parallel region, these local values will be reduced using the + operator, and with the initial value of the variable, prior to entering the section. See this for reference:
The reduction clause specifies a reduction-identifier and one or more
list items. For each list item, a private copy is created in each
implicit task or SIMD lane, and is initialized with the initializer
value of the reduction-identifier. After the end of the region, the
original list item is updated with the values of the private copies
using the combiner associated with the reduction-identifier
So in summary, upon exit you have the equivalent of sum += sum_local_0 + sum_local_1 + ... sum_local_nbthreadsMinusOne
Therefore, since in your code, sum doesn't have any initial value, its value upon exit of the parallel region isn't defined as well, and can be whatever...
Now let's imagine you did indeed initialise it... Then, if instead of using the right initialiser inside the parallel region (like your sum=0.; in the hereinabove code), you used for whatever reason sum=1.; instead, then the final sum won't be just incremented by 1, but by 1 times the number of threads used inside the parallel region, since the extra value will be counted as many times as there are of threads.
So in conclusion, just use reduction clauses and variables the "expected"/"naïve" way, that will spare you and the people coming after for maintaining your code a lot of troubles.
Edit: It looks like my point was not clear enough, so I'll try to explain it better:
this code:
int sum;
#pragma omp parallel reduction(+:sum)
{
sum = 1;
}
printf("Reduction sum = %d\n",sum);
Has an undefined behaviour because it is equivalent to:
int sum, numthreads;
#pragma omp parallel
#pragma omp single
numthreads = omp_get_num_threads();
sum += numthreads; // value of sum is undefined since it never was initialised
printf("Reduction sum = %d\n",sum);
Now, this code is valid:
int sum = 0; //here, sum has been initialised
#pragma omp parallel reduction(+:sum)
{
sum = 1;
}
printf("Reduction sum = %d\n",sum);
To convince yourself, just read the snippet of the standard I gave:
After the end of the region, the
original list item is updated with the values of the private copies
using the combiner associated with the reduction-identifier
So the reduction uses the combination of the private reduction variables and the original value to perform the final reduction upon exit. So if the original value wasn't set, the final value is undefined as well. And that's not because for some reason your compiler gives you a value that seems right, that the code is right.
Is that clearer now?

OpenMP parallelizing loop

I want to parallelize that kind of loop. Note that each "calc_block" uses the data that obtained on previous iteration.
for (i=0 ; i<MAX_ITER; i++){
norma1 = calc_block1();
norma2 = calc_block2();
norma3 = calc_block3();
norma4 = calc_block4();
norma = norma1+norma2+norma3+norma4;
...some calc...
if(norma<eps)break;
}
I tryed this, but speedup is quite small ~1.2
for (i=0 ; i<MAX_ITER; i++){
#pragma omp parallel sections{
#pragma omp section
norma1 = calc_block1();
#pragma omp section
norma2 = calc_block2();
#pragma omp section
norma3 = calc_block3();
#pragma omp section
norma4 = calc_block4();
}
norma = norma1+norma2+norma3+norma4;
...some calc...
if(norma<eps)break;
}
I think it happened because of the overhead of using sections inside of loop. But i dont know how to fix it up...
Thanks in advance!
You could reduce the overhead by moving the entire loop inside the parallel region. Thus the threads in the pool used to implement the team would only get "awaken" once. It is a bit tricky and involves careful consideration of variable sharing classes:
#pragma omp parallel private(i,...) num_threads(4)
{
for (i = 0; i < MAX_ITER; i++)
{
#pragma omp sections
{
#pragma omp section
norma1 = calc_block1();
#pragma omp section
norma2 = calc_block2();
#pragma omp section
norma3 = calc_block3();
#pragma omp section
norma4 = calc_block4();
}
#pragma omp single
{
norma = norm1 + norm2 + norm3 + norm4;
// ... some calc ..
}
if (norma < eps) break;
}
}
Both sections and single constructs have implicit barriers at their ends, hence the threads would synchronise before going into the next loop iteration. The single construct reproduces the previously serial part of your program. The ... part in the private clause should list as many as possible variables that are only relevant to ... some calc .... The idea is to run the serial part with thread-local variables since access to shared variables is slower with most OpenMP implementations.
Note that often time the speed-up might not be linear for completely different reason. For example calc_blockX() (with X being 1, 2, 3 or 4) might have too low compute intensity and therefore require very high memory bandwidth. If the memory subsystem is not able to feed all 4 threads at the same time, the speed-up would be less than 4. An example of such case - this question.

What is the usage of reduction in openmp?

I have this piece of code that is parallelized.
int i,n; double pi,x;
double area=0.0;
#pragma omp parallel for private(x) reduction (+:area)
for(i=0; i<n; i++){
x= (i+0.5)/n;
area+= 4.0/(1.0+x*x);
}
pi = area/n;
It is said that the reduction will remove the race condition that could happen if we didn't use a reduction. Still I'm wondering do we need to add lastprivate for area since its used outside the parallel loop and will not be visible outside of it. Else does the reduction cover this as well?
Reduction takes care of making a private copy of area for each thread. Once the parallel region ends area is reduced in one atomic operation. In other words the area that is exposed is an aggregate of all private areas of each thread.
thread 1 - private area = compute(x)
thread 2 - private area = compute(y)
thread 3 - private area = compute(z)
reduction step - public area = area<thread1> + area<thread2> + area<thread3> ...
You do not need lastprivate. To help you understand how reductions are done I think it's useful to see how this can be done with atomic. The following code
float sum = 0.0f;
pragma omp parallel for reduction (+:sum)
for(int i=0; i<N; i++) {
sum += //
}
is equivalent to
float sum = 0.0f;
#pragma omp parallel
{
float sum_private = 0.0f;
#pragma omp for nowait
for(int i=0; i<N; i++) {
sum_private += //
}
#pragma omp atomic
sum += sum_private;
}
Although this alternative has more code it is helpful to show how to use more complicated operators. One limitation when suing reduction is that atomic only supports a few basic operators. If you want to use a more complicated operator (such as a SSE/AVX addition) then you can replace atomic with critical reduction with OpenMP with SSE/AVX

Resources