Is there a way to efficiently synchronize a subset of theads using OpenMP? - openmp

Because OpenMP nested parallelism has often performance problems, I was wondering if there is a way to implement a partial barrier that would synchronize only a subset of all threads.
Here is an example code structure:
#pragma omp parallel
{
int nth = omp_get_num_threads();
int tid = omp_get_thread_num();
if (tid < nth/2) {
// do some work
...
// I need some synchronization here, but only for nth/2 threads
#pragma omp partial_barrier(nth/2)
// do some more work
...
} else {
// do some other independent work
...
}
}
I did not find anything like that in the openmp standard, but maybe there is a way to efficiently program a similar behaviour with locks or something?
EDIT:
So my actual problem is that I have a computation kernel (a Legendre transform -- part A) that is efficiently parallelized using OpenMP with 4 to 12 threads depending on the problem size.
This computation is followed by a Fast Fourier transform (part B).
I have several independent datasets (typically about 6 to 10) that have to be processed by A followed by B.
I would like to use more parallelism with more threads (48 to 128, depending on the machines).
Since A is not efficiently parallelized with more than 4 to 12 threads, the idea is to split the threads into several groups, each group working on an independent dataset. Because the datasets are independent, I don't need to synchronize all the threads (which is quite expensive when many threads are used) before doing B, only the subset working on a given dataset.
OpenMP tasks with depenencies would do what I need, but my experience is that on some platforms (xeon servers) the performance is significantly lower to what you can get with simple threads.
Is there a way to synchronize a subset of threads efficiently?

Related

Does the using of SIMD load main CPU registers?

Let's imagine we have software developer that's goal is achieve absolute maximum of CPU's performance.
In today's CPUs we have many cores, we can load data in cache for faster processing and we also have SIMD instructions (AVX for example) that allow us to sum\multiply\do other ops with array of items (multiply 8 integers per one CPU clock). The disadvantage of this instruction is the cost of sending data & instructions to SIMD module + overhead of converting vector type to primitive types (sorry I familiar only with C#'s Vector) (We not looling on code complexety for now).
As far as I understand, while we using SIMD, main registers of CPU used only for sending and recieving data to this registers and main ALU blocks used for general purpose calculations are idle at this time.
And here is my question - will using of SIMD instructions load main CPU blocks? For example if we have huge amount of different calculations (let's imagine 40% of them are best to run on SIMD and 60% of them are better to run as a usual), will SIMD allow us to gain performance boost in this way: 100% of all cores performace + n% of SIMD's boost performance?
I'm asking this question because of for example with GPGPU we can use GPU for parallel calculations and CPU used in this case only for sending and recieving data, so it's idle all the time and we can utilize it's performance for sensitive for latency tasks.
Looks like this is a question about Out-Of-Order-Execution? Modern x64 have a number of execution ports on the CPU, and each can dispatch a new instruction per clock cycle (so about 8 CPU ops can run in parallel on an Intel SkyLake). Some of those ports handle memory loads/stores, some handle integer arithmetic, and some handle the SIMD instructions.
So for example, you may be able to displatch 2 AVX float mults, an AVX bitwise op, 2 AVX loads, a single AVX store, and a couple of bits of pointer arithmetic on the general purpose registers in a single cycle [you will have to wait for the operation to complete - the latency]. So in theory, as long as there aren't horrific dependency chains in the code, with some care you should able to keep each of those ports busy (or at least, that's the basic aim!).
Simple Rule 1: The busier you can keep the execution ports, the faster your code goes. This should be self evident. If you can keep 8 ports busy, you're doing 8 times more than if you can only keep 1 busy. In general though, it's mostly not worth worrying about (yes, there are always exceptions to the rule)
Simple Rule 2: When the SIMD execution ports are in use, the ALU doesn't suddenly become idle [A slight terminology error on your part here: The ALU is simply the bit of the CPU that does arithmetic. The computation for general purpose ops is done on an ALU, but it's also correct to call a SIMD unit an ALU. What you meant to ask is: do the general purpose parts of the CPU power down when SIMD units are in use? To which the answer is no... ]. Consider this AVX2 optimised method (which does nothing interesting!)
#include <immintrin.h>
typedef __m256 float8;
#define mul8f _mm256_mul_ps
void computeThing(float8 a[], float8 b[], float8 c[], int count)
{
for(int i = 0; i < count; ++i)
{
a[i] = mul8f(a[i], b[i]);
b[i] = mul8f(b[i], c[i]);
}
}
Since there are no dependencies between a, b, and c (which I should really be explicit about by specifying __restrict), then the two SIMD multiply instructions can both be dispatched in a single clock cycle (since there are two execution ports that can handle floating point multiply).
The General Purpose ALU doesn't suddenly power down here - The general purpose registers & instructions are still being used!
1. to compute memory addresses (for: a[i], b[i], c[i], d[i])
2. to load/store into those memory locations
3. to increment the loop counter
4. to test if the count has been reached?
It just so happens that we are also making use of the SIMD units to do a couple of multiplications...
Simple Rule 3: For floating point operations, using 'float' or '__m256' makes next to no difference. The same CPU hardware used to compute either float or float8 types is exactly the same. There are simply a couple of bits in the machine code encoding that specifies the choice between float/__m128/__m256.
i.e. https://godbolt.org/z/xTcLrf

Turn the code into a code using SIMD instructions

I am preparing for an exam and are doing some exercises without facit. So I am been giving this code and are wondering if I have turned the code into SIMD instructions.
The code
int A[100000];
int B[100000];
int C=0;
for int(i=0; i < 100000; i++)
C += A[i] * B[i];
Since there is no remainder, we don't need to take care of it. We also assume that it is a 128 bit register, and therefore can calculate 4 single precision floating point values.
My result - using SIMD
int A[100000];
int B[100000];
int C=0;
for int(i=0; i < 100000/4; i += 4)
C += A[i] * B[i];
C += A[i+1] * B[i+1];
C += A[i+2] * B[i+2];
C += A[i+3] * B[i+3];
What advantages can you see for using SIMD instructions instead of writing programs with multiple threads?
Assuming the omitted curly braces on your second loop is simply a typo, and typo in the for loop, and the fact that you ask about multiplying floats but your code shows arrays of ints, this won't get great vectorisation even if the compiler sees it. While the compiler might do the loads of 4 values from A and B as a single instruction each, and do the 4 multiplies in one instruction, your code forces the compiler to then extract each of the 4 products and sum them sequentially, and getting individual values out of a SIMD register is typically quite slow.
If on the other hand you did this
float A[100000];
float B[100000];
float C0=0, C1=0, C2=0, C3=0;
for (size_t i=0; i < 100000/4; i += 4)
{
C0 += A[i+0] * B[i+0];
C1 += A[i+1] * B[i+1];
C2 += A[i+2] * B[i+2];
C3 += A[i+3] * B[i+3];
}
float C = (C0 + C1) + (C2 + C3);
Then a good compiler could vectorise this as now it sees that within each loop it loads two SIMD registers, multiplies them, then it can add the result to a SIMD register of the sums, and only extracts those 4 sums and sums them all at the end.
A vectorising compile can do this with SIMD and it will not change the order of evaluation of individual sums (FP maths is NOT associative). The compiler is typically not allowed to change the order of FP maths for this reason (not without some extra flags that allow it to technically breach the language standards), so the code above can be precisely represented by SIMD instructions, and will run much faster (in fact I'd unwind the loop a further stage as the multiplication will be a bottleneck as it stands).
This is sort of the trick with SIMD, you have to understand and then think how the operation would be best implemented with vector instructions, and then write your code to execute the same sequence of operations, and hope the compiler spots what you've done.
Or you can write the vector instructions yourself with intrinsics, or use OpenMP or similar to tell the compiler more explicitly what to do.
Amongst the advantages of SIMD over threads for such an operation is the fact that you're making use of more of the silicon within a single core... so you're not preventing another thread from getting cycles. On our compute grid we typically run many single threaded processes on any one machine to keep all the cores busy at all times... in such a case doing this sum using more cores is a false economy, you'd simply be stealing cycles that another thread could usefully be running another job.
Yes, the provided code should compile into SIMD instructions with capable CPUs and compilers.
On vector-capable processors, SIMD exposes hardware features that greatly accelerate identical, parallel computations. For instance, SIMD typically makes better use of the cache on a single core due to streaming RAM access, assuming the data being processed is localized in contiguous areas of memory. Using multiprocessing, cache competition and other synchronization overhead could actually reduce performance as the various cores attempt to write data simultaneously. This is in addition to the intrinsic boost on von-Neumann machines from only having to read one, not four, separate instructions from the shared system memory.
The logic to do these arithmetic operations in parallel is always present, but requires specific SIMD instructions to utilize. As a result, SIMD tends to be used in hot loops where hand tuning makes overall optimization sense.

What is the best general purpose computing practice in OpenCL for iterative problems?

When we have a program that requires lots of operations over a large data sets and the operations on each of the data elements are independent, OpenCL can be one of the good choice to make it faster. I have a program like the following:
while( function(b,c)!=TRUE)
{
[X,Y] = function1(BigData);
M = functionA(X);
b = function2(M);
N = functionB(Y);
c = function3(N);
}
Here the function1 is applied on each of the elements on the BigData and produce another two big data sets (X,Y). function2 and function3 are then applied operation individually on each of the elements on these X,Y data, respectively.
Since the operations of all the functions are applied on each of the elements of the data sets independently, using GPU might make it faster. So I come up with the following:
while( function(b,c)!=TRUE)
{
//[X,Y] = function1(BigData);
1. load kernel1 and BigData on the GPU. each of the thread will work on one of the data
element and save the result on X and Y on GPU.
//M = functionA(X);
2a. load kernel2 on GPU. Each of the threads will work on one of the
data elements of X and save the result on M on GPU.
(workItems=n1, workgroup size=y1)
//b = function2(M);
2b. load kernel2 (Same kernel) on GPU. Each of the threads will work on
one of the data elements of M and save the result on B on GPU
(workItems=n2, workgroup size=y2)
3. read the data B on host variable b
//N = functionB(Y);
4a. load kernel3 on GPU. Each of the threads will work on one of the
data element of Y and save the result on N on GPU.
(workItems=n1, workgroup size=y1)
//c = function2(M);
4b. load kernel3 (Same kernel) on GPU. Each of the threads will work
on one of the data element of M and save the result on C on GPU
(workItems=n2, workgroup size=y2)
5. read the data C on host variable c
}
However, the overhead involved in this code seems significant to me (I have implemented a test program and run on a GPU). And if the kernels have some sort of synchronizations it might be ended up with more slowdown.
I also believe the workflow is kind of common. So what is the best practice to using OpenCL for speedup for a program like this.
I don't think there's a general problem with the way you've split up the problem into kernels, although it's hard to say as you haven't been very specific. How often do you expect your while loop to run?
If your kernels do negligible work but the outer loop is doing a lot of iterations, you may wish to combine the kernels into one, and do some number of iterations within the kernel itself, if that works for your problem.
Otherwise:
If you're getting unexpectedly bad performance, you most likely need to be looking at the efficiency of each of your kernels, and possibly their data access patterns. Unless neighbouring work items are reading/writing neighbouring data (ideally: 16 work items read 4 bytes each from a 64-byte cache line at a time) you're probably wasting memory bandwidth. If your kernels contain lots of conditionals or non-constant loop iterations, that will cost you, etc.
You don't specify what kind of runtimes you're getting, on what kind Of job size, (Tens? Thousands? Millions of arithmetic ops? How big are your data sets?) or what hardware. (Compute card? Laptop IGPU?) "Significant overhead" can mean a lot of different things. 5ms? 1 second?
Intel, nVidia and AMD all publish optimisation guides - have you read these?

OpenMP slower reduction

There are two versions of openmp codes with reduction and without.
// with reduction
#pragma omp parallel for reduction(+:sum)
for (i=1;i<= num_steps; i++){
x = (i-0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
// without reduction
#pragma omp parallel private(i)
{
int id = omp_get_thread_num();
int numthreads = omp_get_num_threads();
double x;
double partial_sum = 0;
for (i=id;i< num_steps; i+=numthreads){
x = (i+0.5)*step;
partial_sum += + 4.0/(1.0+x*x);
}
#pragma omp critical
sum += partial_sum;
}
I run the codes using 8 cores, the total time double for the reduction version. What's the reason? Thanks.
Scalar reduction in OpenMP is usually quite fast. The observed behaviour in your case is due to two things made wrong in two different ways.
In your first code you did not make x private. Therefore it is shared among the threads and besides getting incorrect results, the execution suffers from the data sharing. Whenever one thread writes to x, the core that it executes on sends a message to all other cores and makes them invalidate their copies of that cache line. When any of them writes to x later, the whole cache line has to be reloaded and then the cache lines in all other cores get invalidated. And so forth. This slows things down significantly.
In your second code you have used the OpenMP critical construct. This is a relatively heavy-weight in comparison with the atomic adds, usually used to implement the reduction at the end. Atomic adds on x86 are performed using the LOCK instruction prefix and everything gets implemented in the hardware. On the other side, critical sections are implemented using mutexes and require several instructions and often busy waiting loops. This is far less efficient than the atomic adds.
In the end, your first code is slowed down due to bad data sharing condition. Your second code is slowed down due to the use of incorrect synchronisation primitive. It just happens that on your particular system the latter effect is less severe than the former and hence your second example runs faster.
If you want to manually parallelize the loop as well as the reduction you can do it like this:
#pragma omp parallel private(i)
{
int id = omp_get_thread_num();
int numthreads = omp_get_num_threads();
int start = id*num_steps/numthreads;
int finish = (id+1)*num_steps/numthreads;
double x;
double partial_sum = 0;
for (i=start; i<finish ; i++){
x = (i+0.5)*step;
partial_sum += + 4.0/(1.0+x*x);
}
#pragma omp atomic
sum += partial_sum;
}
However, I don't recommend this. Reductions don't have to be done with atomic and you should just let OpenMP parallelize the loop. The first case is the best solution (but make sure you declare x private).
Edit: According to Hristo once you make x private these two methods are nearlly the same in speed. I want to explain why using critical in your second method instead of atomic or allowing OpenMP to do the reduction has hardly any effect on the performance in this case.
There are two ways I can think of doing a reduction:
Sum the partial sums linearly using atomic or critical
Sum the partial sums using a tree. I.e. if you have 8 cores this gives you eight partial sums you reduce this to 4 partial sums then 2 partial sums then 1.
The first cast has linear convergence in the number of cores. The second case goes as the log of the number of cores. So one my be temped to think the second case is always better. However, for only eight cores the reduction is entirely dominated by taking the partial sums. Adding eight numbers with atomic/critical vs. reducing the tree in 3 steps will be negligable.
What if you have e.g. 1024 cores? Then the tree can be reduced in only 10 steps and the linear sum takes 1024 steps. But the constant term can be much larger for the second case and doing the partial sum of a large array e.g. with 1 million elements probably still dominates the reduction.
So I suspect that using atomic or even critical for a reduction has a negligable effect on the reduction time in general.

OpenMP: Huge slowdown in what should be ideal scenario

In the code below I'm trying to compare all elements of an array to all other elements in a nested for loop. (It's to run a simple n-body simulation. I'm testing with only 4 bodies for 4 threads on 4 cores). An identical sequential version of the code without OpenMP modifications runs in around 15 seconds for 25M iterations. Last night this code ran in around 30 seconds. Now it runs in around 1 minute! I think the problem may lie in that the threads must write to the array which is passed to the function via a pointer.
The array is dynamically allocated elsewhere and is composed of structs I defined. This is just a hunch. I have verified that the 4 threads are running on 4 separate cores at 100% and that they are accessing the elements of the array properly. Any ideas?
void runSimulation (particle* particles, int numSteps){
//particles is a pointer to an array of structs I've defined and allocated dynamically before calling the function
//Variable Initializations
#pragma omp parallel num_threads(4) private(//The variables inside the loop) shared(k,particles) // 4 Threads for four cores
{
while (k<numSteps){ //Main loop.
#pragma omp master //Check whether it is time to report progress.
{
//Some simple if statements
k=k+1; //Increment step counter for some reason omp doesn't like k++
}
//Calculate new velocities
#pragma omp for
for (i=0; i<numParticles; i++){ //Calculate forces by comparing each particle to all others
Fx = 0;
Fy = 0;
for (j=0; j<numParticles; j++){
//Calcululate the cumulative force by comparing each particle to all others
}
//Calculate accelerations and set new velocities
ax = Fx / particles[i].mass;
ay = Fy / particles[i].mass;
//ARE THESE TWO LINES THE PROBLEM?!
particles[i].xVelocity += deltaT*ax;
particles[i].yVelocity += deltaT*ay;
}
#pragma omp master
//Apply new velocities to create new positions after all forces have been calculated.
for (i=0; i<numParticles; i++){
particles[i].x += deltaT*particles[i].xVelocity;
particles[i].y += deltaT*particles[i].yVelocity;
}
#pragma omp barrier
}
}
}
You are thrashing the cache. All the cores are writing to the same shared structure, which will be continually bouncing around between the cores via the L2 (best case), L3 or main memory/memory bus (worst case). Depending on how stuff is shared this is taking anywhere from 20 to 300 cycles, while writes to private memory in L1 takes 1 cycle or less in ideal conditions.
That explains your slowdown.
If you increase your number of particles the situation may become less severe because you'll often be writing to distinct cache lines, so there will be less thrashing. btown above as the right idea in suggesting a private array.
Not sure if this will fix the problem, but you might try giving each thread its own copy of the full array; the problem might be that the threads are fighting over accessing the shared memory, and you're seeing a lot of cache misses.
I'm not sure of the exact openmp syntax you'd use to do this, but try doing this:
Allocate memory to hold the entire particles array in each thread; do this once, and save all four new pointers.
At the beginning of each main loop iteration, in the master thread, deep-copy the main array four times to each of those new arrays. You can do this quickly with a memcpy().
Do the calculation such that the first thread writes to indices 0 < i < numParticles/4, and so on.
In the master thread, before you apply the new velocities, merge the four arrays into the main array by copying over only the relevant indices. You can do this quickly with a memcpy().
Note that you can parallelize your "apply new velocities" loop without any problems because each iteration only operates on a single index; this is probably the easiest part to parallelize.
The new operations will only be O(N) compared to your calculations which are O(N^2), so they shouldn't take too much time in the long run. There are definitely ways to optimize the steps that I laid out for you, Gabe, but I'll leave those to you.
I don't agree that the problem is cache thrashing since the size of the struct particles must exceed the size of a cache line just from the number of members.
I think the more likely culprit is that the overhead for initializing an omp for is 1000's of cycles http://www.ualberta.ca/CNS/RESEARCH/Courses/2001/PPandV/OpenMP.Eric.pdf and the loop has only a few calculations in it. I'm not remotely surprised the loop is slower with only 4 bodies. If you had a few 100's of bodies the situation would be different. I once worked on a loop a bit like this, and ended up using pthreads directly.

Resources