What is the difference between a serial code and using the keyword critical while parallelizing a code in openmp? - parallel-processing

If I have just one for loop to parallelize and if I use #pragma omp critical while parallelizing, will that make it equivalent to a serial code?

No.
The critical directive specifies that the code it covers is executed by one thread at a time, but it will (eventually) be executed by all threads that encounter it.
The single directive specifies that the code it covers will only be executed by one thread but even that isn't exactly the same as compiling the code without OpenMP. OpenMP imposes some restrictions on what programming constructs can be used inside parallel regions (eg no jumping out of them). Furthermore, at run-time you are likely to incur an overhead for firing up OpenMP even if you don't actually run any code in parallel.

Related

CUDA critical sections, thread/warp execution model and NVCC compiler decisions

Recently I posted this question, about a critical section. Here is a similar question. In those questions the given answer says, that is up to the compiler if the code "works" or not, because the order of the various paths of execution is up to the compiler.
To elaborate the rest of the question I need the following excerpts from The CUDA programming guide:
... Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently....
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path....
The execution context (program counters, registers, etc.) for each warp processed by a multiprocessor is maintained on-chip during the entire lifetime of the warp. Therefore, switching from one execution context to another has no cost, and at every instruction issue time, a warp scheduler selects a warp that has threads ready to execute its next instruction (the active threads of the warp) and issues the instruction to those threads.
What I understand from this three excerpts is that, threads can diverge freely from the rest, all the branch possibilities will be serialized if there is divergence between threads, and if a branch is taken it will execute till completion. And that is why the questions mentioned above ends on deadlock, because the ordering of the execution paths imposed by the compiler, results in the taking of a branch that doesn't get the lock.
Now the question is: the compiler shouldn't always put the branches in the order written by the user?, is there a high level way to enforce the order? I know, the compiler can optimize, do a reordering of the instructions, etc, but it should not fundamentally change the logic of the code (yes there are exceptions like some memory access without the volatile keyword, but that is why the keyword exists, to give control to the user).
Edit
The main point of this question is not about critical sections, is about the compiler, for example in the first link, a compilation flag change drastically the logic of the code. One "working", and the other doesn't. What bothers me, is that in all the reference, it only says be careful, nothing about undefined behaviour from the nvcc compiler.
I believe the order of execution is not set, nor guaranteed, by the CUDA compiler. It's the hardware that sets it - as far as I can recall.
Thus,
the compiler shouldn't always put the branches in the order written by the user?
It doesn't control execution order anyway
is there a high level way to enforce the order?
Just the synchronization instructions like __syncthreads().
The compiler... should not fundamentally change the logic of the code
The semantics of CUDA code is not the same as for C++ code... sequential execution of if branches is not part of the semantics.
I realize this answer may not be satisfying to you, but that's how things stand, for better or for worse.

OpenMP using lock

Why the program did not speed up and become slower than the sequential version?
Will it be faster if I change the lock to omp reduction?
The omp code for the calculation avgvalue
You have multiple threads running a single critical command. That is basically just as efficient as a serial code since only one thread will execute at a time. And you are also adding overhead by creating multiple threads and having them wait on one another to finish their execution before they can execute.
I think that making a reduction would be faster, since there are optimizations for that command in OpenMP.

Is there a way to end idle threads in GNU OpenMP?

I use OpenMP for parallel sorting at start of my program. Once data is loaded and sorted, the program runs as a daemon and OpenMP is not used any more. Is there a way to turn off the idle threads created by OpenMP? omp_set_num_threads() doesn't affect the idle threads which have already been created for a task.
Please look up OMP_WAIT_POLICY, which is new in OpenMP 4 [https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fWAIT_005fPOLICY.html].
There are non-portable alternatives like GOMP_SPINCOUNT if your OpenMP implementation isn't recent enough. I recall from OpenMP specification discussions that at least Intel, IBM, Cray, and Oracle support their own implementation of this feature already.
I don't believe there is a way to trigger the threads' destruction. Modern OpenMP implementations tend to keep threads around in a pool to speed up starting future parallel sections.
In your case I would recommend a two program solution (one parallel to sort and one serial for the daemon). How you communicate the data between them is up to you. You could do something simple like writing it to a file and then reading it again. This may not be as slow as it sounds since a modern linux distribution might keep that file in memory in the file cache.
If you really want to be sure it stays in memory, you could launch the two processes simultaneously and allow them to share memory and allow the first parallel sort process to exit when it is done.
In theory, OpenMP has a implicit synchronization at the end of the "pragma" clauses. So, when the OpenMP parallel work ends, all the threads are deleted. You dont need to kill them or free them: OpenMP does that automatically.
Maybe "omp_get_num_threads()" is telling to you the actual configuration of the program, not the number of active threads. I mean: if you set the number of threads to 4, omp will tell you that the configuration is "4 threads", but this does not mean that there are actually 4 threads in process.

How to remove Fortran race condition?

Forgive me if this is not actually a race condition; I'm not that familiar with the nomenclature.
The problem I'm having is that this code runs slower with OpenMP enabled. I think the loop should be plenty big enough (k=100,000), so I don't think overhead is the issue.
As I understand it, a race condition is occurring here because all the loops are trying to access the same v(i,j) values all the time, slowing down the code.
Would the best fix here be to create as many copies of the v() array as threads and have each thread access a different one?
I'm using intel compiler on 16 cores, and it runs just slightly slower than on a single core.
Thanks all!
!$OMP PARALLEL DO
Do 500, k=1,n
Do 10, i=-(b-1),b-1
Do 20, j=-(b-1),b-1
if (abs(i).le.l.and.abs(j).eq.d) then
cycle
endif
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
if (k.eq.n-1) then
vtest(i,j,1)=v(i,j)
endif
if (k.eq.n) then
vtest(i,j,2)=v(i,j)
endif
20 continue
10 continue
500 continue
!$OMP END PARALLEL DO
You certainly have programmed a race condition though I'm not sure that that is the cause of your program's failure to execute more quickly. This line
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
which will be executed by all threads for the same (set of) values for i and j is where the racing happens. Given that your program does nothing to coordinate reads and writes to the elements of v your program is, in practice, not deterministic as there is no way to know the order in which updates to v are made.
You should have observed this non-determinism on inspecting the results of the program, and have noticed that changing the number of threads has an impact on the results too. Then again, with a long-running stencil operation over an array the results may have converged to the same (or similar enough) values.
OpenMP gives you the tools to coordinate access to variables but it doesn't automatically implement them; there is definitely nothing going on under the hood to prevent quasi-simultaneous reads from and writes to v. So the explanation for the lack of performance improvement lies elsewhere. It may be down to the impact of multiple threads on cache at some level in your system's memory hierarchy. A nice, cache-friendly, run over every element of an array in memory order for a serial program becomes a blizzard of (as far as the cache is concerned) random accesses to memory requiring access to RAM at every go.
It's possible that the explanation lies elsewhere. If the time to execute the OpenMP version is slightly longer than the time to execute a serial version I suspect that the program is not, in fact, being executed in parallel. Failure to compile properly is a common (here on SO) cause of that.
How to fix this ?
Well the usual pattern of OpenMP across an array is to parallelise on one of the array indices. The statements
!$omp parallel do
do i=-(b-1),b-1
....
end do
ensure that each thread gets a different set of values for i which means that they write to different elements of v, removing (almost) the data race. As you've written the program each thread gets a different set of values of k but that's not used (much) in the inner loops.
In passing, testing
if (k==n-1) then
and
if (k==n) then
in every iteration looks like you are tying an anchor to your program, why not just
do k=1,n-2
and deal with the updates to vtest at the end of the loop.
You could separate the !$omp parallel do like this
!$omp parallel
do k=1,n-2
!$omp do
do i=-(b-1),b-1
(and make the corresponding changes at the end of the parallel loop and region). Now all threads execute the entire contents of the parallel region but each gets its own set of i values to use. I recommend that you add clauses to your directives to specify the accessibility (eg private or shared) of each variable; but this answer is getting a bit too long and I won't go into more detail on these. Or on using a schedule clause.
Finally, of course, even with the changes I've suggested your program will be non-deterministic because this statement
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
will read neighbouring elements from v which are updated (at a time you have no control over) by another thread. To sort that out ... got to go back to work.

CUDA __syncthreads() usage within a warp

If it was absolutely required for all the threads in a block to be at the same point in the code, do we require the __syncthreads function if the number of threads being launched is equal to the number of threads in a warp?
Note: No extra threads or blocks, just a single warp for the kernel.
Example code:
shared _voltatile_ sdata[16];
int index = some_number_between_0_and_15;
sdata[tid] = some_number;
output[tid] = x ^ y ^ z ^ sdata[index];
Updated with more information about using volatile
Presumably you want all threads to be at the same point since they are reading data written by other threads into shared memory, if you are launching a single warp (in each block) then you know that all threads are executing together. On the face of it this means you can omit the __syncthreads(), a practice known as "warp-synchronous programming". However, there are a few things to look out for.
Remember that a compiler will assume that it can optimise providing the intra-thread semantics remain correct, including delaying stores to memory where the data can be kept in registers. __syncthreads() acts as a barrier to this and therefore ensures that the data is written to shared memory before other threads read the data. Using volatile causes the compiler to perform the memory write rather than keep in registers, however this has some risks and is more of a hack (meaning I don't know how this will be affected in the future)
Technically, you should always use __syncthreads() to conform with the CUDA Programming Model
The warp size is and always has been 32, but you can:
At compile time use the special variable warpSize in device code (documented in the CUDA Programming Guide, under "built-in variables", section B.4 in the 4.1 version)
At run time use the warpSize field of the cudaDeviceProp struct (documented in the CUDA Reference Manual)
Note that some of the SDK samples (notably reduction and scan) use this warp-synchronous technique.
You still need __syncthreads() even if warps are being executed in parallel. The actual execution in hardware may not be parallel because the number of cores within a SM (Stream Multiprocessor) can be less than 32. For example, GT200 architecture has 8 cores in each SM, so you can never be sure all threads are in the same point in the code.

Resources