If each individual statement in a concurrent program is atomic then is there a critical section? - critical-section

If each individual statement in a concurrent program is atomic then is there a critical section?
This should be a simple true or false answer but you need to understand concurrency and atomicity.

Related

CUDA critical sections, thread/warp execution model and NVCC compiler decisions

Recently I posted this question, about a critical section. Here is a similar question. In those questions the given answer says, that is up to the compiler if the code "works" or not, because the order of the various paths of execution is up to the compiler.
To elaborate the rest of the question I need the following excerpts from The CUDA programming guide:
... Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently....
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path....
The execution context (program counters, registers, etc.) for each warp processed by a multiprocessor is maintained on-chip during the entire lifetime of the warp. Therefore, switching from one execution context to another has no cost, and at every instruction issue time, a warp scheduler selects a warp that has threads ready to execute its next instruction (the active threads of the warp) and issues the instruction to those threads.
What I understand from this three excerpts is that, threads can diverge freely from the rest, all the branch possibilities will be serialized if there is divergence between threads, and if a branch is taken it will execute till completion. And that is why the questions mentioned above ends on deadlock, because the ordering of the execution paths imposed by the compiler, results in the taking of a branch that doesn't get the lock.
Now the question is: the compiler shouldn't always put the branches in the order written by the user?, is there a high level way to enforce the order? I know, the compiler can optimize, do a reordering of the instructions, etc, but it should not fundamentally change the logic of the code (yes there are exceptions like some memory access without the volatile keyword, but that is why the keyword exists, to give control to the user).
Edit
The main point of this question is not about critical sections, is about the compiler, for example in the first link, a compilation flag change drastically the logic of the code. One "working", and the other doesn't. What bothers me, is that in all the reference, it only says be careful, nothing about undefined behaviour from the nvcc compiler.
I believe the order of execution is not set, nor guaranteed, by the CUDA compiler. It's the hardware that sets it - as far as I can recall.
Thus,
the compiler shouldn't always put the branches in the order written by the user?
It doesn't control execution order anyway
is there a high level way to enforce the order?
Just the synchronization instructions like __syncthreads().
The compiler... should not fundamentally change the logic of the code
The semantics of CUDA code is not the same as for C++ code... sequential execution of if branches is not part of the semantics.
I realize this answer may not be satisfying to you, but that's how things stand, for better or for worse.

On linux kernel can atomic operations for eg atomic_inc, atomic_dec etc protect a variable under multi core environment?

Atomic operations protect a variable in a multi threading environment, but is it suitable for mutlicore environment?
Yes, it does. They are typically implemented via atomic memory bus operations and so will work just the same for a multi-core scenario.
In fact, if you know the data you are protecting is only accessed by different threads (tasks) on the same core, it is probably cheaper to implement the protection via other means, such as disable preemption and/or interrupts. Atomic operation are specifically meant for situation where that is not enough. such as multi-core systems.
Multi-threaded essentially means that there are multiple tasks running (processes). According to Wikipedia:
Atomicity is a guarantee of isolation from interrupts, signals, concurrent processes and threads.
This is because these operations are treated as if they are one since it is not being interrupted by anything. Therefore, multiple threads can perform these operations of course, but only one at a time because a processor can only perform one operation, or atomic operation at a time.
The same logic goes for multi-core processes where there are multiple processors trying to access the same data. This is done through mutual exclusion which ensures that a critical code block never gets accessed more than once at the same time. In software terms, this means that it uses locks to ensure that the multiple processors cannot access it while in use.

What will happen if two atomic fetch_add execute simultaneously?

As far as I know, atomic operations of atomic type in cpp11 are guaranteed to be aomtic. However, suppose in multi-core system, if two threads do following operation simultaneously, will the result be 1?(suppose initially atomic<int> val=0;) It seems that the result is guaranteed to be 2, but why?
val.fetch_add(1,std::memory_order_relaxed);
As a supplement, suppose another situation, if thread1 do val.load(2); thread2 do val.load(3), it seems that the result is whether 2 or 3,but not certain which one either.
Even if 1000 threads execute fetch_add at the "same time", the result will still be 1000. This is the whole point of atomic operations: they are synchronized.
If we had to worry about any atomic operations not being synchronized/visible to other threads, then we wouldn't have atomic operations to begin with.
When executing an atomic operation (like fetch_add) you are guaranteed that only one atomic operation starts and finishes at any given time, and it cannot be overlapped/interrupted by other atomic operations started in other threads.

Atomicity in a parallel bus-ed CPU

As we know atomic actions cannot be interleaved, so they can be used without fear of thread interference. For example, in a 32-bit OS "x = 3" is considered as an atomic operation "generally" but memory access mostly takes more than one clock cycles, let's say 3 cycles. So here is the case;
Assuming we have multiple parallel data & address buses and thread A tries to set "x = 3", isn't there any chance for another thread, lets say thread B, to access the same memory location in the second cycle ( while thread A in the middle of the write operation ). How the atomicity is gonna be preserved ?
Hope I was able to be clear.
Thanks
There is no problem with simple assignments at all provided a write performed in a single bus transaction. Even when memory write transaction takes 3 cycles then there are specific arrangements in place that prevent simultaneous bus access from different cores.
The problems arise when you do read-modify-write operations as these involve (at least) two bus transactions and thus such operations could lead to race conditions between cores (threads). These cases are solved by specific opcodes(prefixes) that assert bus lock signal for the whole duration of the next coming instruction or special instructions that do the whole job

How to remove Fortran race condition?

Forgive me if this is not actually a race condition; I'm not that familiar with the nomenclature.
The problem I'm having is that this code runs slower with OpenMP enabled. I think the loop should be plenty big enough (k=100,000), so I don't think overhead is the issue.
As I understand it, a race condition is occurring here because all the loops are trying to access the same v(i,j) values all the time, slowing down the code.
Would the best fix here be to create as many copies of the v() array as threads and have each thread access a different one?
I'm using intel compiler on 16 cores, and it runs just slightly slower than on a single core.
Thanks all!
!$OMP PARALLEL DO
Do 500, k=1,n
Do 10, i=-(b-1),b-1
Do 20, j=-(b-1),b-1
if (abs(i).le.l.and.abs(j).eq.d) then
cycle
endif
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
if (k.eq.n-1) then
vtest(i,j,1)=v(i,j)
endif
if (k.eq.n) then
vtest(i,j,2)=v(i,j)
endif
20 continue
10 continue
500 continue
!$OMP END PARALLEL DO
You certainly have programmed a race condition though I'm not sure that that is the cause of your program's failure to execute more quickly. This line
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
which will be executed by all threads for the same (set of) values for i and j is where the racing happens. Given that your program does nothing to coordinate reads and writes to the elements of v your program is, in practice, not deterministic as there is no way to know the order in which updates to v are made.
You should have observed this non-determinism on inspecting the results of the program, and have noticed that changing the number of threads has an impact on the results too. Then again, with a long-running stencil operation over an array the results may have converged to the same (or similar enough) values.
OpenMP gives you the tools to coordinate access to variables but it doesn't automatically implement them; there is definitely nothing going on under the hood to prevent quasi-simultaneous reads from and writes to v. So the explanation for the lack of performance improvement lies elsewhere. It may be down to the impact of multiple threads on cache at some level in your system's memory hierarchy. A nice, cache-friendly, run over every element of an array in memory order for a serial program becomes a blizzard of (as far as the cache is concerned) random accesses to memory requiring access to RAM at every go.
It's possible that the explanation lies elsewhere. If the time to execute the OpenMP version is slightly longer than the time to execute a serial version I suspect that the program is not, in fact, being executed in parallel. Failure to compile properly is a common (here on SO) cause of that.
How to fix this ?
Well the usual pattern of OpenMP across an array is to parallelise on one of the array indices. The statements
!$omp parallel do
do i=-(b-1),b-1
....
end do
ensure that each thread gets a different set of values for i which means that they write to different elements of v, removing (almost) the data race. As you've written the program each thread gets a different set of values of k but that's not used (much) in the inner loops.
In passing, testing
if (k==n-1) then
and
if (k==n) then
in every iteration looks like you are tying an anchor to your program, why not just
do k=1,n-2
and deal with the updates to vtest at the end of the loop.
You could separate the !$omp parallel do like this
!$omp parallel
do k=1,n-2
!$omp do
do i=-(b-1),b-1
(and make the corresponding changes at the end of the parallel loop and region). Now all threads execute the entire contents of the parallel region but each gets its own set of i values to use. I recommend that you add clauses to your directives to specify the accessibility (eg private or shared) of each variable; but this answer is getting a bit too long and I won't go into more detail on these. Or on using a schedule clause.
Finally, of course, even with the changes I've suggested your program will be non-deterministic because this statement
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
will read neighbouring elements from v which are updated (at a time you have no control over) by another thread. To sort that out ... got to go back to work.

Resources