As far as I know, atomic operations of atomic type in cpp11 are guaranteed to be aomtic. However, suppose in multi-core system, if two threads do following operation simultaneously, will the result be 1?(suppose initially atomic<int> val=0;) It seems that the result is guaranteed to be 2, but why?
val.fetch_add(1,std::memory_order_relaxed);
As a supplement, suppose another situation, if thread1 do val.load(2); thread2 do val.load(3), it seems that the result is whether 2 or 3,but not certain which one either.
Even if 1000 threads execute fetch_add at the "same time", the result will still be 1000. This is the whole point of atomic operations: they are synchronized.
If we had to worry about any atomic operations not being synchronized/visible to other threads, then we wouldn't have atomic operations to begin with.
When executing an atomic operation (like fetch_add) you are guaranteed that only one atomic operation starts and finishes at any given time, and it cannot be overlapped/interrupted by other atomic operations started in other threads.
Related
As we know atomic actions cannot be interleaved, so they can be used without fear of thread interference. For example, in a 32-bit OS "x = 3" is considered as an atomic operation "generally" but memory access mostly takes more than one clock cycles, let's say 3 cycles. So here is the case;
Assuming we have multiple parallel data & address buses and thread A tries to set "x = 3", isn't there any chance for another thread, lets say thread B, to access the same memory location in the second cycle ( while thread A in the middle of the write operation ). How the atomicity is gonna be preserved ?
Hope I was able to be clear.
Thanks
There is no problem with simple assignments at all provided a write performed in a single bus transaction. Even when memory write transaction takes 3 cycles then there are specific arrangements in place that prevent simultaneous bus access from different cores.
The problems arise when you do read-modify-write operations as these involve (at least) two bus transactions and thus such operations could lead to race conditions between cores (threads). These cases are solved by specific opcodes(prefixes) that assert bus lock signal for the whole duration of the next coming instruction or special instructions that do the whole job
Forgive me if this is not actually a race condition; I'm not that familiar with the nomenclature.
The problem I'm having is that this code runs slower with OpenMP enabled. I think the loop should be plenty big enough (k=100,000), so I don't think overhead is the issue.
As I understand it, a race condition is occurring here because all the loops are trying to access the same v(i,j) values all the time, slowing down the code.
Would the best fix here be to create as many copies of the v() array as threads and have each thread access a different one?
I'm using intel compiler on 16 cores, and it runs just slightly slower than on a single core.
Thanks all!
!$OMP PARALLEL DO
Do 500, k=1,n
Do 10, i=-(b-1),b-1
Do 20, j=-(b-1),b-1
if (abs(i).le.l.and.abs(j).eq.d) then
cycle
endif
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
if (k.eq.n-1) then
vtest(i,j,1)=v(i,j)
endif
if (k.eq.n) then
vtest(i,j,2)=v(i,j)
endif
20 continue
10 continue
500 continue
!$OMP END PARALLEL DO
You certainly have programmed a race condition though I'm not sure that that is the cause of your program's failure to execute more quickly. This line
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
which will be executed by all threads for the same (set of) values for i and j is where the racing happens. Given that your program does nothing to coordinate reads and writes to the elements of v your program is, in practice, not deterministic as there is no way to know the order in which updates to v are made.
You should have observed this non-determinism on inspecting the results of the program, and have noticed that changing the number of threads has an impact on the results too. Then again, with a long-running stencil operation over an array the results may have converged to the same (or similar enough) values.
OpenMP gives you the tools to coordinate access to variables but it doesn't automatically implement them; there is definitely nothing going on under the hood to prevent quasi-simultaneous reads from and writes to v. So the explanation for the lack of performance improvement lies elsewhere. It may be down to the impact of multiple threads on cache at some level in your system's memory hierarchy. A nice, cache-friendly, run over every element of an array in memory order for a serial program becomes a blizzard of (as far as the cache is concerned) random accesses to memory requiring access to RAM at every go.
It's possible that the explanation lies elsewhere. If the time to execute the OpenMP version is slightly longer than the time to execute a serial version I suspect that the program is not, in fact, being executed in parallel. Failure to compile properly is a common (here on SO) cause of that.
How to fix this ?
Well the usual pattern of OpenMP across an array is to parallelise on one of the array indices. The statements
!$omp parallel do
do i=-(b-1),b-1
....
end do
ensure that each thread gets a different set of values for i which means that they write to different elements of v, removing (almost) the data race. As you've written the program each thread gets a different set of values of k but that's not used (much) in the inner loops.
In passing, testing
if (k==n-1) then
and
if (k==n) then
in every iteration looks like you are tying an anchor to your program, why not just
do k=1,n-2
and deal with the updates to vtest at the end of the loop.
You could separate the !$omp parallel do like this
!$omp parallel
do k=1,n-2
!$omp do
do i=-(b-1),b-1
(and make the corresponding changes at the end of the parallel loop and region). Now all threads execute the entire contents of the parallel region but each gets its own set of i values to use. I recommend that you add clauses to your directives to specify the accessibility (eg private or shared) of each variable; but this answer is getting a bit too long and I won't go into more detail on these. Or on using a schedule clause.
Finally, of course, even with the changes I've suggested your program will be non-deterministic because this statement
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
will read neighbouring elements from v which are updated (at a time you have no control over) by another thread. To sort that out ... got to go back to work.
I am a bit confused how it is possible that Warps diverge and need to be synchronized via __syncthreads() function. All elements in a Block handle the same code in a SIMT fashion. How could it be that they are not in sync? Is it related to the scheduler? Do the different warps get different computing times? And why is there an overhead when using __syncthreads()?
Lets say we have 12 different Warps in a block 3 of them have finished their work. So now there are idling and the other warps get their computation time. Or do they still get computation time to do the __syncthreads() function?
First let's be careful with terminology. Warp divergence refers to threads within a single warp that take different execution paths, due to control structures in the code (if, while, etc.) Your question really has to do with warps and warp scheduling.
Although the SIMT model might suggest that all threads execute in lockstep, this is not the case. First of all, threads within different blocks are completely independent. They may execute in any order with respect to each other. For your question about threads within the same block, let's first observe that a block can have up to 1024 (or perhaps more) threads, but today's SM's (SM or SMX is the "engine" inside the GPU that processes a threadblock) don't have 1024 cuda cores, so it's not even theoretically possible for an SM to execute all threads of a threadblock in lockstep. Note that a single threadblock executes on a single SM, not across all (or more than one) SMs simultaneously. So even if a machine has 512 or more total cuda cores, they cannot all be used to handle the threads of a single threadblock, because a single threadblock executes on a single SM. (One reason for this is so that SM-specific resources, like shared memory, can be accessible to all threads within a threadblock.)
So what happens? It turns out each SM has a warp scheduler. A warp is nothing more than a collection of 32 threads that gets grouped together, scheduled together, and executed together. If a threadblock has 1024 threads then it has 32 warps of 32 threads per warp. Now, for example, on Fermi, an SM has 32 CUDA cores, so it is reasonable to think about an SM executing a warp in lockstep (and that is what happens, on Fermi). By lockstep, I mean that (ignoring the case of warp divergence, and also certain aspects of instruction-level-parallelism, I'm trying to keep the explanation simple here...) no instruction in the warp is executed until the previous instruction has been executed by all threads in the warp. So a Fermi SM can only actually be executing one of the warps in a threadblock at any given instant. All other warps in that threadblock are queued up, ready to go, waiting.
Now, when the execution of a warp hits a stall for any reason, the warp scheduler is free to move that warp out and bring another ready-to-go warp in (this new warp might not even be from the same threadblock, but I digress.) Hopefully by now you can see that if a threadblock has more than 32 threads in it, not all the threads are actually getting executed in lockstep. Some warps are proceeding ahead of other warps.
This behavior is normally desirable, except when it isn't. There are times when you do not want any thread in the threadblock to proceed beyond a certain point, until a condition is met. This is what __syncthreads() is for. For example, you might be copying data from global to shared memory, and you don't want any of the threadblock data processing to commence until shared memory has been properly populated. __syncthreads() ensures that all threads have had a chance to copy their data element(s) before any thread can proceed beyond the barrier and presumably begin computations on the data that is now resident in shared memory.
The overhead with __syncthreads() is in two flavors. First of all there's a very small cost just to process the machine level instructions associated with this built-in function. Second, __syncthreads() will normally have the effect of forcing the warp scheduler and SM to shuffle through all the warps in the threadblock, until each warp has met the barrier. If this is useful, great. But if it's not needed, then you're spending time doing something that isn't needed. So thus the advice to not just liberally sprinkle __syncthreads() through your code. Use it sparingly and where needed. If you can craft an algorithm that doesn't use it as much as another, that algorithm may be better (faster).
From the OpenMP summary pdf: "operation ensures that a specific storage location is updated atomically". This brough up the question for me what "atomic" is and wheter it is just a lock mechanism. So if I remember correctly "atomic" means that some hardware support is built in to prevent anything else from changing the value. So is making something "atomic" essentially just implementing a lock mechanism or is it something more?
I think there might be some confusion around "atomicity" vs "isolation". They're similar concepts, but there is a subtle difference between them. Atomicity means that an operation either completes entirely, or not at all. Isolation guarantees that operations that happen concurrently result in a state that could have been caused by them executing serially.
For example, if the operation is "add 1 to x, then multiply x by 2", and x starts as 3, the result will be 3 if there is any sort of failure, or 8 if there is not. Even if the power cuts out right after 1 is added, the result when you reboot is guaranteed to be 3.
Now consider what happens if this operation is performed twice, concurrently. Both could fail, resulting in x=3. One could fail, x=8. Both could succeed, x=18. If we are guaranteed isolation, these are the only outcomes. But, if we are only given atomicity, not isolation, a fourth outcome could happen, wherein the individual parts are interleaved as "add 1, add 1, multiply by 2, multiply by 2", resulting in x=20!
If you're guaranteed only isolation, but not atomicity, you could end up with x=3, 4, 5, 8, 10, or 18.
With all that said, this is a common misconception, and often when people say "atomicity" they mean both. That's my suspicion of what they mean in the OpenMP documentation.
Updating a value stored in memory is a three-step process. First the value is fetched from memory and brought to one of the CPU's registers. Then, the value in the register is changed in some way (say incremented). Finally, the new value is written back to memory so that it can be used again.
Doing this (or any other) operation atomically simply means that all three of those steps happen, or none of them does.
It only becomes interesting or important when you have another thread or process that also needs to use that same memory value. Suppose both want to increment the value, which is initially zero. Without atomic operations, the second thread might read the original value (0) from memory even as the first thread is incrementing it in a register. Effectively both threads may see the value 0, increment it to 1, and return it to memory. At the end of this sequence, the value in memory would be 1 despite having been incremented twice.
With an atomic increment operation, there is no way that that sequence can occur. Once the first thread enters the atomic sequence, there is no way the second thread can read the value in memory before the first thread has incremented it and written it back to memory. You'll always get the correct value (2).
So, to answer your question, it's almost like a lock mechanism. In particular, it's like a lock mechanism exists around whatever the original operation was. Atomic operations themselves are frequently used in the implementation of other locking mechanisms, such as mutexes and semaphores.
Consider a VLIW processor with an issue width equal to N: this means that it is able to start N operations simultaneously, so each very long instruction can consist of a maximum of N operations.
Suppose that the VLIW processor load a very long instruction which consists of operations with different latencies: operations belonging to the same very long instruction could end at different times. What happens if an operation finishes its execution before other operations belonging to the same very long instruction? Could a subsequent operation (that is an operation belonging to the next very long instruction) start execution before the remaining operations of the current very long instruction being executed? Or does a very long instruction wait for the completion of all operations belonging to the current very long instruction?
Most VLIW processors I've seen do support operations with different latencies.
It's up to the compiler to schedules these instructions, and to ensure that the
operands are available before the operation executes. A VLIW processor is
dumb, and doesn't check any dependencies between operations. When a long instruction
word executes, each operation in the word simply reads its input data from a register
file, and writes its result back at the end of the same cycle, or later if an
operation takes two or three cycles.
This only works when instructions are deterministic, and always take the same
number of cycles. All VLIW architectures I've seen have operations that take
a fixed number of cycles, no less, no more. In case they do take longer, like for
instance an external memory fetch, the whole machine is simply stalled.
Now there is one key thing that limits the scheduling of instructions that have
different latencies: the number of ports to the register file. The ports are the
connections between the register file and execution units of the operations.
In a VLIW processor, each operation executes in an issue slot, and each issue slot
has its own ports to the register file. Ports are expensive in terms of hardware.
The more ports, the more silicon is required to implement the register file.
Now consider the following situation where a two-cycle operation wants to write its
result to the register file at the same time as a single-cycle operation that
was scheduled right after it. There's now a conflict, as both operations want to
write to the same register file over the same port. Again, it's the compiler's task
to ensure this doesn't happen. In many VLIW architectures, the operands
that execute in the same issue slot all have the same latency. This avoids this
conflict.
Now to answer your questions:
You said: "What happens if an operation finishes its execution before other
operations belonging to the same very long instruction?"
Nothing special happens. The processor just continues to execute the next
very long instruction word.
You said: "Could a subsequent operation (that is an operation belonging to the
next very long instruction) start execution before the remaining operations of
the current very long instruction being executed?"
Yes, but this could present a register port conflict later on. It's up to the
compiler to prevent this situation.
You said: "Or does a very long instruction wait for the completion of all
operations belonging to the current very long instruction?"
No. The processor at every cycle simply goes to the next very long instruction
word. There's an exception and that is when an operation takes longer than
normal, for instance because there's a cache miss, and then the pipeline is
stalled, and the machine does not progress the next long instruction word.
The idea behind VLIW is that the compiler figures out lots of things for the processer to do in parallel and packages them up in bundles called "Very long instruction words".
Amhdahl's law tells us the the speedup of a parallel program (eg., the parallel parts of the VLIW instruction) is constrained by the slowest part (e.g, the longest-duration subinstruction).
The simple answer with VLIW and "long latencies" is "don't mix sub-instructions with different latencies". The practical answer is the VLIW machines try not to have sub-instructions with different latencies; rather ideally you want "one clock" subinstructions. Typically even memory fetches take only one clock by virtue of being divided into "memory fetch start (here's an address to fetch)" with the only variable latency subinstruction being "wait for previous fetch to arrive" with the idea being that the compiler generates as much other computation as it can so that the memory fetch latency is comvered by the other instructions.