Is making something "atomic" essentially just a lock mechanism? - parallel-processing

From the OpenMP summary pdf: "operation ensures that a specific storage location is updated atomically". This brough up the question for me what "atomic" is and wheter it is just a lock mechanism. So if I remember correctly "atomic" means that some hardware support is built in to prevent anything else from changing the value. So is making something "atomic" essentially just implementing a lock mechanism or is it something more?

I think there might be some confusion around "atomicity" vs "isolation". They're similar concepts, but there is a subtle difference between them. Atomicity means that an operation either completes entirely, or not at all. Isolation guarantees that operations that happen concurrently result in a state that could have been caused by them executing serially.
For example, if the operation is "add 1 to x, then multiply x by 2", and x starts as 3, the result will be 3 if there is any sort of failure, or 8 if there is not. Even if the power cuts out right after 1 is added, the result when you reboot is guaranteed to be 3.
Now consider what happens if this operation is performed twice, concurrently. Both could fail, resulting in x=3. One could fail, x=8. Both could succeed, x=18. If we are guaranteed isolation, these are the only outcomes. But, if we are only given atomicity, not isolation, a fourth outcome could happen, wherein the individual parts are interleaved as "add 1, add 1, multiply by 2, multiply by 2", resulting in x=20!
If you're guaranteed only isolation, but not atomicity, you could end up with x=3, 4, 5, 8, 10, or 18.
With all that said, this is a common misconception, and often when people say "atomicity" they mean both. That's my suspicion of what they mean in the OpenMP documentation.

Updating a value stored in memory is a three-step process. First the value is fetched from memory and brought to one of the CPU's registers. Then, the value in the register is changed in some way (say incremented). Finally, the new value is written back to memory so that it can be used again.
Doing this (or any other) operation atomically simply means that all three of those steps happen, or none of them does.
It only becomes interesting or important when you have another thread or process that also needs to use that same memory value. Suppose both want to increment the value, which is initially zero. Without atomic operations, the second thread might read the original value (0) from memory even as the first thread is incrementing it in a register. Effectively both threads may see the value 0, increment it to 1, and return it to memory. At the end of this sequence, the value in memory would be 1 despite having been incremented twice.
With an atomic increment operation, there is no way that that sequence can occur. Once the first thread enters the atomic sequence, there is no way the second thread can read the value in memory before the first thread has incremented it and written it back to memory. You'll always get the correct value (2).
So, to answer your question, it's almost like a lock mechanism. In particular, it's like a lock mechanism exists around whatever the original operation was. Atomic operations themselves are frequently used in the implementation of other locking mechanisms, such as mutexes and semaphores.

Related

Which std::sync::atomic::Ordering to use?

All the methods of std::sync::atomic::AtomicBool take a memory ordering (Relaxed, Release, Acquire, AcqRel, and SeqCst), which I have not used before. Under what circumstances should these values be used? The documentation uses confusing “load” and “store” terms which I don’t really understand. For example:
A producer thread mutates some state held by a Mutex, then calls AtomicBool::compare_and_swap(false, true, ordering) (to coalesce invalidations), and if it swapped, posts an “invalidate” message to a concurrent queue (e.g. mpsc or a winapi PostMessage). A consumer thread resets the AtomicBool, reads from the queue, and reads the state held by the Mutex. Can the producer use Relaxed ordering because it is preceded by a mutex, or must it use Release? Can the consumer use store(false, Relaxed), or must it use compare_and_swap(true, false, Acquire) to receive the changes from the mutex?
What if the producer and consumer share a RefCell instead of a Mutex?
I'm not an expert on this, and it's really complicated, so please feel free to critique my post. As pointed out by mdh.heydari, cppreference.com has much better documentation of orderings than Rust (C++ has an almost identical API).
For your question
You'd need to use "release" ordering in your producer and "acquire" ordering in your consumer. This ensures that the data mutation occurs before the AtomicBool is set to true.
If your queue is asynchronous, then the consumer will need to keep trying to read from it in a loop, since the producer could get interrupted between setting the AtomicBool and putting something in the queue.
If the producer code might run multiple times before client runs, then you can't use RefCell because they could mutate the data while the client is reading it. Otherwise it's fine.
There are other better and simpler ways to implement this pattern, but I assume you were just giving it as an example.
What are orderings?
The different orderings have to do with what another thread sees happen when an atomic operation occurs. Compilers and CPUs are normally both allowed to reorder instructions in order to optimize code, and the orderings effect how much they're allowed to reorder instructions.
You could just always use SeqCst, which basically guarantees everyone will see that instruction as having occurred wherever you put it relative to other instructions, but in some cases if you specify a less restrictive ordering then LLVM and the CPU can better optimize your code.
You should think of these orderings as applying to a memory location (instead of applying to an instruction).
Ordering Types
Relaxed Ordering
There are no constraints besides any modification to the memory location being atomic (so it either happens completely or not at all). This is fine for something like a counter if the values retrieved by/set by individual threads don't matter as long as they're atomic.
Acquire Ordering
This constraint says that any variable reads that occur in your code after "acquire" is applied can't be reordered to occur before it. So, say in your code you read some shared memory location and get value X, which was stored in that memory location at time T, and then you apply the "acquire" constraint. Any memory locations that you read from after applying the constraint will have the value they had at time T or later.
This is probably what most people would expect to happen intuitively, but because a CPU and optimizer are allowed to reorder instructions as long as they don't change the result, it isn't guaranteed.
In order for "acquire" to be useful, it has to be paired with "release", because otherwise there's no guarantee that the other thread didn't reorder its write instructions that were supposed to occur at time T to an earlier time.
Acquire-reading the flag value you're looking for means you won't see a stale value somewhere else that was actually changed by a write before the release-store to the flag.
Release Ordering
This constraint says that any variable writes that occur in your code before "release" is applied can't be reordered to occur after it. So, say in your code you write to a few shared memory locations and then set some memory location t at time T, and then you apply the "release" constraint. Any writes that appear in your code before "release" is applied are guaranteed to have occurred before it.
Again, this is what most people would expect to happen intuitively, but it isn't guaranteed without constraints.
If the other thread trying to read value X doesn't use "acquire", then it isn't guaranteed to see the new value with respect to changes in other variable values. So it could get the new value, but it might not see new values for any other shared variables. Also keep in mind that testing is hard. Some hardware won't in practice show re-ordering with some unsafe code, so problems can go undetected.
Jeff Preshing wrote a nice explanation of acquire and release semantics, so read that if this isn't clear.
AcqRel Ordering
This does both Acquire and Release ordering (ie. both restrictions apply). I'm not sure when this is necessary - it might be helpful in situations with 3 or more threads if some Release, some Acquire, and some do both, but I'm not really sure.
SeqCst Ordering
This is most restrictive and, therefore, slowest option. It forces memory accesses to appear to occur in one, identical order to every thread. This requires an MFENCE instruction on x86 on all writes to atomic variables (full memory barrier, including StoreLoad), while the weaker orderings don't. (SeqCst loads don't require a barrier on x86, as you can see in this C++ compiler output.)
Read-Modify-Write accesses, like atomic increment, or compare-and-swap, are done on x86 with locked instructions, which are already full memory barriers. If you care at all about compiling to efficient code on non-x86 targets, it makes sense to avoid SeqCst when you can, even for atomic read-modify-write ops. There are cases where it's needed, though.
For more examples of how atomic semantics turn into ASM, see this larger set of simple functions on C++ atomic variables. I know this is a Rust question, but it's supposed to have basically the same API as C++. godbolt can target x86, ARM, ARM64, and PowerPC. Interestingly, ARM64 has load-acquire (ldar) and store-release (stlr) instructions, so it doesn't always have to use separate barrier instructions.
By the way, x86 CPUs are always "strongly ordered" by default, which means they always act as if at least AcqRel mode was set. So for x86 "ordering" only affects how LLVM's optimizer behaves. ARM, on the other hand, is weakly ordered. Relaxed is set by default, to allow the compiler full freedom to reorder things, and to not require extra barrier instructions on weakly-ordered CPUs.

What does the processor do while waiting for a main memory fetch

Assuming l1 and l2 cache requests result in a miss, does the processor stall until main memory has been accessed?
I heard about the idea of switching to another thread, if so what is used to wake up the stalled thread?
There are many, many things going on in a modern CPU at the same time. Of course anything needing the result of the memory access cannot proceed, but there may be plenty more things to do. Assume the following C code:
double sum = 0.0;
for (int i = 0; i < 4; ++i) sum += a [i];
if (sum > 10.0) call_some_function ();
and assume that reading the array a stalls. Since reading a [0] stalls, the addition sum += a [0] will stall. However, the processor goes on performing other instructions. Like increasing i, checking that i < 4, looping, and reading a [1]. This stalls as well, the second addition sum += a [1] stalls - this time because neither the correct value of sum nor the value a [1] are known, but things go on and eventually the code reaches the statement "if (sum > 10.0)".
The processor at this point has no idea what sum is. However it can guess the outcome, based on what happened in previous branches, and start executing the function call_some_function () speculatively. So it continues running, but carefully: When call_some_function () stores things to memory, it doesn't happen yet.
Eventually reading a [0] succeeds, many cycles later. When that happens, it will be added to sum, then a [1] will be added to sum, then a [2], then a [3], then the comparison sum > 10.0 will performed properly. Then the decision to branch will turn out to be correct or incorrect. If incorrect, all the results of call_some_function () are throw away. If correct, all the results of call_some_function () are turned from speculative results into real results.
If the stall takes too long, the processor will eventually run out of things to do. It can easily handle the four additions and one compare that couldn't be executed, but eventually it's too much and the processor must stop. However, on a hyper threaded system, you have another thread that can continue running happily, and at a higher speed because nobody else uses the core, so the whole core still can go on doing useful work.
A modern out-of-order processor has a Reorder Buffer (ROB) which tracks all inflight instructions and keeps them in program order. Once the instruction at the head of the ROB is finished, it is cleared from the ROB. Modern ROBs are ~100-200 entries in size.
Likewise, a modern OoO processor has a Load/Store Queue which tracks the state of all memory instructions.
And finally, instructions that have been fetched and decoded, but not yet executed, sit in something called the Issue Queue/Window (or "reservation station", depending on the terminology of the designers and modolo some differences in micro-architecture that are largely irrelevant to this question). Instructions that are sitting in the Issue Queue have a list of register operands they depend on and whether or not their operands are "busy". Once all of their register operands are no longer busy, the instruction is ready to be executed and it requests to be "issued".
The Issue Scheduler picks from among the ready instructions and issues them to the Execution Units (this is the part that is out-of-order).
Let's look at the following sequence:
addi x1 <- x2 + x3
ld x2 0(x1)
sub x3 <- x2 - x4
As we can see, the "sub" instruction depends on the previous load instruction (by way of the register "x2"). The load instruction will be sent to memory and miss in the caches. It may take 100+ cycles for it to return and writeback the result to x2. Meanwhile, the sub instruction will be placed in the Issue Queue, with its operand "x2" being marked as busy. It will sit there waiting for a very, very long time. The ROB will quickly fill up with predicted instructions and then stall. The whole core will grind to a halt and twiddle its thumbs.
Once the load returns, it writes back to "x2", broadcasts this fact to the Issue Queue, the sub hears "x2 is now ready!" and the sub can finally proceed, the ld instruction can finally commit, and the ROB will start emptying so new instructions can be fetched and inserted into the ROB.
Obviously , this leads to an idle pipeline as lots of instructions will get backed up waiting for the load to return. There are a couple of solutions to this.
One idea is to simply switch the entire thread out for a new thread. In a simplified explanation, this basically means flushing out the entire pipeline, storing out to memory the PC of the thread (which is pointing to the load instruction) and the state of the committed register file (at the conclusion of the add instruction before the load). That's a lot of work to schedule a new thread over a cache miss. Yuck.
Another solution is simultaneous multi-threading. For a 2-way SMT machine, you have two PCs and two architectural register files (i.e., you have to duplicate the architectural state for each thread, but you can then share the micro-architectural resources). In this manner, once you've fetched and decoded the instructions for a given thread, they appear the same to the backend. Thus, while the "sub" instruction will sit waiting forever in the Issue Queue for the load to come back, the other thread can proceed ahead. As the first thread comes to a grinding halt, more resources can be allocated to the 2nd thread (fetch bandwidth, decode bandwidth, issue bandwidth, etc.). In this manner, the pipeline stays busy by effortlessly filling it with the 2nd thread.

How to understand acquire and release semantics?

I found out three function from MSDN , below:
1.InterlockedDecrement().
2.InterlockedDecrementAcquire().
3.InterlockedDecrementRelease().
I knew those fucntion use to decrement a value as an atomic operation, but i don't know distinction between the three function
(um... but don't ask me what does it mean exactly)
I'll take a stab at that.
Something to remember is that the compiler, or the CPU itself, may reorder memory reads and writes if they appear to not deal with each other.
This is useful, for instance, if you have some code that, maybe is updating a structure:
if ( playerMoved ) {
playerPos.X += dx;
playerPos.Y += dy;
// Keep the player above the world's surface.
if ( playerPos.Z + dz > 0 ) {
playerPos.Z += dz;
}
else {
playerPos.Z = 0;
}
}
Most of above statements may be reordered because there's no data dependency between them, in fact, a superscalar CPU may execute most of those statements simultaneously, or maybe would start working on the Z section sooner, since it doesn't affect X or Y, but might take longer.
Here's the problem with that - lets say that you're attempting lock-free programming. You want to perform a whole bunch of memory writes to, maybe, fill in a shared queue. You signal that you're done by finally writing to a flag.
Well, since that flag appears to have nothing to do with the rest of the work being done, the compiler and the CPU may reorder those instructions, and now you may set your 'done' flag before you've actually committed the rest of the structure to memory, and now your "lock-free" queue doesn't work.
This is where Acquire and Release ordering semantics come into play. I set that I'm doing work by setting a flag or so with an Acquire semantic, and the CPU guarantees that any memory games I play after that instruction stay actually below that instruction. I set that I'm done by setting a flag or so with a Release semantic, and the CPU guarantees that any memory games I had done just before the release actually stay before the release.
Normally, one would do this using explicit locks - mutexes, semaphores, etc, in which the CPU already knows it has to pay attention to memory ordering. The point of attempting to create 'lock free' data structures is to provide data structures that are thread safe (for some meaning of thread safe), that don't use explicit locks (because they are very slow).
Creating lock-free data structures is possible on a CPU or compiler that doesn't support acquire/release ordering semantics, but it usually means that some slower memory ordering semantic is used. For instance, you could issue a full memory barrier - everything that came before this instruction has to actually be committed before this instruction, and everything that came after this instruction has to be committed actually after this instruction. But that might mean that I wait for a bunch of actually irrelevant memory writes from earlier in the instruction stream (perhaps function call prologue) that has nothing to do with the memory safety I'm trying to implement.
Acquire says "only worry about stuff after me". Release says "only worry about stuff before me". Combining those both is a full memory barrier.
http://preshing.com/20120913/acquire-and-release-semantics/
Acquire semantics is a property which can only apply to operations
which read from shared memory, whether they are read-modify-write
operations or plain loads. The operation is then considered a
read-acquire. Acquire semantics prevent memory reordering of the
read-acquire with any read or write operation which follows it in
program order.
Release semantics is a property which can only apply to operations
which write to shared memory, whether they are read-modify-write
operations or plain stores. The operation is then considered a
write-release. Release semantics prevent memory reordering of the
write-release with any read or write operation which precedes it in
program order.
(um... but don't ask me what does it mean exactly)

How to remove Fortran race condition?

Forgive me if this is not actually a race condition; I'm not that familiar with the nomenclature.
The problem I'm having is that this code runs slower with OpenMP enabled. I think the loop should be plenty big enough (k=100,000), so I don't think overhead is the issue.
As I understand it, a race condition is occurring here because all the loops are trying to access the same v(i,j) values all the time, slowing down the code.
Would the best fix here be to create as many copies of the v() array as threads and have each thread access a different one?
I'm using intel compiler on 16 cores, and it runs just slightly slower than on a single core.
Thanks all!
!$OMP PARALLEL DO
Do 500, k=1,n
Do 10, i=-(b-1),b-1
Do 20, j=-(b-1),b-1
if (abs(i).le.l.and.abs(j).eq.d) then
cycle
endif
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
if (k.eq.n-1) then
vtest(i,j,1)=v(i,j)
endif
if (k.eq.n) then
vtest(i,j,2)=v(i,j)
endif
20 continue
10 continue
500 continue
!$OMP END PARALLEL DO
You certainly have programmed a race condition though I'm not sure that that is the cause of your program's failure to execute more quickly. This line
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
which will be executed by all threads for the same (set of) values for i and j is where the racing happens. Given that your program does nothing to coordinate reads and writes to the elements of v your program is, in practice, not deterministic as there is no way to know the order in which updates to v are made.
You should have observed this non-determinism on inspecting the results of the program, and have noticed that changing the number of threads has an impact on the results too. Then again, with a long-running stencil operation over an array the results may have converged to the same (or similar enough) values.
OpenMP gives you the tools to coordinate access to variables but it doesn't automatically implement them; there is definitely nothing going on under the hood to prevent quasi-simultaneous reads from and writes to v. So the explanation for the lack of performance improvement lies elsewhere. It may be down to the impact of multiple threads on cache at some level in your system's memory hierarchy. A nice, cache-friendly, run over every element of an array in memory order for a serial program becomes a blizzard of (as far as the cache is concerned) random accesses to memory requiring access to RAM at every go.
It's possible that the explanation lies elsewhere. If the time to execute the OpenMP version is slightly longer than the time to execute a serial version I suspect that the program is not, in fact, being executed in parallel. Failure to compile properly is a common (here on SO) cause of that.
How to fix this ?
Well the usual pattern of OpenMP across an array is to parallelise on one of the array indices. The statements
!$omp parallel do
do i=-(b-1),b-1
....
end do
ensure that each thread gets a different set of values for i which means that they write to different elements of v, removing (almost) the data race. As you've written the program each thread gets a different set of values of k but that's not used (much) in the inner loops.
In passing, testing
if (k==n-1) then
and
if (k==n) then
in every iteration looks like you are tying an anchor to your program, why not just
do k=1,n-2
and deal with the updates to vtest at the end of the loop.
You could separate the !$omp parallel do like this
!$omp parallel
do k=1,n-2
!$omp do
do i=-(b-1),b-1
(and make the corresponding changes at the end of the parallel loop and region). Now all threads execute the entire contents of the parallel region but each gets its own set of i values to use. I recommend that you add clauses to your directives to specify the accessibility (eg private or shared) of each variable; but this answer is getting a bit too long and I won't go into more detail on these. Or on using a schedule clause.
Finally, of course, even with the changes I've suggested your program will be non-deterministic because this statement
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
will read neighbouring elements from v which are updated (at a time you have no control over) by another thread. To sort that out ... got to go back to work.

programatically determine amount of time remaining before preemption

i am trying to implement some custom lock-free structures. its operates similar to a stack so it has a take() and a free() method and operates on pointer and underlying array. typically it uses optimistic conncurrency. free() writes a dummy value to pointer+1 increments the pointer and writes the real value to the new address. take() reads the value at pointer in a spin/sleep style until it doesnt read the dummy value and then decrements the pointer. in both operations changes to the pointer are done with compare and swap and if it fails, the whole operation starts again. the purpose of the dummy value is to insure consistency since the write operation can be preempted after the pointer is incremented.
this situation leads me to wonder weather it is possible to prevent preemtion in that critical place by somhow determining how much time is left before the thread will be preempted by the scheduler for another thread. im not worried about hardware interrupts. im trying to eliminate the possible sleep from my reading function so that i can rely on a pure spin.
is this at all possible?
are there other means to handle this situation?
EDIT: to clarify how this may be helpful, if the critical operation is interrupted, it will effectively be like taking out an exclusive lock, and all other threads will have to sleep before they could continue with their operations
EDIT: i am not hellbent on having it solved like this, i am merely trying to see if its possible. the probability of that operation being interrupted in that location for a very long time is extremely unlikely and if it does happen it will be OK if all the other operations need to sleep so that it can complete.
some regard this as premature optimization, but this is just my pet project. regardless - that does not exclude research and sience from attempting to improve techniques. even though computer sience has reasonably matured and every new technology we use today is just an implementation of what was already known 40 years ago, we should not stop to be creative to address even the smallest of concerns, like trying to make a reasonable set of operations atomic woithout too much performance implications.
Such information surely exists somewhere, but it is of no use for you.
Under "normal conditions", you can expect upwards of a dozen DPCs and upwards of 1,000 interrupts per second. These do not respect your time slices, they occur when they occur. Which means, on the average, you can expect 15-16 interrupts within a time slice.
Also, scheduling does not strictly go quantum by quantum. The scheduler under present Windows versions will normally let a thread run for 2 quantums, but may change its opinion in the middle if some external condition changes (for example, if an event object is signalled).
Insofar, even if you know that you still have so and so many nanoseconds left, whatever you think you know might not be true at all.
Cnnot be done without time-travel. You're stuffed.

Resources