OpenMP atomic compare and swap - openmp

I have a shared variable s and private variable p inside parallel region.
How can I do the following atomically (or at least better than with critical section):
if ( p > s )
s = p;
else
p = s;
I.e., I need to update the global maximum (if local maximum is better) or read it, if it was updated by another thread.

OpenMP 5.1 introduced the compare clause which allows compare-and-swap (CAS) operations such as
#pragma omp atomic compare
if (s < p) s = p;
In combination with a capture clause, you should be able to achieve what you want:
int s_cap;
// here we capture the shared variable and also update it if p is larger
#pragma omp atomic compare capture
{
s_cap = s;
if (s < p) s = p;
}
// update p if the captured shared value is larger
if (s_cap > p) p = s_cap;
The only problem? The 5.1 spec is very new and, as of today (2020-11-27), none of the widespread compilers, i.e., those available on Godbolt, supports OpenMP 5.1. See here for a more or less up-to-date list. Adding compare is still listed as an unclaimed task on Clang's OpenMP page. GCC is still working on full OpenMP 5.0 support and the trunk build on Godbolt doesn't recognise compare. Intel's oneAPI compiler may or may not support it - it's not available on Godbolt and I can't get it to compile OpenMP code.
Your best bet for now is to use atomic capture combined with a compiler-specific CAS atomic, possibly in a loop.

Related

In-place RGB->BGR color conversion is slower in OpenCV

Is it the case that the in-place RGB->BGR color conversion routine in OpenCV saves some memory, but takes longer? If yes, can anyone explain why?
My application calls the cv::cvtColor(srcMat, dstMat, cv::COLOR_RGB2BGR) routine in OpenCV (version 4.2.0). In an effort to make the application faster, I tried the in-place version of this routine (by invoking it with the same Mat object for source and destination). I expected the speed to slightly improve, since the in-place version does not allocate new memory.
To test my expectation, I ran my application in a loop over 10,000 250x250 RGB images. To my surprise, my application became slower when the in-place version was used. In fact, I saw that the larger the image (500x500 vs 250x250), the greater the difference between the in-place and regular version.
Is this expected? If so, is it because the in-place version does a swap operation (more statements) and the regular version is only a copy operation?
Would anyone be willing to try to reproduce this behavior? It can be done so easily by timing the following snippet in 2 different ways: 1) use the snippet below, and 2) following the brief instructions in the comments in the snippet for the in-place version.
// Read image
Mat srcMat = imread(filename);
// Comment out this line for the in-place version
Mat dstMat;
for (int i=0; i<10000; i++)
{
// Use srcMat instead of dstMat in the in-place version
cv::cvtColor(srcMat, dstMat, cv::COLOR_RGB2BGR);
}
Thanks.
You may dig in the sources for finding the reason.
There are few possible code path (Using OpenCL or not, using IPP or not).
In my machine the execution of cv::cvtColor reaches the function CvtColorIPPLoopCopy in color.hpp:
template <typename Cvt>
bool CvtColorIPPLoopCopy(const uchar * src_data, size_t src_step, int src_type, uchar * dst_data, size_t dst_step, int width, int height, const Cvt& cvt)
{
Mat temp;
Mat src(Size(width, height), src_type, const_cast<uchar*>(src_data), src_step);
Mat source = src;
if( src_data == dst_data )
{
src.copyTo(temp);
source = temp;
}
bool ok;
parallel_for_(Range(0, source.rows),
CvtColorIPPLoop_Invoker<Cvt>(source.data, source.step, dst_data, dst_step,
source.cols, cvt, &ok),
source.total()/(double)(1<<16) );
return ok;
}
The code checks if src_data == dst_data, and if equal it copies the source image into temporary image:
if( src_data == dst_data )
{
src.copyTo(temp);
source = temp;
}
The extra data copy may be the reason for in-place processing taking longer time.
Note:
I can't say this is the reason for sure, because there are other possible code paths.
There are many highly performance optimized functions that do not support "in-place" processing.
When OpenCV needs to execute a function that doesn't support "in-place" processing the solution may be copying the source image to temporary location.
The same practice may be used for other execution code paths.
As I commented,
In-place processing prevents some compilation (and execution) optimizations due to loop carried dependencies.
In some cases there are also parallelization issues regarding "in-place" processing.
That's the reason many optimized "primitive" functions do not support "in-place" processing.

Sharing memory with the kernel and compiler optimizations

a frame is shared with a kernel.
User-space code:
read frame // read frame content
_mm_mfence // prevent before "releasing" a frame before we read everything.
frame.status = 0 // "release" a frame
Kernel code:
poll for frame.status // reads a frame's status
_mm_lfence
Kernel can poll it asynchronically, in another thread. So, there is no syscall between userspace code and kernelspace.
Is it correctly synchronized?
I doubt because of the following situation:
A compiler has a weak memory model and we have to assume that it can do wild changes as you can imagine if optimizied/changed program is consistent within one-thread.
So, on my eye we need a second barrier because it is possible that a compiler optimize out store frame.status, 0.
Yes, it will be a very wild optimization but if a compiler would be able to prove that noone in the context (within thread) reads that field it can optimize out it.
I believe that it is theoretically possibe, isn't it?
So, to prevent that we can put the second barrier:
User-space code:
read frame // read frame content
_mm_mfence // prevent before "releasing" a frame before we read everything.
frame.status = 0 // "release" a frame
_mm_fence
Ok, now compiler restrain itself before optimization.
What do you think?
EDIT
[The question is raised by the issue that __mm_fence does not prevent before optimizations-out.
#PeterCordes, to make sure myself: __mm_fence does not prevent before optimizations out (it is just x86 memory barrier, not compiler). However, atomic_thread_fence(any_order) prevents before reorderings (it depends on any_order, obviously) but it also prevents before optimizations out?
For example:
// x is an int pointer
*x = 5
*(x+4) = 6
std::atomic_thread_barrier(memory_order_release)
prevents before optimizations out of stores to x? It seems that it must- otherwise every store to x should be volatile.
However, I saw a lot of lock-free code and there is no making fields as volatile.
_mm_mfence is also a compiler barrier. (See When should I use _mm_sfence _mm_lfence and _mm_mfence, and also BeeOnRope's answer there).
atomic_thread_fence with release, rel_acq, or seq_cst stops earlier stores from merging with later stores. But mo_acquire doesn't have to.
Writes to non-atomic globals variables can only be optimized out by merging with other writes to the same non-atomic variables, not by optimizing them away entirely. So the real question is what reorderings can happen that can let two non-atomic assignments come together.
There has to be an assignment to an atomic variable in there somewhere for there to be anything that another thread could synchronize with. Some compilers might give atomic_thread_fence stronger behaviour wrt. non-atomic variables, but in C++11 there's no way for another thread to legally observer anything about the ordering of *x and x[4] in
#include <atomic>
std::atomic<int> shared_flag {0};
int x[8];
void writer() {
*x = 0;
x[4] = 0;
atomic_thread_fence(mo_release);
x[4] = 1;
atomic_thread_fence(mo_release);
shared_flag.store(1, mo_relaxed);
}
The store to shared_flag has to appear after the stores to x[0] and x[4], but it's only an implementation detail what order the stores to x[0] and x[4] happen in, and whether there are 2 stores to x[4].
For example, on the Godbolt compiler explorer gcc7 and earlier merge the stores to x[4], but gcc8 doesn't, and neither do clang or ICC. The old gcc behaviour does not violate the ISO C++ standard, but I think they strengthened gcc's thread_fence because it wasn't strong enough to prevent bugs in other cases.
For example,
void writer_gcc_bug() {
*x = 0;
std::atomic_thread_fence(std::memory_order_release);
shared_flag.store(1, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
*x = 2; // gcc7 and earlier merge this, which arguably a bug
}
gcc only does shared_flag = 1; *x = 2; in that order. You could argue that there's no way for another thread to safely observe *x after seeing shared_flag == 1, because this thread writes it again right away with no synchronization. (i.e. data race UB in any potential observer makes this reordering arguably legal).
But gcc developers don't think that's enough reason, (it may be violating the guarantees of the builtin __atomic functions that the <atomic> header uses to implement the API). And there may be other cases where there is a real bug that even a standards-conforming program could observe the aggressive reordering that violated the standard.
Apparently this changed on 2017-09 with the fix for gcc bug 80640.
Alexander Monakov wrote:
I think the bug is that on x86 __atomic_thread_fence(x) is expanded into nothing for x!=__ATOMIC_SEQ_CST, it should place a compiler barrier similar to expansion of __atomic_signal_fence.
(__atomic_signal_fence includes something as strong as asm("" ::: "memory" ).)
Yup that would definitely be a bug. So it's not that gcc was being really clever and doing allowed reorderings, it was just mostly failing at thread_fence, and any correctness that did happen was due to other factors, like non-inline function boundaries! (And that it doesn't optimize atomics, only non-atomics.)

Purpose of _Compiler_barrier() on 32bit read

I have been stepping through the function calls that are involved when I assign to an atomic_long type on VS2017 with a 64bit project. I specifically wanted to see what happens when I copy an atomic_long into a none-atomic variable, and if there is any locking around it.
atomic_long ll = 10;
long t2 = ll;
Ultimately it ends up with this call (I've removed some code that was ifdefed out)
inline _Uint4_t _Load_seq_cst_4(volatile _Uint4_t *_Tgt)
{ /* load from *_Tgt atomically with
sequentially consistent memory order */
_Uint4_t _Value;
_Value = *_Tgt;
_Compiler_barrier();
return (_Value);
}
Now, I've read from MSDN that a plain read of a 32bit value will be atomic:
Simple reads and writes to properly-aligned 32-bit variables are
atomic operations.
...which explains why there is no Interlocked function for just reading; only those for changing/comparing. What I'd like to know is what the _Compiler_barrier() bit is doing. This is #defined as
__MACHINE(void _ReadWriteBarrier(void))
...and I've found on MSDN again that this
Limits the compiler optimizations that can reorder memory accesses
across the point of the call.
But I don't get this, as there are no other memory accesses apart from the return call; surely the compiler wouldn't move the assignment below that would it?
Can someone please clarify the purpose of this barrier?
_Load_seq_cst_4 is an inline function. The compiler barrier is there to block reordering with later code in the calling function this inlines into.
For example, consider reading a SeqLock. (Over-simplified from this actual implementation).
#include <atomic>
atomic<unsigned> sequence;
atomic_long value;
long seqlock_try_read() {
// this would normally be the body of a retry-loop;
unsigned seq1 = sequence;
long tmpval = value;
unsigned seq2 = sequence;
if (seq1 == seq2 && (seq1 & 1 == 0)
return tmpval;
else
// writer was modifying it, we should retry the loop
}
If we didn't block compile-time reordering, the compiler could merge both reads of sequence into a single access, like perhaps like this
long tmpval = value;
unsigned seq1 = sequence;
unsigned seq2 = sequence;
This would defeat the locking mechanism (where the writer increments sequence once before modifying the data, then again when it's done). Readers are entirely lockless, but it's not a "lock-free" algo because if the writer gets stuck mid-update, the readers can't read anything.
The barrier within each load function blocks reordering with other things after inlining.
(The C++11 memory model is very weak, but the x86 memory model is strong, only allowing StoreLoad reordering. Blocking compile-time reordering with later loads/stores is sufficient to give you an acquire / sequential-consistency load at runtime. x86: Are memory barriers needed here?)
BTW, a better example might be something where some non-atomic variables are read/written after seeing a certain value in an atomic flag. MSVC probably already avoids reordering or merging of atomic accesses, and in the seqlock the data being protected also has to be atomic.
Why don't compilers merge redundant std::atomic writes?

explict flush directive with OpenMP: when is it necessary and when is it helpful

One OpenMP directive I have never used and don't know when to use is flush(with and without a list).
I have two questions:
1.) When is an explicit `omp flush` or `omp flush(var1, ...) necessary?
2.) Is it sometimes not necessary but helpful (i.e. can it make the code fast)?
The main reason I can't understand when to use an explicit flush is that flushes are done implicitly after many directives (e.g. as barrier, single, ...) which synchronize the threads. I can't, for example, see way using flush and not synchronizing (e.g. with nowait) would be helpful.
I understand that different compilers may implement omp flush in different ways. Some may interpret a flush with a list as as one without (i.e. flush all shared objects) OpenMP flush vs flush(list). But I only care about what the specification requires. In other words, I want to know where an explicit flush in principle may be necessary or helpful.
Edit: I think I need to clarify my second question. Let me give an example. I would like to know if there are cases where removing an implicit flush (e.g. with nowait) and instead using an explicit flush instead but only on certain shared variables would be faster (and still give the correct result). Something like the following:
float a,b;
#pragma omp parallel
{
#pragma omp for nowait // No barrier. Do not flush on exit.
//code which uses only shared variable a
#pragma omp flush(a) // Flush only variable a rather than all shared variables.
#pragma omp for
//Code which uses both shared variables a and b.
}
I think that code still needs a barrier after the the first for loop but all barriers have an implicit flush so that defeats the purpose. Is it possible to have a barrier which does not do a flush?
The flush directive tells the OpenMP compiler to generate code to make the thread's private view on the shared memory consistent again. OpenMP usually handles this pretty well and does the right thing for typical programs. Hence, there's no need for flush.
However, there are cases where the OpenMP compiler needs some help. One of these cases is when you try to implement your own spin lock. In these cases, you would need a combination of flushes to make things work, since otherwise the spin variables will not be updated. Getting the sequence of flushes correct will be tough and will be very, very error prone.
The general recommendation is that flushes should not be used. If at all, programmers should avoid flush with a list (flush(var,...)) at all means. Some folks are actually talking about deprecating it in future OpenMP.
Performance-wise the impact of flush should be more negative than positive. Since it causes the compiler to generate memory fences and additional load/store operations, I would expect it to slow down things.
EDIT: For your second question, the answer is no. OpenMP makes sure that each thread has a consistent view on the shared memory when it needs to. If threads do not synchronize, they do not need to update their view on the shared memory, because they do not see any "interesting" change there. That means that any read a thread makes does not read any data that has been changed by some other thread. If that would be the case, then you'd have a race condition and a potential bug in your program. To avoid the race, you need to synchronize (which then implies a flush to make each participating thread's view consistent again). A similar argument applies to barriers. You use barriers to start a new epoch in the computation of a parallel region. Since you're keeping the threads in lock-step, you will very likely also have some shared state between the threads that has been computed in the previous epoch.
BTW, OpenMP may keep private data for a thread, but it does not have to. So, it is likely that the OpenMP compiler will keep variables in registers for a while, which causes them to be out of sync with the shared memory. However, updates to array elements are typically reflected pretty soon in the shared memory, since the amount of private storage for a thread is usually small (register sets, caches, scratch memory, etc.). OpenMP only gives you some weak restrictions on what you can expect. An actual OpenMP implementation (or the hardware) may be as strict as it wishes to be (e.g., write back any change immediately and to flushes all the time).
Not exactly an answer, but Michael Klemm's question is closed for comments. I think an excellent example of why flushes are so hard to understand and use properly is the following one copied (and shortened a bit) from the OpenMP Examples:
//http://www.openmp.org/wp-content/uploads/openmp-examples-4.0.2.pdf
//Example mem_model.2c, from Chapter 2 (The OpenMP Memory Model)
int main() {
int data, flag = 0;
#pragma omp parallel num_threads(2)
{
if (omp_get_thread_num()==0) {
/* Write to the data buffer that will be read by thread */
data = 42;
/* Flush data to thread 1 and strictly order the write to data
relative to the write to the flag */
#pragma omp flush(flag, data)
/* Set flag to release thread 1 */
flag = 1;
/* Flush flag to ensure that thread 1 sees S-21 the change */
#pragma omp flush(flag)
}
else if (omp_get_thread_num()==1) {
/* Loop until we see the update to the flag */
#pragma omp flush(flag, data)
while (flag < 1) {
#pragma omp flush(flag, data)
}
/* Values of flag and data are undefined */
printf("flag=%d data=%d\n", flag, data);
#pragma omp flush(flag, data)
/* Values data will be 42, value of flag still undefined */
printf("flag=%d data=%d\n", flag, data);
}
}
return 0;
}
Read the comments and try to understand.

How to align stack at 32 byte boundary in GCC?

I'm using MinGW64 build based on GCC 4.6.1 for Windows 64bit target. I'm playing around with the new Intel's AVX instructions. My command line arguments are -march=corei7-avx -mtune=corei7-avx -mavx.
But I started running into segmentation fault errors when allocating local variables on the stack. GCC uses the aligned moves VMOVAPS and VMOVAPD to move __m256 and __m256d around, and these instructions require 32-byte alignment. However, the stack for Windows 64bit has only 16 byte alignment.
How can I change the GCC's stack alignment to 32 bytes?
I have tried using -mstackrealign but to no avail, since that aligns only to 16 bytes. I couldn't make __attribute__((force_align_arg_pointer)) work either, it aligns to 16 bytes anyway. I haven't been able to find any other compiler options that would address this. Any help is greatly appreciated.
EDIT:
I tried using -mpreferred-stack-boundary=5, but GCC says that 5 is not supported for this target. I'm out of ideas.
I have been exploring the issue, filed a GCC bug report, and found out that this is a MinGW64 related problem. See GCC Bug#49001. Apparently, GCC doesn't support 32-byte stack alignment on Windows. This effectively prevents the use of 256-bit AVX instructions.
I investigated a couple ways how to deal with this issue. The simplest and bluntest solution is to replace of aligned memory accesses VMOVAPS/PD/DQA by unaligned alternatives VMOVUPS etc. So I learned Python last night (very nice tool, by the way) and pulled off the following script that does the job with an input assembler file produced by GCC:
import re
import fileinput
import sys
# fix aligned stack access
# replace aligned vmov* by unaligned vmov* with 32-byte aligned operands
# see Intel's AVX programming guide, page 39
vmova = re.compile(r"\s*?vmov(\w+).*?((\(%r.*?%ymm)|(%ymm.*?\(%r))")
aligndict = {"aps" : "ups", "apd" : "upd", "dqa" : "dqu"};
for line in fileinput.FileInput(sys.argv[1:],inplace=1):
m = vmova.match(line)
if m and m.group(1) in aligndict:
s = m.group(1)
print line.replace("vmov"+s, "vmov"+aligndict[s]),
else:
print line,
This approach is pretty safe and foolproof. Though I observed a performance penalty on rare occasions. When the stack is unaligned, the memory access crosses the cache line boundary. Fortunately, the code performs as fast as aligned accesses most of the time. My recommendation: inline functions in critical loops!
I also attempted to fix the stack allocation in every function prolog using another Python script, trying to align it always at the 32-byte boundary. This seems to work for some code, but not for other. I have to rely on the good will of GCC that it will allocate aligned local variables (with respect to the stack pointer), which it usually does. This is not always the case, especially when there is a serious register spilling due to the necessity to save all ymm register before a function call. (All ymm registers are callee-save). I can post the script if there's an interest.
The best solution would be to fix GCC MinGW64 build. Unfortunately, I have no knowledge of its internal workings, just started using it last week.
You can get the effect you want by
Declaring your variables not as variables, but as fields in a struct
Declaring an array that is larger than the structure by an appropriate amount of padding
Doing pointer/address arithmetic to find a 32 byte aligned address in side the array
Casting that address to a pointer to your struct
Finally using the data members of your struct
You can use the same technique when malloc() does not align stuff on the heap appropriately.
E.g.
void foo() {
struct I_wish_these_were_32B_aligned {
vec32B foo;
char bar[32];
}; // not - no variable definition, just the struct declaration.
unsigned char a[sizeof(I_wish_these_were_32B_aligned) + 32)];
unsigned char* a_aligned_to_32B = align_to_32B(a);
I_wish_these_were_32B_aligned* s = (I_wish_these_were_32B_aligned)a_aligned_to_32B;
s->foo = ...
}
where
unsigned char* align_to_32B(unsiged char* a) {
uint64_t u = (unit64_t)a;
mask_aligned32B = (1 << 5) - 1;
if (u & mask_aligned32B == 0) return (unsigned char*)u;
return (unsigned char*)((u|mask_aligned_32B) + 1);
}
I just ran in the same issue of having segmentation faults when using AVX inside my functions. And it was also due to the stack misalignment. Given the fact that this is a compiler issue (and the options that could help are not available in Windows), I worked around the stack usage by:
Using static variables (see this issue). Given the fact that they are not stored in the stack, you can force their alignment by using __attribute__((align(32))) in your declaration. For example: static __m256i r __attribute__((aligned(32))).
Inlining the functions/methods receiving/returning AVX data. You can force GCC to inline your function/method by adding inline and __attribute__((always_inline)) to your function prototype/declaration. Inlining your functions increase the size of your program, but they also prevent the function from using the stack (and hence, avoids the stack-alignment issue). Example: inline __m256i myAvxFunction(void) __attribute__((always_inline));.
Be aware that the usage of static variables is no thread-safe, as mentioned in the reference. If you are writing a multi-threaded application you may have to add some protection for your critical paths.

Resources