difference in mfence and asm volatile ("" : : : "memory") - gcc

As far as I have understood, mfence is a hardware memory barrier while asm volatile ("" : : : "memory") is a compiler barrier. But,can asm volatile ("" : : : "memory") be used in place of mfence.
The reason I have got confused is this link

Well, a memory barrier is only needed on architectures that have weak memory ordering. x86 and x64 don't have weak memory ordering. on x86/x64 all stores have a release fence and all loads have an acquire fence. so, you should only really need asm volatile ("" : : : "memory")
For a good overview of both Intel and AMD as well as references to the relavent manufacturer specs, see http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
Generally things like "volatile" are used on a per-field basis where loads and stores to that field are natively atomic. Where loads and stores to a field are already atomic (i.e. the "operation" in question is a load or a store to a single field and thus the entire operation is atomic) the volatile field modifier or memory barriers are not needed on x86/x64. Portable code notwithstanding.
When it comes to "operations" that are not atomic--e.g. loads or stores to a field that is larger than a native word or loads or stores to multiple fields within an "operation"--a means by which the operation can be viewed as atomic are required regardless of CPU architecture. generally this is done by means of a synchronization primitive like a mutex. Mutexes (the ones I've used) include memory barriers to avoid issues like processor reordering so you don't have to add extra memory barrier instructions. I generally consider not using synchronization primitives a premature optimization; but, the nature of premature optimization is, of course, 97% of the time :)
Where you don't use a synchronization primitive and you're dealing with a multi-field invariant, memory barriers that ensure the processor does not reorder stores and loads to different memory locations is important.
Now, in terms of not issuing an "mfence" instruction in asm volatile but using "memory" in the clobber list. From what I've been able to read
If your assembler instructions access memory in an unpredictable fashion, add `memory' to the list of clobbered registers. This will cause GCC to not keep memory values cached in registers across the assembler instruction and not optimize stores or loads to that memory.
When they say "GCC" and don't mention anything about the CPU, this means it applies to only the compiler. The lack of "mfence" means there is no CPU memory barrier. You can verify this by disassembling the resulting binary. If no "mfence" instruction is issued (depending on the target platform) then it's clear the CPU is not being told to issue a memory fence.
Depending on the platform you're on and what you're trying to do, there maybe something "better" or more clear... portability not withstanding.

asm volatile ("" ::: "memory") is just a compiler barrier.
asm volatile ("mfence" ::: "memory") is both a compiler barrier and MFENCE
__sync_synchronize() is also a compiler barrier and a full memory barrier.
so asm volatile ("" ::: "memory") will not prevent CPU reordering independent data instructions per se. As pointed out x86-64 has a strong memory model, but StoreLoad reordering is still possible. If a full memory barrier is needed for your algorithm to work then you neeed __sync_synchronize

There are two reorderings, one is compiler reordering, the other one is CPU reordering.
x86/x64 has a relatively strong memory model, but on x86/x64 StoreLoad reordering (later loads passing earlier stores) CAN happen.
see http://en.wikipedia.org/wiki/Memory_ordering
asm volatile ("" ::: "memory") is just a compiler barrier.
asm volatile ("mfence" ::: "memory") is both a compiler barrier and CPU barrier.
that means, only use a compiler barrier, you can only prevent compiler reordering, but you can not prevent CPU reordering. that means there is no reordering when compiling source code, but reordering can happen in runtime.
So, it depends your needs, which one to use.

Related

why does arm atomic_[read/write] operations implemented as volatile pointers?

He is example of atomic_read implementation:
#define atomic_read(v) (*(volatile int *)&(v)->counter)
Also, should we explicitly use memory barriers for atomic operations on arm?
He is example of atomic_read implementation:
A problematic one actually, which assumes that a cast is not a nop, which isn't guaranteed.
Also, should we explicitly use memory barriers for atomic operations
on arm?
Probably. It depends on what you are doing and what you are expecting.
Yes, the casting to volatile is to prevent the compiler from assuming the value of v cannot change. As for using memory barriers, the GCC builtins already allow you to specify the memory ordering you desire, no need to do it manually: https://gcc.gnu.org/onlinedocs/gcc-9.2.0/gcc/_005f_005fatomic-Builtins.html#g_t_005f_005fatomic-Builtins
The default behavior on GCC is to use __ATOMIC_SEQ_CST which will emit the barriers necessary on Arm to make sure your atomics execute in the order you place them in the code. To optimize performance on Arm, you will want to consider using weaker semantics to allow the compiler to elide barriers and let the hardware execute faster. For more information on the types of memory barriers the Arm architecture has, see https://developer.arm.com/docs/den0024/latest/memory-ordering/barriers.

How does GCC know that a register needs to be flushed to memory when memory clobber is declared?

The Extended Asm manual https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html says the following about the "memory" clobber:
The "memory" clobber tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands (for example, accessing the memory pointed to by one of the input parameters). To ensure memory contains correct values, GCC may need to flush specific register values to memory before executing the asm. Further, the compiler does not assume that any values read from memory before an asm remain unchanged after that asm; it reloads them as needed. Using the "memory" clobber effectively forms a read/write memory barrier for the compiler.
I am confused about the decision to flush to memory. Before the asm code, how would GCC know if a register serves as a cache for a memory location, and thus needs to be flushed to memory? And is this part of cache coherency (I thought cache coherency was a hardware behavior)? After the asm code, how does GCC distinguish a register as a cache and, next time the register is read, decide instead to read from memory as the cache may be old?
Before the asm code, how would GCC know if a register serves as a
cache for a memory location, and thus needs to be flushed to memory?
Because GCC is the one who generates this code.
Generally, from GCC's perspective:
[C code to compile]
[your inline asm with clobber]
[C code to compile]
GCC generates the assembly instructions prior and after your inline asm, hence it knows everything before and after it. Now, since the memory clobber means sw memory barrier, the following applies:
[GCC generated asm]
[compiler memory barrier]
[GCC generated asm]
So GCC generates the assembly before and after the barrier, and it knows that it cannot have memory accesses crossing the memory barrier. Basically, from GCC's eyes, there is code to compile, then memory barrier, then more code to compile, and that's it, the only restriction the memory barrier applies here is that GCC generated code must not have memory accesses crossing this barrier.
So if, for example, GCC loads a register with a value from memory, change it, and store it back to memory, the load and store cannot cross the barrier. Depending on the code, they must reside before or after the barrier (or twice, on both sides).
I would recommend you reading this related SO thread.

gcc and cpu_relax, smb_mb, etc.?

I've been reading on compiler optimizations vs CPU optimizations, and volatile vs memory barriers.
One thing which isn't clear to me is that my current understanding is that CPU optimizations and compiler optimizations are orthogonal. I.e. can occur independently of each other.
However, the article volatile considered harmful makes the point that volatile should not be used. Linus's post makes similar claims. The main reasoning, IIUC, is that marking a variable as volatile disables all compiler optimizations when accessing that variable (i.e. even if they are not harmful), while still not providing protection against memory reorderings. Essentially, the main point is that it's not the data that should be handled with care, but rather a particular access pattern needs to be handled with care.
Now, the volatile considered harmful article gives the following example of a busy loop waiting for a flag:
while (my_variable != what_i_want) {}
and makes the point that the compiler can optimize the access to my_variable so that it only occurs once and not in a loop. The solution, so the article claims, is the following:
while (my_variable != what_i_want)
cpu_relax();
It is said that cpu_relax acts as a compiler barrier (earlier versions of the article said that it's a memory barrier).
I have several gaps here:
1) Is the implication that gcc has special knowledge of the cpu_relax call, and that it translates to a hint to both the compiler and the CPU?
2) Is the same true for other instructions such as smb_mb() and the likes?
3) How does that work, given that cpu_relax is essentially defined as a C macro? If I manually expand cpu_relax will gcc still respect it as a compiler barrier? How can I know which calls are respected by gcc?
4) What is the scope of cpu_relax as far as gcc is concerned? In other words, what's the scope of reads that cannot be optimized by gcc when it sees the cpu_relax instruction? From the CPU's perspective, the scope is wide (memory barriers place a mark in the read or write buffer). I would guess gcc uses a smaller scope - perhaps the C scope?
Yes, gcc has special knowledge of the semantics of cpu_relax or whatever it expands to, and must translate it to something for which the hardware will respect the semantics too.
Yes, any kind of memory fencing primitive needs special respect by the compiler and hardware.
Look at what the macro expands to, e.g. compile with "gcc -E" and examine the output. You'll have to read the compiler documentation to find out the semantics of the primitives.
The scope of a memory fence is as wide as the scope the compiler might move a load or store across. A non-optimizing compiler that never moves loads or stores across a subroutine call might not need to pay much attention to a memory fence that is represented as a subroutine call. An optimizing compiler that does interprocedural optimization across translation units would need to track a memory fence across a much bigger scope.
There are a number subtle questions related to cpu and smp concurrency in your questions which will require you to look at the kernel code. Here are some quick ideas to get you started on the research specifically for the x86 architecture.
The idea is that you are trying to perform a concurrency operation where your kernel task (see kernel source sched.h for struct task_struct) is in a tight loop comparing my_variable with a local variable until it is changed by another kernel task (or change asynchronously by a hardware device!) This is a common pattern in the kernel.
The kernel has been ported to a number of architectures and each has a specific set of machine instructions to handle concurrency. For x86, cpu_relax maps to the PAUSE machine instruction. It allows an x86 CPU to more efficiently run a spinlock so that the lock variable update is more readily visible by the spinning CPU. GCC will execute the function/macro just like any other function. If cpu_relax is removed from the loop then gcc CAN consider the loop as non-functional and remove it. Look at the Intel X86 Software Manuals for the PAUSE instruction.
smp_mb is an x86 memory fence instruction that flushes the memory cache. One CPU can change my_variable in its cache but it will not be visible to other CPUs. smp_mb provides on-demand cache coherency. Look at the Intel X86 Software Manuals for MFENCE/LFENCE instructions.
Note that smp_mb() flushes the CPU cache so it CAN be an expensive operation. Current Intel CPUs have huge caches (~6MB).
If you expand cpu_relax on an x86, it will show asm volatile("rep; nop" ::: "memory"). This is NOT a compiler barrier but code that GCC will not optimize out. See the barrier macro, which is asm volatile("": : : "memory") for the GCC hint.
I'm not clear what you mean by "scope of cpu_relax". Some possible ideas: It's the PAUSE machine instruction, similar to ADD or MOV. PAUSE will affect only the current CPU. PAUSE allows for more efficient cache coherency between CPUs.
I just looked at the PAUSE instruction a little more - an additional property is it prevents the CPU from doing out-of-order memory speculation when leaving a tight loop/spinlock. I'm not clear what THAT means but I suppose it could briefly indicate a false value in a variable? Still a lot of questions....

Run time overhead of compiler barrier in gcc for x86 processors

I was looking into the side effects/run time overhead of using compiler barrier ( in gcc ) in x86 env.
Compiler barrier: asm volatile( ::: "memory" )
GCC documentation tells something interesting ( https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html )
Excerpt:
The "memory" clobber tells the compiler that the assembly code
performs memory reads or writes to items other than those listed in
the input and output operands (for example, accessing the memory
pointed to by one of the input parameters). To ensure memory contains
correct values, GCC may need to flush specific register values to
memory before executing the asm. Further, the compiler does not assume
that any values read from memory before an asm remain unchanged after
that asm; it reloads them as needed. Using the "memory" clobber
effectively forms a read/write memory barrier for the compiler.
Question:
1) What register values are flushed ?
2) Why it needs to be flushed ?
3) Example ?
4) Is there any other overhead apart from register flushing ?
Every memory location which another thread might have a pointer to needs to be up to date before the barrier, and reloaded after. So any such values that are live in registers needed to be stored (if dirty), or just "forgotten about" if the value in a register is just a copy of what's still in memory.
See this gcc non-bug report for this quote from a gcc dev: a "memory" clobber only includes memory that can be indirectly accessed (thus may be address-taken in this or another compilation unit)
Is there any other overhead apart from register flushing ?
A barrier can prevent optimizations like sinking a store out of a loop, but that's usually why you used barriers. Make sure your loop counters and loop variables are locals that haven't had their address passed to functions the compiler can't see, or else they'll have to be spilled/reloaded inside the loop. Letting references escape your function is always a potential problem for optimization, but it's a near-guarantee of worse code with barriers.
Why?
This is the whole point of a barrier: so values are synced to memory, preventing compile-time reordering.
asm volatile( ::: "memory" ) is (exactly?) equivalent to atomic_signal_fence(memory_order_seq_cst) (not atomic_thread_fence, which would take an mfence instruction to implement on x86).
Examples:
See Jeff Preshing's Memory Ordering at Compile Time article for more about why, and examples with actual x86 asm.

How do I force the CPU to perform in order execution of a program without any loops or branches?

Is it possible? For a small code without any branches/loops.
Are there any gcc flags or intrinsic instructions like SSE's for x86 and other processor families? I am just curious since all the processors available these days follow out of order execution model.
Thanks in advance
Most modern out-of-order CPUs are inherently out-of-order, without switching possible between in-order and out-of-order modes.
You can try to find some in-order CPU, and there are some:
x86: Intel Atom (only 45 nm and older versions; they have two parallel pipelines but executes all instructions in order)
arm: Cortex-A8, and many older cores;
While it is not possible to directly turn off instruction reordering in the typical out-of-order CPU, you can inject something serializing (like cpuid in x86 world) between every your instruction to simulate in-order execution.
There is a part of Intel manuals (vol 3a) about serializing instructions (copied from http://objectmix.com/asm-x86-asm-370/69413-serializing-instructions.html):
Volume 3A: System Programming Guide states
7.4 SERIALIZING INSTRUCTIONS
The Intel 64 and IA-32 architectures define several serializing
instructions. These instructions force the processor to complete all
modifications to flags, registers, and memory by previous instructions
and to drain all buffered writes to memory before the next instruction
is fetched and executed. For example, when a MOV to control register
instruction is used to load a new value into control register CR0 to
enable protected mode, the processor must perform a serializing
operation before it enters protected mode. This serializing operation
insures that all operations that were started while the processor was
in real-address mode are completed before the switch to protected
mode is made.
The concept of serializing instructions was introduced into the IA-32
architecture with the Pentium processor to support parallel
instruction execution. Serializing instructions have no meaning for
the Intel486 and earlier processors that do not implement parallel
instruction execution.
It is important to note that executing of serializing instructions on
P6 and more recent processor families constrain speculative execution
because the results of speculatively executed instructions are
discarded. The following instructions are serializing instructions:
o Privileged serializing instructions - MOV (to control register,
with the exception of MOV CR8), MOV (to debug register), WRMSR, INVD,
INVLPG, WBINVD, LGDT, LLDT, LIDT, and LTR.
o Non-privileged serializing instructions - CPUID, IRET, and RSM.
When the processor serializes instruction execution, it ensures that
all pending memory transactions are completed (including writes
stored in its store buffer) before it executes the next instruction.
Nothing can pass a serializing instruction and a serializing
instruction cannot pass any other instruction (read, write,
instruction fetch, or I/O). For example, CPUID can be executed at any
privilege level to serialize instruction execution with no effect on
program flow, except that the EAX, EBX, ECX, and EDX registers are
modified.
It is possible, but it depends on the CPU. Then again, the instructions themselves don't matter, the memory accesses matter.
AFAIK all CPUs guarantee that registers (and thus the internal state) appear updated in order, regardless of how execution happens. In some CPUs temporary registers are allocated with a value and the result is "written back" (register renaming, so there's no copy per se) at the appropriate time.
For memory accesses, most CPUs have memory barriers of some kind, which limit the reordering of memory accesses. There are several different kinds of memory barriers, and they differ from CPU to CPU. You can conceivably place a full memory barrier between each instruction and you'll get halfway there. If you have a multi-processor machine you might need to do some extra work to make sure the caches are also flushed. Without explicit instructions the other core may not see the results in order.
It very much depends on what you're trying to achieve and on which specific CPU. Every CPU out there is different in some way. And there won't be any magical gcc flags. In gcc the best you'll have are builtin atomic types (link below). The topic is huge. No simple answers.
Recommended reading list:
http://lxr.free-electrons.com/source/Documentation/memory-barriers.txt
https://en.wikipedia.org/wiki/Memory_ordering
http://gcc.gnu.org/onlinedocs/gcc-4.4.5/gcc/Atomic-Builtins.html
http://www.akkadia.org/drepper/cpumemory.pdf

Resources