x86 spinlock using cmpxchg - gcc

I'm new to using gcc inline assembly, and was wondering if, on an x86 multi-core machine, a spinlock (without race conditions) could be implemented as (using AT&T syntax):
spin_lock:
mov 0 eax
lock cmpxchg 1 [lock_addr]
jnz spin_lock
ret
spin_unlock:
lock mov 0 [lock_addr]
ret

You have the right idea, but your asm is broken:
cmpxchg can't work with an immediate operand, only registers.
lock is not a valid prefix for mov. mov to an aligned address is atomic on x86, so you don't need lock anyway.
It has been some time since I've used AT&T syntax, hope I remembered everything:
spin_lock:
xorl %ecx, %ecx
incl %ecx # newVal = 1
spin_lock_retry:
xorl %eax, %eax # expected = 0
lock; cmpxchgl %ecx, (lock_addr)
jnz spin_lock_retry
ret
spin_unlock:
movl $0, (lock_addr) # atomic release-store
ret
Note that GCC has atomic builtins, so you don't actually need to use inline asm to accomplish this:
void spin_lock(int *p)
{
while(!__sync_bool_compare_and_swap(p, 0, 1));
}
void spin_unlock(int volatile *p)
{
asm volatile ("":::"memory"); // acts as a memory barrier.
*p = 0;
}
As Bo says below, locked instructions incur a cost: every one you use must acquire exclusive access to the cache line and lock it down while lock cmpxchg runs, like for a normal store to that cache line but held for the duration of lock cmpxchg execution. This can delay the unlocking thread especially if multiple threads are waiting to take the lock. Even without many CPUs, it's still easy and worth it to optimize around:
void spin_lock(int volatile *p)
{
while(!__sync_bool_compare_and_swap(p, 0, 1))
{
// spin read-only until a cmpxchg might succeed
while(*p) _mm_pause(); // or maybe do{}while(*p) to pause first
}
}
The pause instruction is vital for performance on HyperThreading CPUs when you've got code that spins like this -- it lets the second thread execute while the first thread is spinning. On CPUs which don't support pause, it is treated as a nop.
pause also prevents memory-order mis-speculation when leaving the spin-loop, when it's finally time to do real work again. What is the purpose of the "PAUSE" instruction in x86?
Note that spin locks are actually rarely used: typically, one uses something like a critical section or futex. These integrate a spin lock for performance under low contention, but then fall back to an OS-assisted sleep and notify mechanism. They may also take measures to improve fairness, and lots of other things the cmpxchg / pause loop doesn't do.
Also note that cmpxchg is unnecessary for a simple spinlock: you can use xchg and then check whether the old value was 0 or not. Doing less work inside the locked instruction may keep the cache line pinned for less time. See Locks around memory manipulation via inline assembly for a complete asm implementation using xchg and pause (but still with no fallback to OS-assisted sleep, just spinning indefinitely.)

This will put less contention on the memory bus:
void spin_lock(int *p)
{
while(!__sync_bool_compare_and_swap(p, 0, 1)) while(*p);
}

The syntax is wrong. It works after a little modification.
spin_lock:
movl $0, %eax
movl $1, %ecx
lock cmpxchg %ecx, (lock_addr)
jnz spin_lock
ret
spin_unlock:
movl $0, (lock_addr)
ret
To provide a code running faster. Assume lock_addr is store in %rdi redister.
Use movl and test instead of lock cmpxchgl %ecx, (%rdi) to spin.
Use lock cmpxchgl %ecx, (%rdi) for trying to enter critical section only if there's a chance.
Then could avoid unneeded bus locking.
spin_lock:
movl $1, %ecx
loop:
movl (%rdi), %eax
test %eax, %eax
jnz loop
lock cmpxchgl %ecx, (%rdi)
jnz loop
ret
spin_unlock:
movl $0, (%rdi)
ret
I have tested it using pthread and an easy loop like this.
for(i = 0; i < 10000000; ++i){
spin_lock(&mutex);
++count;
spin_unlock(&mutex);
}
In my test, the first one take 2.5~3 secs and the second one take 1.3~1.8 secs.

Related

How get EIP from x86 inline assembly by gcc

I want to get the value of EIP from the following code, but the compilation does not pass
Command :
gcc -o xxx x86_inline_asm.c -m32 && ./xxx
file contetn x86_inline_asm.c:
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
unsigned int eip_val;
__asm__("mov %0,%%eip":"=r"(eip_val));
return 0;
}
How to use the inline assembly to get the value of EIP, and it can be compiled successfully under x86.
How to modify the code and use the command to complete it?
This sounds unlikely to be useful (vs. just taking the address of the whole function like void *tmp = main), but it is possible.
Just get a label address, or use . (the address of the current line), and let the linker worry about getting the right immediate into the machine code. So you're not architecturally reading EIP, just reading the value it currently has from an immediate.
asm volatile("mov $., %0" : "=r"(address_of_mov_instruction) );
AT&T syntax is mov src, dst, so what you wrote would be a jump if it assembled.
(Architecturally, EIP = the end of an instruction while it's executing, so arguably you should do
asm volatile(
"mov $1f, %0 \n\t" // reference label 1 forward
"1:" // GAS local label
"=r"(address_after_mov)
);
I'm using asm volatile in case this asm statement gets duplicated multiple times inside the same function by inlining or something. If you want each case to get a different address, it has to be volatile. Otherwise the compiler can assume that all instances of this asm statement produce the same output. Normally that will be fine.
Architecturally in 32-bit mode you don't have RIP-relative addressing for LEA so the only good way to actually read EIP is call / pop. Reading program counter directly. It's not a general-purpose register so you can't just use it as the source or destination of a mov or any other instruction.
But really you don't need inline asm for this at all.
Is it possible to store the address of a label in a variable and use goto to jump to it? shows how to use the GNU C extension where &&label takes its address.
int foo;
void *addr_inside_function() {
foo++;
lab1: ; // labels only go on statements, not declarations
void *tmp = &&lab1;
foo++;
return tmp;
}
There's nothing you can safely do with this address outside the function; I returned it just as an example to make the compiler put a label in the asm and see what happens. Without a goto to that label, it can still optimize the function pretty aggressively, but you might find it useful as an input for an asm goto(...) somewhere else in the function.
But anyway, it compiles on Godbolt to this asm
# gcc -O3 -m32
addr_inside_function:
.L2:
addl $2, foo
movl $.L2, %eax
ret
#clang -O3 -m32
addr_inside_function:
movl foo, %eax
leal 1(%eax), %ecx
movl %ecx, foo
.Ltmp0: # Block address taken
addl $2, %eax
movl %eax, foo
movl $.Ltmp0, %eax # retval = label address
retl
So clang loads the global, computes foo+1 and stores it, then after the label computes foo+2 and stores that. (Instead of loading twice). So you still can't usefully jump to the label from anywhere, because it depends on having foo's old value in eax, and on the desired behaviour being to store foo+2
I don't know gcc inline assembly syntax for this, but for masm:
call next0
next0: pop eax ;eax = eip for this line
In the case of Masm, $ represents the current location, and since call is a 5 byte instruction, an alternative syntax without a label would be:
call $+5
pop eax

Cannot modify data segment register. When tried General Protection Error is thrown

I have been trying to create an ISR handler following this
tutorial by James Molloy but I got stuck. Whenever I throw a software interrupt, general purpose registers and the data segment register is pushed onto the stack with the variables automatically pushed by the CPU. Then the data segment is changed to the value of 0x10 (Kernel Data Segment Descriptor) so the privilege levels are changed. Then after the handler returns those values are poped. But whenever the value in ds is changed a GPE is thrown with the error code 0x2544 and after a few seconds the VM restarts. (linker and compiler i386-elf-gcc , assembler nasm)
I tried placing hlt instructions in between instructions to locate which instruction was throwing the GPE. After that I was able to find out that the the `mov ds,ax' instruction. I tried various things like removing the stack which was initialized by the bootstrap code to deleting the privilege changing parts of the code. The only way I can return from the common stub is to remove the parts of my code which change the privilege levels but as I want to move towards user mode I still want them to stay.
Here is my common stub:
isr_common_stub:
pusha ; Pushes edi,esi,ebp,esp,ebx,edx,ecx,eax
xor eax,eax
mov ax, ds ; Lower 16-bits of eax = ds.
push eax ; save the data segment descriptor
mov ax, 0x10 ; load the kernel data segment descriptor
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax
call isr_handler
xor eax,eax
pop eax
mov ds, ax ; This is the instruction everything fails;
mov es, ax
mov fs, ax
mov gs, ax
popa
iret
My ISR handler macros:
extern isr_handler
%macro ISR_NOERRCODE 1
global isr%1 ; %1 accesses the first parameter.
isr%1:
cli
push byte 0
push %1
jmp isr_common_stub
%endmacro
%macro ISR_ERRCODE 1
global isr%1
isr%1:
cli
push byte %1
jmp isr_common_stub
%endmacro
ISR_NOERRCODE 0
ISR_NOERRCODE 1
ISR_NOERRCODE 2
ISR_NOERRCODE 3
...
My C handler which results in "Received interrupt: 0xD err. code 0x2544"
#include <stdio.h>
#include <isr.h>
#include <tty.h>
void isr_handler(registers_t regs) {
printf("ds: %x \n" ,regs.ds);
printf("Received interrupt: %x with err. code: %x \n", regs.int_no, regs.err_code);
}
And my main function:
void kmain(struct multiboot *mboot_ptr) {
descinit(); // Sets up IDT and GDT
ttyinit(TTY0); // Sets up the VGA Framebuffer
asm volatile ("int $0x1"); // Triggers a software interrupt
printf("Wow"); // After that its supposed to print this
}
As you can see the code was supposed to output,
ds: 0x10
Received interrupt: 0x1 with err. code: 0
but results in,
...
ds: 0x10
Received interrupt: 0xD with err. code: 0x2544
ds: 0x10
Received interrupt: 0xD with err. code: 0x2544
...
Which goes on until the VM restarts itself.
What am I doing wrong?
The code isn't complete but I'm going to guess what you are seeing is a result of a well known bug in James Molloy's OSDev tutorial. The OSDev community has compiled a list of known bugs in an errata list. I recommend reviewing and fixing all the bugs mentioned there. Specifically in this case I believe the bug that is causing problems is this one:
Problem: Interrupt handlers corrupt interrupted state
This article previously told you to know the ABI. If you do you will
see a huge problem in the interrupt.s suggested by the tutorial: It
breaks the ABI for structure passing! It creates an instance of the
struct registers on the stack and then passes it by value to the
isr_handler function and then assumes the structure is intact
afterwards. However, the function parameters on the stack belongs to
the function and it is allowed to trash these values as it sees fit
(if you need to know whether the compiler actually does this, you are
thinking the wrong way, but it actually does). There are two ways
around this. The most practical method is to pass the structure as a
pointer instead, which allows you to explicitly edit the register
state when needed - very useful for system calls, without having the
compiler randomly doing it for you. The compiler can still edit the
pointer on the stack when it's not specifically needed. The second
option is to make another copy the structure and pass that
The problem is that the 32-bit System V ABI doesn't guarantee that data passed by value will be unmodified on the stack! The compiler is free to reuse that memory for whatever purposes it chooses. The compiler probably generated code that trashed the area on the stack where DS is stored. When DS was set with the bogus value it crashed. What you should be doing is passing by reference rather than value. I'd recommend these code changes in the assembly code:
irq_common_stub:
pusha
mov ax, ds
push eax
mov ax, 0x10 ;0x10
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax
push esp ; At this point ESP is a pointer to where GS (and the rest
; of the interrupt handler state resides)
; Push ESP as 1st parameter as it's a
; pointer to a registers_t
call irq_handler
pop ebx ; Remove the saved ESP on the stack. Efficient to just pop it
; into any register. You could have done: add esp, 4 as well
pop ebx
mov ds, bx
mov es, bx
mov fs, bx
mov gs, bx
popa
add esp, 8
sti
iret
And then modify irq_handler to use registers_t *regs instead of registers_t regs :
void irq_handler(registers_t *regs) {
if (regs->int_no >= 40) port_byte_out(0xA0, 0x20);
port_byte_out(0x20, 0x20);
if (interrupt_handlers[regs->int_no] != 0) {
interrupt_handlers[regs->int_no](*regs);
}
else
{
klog("ISR: Unhandled IRQ%u!\n", regs->int_no);
}
}
I'd actually recommend each interrupt handler take a pointer to registers_t to avoid unnecessary copying. If your interrupt handlers and the interrupt_handlers array used function that took registers_t * as the parameter (instead of registers_t) then you'd modify the code:
interrupt_handlers[r->int_no](*regs);
to be:
interrupt_handlers[r->int_no](regs);
Important: You have to make these same type of changes for your ISR handlers as well. Both the IRQ and ISR handlers and associated code have this same problem.

In Clang/LLVM x86-64 inline assembly, how do I say I clobbered the x87/media state?

I'm writing some x86-64 inline assembly that might affect the floating point and media (SSE, MMX, etc.) state, but I don't feel like saving and restoring the state myself. Does Clang/LLVM have a clobber constraint for that?
(I'm not too familiar with the x86-64 architecture or inline assembly, so it was hard to know what to search for. More details in case this is an XY problem: I'm working on a simple coroutine library in Rust. When we switch tasks, we need to store the old CPU state and load the new state, and I'd like to write as little assembly as possible. My guess is that letting the compiler take care of saving and restoring state is the simplest way to do that.)
If your coroutine looks like an opaque (non-inline) function call, the compiler will already assume the FP state is clobbered (except for control regs like MXCSR and the x87 control word (rounding mode)), because all the FP regs are call-clobbered in the normal function calling convention.
Except for Windows, where xmm6..15 are call-preserved.
Also beware that if you're putting a call inside inline asm, there's no way to tell the compiler that your asm clobbers the red zone (128 bytes below RSP in the x86-64 System V ABI). You could compile that file with -mno-redzone or use add rsp, -128 before call to skip over the red-zone that belongs to the compiler-generated code.
To declare clobbers on the FP state, you have to name all the registers separately.
"xmm0", "xmm1", ..., "xmm15" (clobbering xmm0 counts as clobbering ymm0/zmm0).
For good measure you should also name "mm0", ..., "mm7" as well (MMX), in case your code inlines into some legacy code using MMX intrinsics.
To clobber the x87 stack as well, "st" is how you refer to st(0) in the clobber list. The rest of the registers have their normal names for GAS syntax, "st(1)", ..., "st(7)".
https://stackoverflow.com/questions/39728398/how-to-specify-clobbered-bottom-of-the-x87-fpu-stack-with-extended-gcc-assembly
You never know, it is possible to compile withclang -mfpmath=387, or to use 387 vialong double`.
(Hopefully no code uses -mfpmath=387 in 64-bit mode and MMX intrinsics at the same time; the following test-case looks slightly broken with gcc in that case.)
#include <immintrin.h>
float gvar;
int testclobber(float f, char *p)
{
int arg1 = 1, arg2 = 2;
f += gvar; // with -mno-sse, this will be in an x87 register
__m64 mmx_var = *(const __m64*)p; // MMX
mmx_var = _mm_unpacklo_pi8(mmx_var, mmx_var);
// x86-64 System V calling convention
unsigned long long retval;
asm volatile ("add $-128, %%rsp \n\t" // skip red zone. -128 fits in an imm8
"call whatever \n\t"
"sub $-128, %%rsp \n\t"
// FIXME should probably align the stack in here somewhere
: "=a"(retval) // returns in RAX
: "D" (arg1), "S" (arg2) // input args in registers
: "rcx", "rdx", "r8", "r9", "r10", "r11" // call-clobbered integer regs
// call clobbered FP regs, *NOT* including MXCSR
, "mm0", "mm1", "mm2", "mm3", "mm4", "mm5", "mm6", "mm7" // MMX
, "st", "st(1)", "st(2)", "st(3)", "st(4)", "st(5)", "st(6)", "st(7)" // x87
// SSE/AVX: clobbering any results in a redundant vzeroupper with gcc?
, "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7"
, "xmm8", "xmm9", "xmm10", "xmm11", "xmm12", "xmm13", "xmm14", "xmm15"
#ifdef __AVX512F__
, "zmm16", "zmm17", "zmm18", "zmm19", "zmm20", "zmm21", "zmm22", "zmm23"
, "zmm24", "zmm25", "zmm26", "zmm27", "zmm28", "zmm29", "zmm30", "zmm31"
, "k0", "k1", "k2", "k3", "k4", "k5", "k6", "k7"
#endif
#ifdef __MPX__
, "bnd0", "bnd1", "bnd2", "bnd3"
#endif
, "memory" // reads/writes of globals and pointed-to data can't reorder across the asm (at compile time; runtime StoreLoad reordering is still a thing)
);
// Use the MMX var after the asm: compiler has to spill/reload the reg it was in
*(__m64*)p = mmx_var;
_mm_empty(); // emms
gvar = f; // memory clobber prevents hoisting this ahead of the asm.
return retval;
}
source + asm on the Godbolt compiler explorer
By commenting one of the lines of clobbers, we can see that the spill-reload go away in the asm. e.g. commenting the x87 st .. st(7) clobbers makes code that leaves f + gvar in st0, for just a fst dword [gvar] after the call.
Similarly, commenting the mm0 line lets gcc and clang keep mmx_var in mm0 across the call. The ABI requires that the FPU is in x87 mode, not MMX, on call / ret, this isn't really sufficient. The compiler will spill/reload around the asm, but it won't insert an emms for us. But by the same token, it would be an error for a function using MMX to call your co-routine without doing _mm_empty() first, so maybe this isn't a real problem.
I haven't experimented with __m256 variables to see if it inserts a vzeroupper before the asm, to avoid possible SSE/AVX slowdowns.
If we comment the xmm8..15 line, we see the version that isn't using x87 for float keeps it in xmm8, because now it thinks it has some non-clobbered xmm regs. If we comment both sets of lines, it assumes xmm0 lives across the asm, so this works as a test of the clobbers.
asm output with all clobbers in place
It saves/restores RBX (to hold the pointer arg across the asm statement), which happens to re-align the stack by 16. That's another problem with using call from inline asm: I don't think alignment of RSP is guaranteed.
# from clang7.0 -march=skylake-avx512 -mmpx
testclobber: # #testclobber
push rbx
vaddss xmm0, xmm0, dword ptr [rip + gvar]
vmovss dword ptr [rsp - 12], xmm0 # 4-byte Spill (because of xmm0..15 clobber)
mov rbx, rdi # save pointer for after asm
movq mm0, qword ptr [rdi]
punpcklbw mm0, mm0 # mm0 = mm0[0,0,1,1,2,2,3,3]
movq qword ptr [rsp - 8], mm0 # 8-byte Spill (because of mm0..7 clobber)
mov edi, 1
mov esi, 2
add rsp, -128
call whatever
sub rsp, -128
movq mm0, qword ptr [rsp - 8] # 8-byte Reload
movq qword ptr [rbx], mm0
emms # note this didn't happen before call
vmovss xmm0, dword ptr [rsp - 12] # 4-byte Reload
vmovss dword ptr [rip + gvar], xmm0
pop rbx
ret
Notice that because of the "memory" clobber in the asm statement, *p and gvar are read before the asm, but written after. Without that, the optimizer could sink the load or hoist the store so no local variable was live across the asm statement. But now the optimizer needs to assume that the asm statement itself might read the old value of gvar and/or modify it. (And assume that p points to memory that's also globally accessible somehow, because we didn't use __restrict.)

Why does %rdi still have the right value, after I clobber it and call printf?

I'm having trouble understanding the (erroneous) output of the following assembly code I've generated through a compiler I'm writing.
This is the pseudo-code of what I'm compiling:
int sidefx ( ) {
a = a + 1;
printf("side effect: a is %d\n", a);
return a;
}
void threeargs ( int one, int two, int three ) {
printf("three arguments. one: %d, two: %d, three: %d\n", one, two, three);
}
void main ( ) {
a = 0;
threeargs(sidefx(), sidefx(), sidefx());
}
Here's the assembly code I've generated:
.section .rodata
.comm global_a, 8, 8
.string0:
.string "a is %d\n"
.string1:
.string "one: %d, two: %d, three: %d\n"
.globl main
sidefx: /* Function sidefx() */
enter $(8*0),$0 /* Enter a new stack frame */
movq global_a, %r10 /* Store the value in .global_a in %r10 */
movq $1, %r11 /* Store immediate 1 into %r11 */
addq %r10,%r11 /* Add %r10 and %r11 */
movq %r11, global_a /* Store the result in .global_a */
movq global_a, %rsi /* Put the value of .global_a into second paramater register */
movq $.string0, %rdi /* Move .string0 to first parameter register */
movq $0, %rax
call printf /* Call printf */
movq global_a, %rax /* Return the new value of .global_a */
leave /* Restore old %rsp, %rbp values */
ret /* Pop the return address */
threeargs: /* Function threeargs() */
enter $(8*0),$0 /* Enter a new stack frame */
movq %rdx, %rcx /* Move 3rd parameter register value into 4th parameter register */
movq %rsi, %rdx /* move 2nd parameter register value into 3th parameter register */
movq %rdi, %rsi /* Move 1st parameter register value into 2nd parameter register */
movq $.string1, %rdi /* Move .string1 to 1st parameter register */
movq $0, %rax
call printf /* call printf */
leave /* Restore old %rsp, %rbp values */
ret /* Pop the return address */
main:
enter $(8*0),$0 /* Enter a new stack frame */
movq $0, global_a /* Set .global_a to 0 */
movq $0, %rax
call sidefx /* Call sidefx() */
movq %rax,%rdi /* Store value in %rdi, our first parameter register */
movq $0, %rax
call sidefx /* Call sidefx() */
movq %rax,%rsi /* Store value in %rsi, our second parameter register */
movq $0, %rax
call sidefx /* Call sidefx() */
movq %rax,%rdx /* Store value in %rdx, our third parameter register */
movq $0, %rax
call threeargs /* Call threeargs() */
main_return:
leave
ret
Now here's what I don't understand. The output to the program when compiled (gcc file.s -o code && ./code) is the following :
dmlittle$ gcc file.s -o code && ./code
a is 1
a is 2
a is 3
one: 1, two: 2147483641, three: 3
The problem with the assembly code is that I'm storing the values of the sidefx() call that will eventually be parameters to threeargs() into the function registers, but the 2 succeeding calls to sidefx() will overwrite the values of %rdi and %rsi in order to call printf. In order to fix this problem I need to store the return values either somewhere in the stack or maybe in callee-saved registers.
Why is the final printf returning one: 1, two: 2147483641, three: 3
? Shouldn't the first number printed also be mangled like what happened to the second number due to the succeeding sidefx calls?
You didn't specify which x86-64 ABI you're using, but from your use of %rdi / %rsi for arg passing, I'll assume you're targeting the SysV ABI (everything except windows). See the x86 wiki for links to docs and stuff.
... clobbering of return values from first two sidefx() calls... In order to fix this problem I need to store the return values either somewhere in the stack or maybe in callee-saved registers.
That's correct. gcc prefers using call-preserved regs, because then you don't have to fiddle with the stack alignment when pushing or popping between calls.
Why is the final printf returning one: 1, two: 2147483641, three: 3? Shouldn't the first number printed also be mangled like what happened to the second number due to the succeeding sidefx calls?
It's just a coincidence that %rdi=1 when you call threeargs(). If you single-step your code, you'd probably find it happens to have that value when printf returns. It's not from saving/restoring, since the original value is destroyed by movq $.string1, %rdi before the call to printf. It just happens that 1 is a common thing to find in a register.
Best guess: 1 is the file-descriptor arg to the write(2) system call, which is the last thing printf needed to do before returning. (Because stdout is line-buffered).
Your C doesn't match your implementation. In the asm, global_a is 8 bytes, but in C you're treating it as a 4 byte integer (printing with %d, not %ld). Your C doesn't declare it at all. I was going to edit in a declaration into the question, but you should resolve the ambiguity yourself (between long global_a = 0; or int global_a = 0;). The AMD64 SysV ABI specifies that long is 8 bytes. Use int64_t whenever you're writing portable C, though. There's no harm in writing int64_t when interoperating with asm, even when you do happen to know the sizes of short, int and long in the ABI you're using.
Avoid the enter instruction, unless you only care about code size, and not speed. It's horribly slow. leave is ok, maybe slower than mov %rbp, %rsp / pop %rbp, but usually you only need pop %rbp because you either didn't modify %rsp, or you needed to restore rsp anyway with add $something, %rsp before popping some other registers that you saved after %rbp.
Zeroing 64bit registers with xor %eax,%eax (2 bytes) has many advantages beyond code-size over mov $0, %rax (7 bytes: mov $sign-extended-imm32, r64).
Compare your code with compiler output: gcc -fverbose-asm -O3 -fno-inline will actually generate code from your C; all you need is a declaration of a, and to make main return an int, and it compiles just fine as C11. Of course, it mostly uses 32bit operand size because you used int, but the data movement (which thing goes in which register) is the same.
Also, the order of evaluation of argument lists is not specified, so threeargs(sidefx(), sidefx(), sidefx()) is undefined behaviour. You have multiple expressions with side effects with no sequence points separating them. I guess this is why you called it pseudo-code, not C, but it's poor way to express what you mean.
Anyway, here's your code on the Godbolt Compiler Explorer from gcc 5.3 -O3.
threeargs uses a jmp to tail-call printf, instead of call/ret.
The significant differences in main are all about correctly saving the return values from sidefx. Note that a=0 in main is not needed, because it's already initialized to zero by being in the BSS, but with -fwhole-program, gcc can't optimize it away. (A constructor could modify a before main runs, or maybe after linking a different definition of a could be used, that has a different initializer.)
The implementation of sidefx is noticeably tighter than yours:
sidefx:
subq $8, %rsp # aligns the stack for another function call
movl a(%rip), %eax # a, tmp94 # load `a`
movl $.LC0, %edi #, # the format string
leal 1(%rax), %esi #, D.2311 # esi = a+1
xorl %eax, %eax # # only needed because printf is a varargs function. Your `main` is doing this unnecessarily.
movl %esi, a(%rip) # D.2311, a # store back to the global
call printf #
movl a(%rip), %eax # a, # reload a
addq $8, %rsp #,
ret
IDK why gcc didn't load into %esi in the first place, and inc %esi instead of using lea to add one and store in a different dest. Your version moves an immediate 1 into a register, which is silly. Use immediate operands, and lea. The CPU designers already paid the x86 tax (extra design complexity to support the CISC instruction set), make sure you get your money's worth by taking full advantage of lea and immediate operands.
Note that it doesn't store/reload a before the call to printf. Your version doesn't need to do that.
Also note that none of the functions waste instructions making stack frames.

Why is GCC std::atomic increment generating inefficient non-atomic assembly?

I've been using gcc's Intel-compatible builtins (like __sync_fetch_and_add) for quite some time, using my own atomic template. The "__sync" functions are now officially considered "legacy".
C++11 supports std::atomic<> and its descendants, so it seems reasonable to use that instead, since it makes my code standard compliant, and the compiler will produce the best code either way, in a platform independent manner, that is almost too good to be true.
Incidentally, I'd only have to text-replace atomic with std::atomic, too. There's a lot in std::atomic (re: memory models) that I don't really need, but default parameters take care of that.
Now for the bad news. As it turns out, the generated code is, from what I can tell, ... utter crap, and not even atomic at all. Even a minimum example that increments a single atomic variable and outputs it has no fewer than 5 non-inlined function calls to ___atomic_flag_for_address, ___atomic_flag_wait_explicit, and __atomic_flag_clear_explicit (fully optimized), and on the other hand, there is not a single atomic instruction in the generated executable.
What gives? There is of course always the possibility of a compiler bug, but with the huge number of reviewers and users, such rather drastic things are generally unlikely to go unnoticed. Which means, this is probably not a bug, but intended behaviour.
What is the "rationale" behind so many function calls, and how is atomicity implemented without atomicity?
As-simple-as-it-can-get example:
#include <atomic>
int main()
{
std::atomic_int a(5);
++a;
__builtin_printf("%d", (int)a);
return 0;
}
produces the following .s:
movl $5, 28(%esp) #, a._M_i
movl %eax, (%esp) # tmp64,
call ___atomic_flag_for_address #
movl $5, 4(%esp) #,
movl %eax, %ebx #, __g
movl %eax, (%esp) # __g,
call ___atomic_flag_wait_explicit #
movl %ebx, (%esp) # __g,
addl $1, 28(%esp) #, MEM[(__i_type *)&a]
movl $5, 4(%esp) #,
call _atomic_flag_clear_explicit #
movl %ebx, (%esp) # __g,
movl $5, 4(%esp) #,
call ___atomic_flag_wait_explicit #
movl 28(%esp), %esi # MEM[(const __i_type *)&a], __r
movl %ebx, (%esp) # __g,
movl $5, 4(%esp) #,
call _atomic_flag_clear_explicit #
movl $LC0, (%esp) #,
movl %esi, 4(%esp) # __r,
call _printf #
(...)
.def ___atomic_flag_for_address; .scl 2; .type 32; .endef
.def ___atomic_flag_wait_explicit; .scl 2; .type 32; .endef
.def _atomic_flag_clear_explicit; .scl 2; .type 32; .endef
... and the mentioned functions look e.g. like this in objdump:
004013c4 <__atomic_flag_for_address>:
mov 0x4(%esp),%edx
mov %edx,%ecx
shr $0x2,%ecx
mov %edx,%eax
shl $0x4,%eax
add %ecx,%eax
add %edx,%eax
mov %eax,%ecx
shr $0x7,%ecx
mov %eax,%edx
shl $0x5,%edx
add %ecx,%edx
add %edx,%eax
mov %eax,%edx
shr $0x11,%edx
add %edx,%eax
and $0xf,%eax
add $0x405020,%eax
ret
The others are somewhat simpler, but I don't find a single instruction that would really be atomic (other than some spurious xchg which are atomic on X86, but these seem to be rather NOP/padding, since it's xchg %ax,%ax following ret).
I'm absolutely not sure what such a rather complicated function is needed for, and how it's meant to make anything atomic.
It is an inadequate compiler build.
Check your c++config.h, it shoukld look like this, but it doesn't:
/* Define if builtin atomic operations for bool are supported on this host. */
#define _GLIBCXX_ATOMIC_BUILTINS_1 1
/* Define if builtin atomic operations for short are supported on this host.
*/
#define _GLIBCXX_ATOMIC_BUILTINS_2 1
/* Define if builtin atomic operations for int are supported on this host. */
#define _GLIBCXX_ATOMIC_BUILTINS_4 1
/* Define if builtin atomic operations for long long are supported on this
host. */
#define _GLIBCXX_ATOMIC_BUILTINS_8 1
These macros are defined or not depending on configure tests, which check host machine support for __sync_XXX functions. These tests are in libstdc++v3/acinclude.m4, AC_DEFUN([GLIBCXX_ENABLE_ATOMIC_BUILTINS] ....
On your installation, it's evident from the MEM[(__i_type *)&a] put in the assembly file by -fverbose-asm that the compiler uses macros from atomic_0.h, for example:
#define _ATOMIC_LOAD_(__a, __x) \
({typedef __typeof__(_ATOMIC_MEMBER_) __i_type; \
__i_type* __p = &_ATOMIC_MEMBER_; \
__atomic_flag_base* __g = __atomic_flag_for_address(__p); \
__atomic_flag_wait_explicit(__g, __x); \
__i_type __r = *__p; \
atomic_flag_clear_explicit(__g, __x); \
__r; })
With a properly built compiler, with your example program, c++ -m32 -std=c++0x -S -O2 -march=core2 -fverbose-asm should produce something like this:
movl $5, 28(%esp) #, a.D.5442._M_i
lock addl $1, 28(%esp) #,
mfence
movl 28(%esp), %eax # MEM[(const struct __atomic_base *)&a].D.5442._M_i, __ret
mfence
movl $.LC0, (%esp) #,
movl %eax, 4(%esp) # __ret,
call printf #
There are two implementations. One that uses the __sync primitives and one that does not. Plus a mixture of the two that only uses some of those primitives. Which is selected depends on macros _GLIBCXX_ATOMIC_BUILTINS_1, _GLIBCXX_ATOMIC_BUILTINS_2, _GLIBCXX_ATOMIC_BUILTINS_4 and _GLIBCXX_ATOMIC_BUILTINS_8.
At least the first one is needed for the mixed implementation, all are needed for the fully atomic one. It seems that whether they are defined depends on target machine (they may not be defined for -mi386 and should be defined for -mi686).

Resources