Why is GCC std::atomic increment generating inefficient non-atomic assembly? - gcc

I've been using gcc's Intel-compatible builtins (like __sync_fetch_and_add) for quite some time, using my own atomic template. The "__sync" functions are now officially considered "legacy".
C++11 supports std::atomic<> and its descendants, so it seems reasonable to use that instead, since it makes my code standard compliant, and the compiler will produce the best code either way, in a platform independent manner, that is almost too good to be true.
Incidentally, I'd only have to text-replace atomic with std::atomic, too. There's a lot in std::atomic (re: memory models) that I don't really need, but default parameters take care of that.
Now for the bad news. As it turns out, the generated code is, from what I can tell, ... utter crap, and not even atomic at all. Even a minimum example that increments a single atomic variable and outputs it has no fewer than 5 non-inlined function calls to ___atomic_flag_for_address, ___atomic_flag_wait_explicit, and __atomic_flag_clear_explicit (fully optimized), and on the other hand, there is not a single atomic instruction in the generated executable.
What gives? There is of course always the possibility of a compiler bug, but with the huge number of reviewers and users, such rather drastic things are generally unlikely to go unnoticed. Which means, this is probably not a bug, but intended behaviour.
What is the "rationale" behind so many function calls, and how is atomicity implemented without atomicity?
As-simple-as-it-can-get example:
#include <atomic>
int main()
{
std::atomic_int a(5);
++a;
__builtin_printf("%d", (int)a);
return 0;
}
produces the following .s:
movl $5, 28(%esp) #, a._M_i
movl %eax, (%esp) # tmp64,
call ___atomic_flag_for_address #
movl $5, 4(%esp) #,
movl %eax, %ebx #, __g
movl %eax, (%esp) # __g,
call ___atomic_flag_wait_explicit #
movl %ebx, (%esp) # __g,
addl $1, 28(%esp) #, MEM[(__i_type *)&a]
movl $5, 4(%esp) #,
call _atomic_flag_clear_explicit #
movl %ebx, (%esp) # __g,
movl $5, 4(%esp) #,
call ___atomic_flag_wait_explicit #
movl 28(%esp), %esi # MEM[(const __i_type *)&a], __r
movl %ebx, (%esp) # __g,
movl $5, 4(%esp) #,
call _atomic_flag_clear_explicit #
movl $LC0, (%esp) #,
movl %esi, 4(%esp) # __r,
call _printf #
(...)
.def ___atomic_flag_for_address; .scl 2; .type 32; .endef
.def ___atomic_flag_wait_explicit; .scl 2; .type 32; .endef
.def _atomic_flag_clear_explicit; .scl 2; .type 32; .endef
... and the mentioned functions look e.g. like this in objdump:
004013c4 <__atomic_flag_for_address>:
mov 0x4(%esp),%edx
mov %edx,%ecx
shr $0x2,%ecx
mov %edx,%eax
shl $0x4,%eax
add %ecx,%eax
add %edx,%eax
mov %eax,%ecx
shr $0x7,%ecx
mov %eax,%edx
shl $0x5,%edx
add %ecx,%edx
add %edx,%eax
mov %eax,%edx
shr $0x11,%edx
add %edx,%eax
and $0xf,%eax
add $0x405020,%eax
ret
The others are somewhat simpler, but I don't find a single instruction that would really be atomic (other than some spurious xchg which are atomic on X86, but these seem to be rather NOP/padding, since it's xchg %ax,%ax following ret).
I'm absolutely not sure what such a rather complicated function is needed for, and how it's meant to make anything atomic.

It is an inadequate compiler build.
Check your c++config.h, it shoukld look like this, but it doesn't:
/* Define if builtin atomic operations for bool are supported on this host. */
#define _GLIBCXX_ATOMIC_BUILTINS_1 1
/* Define if builtin atomic operations for short are supported on this host.
*/
#define _GLIBCXX_ATOMIC_BUILTINS_2 1
/* Define if builtin atomic operations for int are supported on this host. */
#define _GLIBCXX_ATOMIC_BUILTINS_4 1
/* Define if builtin atomic operations for long long are supported on this
host. */
#define _GLIBCXX_ATOMIC_BUILTINS_8 1
These macros are defined or not depending on configure tests, which check host machine support for __sync_XXX functions. These tests are in libstdc++v3/acinclude.m4, AC_DEFUN([GLIBCXX_ENABLE_ATOMIC_BUILTINS] ....
On your installation, it's evident from the MEM[(__i_type *)&a] put in the assembly file by -fverbose-asm that the compiler uses macros from atomic_0.h, for example:
#define _ATOMIC_LOAD_(__a, __x) \
({typedef __typeof__(_ATOMIC_MEMBER_) __i_type; \
__i_type* __p = &_ATOMIC_MEMBER_; \
__atomic_flag_base* __g = __atomic_flag_for_address(__p); \
__atomic_flag_wait_explicit(__g, __x); \
__i_type __r = *__p; \
atomic_flag_clear_explicit(__g, __x); \
__r; })
With a properly built compiler, with your example program, c++ -m32 -std=c++0x -S -O2 -march=core2 -fverbose-asm should produce something like this:
movl $5, 28(%esp) #, a.D.5442._M_i
lock addl $1, 28(%esp) #,
mfence
movl 28(%esp), %eax # MEM[(const struct __atomic_base *)&a].D.5442._M_i, __ret
mfence
movl $.LC0, (%esp) #,
movl %eax, 4(%esp) # __ret,
call printf #

There are two implementations. One that uses the __sync primitives and one that does not. Plus a mixture of the two that only uses some of those primitives. Which is selected depends on macros _GLIBCXX_ATOMIC_BUILTINS_1, _GLIBCXX_ATOMIC_BUILTINS_2, _GLIBCXX_ATOMIC_BUILTINS_4 and _GLIBCXX_ATOMIC_BUILTINS_8.
At least the first one is needed for the mixed implementation, all are needed for the fully atomic one. It seems that whether they are defined depends on target machine (they may not be defined for -mi386 and should be defined for -mi686).

Related

Why does %rdi still have the right value, after I clobber it and call printf?

I'm having trouble understanding the (erroneous) output of the following assembly code I've generated through a compiler I'm writing.
This is the pseudo-code of what I'm compiling:
int sidefx ( ) {
a = a + 1;
printf("side effect: a is %d\n", a);
return a;
}
void threeargs ( int one, int two, int three ) {
printf("three arguments. one: %d, two: %d, three: %d\n", one, two, three);
}
void main ( ) {
a = 0;
threeargs(sidefx(), sidefx(), sidefx());
}
Here's the assembly code I've generated:
.section .rodata
.comm global_a, 8, 8
.string0:
.string "a is %d\n"
.string1:
.string "one: %d, two: %d, three: %d\n"
.globl main
sidefx: /* Function sidefx() */
enter $(8*0),$0 /* Enter a new stack frame */
movq global_a, %r10 /* Store the value in .global_a in %r10 */
movq $1, %r11 /* Store immediate 1 into %r11 */
addq %r10,%r11 /* Add %r10 and %r11 */
movq %r11, global_a /* Store the result in .global_a */
movq global_a, %rsi /* Put the value of .global_a into second paramater register */
movq $.string0, %rdi /* Move .string0 to first parameter register */
movq $0, %rax
call printf /* Call printf */
movq global_a, %rax /* Return the new value of .global_a */
leave /* Restore old %rsp, %rbp values */
ret /* Pop the return address */
threeargs: /* Function threeargs() */
enter $(8*0),$0 /* Enter a new stack frame */
movq %rdx, %rcx /* Move 3rd parameter register value into 4th parameter register */
movq %rsi, %rdx /* move 2nd parameter register value into 3th parameter register */
movq %rdi, %rsi /* Move 1st parameter register value into 2nd parameter register */
movq $.string1, %rdi /* Move .string1 to 1st parameter register */
movq $0, %rax
call printf /* call printf */
leave /* Restore old %rsp, %rbp values */
ret /* Pop the return address */
main:
enter $(8*0),$0 /* Enter a new stack frame */
movq $0, global_a /* Set .global_a to 0 */
movq $0, %rax
call sidefx /* Call sidefx() */
movq %rax,%rdi /* Store value in %rdi, our first parameter register */
movq $0, %rax
call sidefx /* Call sidefx() */
movq %rax,%rsi /* Store value in %rsi, our second parameter register */
movq $0, %rax
call sidefx /* Call sidefx() */
movq %rax,%rdx /* Store value in %rdx, our third parameter register */
movq $0, %rax
call threeargs /* Call threeargs() */
main_return:
leave
ret
Now here's what I don't understand. The output to the program when compiled (gcc file.s -o code && ./code) is the following :
dmlittle$ gcc file.s -o code && ./code
a is 1
a is 2
a is 3
one: 1, two: 2147483641, three: 3
The problem with the assembly code is that I'm storing the values of the sidefx() call that will eventually be parameters to threeargs() into the function registers, but the 2 succeeding calls to sidefx() will overwrite the values of %rdi and %rsi in order to call printf. In order to fix this problem I need to store the return values either somewhere in the stack or maybe in callee-saved registers.
Why is the final printf returning one: 1, two: 2147483641, three: 3
? Shouldn't the first number printed also be mangled like what happened to the second number due to the succeeding sidefx calls?
You didn't specify which x86-64 ABI you're using, but from your use of %rdi / %rsi for arg passing, I'll assume you're targeting the SysV ABI (everything except windows). See the x86 wiki for links to docs and stuff.
... clobbering of return values from first two sidefx() calls... In order to fix this problem I need to store the return values either somewhere in the stack or maybe in callee-saved registers.
That's correct. gcc prefers using call-preserved regs, because then you don't have to fiddle with the stack alignment when pushing or popping between calls.
Why is the final printf returning one: 1, two: 2147483641, three: 3? Shouldn't the first number printed also be mangled like what happened to the second number due to the succeeding sidefx calls?
It's just a coincidence that %rdi=1 when you call threeargs(). If you single-step your code, you'd probably find it happens to have that value when printf returns. It's not from saving/restoring, since the original value is destroyed by movq $.string1, %rdi before the call to printf. It just happens that 1 is a common thing to find in a register.
Best guess: 1 is the file-descriptor arg to the write(2) system call, which is the last thing printf needed to do before returning. (Because stdout is line-buffered).
Your C doesn't match your implementation. In the asm, global_a is 8 bytes, but in C you're treating it as a 4 byte integer (printing with %d, not %ld). Your C doesn't declare it at all. I was going to edit in a declaration into the question, but you should resolve the ambiguity yourself (between long global_a = 0; or int global_a = 0;). The AMD64 SysV ABI specifies that long is 8 bytes. Use int64_t whenever you're writing portable C, though. There's no harm in writing int64_t when interoperating with asm, even when you do happen to know the sizes of short, int and long in the ABI you're using.
Avoid the enter instruction, unless you only care about code size, and not speed. It's horribly slow. leave is ok, maybe slower than mov %rbp, %rsp / pop %rbp, but usually you only need pop %rbp because you either didn't modify %rsp, or you needed to restore rsp anyway with add $something, %rsp before popping some other registers that you saved after %rbp.
Zeroing 64bit registers with xor %eax,%eax (2 bytes) has many advantages beyond code-size over mov $0, %rax (7 bytes: mov $sign-extended-imm32, r64).
Compare your code with compiler output: gcc -fverbose-asm -O3 -fno-inline will actually generate code from your C; all you need is a declaration of a, and to make main return an int, and it compiles just fine as C11. Of course, it mostly uses 32bit operand size because you used int, but the data movement (which thing goes in which register) is the same.
Also, the order of evaluation of argument lists is not specified, so threeargs(sidefx(), sidefx(), sidefx()) is undefined behaviour. You have multiple expressions with side effects with no sequence points separating them. I guess this is why you called it pseudo-code, not C, but it's poor way to express what you mean.
Anyway, here's your code on the Godbolt Compiler Explorer from gcc 5.3 -O3.
threeargs uses a jmp to tail-call printf, instead of call/ret.
The significant differences in main are all about correctly saving the return values from sidefx. Note that a=0 in main is not needed, because it's already initialized to zero by being in the BSS, but with -fwhole-program, gcc can't optimize it away. (A constructor could modify a before main runs, or maybe after linking a different definition of a could be used, that has a different initializer.)
The implementation of sidefx is noticeably tighter than yours:
sidefx:
subq $8, %rsp # aligns the stack for another function call
movl a(%rip), %eax # a, tmp94 # load `a`
movl $.LC0, %edi #, # the format string
leal 1(%rax), %esi #, D.2311 # esi = a+1
xorl %eax, %eax # # only needed because printf is a varargs function. Your `main` is doing this unnecessarily.
movl %esi, a(%rip) # D.2311, a # store back to the global
call printf #
movl a(%rip), %eax # a, # reload a
addq $8, %rsp #,
ret
IDK why gcc didn't load into %esi in the first place, and inc %esi instead of using lea to add one and store in a different dest. Your version moves an immediate 1 into a register, which is silly. Use immediate operands, and lea. The CPU designers already paid the x86 tax (extra design complexity to support the CISC instruction set), make sure you get your money's worth by taking full advantage of lea and immediate operands.
Note that it doesn't store/reload a before the call to printf. Your version doesn't need to do that.
Also note that none of the functions waste instructions making stack frames.

Using sigaction and setitimer system calls to implement assembly language timer on BSD/OS X

I'm trying to implement a timer routine in 32-bit assembler on OS X Lion using sigaction() & setitimer() system calls. The idea is to set a timer using setitimer() & then have the generated alarm signal invoke a handler function previously setup through sigaction(). I have such a mechanism working on Linux, but cannot seem to get it working on OS X. I know the system call convention is different between OS X & Linux, and that OS X has a 16 byte alignment requirement. Despite compensating for these, I'm still not able to get it working (usually a "Bus error: 10" error). Thinking I did something wrong with the alignment, I wrote a simple C program that does what I want & then used clang 3.2 to generate the assembly code. Then I modified the machine-generated assembly by replacing the calls to sigaction() & setitimer() with the appropriate system & int $0x80 calls, as well as stack alignment instructions. The resulting program still doesn't work.
Here is the C program, sigaction.c, that I used to generate the assembly. Note that I commented out the printf & sleep stuff so the resulting assembly code would be easier to read:
//#include <stdio.h>
#include <signal.h>
#include <sys/time.h>
struct sigaction action;
void handler(int arg) {
// printf("HERE!\n");
}
int main() {
action.__sigaction_u.__sa_handler = handler;
action.sa_mask = 0;
action.sa_flags = 0;
// printf("sigaction size: %d\n", sizeof(action));
int fd = sigaction(14, &action, 0);
struct itimerval timer;
timer.it_interval.tv_sec = 1;
timer.it_interval.tv_usec = 0;
timer.it_value.tv_sec = 1;
timer.it_value.tv_usec = 0;
// printf("itimerval size: %d\n", sizeof(timer));
fd = setitimer(0, &timer, 0);
while (1) {
// sleep(60);
}
return 0;
}
Here is the assembly code generated using "clang -arch i386 -S sigaction.c" on the above file:
.section __TEXT,__text,regular,pure_instructions
.globl _handler
.align 4, 0x90
_handler: ## #handler
## BB#0:
pushl %ebp
movl %esp, %ebp
pushl %eax
movl 8(%ebp), %eax
movl %eax, -4(%ebp)
addl $4, %esp
popl %ebp
ret
.globl _main
.align 4, 0x90
_main: ## #main
## BB#0:
pushl %ebp
movl %esp, %ebp
pushl %esi
subl $52, %esp
calll L1$pb
L1$pb:
popl %eax
movl $14, %ecx
movl L_action$non_lazy_ptr-L1$pb(%eax), %edx
movl $0, %esi
leal _handler-L1$pb(%eax), %eax
movl $0, -8(%ebp)
movl %eax, (%edx)
movl $0, 4(%edx)
movl $0, 8(%edx)
movl $14, (%esp)
movl %edx, 4(%esp)
movl $0, 8(%esp)
movl %esi, -36(%ebp) ## 4-byte Spill
movl %ecx, -40(%ebp) ## 4-byte Spill
calll _sigaction
movl $0, %ecx
leal -32(%ebp), %edx
movl %eax, -12(%ebp)
movl $1, -32(%ebp)
movl $0, -28(%ebp)
movl $1, -24(%ebp)
movl $0, -20(%ebp)
movl $0, (%esp)
movl %edx, 4(%esp)
movl $0, 8(%esp)
movl %ecx, -44(%ebp) ## 4-byte Spill
calll _setitimer
movl %eax, -12(%ebp)
LBB1_1: ## =>This Inner Loop Header: Depth=1
jmp LBB1_1
.comm _action,12,2 ## #action
.section __IMPORT,__pointers,non_lazy_symbol_pointers
L_action$non_lazy_ptr:
.indirect_symbol _action
.long 0
.subsections_via_symbols
If I compile the assembly code using "clang -arch i386 sigaction.s -o sigaction", debug it using lldb & place a breakpoint in the handler function, the handler function is indeed called every second. so I know the assembly code is correct (ditto for the C code).
Now if I replace the call to sigaction() with:
# calll _sigaction
movl $0x2e, %eax
subl $0x04, %esp
int $0x80
addl $0x04, %esp
and the call to setitimer() with:
# calll _setitimer
movl $0x53, %eax
subl $0x04, %esp
int $0x80
addl $0x04, %esp
the assembly code no longer works, and generates the same "Bus error: 10" that my hand-coded assembly code does.
I've tried removing the subl/addl instructions that I'm using to align the stack as well as changing the values to make sure the stack is aligned on 16-byte boundaries, but nothing seems to work. I either get the bus error, a segmentation fault, or the code just hangs without calling the handler function.
One thing I did notice during debugging is that the sigaction call appears to have a lengthy wrapper around the underlying system call. If you disassemble both functions from within lldb, you will see sigaction() has a lengthy wrapper but setitimer does not. Not sure this means anything, but perhaps the sigaction() wrapper is massaging the data before passing it along. I tried debugging that code, but haven't found anything definitive yet.
If anyone knows how to get the above assembly code working by replacing the sigaction() & setitimer() functions with the appropriate system calls, it would be greatly appreciated. I can then take those changes & apply them to my hand-coded routines.
Thanks.
Update: I stripped down my hand-written assembly code to a manageable size & was able to get it working using the sigaction() & setitimer() library calls, but still haven't figured out why the syscalls don't work. Here's the code (timer.s):
.globl _main
.data
.set ITIMER_REAL, 0x00
.set SIGALRM, 0x0e
.set SYS_SIGACTION, 0x2e
.set SYS_SETITIMER, 0x53
.set TRAP, 0x80
itimerval:
interval_tv_sec:
.long 0
interval_tv_usec:
.long 0
value_tv_sec:
.long 0
value_tv_usec:
.long 0
sigaction:
sa_handler:
.long handler
sa_mask:
.long 0
sa_flags:
.long 0
.text
handler:
pushl %ebp
movl %esp, %ebp
movl %ebp, %esp
popl %ebp
ret
_main:
pushl %ebp
movl %esp, %ebp
subl $0x0c, %esp
movl $SIGALRM, %ebx
movl $sigaction, %ecx
movl $0x00, %edx
pushl %edx
pushl %ecx
pushl %ebx
# subl $0x04, %esp
call _sigaction
# movl $SYS_SIGACTION, %eax
# int $0x80
addl $0x0c, %esp
# addl $0x10, %esp
movl $ITIMER_REAL, %ebx
movl $0x01, interval_tv_sec # Successive calls every 1 second
movl $0x00, interval_tv_usec
movl $0x01, value_tv_sec # Initial call in 1 second
movl $0x00, value_tv_usec
movl $itimerval, %ecx
movl $0x00, %edx
pushl %edx
pushl %ecx
pushl %ebx
# subl $0x04, %esp
call _setitimer
# movl $SYS_SETITIMER, %eax
# int $0x80
addl $0x0c, %esp
# addl $0x10, %esp
loop:
jmp loop
When compiled with "clang -arch i386 timer.s -o timer" & debugged with lldb, the handler routine is called every second. I left my efforts at making the code work with syscalls in the code - they are commented out around the sigaction() & setitimer() calls. If for no other reason than to educate myself (and others), I would still like to get the sys call version working if possible, and if not, understand the reason why it doesn't work.
Thanks again.
Update 2: I got the setitimer syscall working. Here's the modified code:
pushl %edx
pushl %ecx
pushl %ebx
subl $0x04, %esp
movl $SYS_SETITIMER, %eax
int $0x80
addl $0x10, %esp
But the same edits do not work for the sigaction sys call, which leads me back to my original conclusion - the sigaction() library function is doing something extra before making the actual syscall. This snippet from dtruss seems to suggest the same:
With sigaction() syscall (not working):
sigaction(0xE, 0x2030, 0x0) = 0 0
setitimer(0x0, 0x2020, 0x0) = 0 0
With sigaction() library call (working):
sigaction(0xE, 0xBFFFFC40, 0x0) = 0 0
setitimer(0x0, 0x2028, 0x0) = 0 0
As you can see, the 2nd argument is different between the two versions. It seems the address of the sigaction structure (0x2030) is passed directly when using the syscall, but something else is passed when using the library call. I'm guessing that the "something else" is generated in the sigaction() library function.
Update 3: I discovered that the same exact problem exists on FreeBSD 9.1. The setitimer syscall works, but the sigaction syscall does not. Like OS X, the sigaction() library call does work.
BSD has a few sigaction syscalls - so far, I've only tried the same one I was using in OS X - 0x2e. Perhaps one of the other sigaction syscalls will work. Knowing that BSD has the same behavior will make this easier to track down, as I can pull the C source code. Plus this opens the problem up to a much wider group of people who may already know what the problem is.
Based on my understanding of how syscalls work coupled with the fact that sigaction does work on Linux, I can't help but to think I am doing something wrong in my code. However, the fact that replacing the int $0x80 call with the sigaction() library function causes my code to work seems to contradict this. There is an entire chapter on assembly language programming in the FreeBSD developer manual, as well as a section on making system calls, so what I'm doing should be possible:
https://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/x86-system-calls.html
Unless someone can point out what, if anything, I am doing wrong, I think the next step is for me to look at the BSD sources for sigaction(). As I mentioned previously, I looked at the disassembled version of sigaction on OS X & found it to be quite lengthy compared to other syscalls & filled with magic numbers. Hopefully looking at the C code will make clear what it's doing that causes it to work. In the end, it could be something as simple as passing in the wrong sigaction struct (there are several of them) or failing to set some bit somewhere.

Understanding gcc -S output

I did gcc -S on the very complex program below on x86_64:
int main() {
int x = 3;
x = 5;
return 0;
}
And what I got was:
.file "main.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $3, -4(%rbp)
movl $5, -4(%rbp)
movl $0, %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-3)"
.section .note.GNU-stack,"",#progbits
I was wondering if someone could help me understand the output or refer me to some link explaining. Specifically, What does cfi ,LFB0,LFE0 , leave mean? All I could find regarding these is this post but couldn't fully understand what it was for. Also, what does ret do in this case? I'm guessing it's returning to __libc_start_main() which in turn would call do_exit() , is that correct?
Those .cfisomething directives result in generation of additional data by the compiler. This data helps traverse the call stack when an instruction causes an exception, so the exception handler (if any) can be found and correctly executed. The call stack information is useful for debugging. This data most probably goes into a separate section of the executable. It's not inserted between the instructions of your code.
.LFsomething: are just regular labels that are probably referenced by that extra exception-related data.
leave and ret are CPU instructions.
leave is equivalent to:
movq %rbp, %rsp
popq %rbp
and it undoes the effect of these two instructions
pushq %rbp
movq %rsp, %rbp
and instructions that allocate space on the stack by subtracting something from rsp.
ret returns from the function. It pops the return address from the stack and jumps to that address. If it was __libc_start_main() that called main(), then it returns there.
.LFB0, .LFE0 are nothing but local labels.
.cfi_startproc is used at the beginning of each function and end of the function happens by .cfi_endproc.
These assembler directives help the assembler to put debugging and stack unwinding information into the executable.
the leave instruction is an x86 assembler instruction which does the work of restoring the calling function's stack frame.
And lastly after the ret instruction, the following things happen:
%rip contains return address
%rsp points at arguments pushed by caller that didn't fit in the six registers used to pass arguments on amd64 (%rdi, %rsi, %rdx, %rcx, %r8, %r9)
called function may have trashed arguments
%rax contains return value (or trash if function is void) (or %rax and %rdx contain the return value if its size is >8 bytes but <=16 bytes1)
%r10, %r11 may be trashed
%rbp, %rbx, %r12, %r13, %r14, %r15 must contain contents from time of call
Additional information can be found here (SO question) and here (standards PDFs).
Or, on 32-bit:
%eip contains return address
%esp points at arguments pushed by caller
called function may have trashed arguments
%eax contains return value (or trash if function is void)
%ecx, %edx may be trashed
%ebp, %ebx, %esi, %edi must contain contents from time of call

gcc compilation and assemblying

I am trying to create an executable with gcc. I have two files virtualstack.c (which consists of the C-code below) and stack.s which consists of the intel x86 assembly code written in AT&T syntax (seen below the C-code). My command line command is gcc -c virtualstack.c -s stack.s, but I get two errors (line 3 in stack.s) - missing symbol name in directive and no such instruction _stack_create. I thought I have correctly declared functions from C in assembly prefixed with a underscore (_). I would be very grateful for any comments.
C code:
#include <stdio.h>
#include <stdlib.h>
extern void stack_create(void);
int main(void)
{
stack_create();
return 0;
}
Assembly code:
.global _stack_create
.type, #function
_stack_create
pushl %ebp
movl $5, %esp
movl %esp, ebp
movl $21, %edx
pushl %edx
I will try to explain you method how to investigate such cases.
1) It is always good idea to make compiler work for you. So lets start with code (lets call it assemble.c):
#include <stdio.h>
#include <stdlib.h>
/* stub stuff */
void __attribute__ ((noinline))
stack_create(void) { }
int
main(void)
{
stack_create();
return 0;
}
Now compile it to assembler with gcc -S -g0 assemble.c. stack_create function was assembled to (your results may differ, so please follow my instructions by yourself):
.text
.globl stack_create
.type stack_create, #function
stack_create:
pushq %rbp
movq %rsp, %rbp
popq %rbp
ret
.size stack_create, .-stack_create
2) Now all you need is to take this template and fill it with your stuff:
.text
.globl stack_create
.type stack_create, #function
stack_create:
pushq %rbp
movq %rsp, %rbp
;; Go and put your favorite stuff here!
pushl %ebp
movl $5, %esp
movl %esp, ebp
movl $21, %edx
pushl %edx
... etc ...
popq %rbp
ret
.size stack_create, .-stack_create
And of course make it separate .s file, say stack.s.
3) Now lets compile alltogether. Remove stub stuff from assemble.c and compile everything as:
gcc assemble.c stack.s
I got no errors. I believe you will get no errors too.
The main lesson: don't ever try to write in assembler in details like sections, function labels, etc. Compiler better knows how to do it. Use his knowledge instead.

x86 spinlock using cmpxchg

I'm new to using gcc inline assembly, and was wondering if, on an x86 multi-core machine, a spinlock (without race conditions) could be implemented as (using AT&T syntax):
spin_lock:
mov 0 eax
lock cmpxchg 1 [lock_addr]
jnz spin_lock
ret
spin_unlock:
lock mov 0 [lock_addr]
ret
You have the right idea, but your asm is broken:
cmpxchg can't work with an immediate operand, only registers.
lock is not a valid prefix for mov. mov to an aligned address is atomic on x86, so you don't need lock anyway.
It has been some time since I've used AT&T syntax, hope I remembered everything:
spin_lock:
xorl %ecx, %ecx
incl %ecx # newVal = 1
spin_lock_retry:
xorl %eax, %eax # expected = 0
lock; cmpxchgl %ecx, (lock_addr)
jnz spin_lock_retry
ret
spin_unlock:
movl $0, (lock_addr) # atomic release-store
ret
Note that GCC has atomic builtins, so you don't actually need to use inline asm to accomplish this:
void spin_lock(int *p)
{
while(!__sync_bool_compare_and_swap(p, 0, 1));
}
void spin_unlock(int volatile *p)
{
asm volatile ("":::"memory"); // acts as a memory barrier.
*p = 0;
}
As Bo says below, locked instructions incur a cost: every one you use must acquire exclusive access to the cache line and lock it down while lock cmpxchg runs, like for a normal store to that cache line but held for the duration of lock cmpxchg execution. This can delay the unlocking thread especially if multiple threads are waiting to take the lock. Even without many CPUs, it's still easy and worth it to optimize around:
void spin_lock(int volatile *p)
{
while(!__sync_bool_compare_and_swap(p, 0, 1))
{
// spin read-only until a cmpxchg might succeed
while(*p) _mm_pause(); // or maybe do{}while(*p) to pause first
}
}
The pause instruction is vital for performance on HyperThreading CPUs when you've got code that spins like this -- it lets the second thread execute while the first thread is spinning. On CPUs which don't support pause, it is treated as a nop.
pause also prevents memory-order mis-speculation when leaving the spin-loop, when it's finally time to do real work again. What is the purpose of the "PAUSE" instruction in x86?
Note that spin locks are actually rarely used: typically, one uses something like a critical section or futex. These integrate a spin lock for performance under low contention, but then fall back to an OS-assisted sleep and notify mechanism. They may also take measures to improve fairness, and lots of other things the cmpxchg / pause loop doesn't do.
Also note that cmpxchg is unnecessary for a simple spinlock: you can use xchg and then check whether the old value was 0 or not. Doing less work inside the locked instruction may keep the cache line pinned for less time. See Locks around memory manipulation via inline assembly for a complete asm implementation using xchg and pause (but still with no fallback to OS-assisted sleep, just spinning indefinitely.)
This will put less contention on the memory bus:
void spin_lock(int *p)
{
while(!__sync_bool_compare_and_swap(p, 0, 1)) while(*p);
}
The syntax is wrong. It works after a little modification.
spin_lock:
movl $0, %eax
movl $1, %ecx
lock cmpxchg %ecx, (lock_addr)
jnz spin_lock
ret
spin_unlock:
movl $0, (lock_addr)
ret
To provide a code running faster. Assume lock_addr is store in %rdi redister.
Use movl and test instead of lock cmpxchgl %ecx, (%rdi) to spin.
Use lock cmpxchgl %ecx, (%rdi) for trying to enter critical section only if there's a chance.
Then could avoid unneeded bus locking.
spin_lock:
movl $1, %ecx
loop:
movl (%rdi), %eax
test %eax, %eax
jnz loop
lock cmpxchgl %ecx, (%rdi)
jnz loop
ret
spin_unlock:
movl $0, (%rdi)
ret
I have tested it using pthread and an easy loop like this.
for(i = 0; i < 10000000; ++i){
spin_lock(&mutex);
++count;
spin_unlock(&mutex);
}
In my test, the first one take 2.5~3 secs and the second one take 1.3~1.8 secs.

Resources