How could I enforce gcc/g++ to not use registers but only stack in x86_64
to pass arguments to functions,
like it was in 32-bit version
(and possibly take the function result this way).
I know it breaks official ABI and both the caller side and the called side
must be compiled this way so it works. I don't care if push/pop or mov/sub way
is used. I expect there should be a flag to compiler that could enforce it
but I couldn't find it.
It seems you can't without hacking GCC's source code.
There is no standard x86-64 calling convention that uses inefficient stack args.
GCC only knows how to use standard calling conventions, in this case x86-64 SysV and MS Windows fastcall and vectorcall. (e.g. __attribute__((ms_abi)) or vectorcall). Normally nobody wants this; MS's calling convention is already friendly enough for wrappers or variadic functions. You can use that for some functions (controlled by __attribute__) even when compiling for Linux, MacOS, *BSD, etc., if that helps. Hard to imagine a use-case for pure stack args.
GCC lets you specify a register as fixed (never touched by GCC, like -ffixed-rdi), call-clobbered, or call-preserved. But using those with arg-passing registers just creates wrong code, not what you want.
e.g.
int foo(int a, int b, int c);
void caller(int x) {
foo(1,2,3);
//foo(4,x,6);
}
compiled by gcc9.2 -O3 -fcall-saved-rdi
caller:
push rdi
mov edx, 3
mov esi, 2
pop rdi
jmp foo
It saves/restores RDI but doesn't put a 1 in it before calling foo.
And it doesn't leave RDI out of the arg-passing sequence and bump other args later. (I was thinking you might be able to invent a calling convention where all the arg-passing registers were fixed or call-saved, maybe getting GCC to fall back to stack args. But nope.)
Related
My code is written in C++, and compiled with gcc version 4.7.2.
It's linked with 3rd party library, which is written in C, and compiled with gcc 4.5.2.
My code calls a function initStuff(). During the debug I found out that the value of R15 register before the call to initStuff() is not the same as the value upon return from that function.
As a quick hack I did:
asm(" mov %%r15, %0" : "=r" ( saveR15 ) );
initStuff();
asm(" mov %0, %%r15;" : : "r" (saveR15) );
which seems to work for now.
Who is to blame here? How can I find if it's a compiler issue, or maybe compatibility issue?
gcc on x86-64 follows the System V ABI, which defines r15 as a callee-saved register; any function which uses this register is supposed to save and restore it.
So if this third-party function is not doing so, it is failing to conform to the ABI, and unless this is documented, it is to blame. AFAIK this part of the ABI has been stable forever, so if compiler-generated code (with default options) is failing to save and restore r15, that would be a compiler bug. More likely some part of the third-party code uses assembly language and is buggy, or conceivably it was built with non-standard compiler options.
You can either dig into it, or as a workaround, write a wrapper around it that saves and restores r15. Your current workaround is not really safe, since the compiler might reorder your asm statements with respect to surrounding code. You should instead put the call to initStuff inside a single asm block with the save-and-restore (declaring it as clobbering all caller-saved registers), or write a "naked" assembly wrapper which does the save/restore and call, and call it instead. (Make sure to preserve stack alignment.)
This question already has an answer here:
Why noreturn/__builtin_unreachable prevents tail call optimization
(1 answer)
Closed 3 years ago.
I looked at
//#define _Noreturn //<= uncomment to see jmp's change to call's
_Noreturn void ext(int,char const*);
__attribute((noinline))
_Noreturn void intern(int X)
{
ext(X,"X"); //jmp unless ext is _Noreturn
}
void do_intern(void)
{
intern(0); //jmp unless intern is _Noreturn
}
int do_int_intern(void)
{
intern(0); //call in either case,
//would've expected a jmp if intern is _Noreturn
return 42; //erased if intern is _Noreturn
}
in
https://gcc.godbolt.org/z/MoT-LK and I noticed all gcc, clang, and icc generate real calls (call in x86_64) for calls to a _Noreturn function even though without the _Noreturn the call would have been a direct jump (jmp in x86_64).
Why is this?
For your examples, when compiled for x86_64, a JMP instruction couldn't be substituted for a CALL instruction as the stack won't be properly aligned. The callee expects the stack to be 16-byte aligned plus 8 for the return value.
Your example code, when compiled with -Os, generates this assembly on all the compilers in your godbolt link:
do_intern:
push rax ; ICC uses RSI here instead
xor edi, edi
call intern
To change the call into a JMP it would have to either add additional instruction to push a fake return value on the stack or undo the stack alignment made at the top of the function:
do_intern:
push rax
xor edi, edi
push rax ; or pop rax
jmp intern
Now if the compiler were really smart it could realize that there really isn't need to align the stack at the start of the function, so no need to undo it or push a fake return value:
do_intern:
xor edi, edi
jmp intern
But the compilers aren't smart enough to do this. Probably because it won't work in the general case where the function allocates variables on the stack and because there's rarely anything to be gained by improving the performance of calling functions that can never return.
In the tail-call case without _Noreturn, the CALL instruction could potentially be called multiple times during the execution of the program, so it can be worthwhile treating as a special case. With _Noreturn the CALL instruction can only be called once during the execution of the program (unless intern ends up making a recursive call to do_intern). Despite the similarity of the two cases, it would require new code to recognize the _Noreturn special case.
Note for GCC at least, it turns recognizing the _Noreturn special case is a lot harder than I initially thought, as explained by Richard Henderson on the GCC bugs mailing list:
This is the fault of my noreturn patch -- sibcalls to noreturn
functions no longer got an edge to EXIT, which meant that the code
intended to insert sibcall_epilogue patterns didn't.
It's a quandry: we need the more accurate CFG for the "control reaches
end of non-void function" test. But then if we brute force search for
sibcalls to insert the sibcall_epilogue patterns, we'll wind up with
flow2 warning about wanting to delete dead epilogue code (the loads
for the call-saved registers are dead becase we don't return).
I think the best way to solve this is to not create sibcalls to
noreturn functions. [...]
The term "sibcalls" refer to "sibling" calls, where a tail-call can use a jump instruction instead of a call instruction.
This is the reason why, at least initially, the optimization you're expecting was never (correctly) implemented in GCC. The fact that in most cases it would be better to have more accurate backtraces when non-returning functions like abort are called, seems to be a post hoc justification, though one that's probably significantly contributed to the optimization never being implemented in GCC, as well as clang and ICC.
I saw the code of switch_to in the article "Evolution of the x86 context switch in Linux" in the link https://www.maizure.org/projects/evolution_x86_context_switch_linux/
Most versions of switch_to only save/restore ESP/RSP and/or EBP/RBP, not other call-preserved registers in the inline asm. But the Linux 2.2.0 version does save them in this function, because it uses software context switching instead of relying on hardware TSS stuff. Later Linux versions still do software context switching, but don't have these push / pop instructions.
Are the registers are saved in other function (maybe in the schedule() function)? Or is there no need to save these registers in the kernel context?
(I know that those registers of the user context are saved in the kernel stack when the system enters kernel mode).
Linux versions before 2.2.0 use hardware task switching, where the TSS saves/restores registers for you. That's what the "ljmp %0\n\t" is doing. (ljmp is AT&T syntax for a far jmp, presumably to a task gate). I'm not really familiar with hardware TSS stuff because it's not very relevant; it's still used in modern kernels for getting RSP pointing to the kernel stack for interrupt handlers, but not for context switching between tasks.
Hardware task switching is slow, so later kernels avoid it. Linux 2.2 does save/restore the call-preserved registers manually, with push/pop before/after swapping stacks. EAX, EDX, and ECX are declared as dummy outputs ("=a" (eax), "=d" (edx), "=c" (ecx)) so the compiler knows that the old values of those registers are no longer available.
This is a sensible choice because switch_to is probably used inside a non-inline function. The caller will make a function call that eventually returns (after running another task for a while) with the call-preserved registers restored, and the call-clobbered registers clobbered, just like a regular function call. (So compiler code-gen for the function that uses the switch_to macro doesn't need to emit save/restore code outside of the inline asm). If you think about writing a whole context switch function in asm (not inline asm), you'd get this clobbering of volatile registers for free because callers expect that.
So how do later kernels avoid saving/restoring those registers in inline asm?
Linux 2.4 uses "=b" (last) as an output operand, so the compiler has to save/restore EBX in a function that uses this asm. The asm still saves/restores ESI, EDI, and EBP (as well as ESP). The text of the article notes this:
The 2.4 kernel context switch brings a few minor changes: EBX is no longer pushed/popped, but it is now included in the output of the inline assembly. We have a new input argument.
I don't see where they tell the compiler about EAX, ECX, and EDX not surviving, so that's odd. It might be a bug that they get away with by making the function noinline or something?
Linux 2.6 on i386 uses more output operands that get the compiler to handle the save/restore.
But Linux 2.6 for x86-64 introduces the trick that hands off the save/restore to the compiler easily: #define __EXTRA_CLOBBER ,"rcx","rbx","rdx","r8","r9","r10", "r11","r12","r13","r14","r15"
Notice the clobbers declaration: : "memory", "cc" __EXTRA_CLOBBER
This tells the compiler that the inline asm destroys all those registers, so the compiler will emit instructions to save/restore these registers at the start/end of whatever function switch_to ultimately inlines into.
Telling the compiler that all the registers are destroyed after a context switch solves the same problem as manually saving/restoring them with inline asm. The compiler will still make a function that obeys the calling convention.
The context-switch swaps to the new task's stack, so the compiler-generated save/restore code is always running with the appropriate stack pointer. Notice that the explicit push/pop instructions inside the inline asm int Linux 2.2 and 2.4 are before / after everything else.
Why is it that 32-bit C pushes all function arguments straight onto the stack while 64-bit C puts the first 6 arguments into registers and the rest on the stack?
So the 32-bit stack would look like:
...
arg2
arg1
return address
old %rbp
While the 64-bit stack would look like:
...
arg8
arg7
return address
old %rbp
arg6
arg5
arg4
arg3
arg2
arg1
So why does 64-bit C do this? Isn't it much easier to just push everything to the stack instead of put the first 6 arguments in registers just to move them onto the stack in the function prologue?
instead of put the first 6 arguments in registers just to move them onto the stack in the function prologue?
I was looking at some code that gcc generated and that's what it always did.
Then you forgot to enable optimization. gcc -O0 spills everything to memory so you can modify them with a debugger while single-stepping. That's obviously horrible for performance, so compilers don't do that unless you force them to by compiling with -O0.
x86-64 System V allows int add(int x, int y) { return x+y; } to compile to
lea eax, [rdi + rsi] / ret, which is what compilers actually do as you can see on the Godbolt compiler explorer.
Stack-args calling conventions are slow and obsolete. RISC machines have been using register-args calling conventions since before x86-64 existed, and on OSes that still care about 32-bit x86 (i.e. Windows), there are better calling conventions like __vectorcall that pass the first 2 integer args in registers.
i386 System V hasn't been replaced because people mostly don't care as much about 32-bit performance on other OSes; we just use 64-bit code with the nicely-designed x86-64 System V calling convention.
For more about the tradeoff between register args and call-preserved vs. call-clobbered registers in calling convention design, see Why not store function parameters in XMM vector registers?, and also Why does Windows64 use a different calling convention from all other OSes on x86-64?.
If I compile an empty C function
void nothing(void)
{
}
using gcc -O2 -S (and clang) on MacOS, it generates:
_nothing:
pushq %rbp
movq %rsp, %rbp
popq %rbp
ret
Why does gcc not remove everything but the ret? It seems like an easy optimisation to make unless it really does something (seems not to, to me). This pattern (push/move at the beginning, pop at the end) is also visible in other non-empty functions where rbp is otherwise unused.
On Linux using a more recent gcc (4.4.5) I see just
nothing:
rep
ret
Why the rep ? The rep is absent in non-empty functions.
Why the rep ?
The reasons are explained in this blog post. In short, jumping directly to a single-byte ret instruction would mess up the branch prediction on some AMD processors. And rather than adding a nop before the ret, a meaningless prefix byte was added to save instruction decoding bandwidth.
The rep is absent in non-empty functions.
To quote from the blog post I linked to: "[rep ret] is preferred to the simple ret either when it is the target of any kind of branch, conditional (jne/je/...) or unconditional (jmp/call/...)".
In the case of an empty function, the ret would have been the direct target of a call. In a non-empty function, it wouldn't be.
Why does gcc not remove everything but the ret?
It's possible that some compilers won't omit frame pointer code even if you've specified -O2. At least with gcc, you can explicitly tell the compiler to omit them by using the -fomit-frame-pointer option.
As explained here: http://support.amd.com/us/Processor_TechDocs/25112.PDF, a two-byte near-return instruction (i.e. rep ret) is used because a single-byte return can me mispredicted on some on some amd64 processors in some situations such as this one.
If you fiddle around with the processor targeted by gcc you may find that you can get it to generate a single-byte ret. -mtune=nocona worked for me.
I suspect early, your last code is a bug. As johnfound says. The first code is because all C Compiler must always follow _cdecl calling convention that in function means (In Intel, sorry, I don't know the AT&T Syntax):
Function Definition
_functionA:
push rbp
mov rbp, rsp
;Some function
pop rbp
ret
In caller :
call _functionA
sub esp, 0 ; Maybe if it zero, some compiler can strip it
Why GCC is always follow _cdecl calling convention when not following that is nonsense, that is the compiler isn't smarter that the advanced assembly programmer. So, it always follow _cdecl at all cost.
That is, because even so called "optimization compilers" are too dumb to generate always good machine code.
They can't generate better code than their creators made them to generate.
As long as an empty function is nonsense, they probably simply didn't bother to optimize it or even to detect this very special case.
Although, single "rep" prefix is probably a bug. It does nothing when used without string instruction, but anyway, in some newer CPU it theoretically can cause an exception. (and IMHO should)