Why is an empty function not just a return - gcc

If I compile an empty C function
void nothing(void)
{
}
using gcc -O2 -S (and clang) on MacOS, it generates:
_nothing:
pushq %rbp
movq %rsp, %rbp
popq %rbp
ret
Why does gcc not remove everything but the ret? It seems like an easy optimisation to make unless it really does something (seems not to, to me). This pattern (push/move at the beginning, pop at the end) is also visible in other non-empty functions where rbp is otherwise unused.
On Linux using a more recent gcc (4.4.5) I see just
nothing:
rep
ret
Why the rep ? The rep is absent in non-empty functions.

Why the rep ?
The reasons are explained in this blog post. In short, jumping directly to a single-byte ret instruction would mess up the branch prediction on some AMD processors. And rather than adding a nop before the ret, a meaningless prefix byte was added to save instruction decoding bandwidth.
The rep is absent in non-empty functions.
To quote from the blog post I linked to: "[rep ret] is preferred to the simple ret either when it is the target of any kind of branch, conditional (jne/je/...) or unconditional (jmp/call/...)".
In the case of an empty function, the ret would have been the direct target of a call. In a non-empty function, it wouldn't be.
Why does gcc not remove everything but the ret?
It's possible that some compilers won't omit frame pointer code even if you've specified -O2. At least with gcc, you can explicitly tell the compiler to omit them by using the -fomit-frame-pointer option.

As explained here: http://support.amd.com/us/Processor_TechDocs/25112.PDF, a two-byte near-return instruction (i.e. rep ret) is used because a single-byte return can me mispredicted on some on some amd64 processors in some situations such as this one.
If you fiddle around with the processor targeted by gcc you may find that you can get it to generate a single-byte ret. -mtune=nocona worked for me.

I suspect early, your last code is a bug. As johnfound says. The first code is because all C Compiler must always follow _cdecl calling convention that in function means (In Intel, sorry, I don't know the AT&T Syntax):
Function Definition
_functionA:
push rbp
mov rbp, rsp
;Some function
pop rbp
ret
In caller :
call _functionA
sub esp, 0 ; Maybe if it zero, some compiler can strip it
Why GCC is always follow _cdecl calling convention when not following that is nonsense, that is the compiler isn't smarter that the advanced assembly programmer. So, it always follow _cdecl at all cost.

That is, because even so called "optimization compilers" are too dumb to generate always good machine code.
They can't generate better code than their creators made them to generate.
As long as an empty function is nonsense, they probably simply didn't bother to optimize it or even to detect this very special case.
Although, single "rep" prefix is probably a bug. It does nothing when used without string instruction, but anyway, in some newer CPU it theoretically can cause an exception. (and IMHO should)

Related

What is the difference between __i686.get_pc_thunk and __x86.get_pc_thunk?

These helper functions are used by GCC and Clang in 32-bit x86 position-independent code to get the current execution address into a register, for example:
call __i686.get_pc_thunk.bx
addl $_GLOBAL_OFFSET_TABLE_, %ebx
movl $2, 4(%esp)
leal .LC0#GOTOFF(%ebx), %eax
movl %eax, (%esp)
call dlopen#PLT
It seems the implementations are equivalent:
__x86.get_pc_thunk.bx:
movl (%esp), %ebx
ret
__i686.get_pc_thunk.bx:
movl (%esp), %ebx
ret
Is there any difference besides the name change (seems i686 is older)? And is there a reason for the i686 prefix instead of i386?
So, after some digging in commit history and bug trackers I think I mostly figured it out.
Long time ago, glibc used to have its own handling of PIC code which involved a call/pop pattern to get the GOT address.
Around 2002, __i686.get_pc_thunk.*, which did a similar task, was added to GCC, initially as an internal symbol.
Shortly afterwards it ended up in glibc too, probably to avoid code duplication when being compiled with GCC.
However, when built for Pentium 2 or later (-march=i686), GCC defined preprocessor macro __i686=1, breaking glibc's compilation of the stub code. The problem has been discovered quite early but for several years glibc used various workarounds to handle this.
In 2011 (GCC 4.7?) the name was changed to __x86.get_pc_thunk.* and glibc added some checks to use a matching name. Eventually support for old GCC versions was dropped together with the old name. Both GCC and glibc only use __x86.get_pc_thunk.* now (although GCC can also generate the inline call/pop version).
So, in summary:
There is no actual difference between the two, the name change is simply historical due to a predefined macro collision.
References:
https://gcc.gnu.org/git/?p=gcc.git&a=search&h=HEAD&st=commit&s=get_pc_thunk
https://sourceware.org/git/?p=glibc.git&a=search&st=commit&s=get_pc_thunk
https://sourceware.org/bugzilla/show_bug.cgi?id=411
https://sourceware.org/bugzilla/show_bug.cgi?id=4507
Just a different name choice, not significant AFAIK.
i686 is the standard name for 32-bit code using PPro new instructions like CMOV and FCOMI, and 586 CMPGXCHG and CPUID. Modern GNU/Linux distros typically configure gcc to use that as the default target for -m32 32-bit code, instead of truly baseline i386. e.g. gcc -v will show i686-linux-gnu for a 32-bit build of GCC.
Usually clang uses call next_insn / pop reg to read EIP into a register. (Fun fact: that actually doesn't break return-address prediction on CPUs other than original Pentium-Pro or Via Nano3000: http://blog.stuffedcow.net/2018/04/ras-microbenchmarks/#call0 - CPUs special case call rel32=0 as not being a real call and don't put the return address into the predictor stack.)
The get_pc_thunk.bx includes the name of the register to return in. 32-bit PIC code used to only ever use EBX as the GOT pointer register, but GCC can now pick any convenient register and emit a thunk function for it, like ....get_pc_thunk.ax so leaf functions don't have to save/restore EBX.
PIE does make executables slower, by maybe 15% for 32-bit code vs. a couple percent for 64-bit code. x86-64 has RIP-relative addressing which avoids the need for these thunks. IMO a 32-bit PIE isn't worth the price, unless you really need more hardening against ROP and Spectre attacks by having ASLR of the main executable.

Why do calls to _Noreturn generate real calls instead of jumps? [duplicate]

This question already has an answer here:
Why noreturn/__builtin_unreachable prevents tail call optimization
(1 answer)
Closed 3 years ago.
I looked at
//#define _Noreturn //<= uncomment to see jmp's change to call's
_Noreturn void ext(int,char const*);
__attribute((noinline))
_Noreturn void intern(int X)
{
ext(X,"X"); //jmp unless ext is _Noreturn
}
void do_intern(void)
{
intern(0); //jmp unless intern is _Noreturn
}
int do_int_intern(void)
{
intern(0); //call in either case,
//would've expected a jmp if intern is _Noreturn
return 42; //erased if intern is _Noreturn
}
in
https://gcc.godbolt.org/z/MoT-LK and I noticed all gcc, clang, and icc generate real calls (call in x86_64) for calls to a _Noreturn function even though without the _Noreturn the call would have been a direct jump (jmp in x86_64).
Why is this?
For your examples, when compiled for x86_64, a JMP instruction couldn't be substituted for a CALL instruction as the stack won't be properly aligned. The callee expects the stack to be 16-byte aligned plus 8 for the return value.
Your example code, when compiled with -Os, generates this assembly on all the compilers in your godbolt link:
do_intern:
push rax ; ICC uses RSI here instead
xor edi, edi
call intern
To change the call into a JMP it would have to either add additional instruction to push a fake return value on the stack or undo the stack alignment made at the top of the function:
do_intern:
push rax
xor edi, edi
push rax ; or pop rax
jmp intern
Now if the compiler were really smart it could realize that there really isn't need to align the stack at the start of the function, so no need to undo it or push a fake return value:
do_intern:
xor edi, edi
jmp intern
But the compilers aren't smart enough to do this. Probably because it won't work in the general case where the function allocates variables on the stack and because there's rarely anything to be gained by improving the performance of calling functions that can never return.
In the tail-call case without _Noreturn, the CALL instruction could potentially be called multiple times during the execution of the program, so it can be worthwhile treating as a special case. With _Noreturn the CALL instruction can only be called once during the execution of the program (unless intern ends up making a recursive call to do_intern). Despite the similarity of the two cases, it would require new code to recognize the _Noreturn special case.
Note for GCC at least, it turns recognizing the _Noreturn special case is a lot harder than I initially thought, as explained by Richard Henderson on the GCC bugs mailing list:
This is the fault of my noreturn patch -- sibcalls to noreturn
functions no longer got an edge to EXIT, which meant that the code
intended to insert sibcall_epilogue patterns didn't.
It's a quandry: we need the more accurate CFG for the "control reaches
end of non-void function" test. But then if we brute force search for
sibcalls to insert the sibcall_epilogue patterns, we'll wind up with
flow2 warning about wanting to delete dead epilogue code (the loads
for the call-saved registers are dead becase we don't return).
I think the best way to solve this is to not create sibcalls to
noreturn functions. [...]
The term "sibcalls" refer to "sibling" calls, where a tail-call can use a jump instruction instead of a call instruction.
This is the reason why, at least initially, the optimization you're expecting was never (correctly) implemented in GCC. The fact that in most cases it would be better to have more accurate backtraces when non-returning functions like abort are called, seems to be a post hoc justification, though one that's probably significantly contributed to the optimization never being implemented in GCC, as well as clang and ICC.

Change GCC's output for 0-set (clear) operation

GCC often produces the following x86 assembly to set the value in eax to 0 before returning to the caller:
xor eax, eax
For the purposes of can I do it?, as opposed to should I do it?, is it possible to change GCC's behaviour to produce "equivalent" assembly? Also, where in libgcc does this generation occur?
To clarify, I'm not looking for guidance on which assembly instructions would be appropriate to use, I'm wondering how it is possible to change GCC's output behaviour.
You mean like sub eax,eax which is specially recognized as a zeroing idiom on only some CPUs, not all?
The optimization is done as part of -fpeephole2, as part of -O2 or -Os; using -fno-peephole2 would give you mov eax,0 for materializing a 0 in a register. (As well as creating other missed optimizations I assume! xor-zeroing probably isn't the only peephole gcc looks for.)
I don't know where to look in the gcc source code but knowing the option might help track it down.
It's not in "libgcc" though, that's helper functions like 64-bit multiply on a 32-bit machine. (When gcc emits calls to funny-named helper functions like __udivdi3, it's expecting the asm output to be linked against libgcc).
More like you'd find it in the x86 machine-definition files, one of the .md files in the gcc source tree. Otherwise hard-coded into a C optimization function. Like "xor %1, %0" might be something to search on, or more likely it'll have {... | ...} dialect-alternatives. But searching on the xor mnemonic might still help.
This is a half-assed partial answer. Please post a specific answer or at least leave a comment if you know where to look.

what is GCC compiler option to get Segment Override prefix in x86

I have memory layout (In Increasing memory addr) like :
Code Section (0-4k), Data Section(4k-8k), Stack Section(8k-12k), CustomData Section(12k-16k).
I have put some special arrays, structs in Custom Data Section.
As i know, Data Segment (#DS)Selector will be used for any Data related compiler code.
So Data Section(4k-8k) will have #DS by default for all operation. Except some str op where ES may be used. Like:
mov $0xc00,%eax
addl $0xd, (%eax)
But, I want to use Extra Segment(#ES) selector for CustomData access. I would define a new GDT entry for ES with different Base and Limit. like:
mov $0x3400,%eax
addl $0xd, %es:(%eax)
So my question is:
Does GCC has any x86 compiler flag, which can be used to tell compiler that use #ES for CustomData Section code access.?
Means, compiler flag which will generate code using #ES for CustomData Section.?
Thanks in advance !!
While the question is asking for an option to make gcc's code-gen use an es prefix for accessing a custom section, IF you wanted to do it in hand-written code, the AT&T syntax already allows e.g. %es:(%eax).
Note, this may break rep-string instructions which gcc sometimes inlines; fs or gs would be the only sane choices, and are still usable even in x86-64.
(Making this comment community wiki based on useful critique by Peter Cordes.)
Quoting the example from the clang language-extension docs
#define GS_RELATIVE __attribute__((address_space(256)))
int foo(int GS_RELATIVE *P) {
return *P;
}
Which compiles to (on X86-32):
_foo:
movl 4(%esp), %eax # load the arg
movl %gs:(%eax), %eax # use it with its prefix
ret
address-space 256 is gs, 257 is fs, and 258 is ss.
The docs don't mention es; compilers normally assume that es=ds so they can freely inline rep movs or rep stos for memcpy / memset if they choose to do so based on tuning options. Library implementations of memcpy or memset may also use rep stos/movs on some CPUs. Related: Why is std::fill(0) slower than std::fill(1)?
Obviously this is really low-level stuff, and only makes sense if you've already set the GS or FS base address. (wrfsbase). Beware that i386 Linux uses gs for thread-local storage in user-space, while x86-64 Linux uses fs.
I'm not aware of extensions like this for gcc, ICC, or MSVC.
Well, there is __thread in GNU C, which will use %gs: or %fs: prefixes as appropriate for the target platform. How does the gcc `__thread` work?.

GCC with the -fomit-frame-pointer option

I'm using GCC with the -fomit-frame-pointer and -O2 options. When I looked through the assembly code it generated,
push %ebp
movl %esp, %ebp
at the start and pop %ebp at the end is removed. But some redundant subl/addl instructions to esp is left in - subl $12, %esp at the start and addl $12, %esp at the end.
How will I be able to remove them as some inline assembly will jmp to another function before addl is excecuted.
You probably don't want to remove those -- that's usually the code that allocates and deallocates your local variables. If you remove those, your code will trample all over the return addresses and such.
The only safe way to get rid of them is not to use any local variables. Even in macros. And be really careful about inline functions, as they often have their own locals that'll get put in with yours. You may want to consider explicitly disabling function inlining for that section of code, if you can.
If you're absolutely sure that the adds and subs aren't needed (and i mean really, really sure), on my machine GCC apaprently does some stack manipulation to keep the stack aligned at 16 byte boundaries. You may be able to say "-mpreferred-stack-boundary=2", which will align to 4-byte boundaries -- which x86 processors like to do anyway, so no code is generated to realign it. Works on my box with my GCC; int main() { return 0; } turned into
main:
xorl %eax, %eax
ret
but the alignment code looked different to start with...so that may not be the problem for you.
Just so you're warned: optimization causes a lot of weird stuff like that to happen. Be careful with hand-coded assembler language and optimized <insert-almost-any-language-here> code, especially when you're doing something as unstructured as a jump from the middle of one function into another.
I solved the problem by giving a function prototype, then defining it manually like this:
void my_function();
asm (
".globl _my_function\n"
"_my_function:\n\t"
/* Assembler instructions go here */
);
Later I also wanted the function to be exported, so I added this at the end of the source file:
asm (
".section .drectve\n\t"
".ascii \" -export:my_function\"\n"
);
How will I be able to remove them as some inline assembly will jmp to another function before addl is executed.
This will corrupt your stack, that caller expects the stack pointer
to be corrected on function return. Does the other function return
by ret instruction? What exactly do you try to achieve? maybe there's another solution possible?
Please, show us the lines around the function call (in the caller) and your
entry/exit part of your function in question.

Resources