Force GCC to push arguments on the stack before calling function (using PUSH instruction) - gcc

I have started developing a small 16-bit OS under GCC/G++.
I am using a GCC cross-compiler, which I compiled under Cygwin, I am putting asm(".code16gcc\n") as the first line of each .CPP file, using Intel ASM syntax and the command lines for compiling and linking a .CPP file look like this:
G++: i586-elf-g++ -c $(CPP_FILE) -o $(OBJECT_OUTPUT) -nostdinc -ffreestanding -nostdlib -fno-builtin -fno-rtti -fno-exceptions -fpermissive -masm=intel
LD: i586-elf-ld -T $(LD_SCRIPT) $(OBJECT_OUTPUT) -o $(BINARY_OUTPUT)
The problem I am currently facing is the way GCC translates function-calling code into assembly.
To be more specific, instead of using the PUSH instruction to pass the arguments, GCC "calculates" the offsets relative to ESP the arguments should be located at, and then uses the MOV instruction to write the stack manually.
This is not beneficial for me, since I rely on the PUSH instruction in my assembly code. To illustrate my problem clearer, take these 2 functions:
void f2(int x);
void f1(){
int arg = 8;
asm("mov eax, 5"); // note: super hacky unsafe use of GNU C inline asm
asm("push eax"); // Writing registers without declaring a clobber is UB
f2(arg);
asm("pop eax");
}
void f2(int x){
}
In function f1, I am saving EAX using the PUSH instruction, and I would expect to have it restored to 5 after calling f2 and executing the "POP EAX" instruction. It turns out however that EAX becomes 8, not 5. That's because the ASSEMBLY CODE GCC generates looks like this (I've included the source as well for clarity):
void f1()
C++: {
push ebp
mov ebp,esp
sub esp,byte +0x14
C++: int arg = 8;
mov dword [ebp-0x4],0x8
C++: asm("mov eax, 5");
mov eax,0x5
C++: asm("push eax");
push eax
C++: f2(arg);
mov eax,[ebp-0x4]
mov [dword esp],eax =======>>>>>> HERE'S THE PROBLEM, WHY NOT 'PUSH EAX' ?!!
call f2
C++: asm("pop eax");
pop eax
C++: }
o32 leave
o32 ret
void f2(int x)
C++: {
push ebp
mov ebp,esp
C++: }
pop ebp
o32 ret
I have tried using some G++ compilation flags like -mpush-args or -mno-push-args and another one which I can't remember and GCC still doesn't want to use PUSH. The version I'm using is i586-elf-g++ (GCC) 4.7.2 (Cross-Compiler recompiled in Cygwin).
Thank you in advance!
UPDATE: Here's a webpage I've found: http://fixunix.com/linux/6799-gcc-function-call-pass-arguments-via-push.html
That just seems really stupid for GCC to do, considering that it limits the usability of inline assembly for complex stuff. :( Please leave an answer if you have a suggestion.

I've been very lucky finding a solution to this problem, but it finally does what I want it to do.
Here's what the GCC manual for version 4.7.2 state:
-mpush-args
-mno-push-args
Use PUSH operations to store outgoing parameters. This method is shorter
and usually equally fast as method using SUB/MOV operations and is enabled
by default. In some cases disabling it may improve performance because of
improved scheduling and reduced dependencies.
-maccumulate-outgoing-args
If enabled, the maximum amount of space required for outgoing arguments will
be computed in the function prologue. This is faster on most modern CPUs
because of reduced dependencies, improved scheduling and reduced stack usage
when preferred stack boundary is not equal to 2. The drawback is a notable
increase in code size. This switch implies ‘-mno-push-args’.
I'm saying I am lucky because -mpush-args does not work, what works is instead "-mno-accumulate-outgoing-args", which is not even documented!

I had similar question lately and people didn't find it important I guess, I found out undocumented option at least for GCC 4.8.1, don't know about latest 4.9 version.
Someone said he gets the "warning: stack probing requires -maccumulate-outgoing-args for correctness [enabled by default]" error message.
To disable stack probing, use -mno-stack-arg-probe, so pass these options I guess to ensure:
-mpush-args -mno-accumulate-outgoing-args -mno-stack-arg-probe
For me this works now, it uses PUSH, much smaller and better code, and much easier to debug with OllyDbg.

Related

Changing calling convention in gcc/g++ abi

How could I enforce gcc/g++ to not use registers but only stack in x86_64
to pass arguments to functions,
like it was in 32-bit version
(and possibly take the function result this way).
I know it breaks official ABI and both the caller side and the called side
must be compiled this way so it works. I don't care if push/pop or mov/sub way
is used. I expect there should be a flag to compiler that could enforce it
but I couldn't find it.
It seems you can't without hacking GCC's source code.
There is no standard x86-64 calling convention that uses inefficient stack args.
GCC only knows how to use standard calling conventions, in this case x86-64 SysV and MS Windows fastcall and vectorcall. (e.g. __attribute__((ms_abi)) or vectorcall). Normally nobody wants this; MS's calling convention is already friendly enough for wrappers or variadic functions. You can use that for some functions (controlled by __attribute__) even when compiling for Linux, MacOS, *BSD, etc., if that helps. Hard to imagine a use-case for pure stack args.
GCC lets you specify a register as fixed (never touched by GCC, like -ffixed-rdi), call-clobbered, or call-preserved. But using those with arg-passing registers just creates wrong code, not what you want.
e.g.
int foo(int a, int b, int c);
void caller(int x) {
foo(1,2,3);
//foo(4,x,6);
}
compiled by gcc9.2 -O3 -fcall-saved-rdi
caller:
push rdi
mov edx, 3
mov esi, 2
pop rdi
jmp foo
It saves/restores RDI but doesn't put a 1 in it before calling foo.
And it doesn't leave RDI out of the arg-passing sequence and bump other args later. (I was thinking you might be able to invent a calling convention where all the arg-passing registers were fixed or call-saved, maybe getting GCC to fall back to stack args. But nope.)

gcc, __atomic_exchange seems to produce non-atomic asm, why?

I am working on a nice tool, which requires the atomic swap of two different 64-bit values. On the amd64 architecture it is possible with the XCHGQ instruction (see here in doc, warning: it is a long pdf).
Correspondingly, gcc has some atomic builtins which would ideally do the same, as it is visible for example here.
Using these 2 docs I produced the following simple C function, for the atomic swapping of two, 64-bit values:
void theExchange(u64* a, u64* b) {
__atomic_exchange(a, b, b, __ATOMIC_SEQ_CST);
};
(Btw, it wasn't really clear to me, why needs an "atomic exchange" 3 operands.)
It was to me a little bit fishy, that the gcc __atomic_exchange macro uses 3 operands, so I tested its asm output. I compiled this with a gcc -O6 -masm=intel -S and I've got the following output:
.LHOTB0:
.p2align 4,,15
.globl theExchange
.type theExchange, #function
theExchange:
.LFB16:
.cfi_startproc
mov rax, QWORD PTR [rsi]
xchg rax, QWORD PTR [rdi] /* WTF? */
mov QWORD PTR [rsi], rax
ret
.cfi_endproc
.LFE16:
.size theExchange, .-theExchange
.section .text.unlikely
As we can see, the result function contains not only a single data move, but three different data movements. Thus, as I understood this asm code, this function won't be really atomic.
How is it possible? Maybe I misunderstood some of the docs? I admit, the gcc builtin doc wasn't really clear to me.
This is the generic version of __atomic_exchange_n (type *ptr, type val, int memorder) where only the exchange operation on ptr is atomic, the reading of val is not. In the generic version, val is accessed via pointer, but the atomicity still does not apply to it. The pointer is so that it will work with multiple sizes, when the compiler has to call an external helper:
The four non-arithmetic functions (load, store, exchange, and
compare_exchange) all have a generic version as well. This generic
version works on any data type. It uses the lock-free built-in
function if the specific data type size makes that possible;
otherwise, an external call is left to be resolved at run time. This
external call is the same format with the addition of a ‘size_t’
parameter inserted as the first parameter indicating the size of the
object being pointed to. All objects must be the same size.

Why is an empty function not just a return

If I compile an empty C function
void nothing(void)
{
}
using gcc -O2 -S (and clang) on MacOS, it generates:
_nothing:
pushq %rbp
movq %rsp, %rbp
popq %rbp
ret
Why does gcc not remove everything but the ret? It seems like an easy optimisation to make unless it really does something (seems not to, to me). This pattern (push/move at the beginning, pop at the end) is also visible in other non-empty functions where rbp is otherwise unused.
On Linux using a more recent gcc (4.4.5) I see just
nothing:
rep
ret
Why the rep ? The rep is absent in non-empty functions.
Why the rep ?
The reasons are explained in this blog post. In short, jumping directly to a single-byte ret instruction would mess up the branch prediction on some AMD processors. And rather than adding a nop before the ret, a meaningless prefix byte was added to save instruction decoding bandwidth.
The rep is absent in non-empty functions.
To quote from the blog post I linked to: "[rep ret] is preferred to the simple ret either when it is the target of any kind of branch, conditional (jne/je/...) or unconditional (jmp/call/...)".
In the case of an empty function, the ret would have been the direct target of a call. In a non-empty function, it wouldn't be.
Why does gcc not remove everything but the ret?
It's possible that some compilers won't omit frame pointer code even if you've specified -O2. At least with gcc, you can explicitly tell the compiler to omit them by using the -fomit-frame-pointer option.
As explained here: http://support.amd.com/us/Processor_TechDocs/25112.PDF, a two-byte near-return instruction (i.e. rep ret) is used because a single-byte return can me mispredicted on some on some amd64 processors in some situations such as this one.
If you fiddle around with the processor targeted by gcc you may find that you can get it to generate a single-byte ret. -mtune=nocona worked for me.
I suspect early, your last code is a bug. As johnfound says. The first code is because all C Compiler must always follow _cdecl calling convention that in function means (In Intel, sorry, I don't know the AT&T Syntax):
Function Definition
_functionA:
push rbp
mov rbp, rsp
;Some function
pop rbp
ret
In caller :
call _functionA
sub esp, 0 ; Maybe if it zero, some compiler can strip it
Why GCC is always follow _cdecl calling convention when not following that is nonsense, that is the compiler isn't smarter that the advanced assembly programmer. So, it always follow _cdecl at all cost.
That is, because even so called "optimization compilers" are too dumb to generate always good machine code.
They can't generate better code than their creators made them to generate.
As long as an empty function is nonsense, they probably simply didn't bother to optimize it or even to detect this very special case.
Although, single "rep" prefix is probably a bug. It does nothing when used without string instruction, but anyway, in some newer CPU it theoretically can cause an exception. (and IMHO should)

what is GCC compiler option to get Segment Override prefix in x86

I have memory layout (In Increasing memory addr) like :
Code Section (0-4k), Data Section(4k-8k), Stack Section(8k-12k), CustomData Section(12k-16k).
I have put some special arrays, structs in Custom Data Section.
As i know, Data Segment (#DS)Selector will be used for any Data related compiler code.
So Data Section(4k-8k) will have #DS by default for all operation. Except some str op where ES may be used. Like:
mov $0xc00,%eax
addl $0xd, (%eax)
But, I want to use Extra Segment(#ES) selector for CustomData access. I would define a new GDT entry for ES with different Base and Limit. like:
mov $0x3400,%eax
addl $0xd, %es:(%eax)
So my question is:
Does GCC has any x86 compiler flag, which can be used to tell compiler that use #ES for CustomData Section code access.?
Means, compiler flag which will generate code using #ES for CustomData Section.?
Thanks in advance !!
While the question is asking for an option to make gcc's code-gen use an es prefix for accessing a custom section, IF you wanted to do it in hand-written code, the AT&T syntax already allows e.g. %es:(%eax).
Note, this may break rep-string instructions which gcc sometimes inlines; fs or gs would be the only sane choices, and are still usable even in x86-64.
(Making this comment community wiki based on useful critique by Peter Cordes.)
Quoting the example from the clang language-extension docs
#define GS_RELATIVE __attribute__((address_space(256)))
int foo(int GS_RELATIVE *P) {
return *P;
}
Which compiles to (on X86-32):
_foo:
movl 4(%esp), %eax # load the arg
movl %gs:(%eax), %eax # use it with its prefix
ret
address-space 256 is gs, 257 is fs, and 258 is ss.
The docs don't mention es; compilers normally assume that es=ds so they can freely inline rep movs or rep stos for memcpy / memset if they choose to do so based on tuning options. Library implementations of memcpy or memset may also use rep stos/movs on some CPUs. Related: Why is std::fill(0) slower than std::fill(1)?
Obviously this is really low-level stuff, and only makes sense if you've already set the GS or FS base address. (wrfsbase). Beware that i386 Linux uses gs for thread-local storage in user-space, while x86-64 Linux uses fs.
I'm not aware of extensions like this for gcc, ICC, or MSVC.
Well, there is __thread in GNU C, which will use %gs: or %fs: prefixes as appropriate for the target platform. How does the gcc `__thread` work?.

How to debug an assembled program?

I have a program written in assembly that crashes with a segmentation fault. (The code is irrelevant, but is here.)
My question is how to debug an assembly language program with GDB?
When I try running it in GDB and perform a backtrace, I get no meaningful information. (Just hex offsets.)
How can I debug the program?
(I'm using NASM on Ubuntu, by the way if that somehow helps.)
I would just load it directly into gdb and step through it instruction by instruction, monitoring all registers and memory contents as you go.
I'm sure I'm not telling you anything you don't know there but the program seems simple enough to warrant this sort of approach. I would leave fancy debugging tricks like backtracking (and even breakpoints) for more complex code.
As to the specific problem (code paraphrased below):
extern printf
SECTION .data
format: db "%d",0
SECTION .bss
v_0: resb 4
SECTION .text
global main
main:
push 5
pop eax
mov [v_0], eax
mov eax, v_0
push eax
call printf
You appear to be just pushing 5 on to the stack followed by the address of that 5 in memory (v_0). I'm pretty certain you're going to need to push the address of the format string at some point if you want to call printf. It's not going to take to kindly to being given a rogue format string.
It's likely that your:
mov eax, v_0
should be:
mov eax, format
and I'm assuming that there's more code after that call to printf that you just left off as unimportant (otherwise you'll be going off to never-never land when it returns).
You should still be able to assemble with Stabs markers when linking code (with gcc).
I reccomend using YASM and assembling with -dstabs options:
$ yasm -felf64 -mamd64 -dstabs file.asm
This is how I assemble my assembly programs.
NASM and YASM code is interchangable for the most part (YASM has some extensions that aren't available in NASM, but every NASM code is well assembled with YASM).
I use gcc to link my assembled object files together or while compiling with C or C++ code. When using gcc, I use -gstabs+ to compile it with debug markers.

Resources