what is GCC compiler option to get Segment Override prefix in x86 - gcc

I have memory layout (In Increasing memory addr) like :
Code Section (0-4k), Data Section(4k-8k), Stack Section(8k-12k), CustomData Section(12k-16k).
I have put some special arrays, structs in Custom Data Section.
As i know, Data Segment (#DS)Selector will be used for any Data related compiler code.
So Data Section(4k-8k) will have #DS by default for all operation. Except some str op where ES may be used. Like:
mov $0xc00,%eax
addl $0xd, (%eax)
But, I want to use Extra Segment(#ES) selector for CustomData access. I would define a new GDT entry for ES with different Base and Limit. like:
mov $0x3400,%eax
addl $0xd, %es:(%eax)
So my question is:
Does GCC has any x86 compiler flag, which can be used to tell compiler that use #ES for CustomData Section code access.?
Means, compiler flag which will generate code using #ES for CustomData Section.?
Thanks in advance !!

While the question is asking for an option to make gcc's code-gen use an es prefix for accessing a custom section, IF you wanted to do it in hand-written code, the AT&T syntax already allows e.g. %es:(%eax).
Note, this may break rep-string instructions which gcc sometimes inlines; fs or gs would be the only sane choices, and are still usable even in x86-64.
(Making this comment community wiki based on useful critique by Peter Cordes.)

Quoting the example from the clang language-extension docs
#define GS_RELATIVE __attribute__((address_space(256)))
int foo(int GS_RELATIVE *P) {
return *P;
}
Which compiles to (on X86-32):
_foo:
movl 4(%esp), %eax # load the arg
movl %gs:(%eax), %eax # use it with its prefix
ret
address-space 256 is gs, 257 is fs, and 258 is ss.
The docs don't mention es; compilers normally assume that es=ds so they can freely inline rep movs or rep stos for memcpy / memset if they choose to do so based on tuning options. Library implementations of memcpy or memset may also use rep stos/movs on some CPUs. Related: Why is std::fill(0) slower than std::fill(1)?
Obviously this is really low-level stuff, and only makes sense if you've already set the GS or FS base address. (wrfsbase). Beware that i386 Linux uses gs for thread-local storage in user-space, while x86-64 Linux uses fs.
I'm not aware of extensions like this for gcc, ICC, or MSVC.
Well, there is __thread in GNU C, which will use %gs: or %fs: prefixes as appropriate for the target platform. How does the gcc `__thread` work?.

Related

What is the difference between __i686.get_pc_thunk and __x86.get_pc_thunk?

These helper functions are used by GCC and Clang in 32-bit x86 position-independent code to get the current execution address into a register, for example:
call __i686.get_pc_thunk.bx
addl $_GLOBAL_OFFSET_TABLE_, %ebx
movl $2, 4(%esp)
leal .LC0#GOTOFF(%ebx), %eax
movl %eax, (%esp)
call dlopen#PLT
It seems the implementations are equivalent:
__x86.get_pc_thunk.bx:
movl (%esp), %ebx
ret
__i686.get_pc_thunk.bx:
movl (%esp), %ebx
ret
Is there any difference besides the name change (seems i686 is older)? And is there a reason for the i686 prefix instead of i386?
So, after some digging in commit history and bug trackers I think I mostly figured it out.
Long time ago, glibc used to have its own handling of PIC code which involved a call/pop pattern to get the GOT address.
Around 2002, __i686.get_pc_thunk.*, which did a similar task, was added to GCC, initially as an internal symbol.
Shortly afterwards it ended up in glibc too, probably to avoid code duplication when being compiled with GCC.
However, when built for Pentium 2 or later (-march=i686), GCC defined preprocessor macro __i686=1, breaking glibc's compilation of the stub code. The problem has been discovered quite early but for several years glibc used various workarounds to handle this.
In 2011 (GCC 4.7?) the name was changed to __x86.get_pc_thunk.* and glibc added some checks to use a matching name. Eventually support for old GCC versions was dropped together with the old name. Both GCC and glibc only use __x86.get_pc_thunk.* now (although GCC can also generate the inline call/pop version).
So, in summary:
There is no actual difference between the two, the name change is simply historical due to a predefined macro collision.
References:
https://gcc.gnu.org/git/?p=gcc.git&a=search&h=HEAD&st=commit&s=get_pc_thunk
https://sourceware.org/git/?p=glibc.git&a=search&st=commit&s=get_pc_thunk
https://sourceware.org/bugzilla/show_bug.cgi?id=411
https://sourceware.org/bugzilla/show_bug.cgi?id=4507
Just a different name choice, not significant AFAIK.
i686 is the standard name for 32-bit code using PPro new instructions like CMOV and FCOMI, and 586 CMPGXCHG and CPUID. Modern GNU/Linux distros typically configure gcc to use that as the default target for -m32 32-bit code, instead of truly baseline i386. e.g. gcc -v will show i686-linux-gnu for a 32-bit build of GCC.
Usually clang uses call next_insn / pop reg to read EIP into a register. (Fun fact: that actually doesn't break return-address prediction on CPUs other than original Pentium-Pro or Via Nano3000: http://blog.stuffedcow.net/2018/04/ras-microbenchmarks/#call0 - CPUs special case call rel32=0 as not being a real call and don't put the return address into the predictor stack.)
The get_pc_thunk.bx includes the name of the register to return in. 32-bit PIC code used to only ever use EBX as the GOT pointer register, but GCC can now pick any convenient register and emit a thunk function for it, like ....get_pc_thunk.ax so leaf functions don't have to save/restore EBX.
PIE does make executables slower, by maybe 15% for 32-bit code vs. a couple percent for 64-bit code. x86-64 has RIP-relative addressing which avoids the need for these thunks. IMO a 32-bit PIE isn't worth the price, unless you really need more hardening against ROP and Spectre attacks by having ASLR of the main executable.

Changing calling convention in gcc/g++ abi

How could I enforce gcc/g++ to not use registers but only stack in x86_64
to pass arguments to functions,
like it was in 32-bit version
(and possibly take the function result this way).
I know it breaks official ABI and both the caller side and the called side
must be compiled this way so it works. I don't care if push/pop or mov/sub way
is used. I expect there should be a flag to compiler that could enforce it
but I couldn't find it.
It seems you can't without hacking GCC's source code.
There is no standard x86-64 calling convention that uses inefficient stack args.
GCC only knows how to use standard calling conventions, in this case x86-64 SysV and MS Windows fastcall and vectorcall. (e.g. __attribute__((ms_abi)) or vectorcall). Normally nobody wants this; MS's calling convention is already friendly enough for wrappers or variadic functions. You can use that for some functions (controlled by __attribute__) even when compiling for Linux, MacOS, *BSD, etc., if that helps. Hard to imagine a use-case for pure stack args.
GCC lets you specify a register as fixed (never touched by GCC, like -ffixed-rdi), call-clobbered, or call-preserved. But using those with arg-passing registers just creates wrong code, not what you want.
e.g.
int foo(int a, int b, int c);
void caller(int x) {
foo(1,2,3);
//foo(4,x,6);
}
compiled by gcc9.2 -O3 -fcall-saved-rdi
caller:
push rdi
mov edx, 3
mov esi, 2
pop rdi
jmp foo
It saves/restores RDI but doesn't put a 1 in it before calling foo.
And it doesn't leave RDI out of the arg-passing sequence and bump other args later. (I was thinking you might be able to invent a calling convention where all the arg-passing registers were fixed or call-saved, maybe getting GCC to fall back to stack args. But nope.)

How does gcc know the register size to use in inline assembly?

I have the inline assembly code:
#define read_msr(index, buf) asm volatile ("rdmsr" : "=d"(buf[1]), "=a"(buf[0]) : "c"(index))
The code using this macro:
u32 buf[2];
read_msr(0x173, buf);
I found the disassembly is (using gnu toolchain):
mov eax,0x173
mov ecx,eax
rdmsr
mov DWORD PTR [rbp-0xc],edx
mov DWORD PTR [rbp-0x10],eax
The question is that 0x173 is less than 0xffff, why gcc does not use mov cx, 0x173? Will the gcc analysis the following instruction rdmsr? Will the gcc always know the correct register size?
It depends on the size of the value or variable passed.
If you pass a "short int" it will set "cx" and read the data from "ax" and "dx" (if buf is a short int, too).
For char it would access "cl" and so on.
So "c" refers to the "ecx" register, but this is accessed with "ecx", "cx", or "cl" depending on the size of the access, which I think makes sense.
To test you can try passing (unsigned short)0x173, it should change the code.
There is no analysis of the inline assembly (in fact it is after text substitution direclty copied to the output assembly, including syntax errors). Also there is no default register size, depending on whether you have a 32 or 64 bit target. This would be way to limiting.
I think the answer is because the current default data size is 32-bit. In 64-bit long mode, the default data size is also 32-bit, unless you use "rex.w" prefix.
Intel specifies the RDMSR instruction as using (all of) ECX to determine the model specific register. That being the case, and apparently as specified by your macro, GCC has every reason to load your constant into the full ECX.
So the question about why it doesn't load CX seems completely inappropriate. It looks like GCC is generating the right code.
(You didn't ask why it stages the load of ECX inefficiently by using EAX; I don't know the answer to that).

Why is an empty function not just a return

If I compile an empty C function
void nothing(void)
{
}
using gcc -O2 -S (and clang) on MacOS, it generates:
_nothing:
pushq %rbp
movq %rsp, %rbp
popq %rbp
ret
Why does gcc not remove everything but the ret? It seems like an easy optimisation to make unless it really does something (seems not to, to me). This pattern (push/move at the beginning, pop at the end) is also visible in other non-empty functions where rbp is otherwise unused.
On Linux using a more recent gcc (4.4.5) I see just
nothing:
rep
ret
Why the rep ? The rep is absent in non-empty functions.
Why the rep ?
The reasons are explained in this blog post. In short, jumping directly to a single-byte ret instruction would mess up the branch prediction on some AMD processors. And rather than adding a nop before the ret, a meaningless prefix byte was added to save instruction decoding bandwidth.
The rep is absent in non-empty functions.
To quote from the blog post I linked to: "[rep ret] is preferred to the simple ret either when it is the target of any kind of branch, conditional (jne/je/...) or unconditional (jmp/call/...)".
In the case of an empty function, the ret would have been the direct target of a call. In a non-empty function, it wouldn't be.
Why does gcc not remove everything but the ret?
It's possible that some compilers won't omit frame pointer code even if you've specified -O2. At least with gcc, you can explicitly tell the compiler to omit them by using the -fomit-frame-pointer option.
As explained here: http://support.amd.com/us/Processor_TechDocs/25112.PDF, a two-byte near-return instruction (i.e. rep ret) is used because a single-byte return can me mispredicted on some on some amd64 processors in some situations such as this one.
If you fiddle around with the processor targeted by gcc you may find that you can get it to generate a single-byte ret. -mtune=nocona worked for me.
I suspect early, your last code is a bug. As johnfound says. The first code is because all C Compiler must always follow _cdecl calling convention that in function means (In Intel, sorry, I don't know the AT&T Syntax):
Function Definition
_functionA:
push rbp
mov rbp, rsp
;Some function
pop rbp
ret
In caller :
call _functionA
sub esp, 0 ; Maybe if it zero, some compiler can strip it
Why GCC is always follow _cdecl calling convention when not following that is nonsense, that is the compiler isn't smarter that the advanced assembly programmer. So, it always follow _cdecl at all cost.
That is, because even so called "optimization compilers" are too dumb to generate always good machine code.
They can't generate better code than their creators made them to generate.
As long as an empty function is nonsense, they probably simply didn't bother to optimize it or even to detect this very special case.
Although, single "rep" prefix is probably a bug. It does nothing when used without string instruction, but anyway, in some newer CPU it theoretically can cause an exception. (and IMHO should)

GCC's extended version of asm

I never thought I'd be posting an assembly question. :-)
In GCC, there is an extended version of the asm function. This function can take four parameters: assembly-code, output-list, input-list and overwrite-list.
My question is, are the registers in the overwrite-list zeroed out? What happens to the values that were previously in there (from other code executing).
Update: In considering my answers thus far (thank you!), I want to add that though a register is listed in the clobber-list, it (in my instance) is being used in a pop (popl) command. There is no other reference.
No, they are not zeroed out. The purpose of the overwrite list (more commonly called the clobber list) is to inform GCC that, as a result of the asm instructions the register(s) listed in the clobber list will be modified, and so the compiler should preserve any which are currently live.
For example, on x86 the cpuid instruction returns information in four parts using four fixed registers: %eax, %ebx, %ecx and %edx, based on the input value of %eax. If we were only interested in the result in %eax and %ebx, then we might (naively) write:
int input_res1 = 0; // also used for first part of result
int res2;
__asm__("cpuid" : "+a"(input_res1), "=b"(res2) );
This would get the first and second parts of the result in C variables input_res1 and res2; however if GCC was using %ecx and %edx to hold other data; they would be overwritten by the cpuid instruction without gcc knowing. To prevent this; we use the clobber list:
int input_res1 = 0; // also used for first part of result
int res2;
__asm__("cpuid" : "+a"(input_res1), "=b"(res2)
: : "%ecx", "%edx" );
As we have told GCC that %ecx and %edx will be overwritten by this asm call, it can handle the situation correctly - either by not using %ecx or %edx, or by saving their values to the stack before the asm function and restoring after.
Update:
With regards to your second question (why you are seeing a register listed in the clobber list for a popl instruction) - assuming your asm looks something like:
__asm__("popl %eax" : : : "%eax" );
Then the code here is popping an item off the stack, however it doesn't care about the actual value - it's probably just keeping the stack balanced, or the value isn't needed in this code path. By writing this way, as opposed to:
int trash // don't ever use this.
__asm__("popl %0" : "=r"(trash));
You don't have to explicitly create a temporary variable to hold the unwanted value. Admittedly in this case there isn't a huge difference between the two, but the version with the clobber makes it clear that you don't care about the value from the stack.
If by "zeroed out" you mean "the values in the registers are replaced with 0's to prevent me from knowing what some other function was doing" then no, the registers are not zeroed out before use. But it shouldn't matter because you're telling GCC you plan to store information there, not that you want to read information that's currently there.
You give this information to GCC so that (reading the documentation) "you need not guess which registers or memory locations will contain the data you want to use" when you're finished with the assembly code (eg., you don't have to remember if the data will be in the stack register, or some other register).
GCC needs a lot of help for assembly code because "The compiler ... does not parse the assembler instruction template and does not know what it means or even whether it is valid assembler input. The extended asm feature is most often used for machine instructions the compiler itself does not know exist."
Update
GCC is designed as a multi-pass compiler. Many of the passes are in fact entirely different programs. A set of programs forming "the compiler" translate your source from C, C++, Ada, Java, etc. into assembly code. Then a separate program (gas, for GNU Assembler) takes that assembly code and turns it into a binary (and then ld and collect2 do more things to the binary). Assembly blocks exist to pass text directly to gas, and the clobber-list (and input list) exist so that the compiler can do whatever set up is needed to pass information between the C, C++, Ada, Java, etc. side of things and the gas side of things, and to guarantee that any important information currently in registers can be protected from the assembly block by copying it to memory before the assembly block runs (and copying back from memory afterward).
The alternative would be to save and restore every register for every assembly code block. On a RISC machine with a large number of registers that could get expensive (the Itanium has 128 general registers, another 128 floating point registers and 64 1-bit registers, for instance).
It's been a while since I've written any assembly code. And I have much more experience using GCC's named registers feature than doing things with specific registers. So, looking at an example:
#include <stdio.h>
long foo(long l)
{
long result;
asm (
"movl %[l], %[reg];"
"incl %[reg];"
: [reg] "=r" (result)
: [l] "r" (l)
);
return result;
}
int main(int argc, char** argv)
{
printf("%ld\n", foo(5L));
}
I have asked for an output register, which I will call reg inside the assembly code, and that GCC will automatically copy to the result variable on completion. There is no need to give this variable different names in C code vs assembly code; I only did it to show that it is possible. Whichever physical register GCC decides to use -- whether it's %%eax, %%ebx, %%ecx, etc. -- GCC will take care of copying any important data from that register into memory when I enter the assembly block so that I have full use of that register until the end of the assembly block.
I have also asked for an input register, which I will call l both in C and in assembly. GCC promises that whatever physical register it decides to give me will have the value currently in the C variable l when I enter the assembly block. GCC will also do any needed recordkeeping to protect any data that happens to be in that register before I enter the assembly block.
What if I add a line to the assembly code? Say:
"addl %[reg], %%ecx;"
Since the compiler part of GCC doesn't check the assembly code it won't have protected the data in %%ecx. If I'm lucky, %%ecx may happen to be one of the registers GCC decided to use for %[reg] or %[l]. If I'm not lucky, I will have "mysteriously" changed a value in some other part of my program.
I suspect the overwrite list is just to give GCC a hint not to store anything of value in these registers across the ASM call; since GCC doesn't analyze what ASM you're giving it, and certain instructions have side-effects that touch other registers not explicitly named in the code, this is the way to tell GCC about it.

Resources