Change GCC's output for 0-set (clear) operation - gcc

GCC often produces the following x86 assembly to set the value in eax to 0 before returning to the caller:
xor eax, eax
For the purposes of can I do it?, as opposed to should I do it?, is it possible to change GCC's behaviour to produce "equivalent" assembly? Also, where in libgcc does this generation occur?
To clarify, I'm not looking for guidance on which assembly instructions would be appropriate to use, I'm wondering how it is possible to change GCC's output behaviour.

You mean like sub eax,eax which is specially recognized as a zeroing idiom on only some CPUs, not all?
The optimization is done as part of -fpeephole2, as part of -O2 or -Os; using -fno-peephole2 would give you mov eax,0 for materializing a 0 in a register. (As well as creating other missed optimizations I assume! xor-zeroing probably isn't the only peephole gcc looks for.)
I don't know where to look in the gcc source code but knowing the option might help track it down.
It's not in "libgcc" though, that's helper functions like 64-bit multiply on a 32-bit machine. (When gcc emits calls to funny-named helper functions like __udivdi3, it's expecting the asm output to be linked against libgcc).
More like you'd find it in the x86 machine-definition files, one of the .md files in the gcc source tree. Otherwise hard-coded into a C optimization function. Like "xor %1, %0" might be something to search on, or more likely it'll have {... | ...} dialect-alternatives. But searching on the xor mnemonic might still help.
This is a half-assed partial answer. Please post a specific answer or at least leave a comment if you know where to look.

Related

MOVXZ into register - "invalid operand for instruction"

I am trying to compile an assembler-based implementation of AES, viewable here. My assembler is giving me the following error, repeated several different times over what appear to be instances of the same error. The exact source location is here, but due to the large amount of preprocessor indirection used in this file, I have copied the exact error from my build output, which gives the exact code as seen by the compiler:
/Volumes/Sources/Andromeda/Kernel/libkern/crypto/aes/EncryptDecrypt.s:297:19: error: invalid operand for instruction
movzx 240(%r10), %rax
^~~~
I do not quite understand what may be causing this problem. If I understand it properly, this instruction moves a byte (or more, this is unclear, and may in fact be the source of the problem) into the RAX register, zero-extending it if the source is less than 64-bits in size. Do I need to explicitly specify a size by adding a tag to the movxz instruction (e.g. movzxb)? What else might be the cause of this problem? Thanks!
At&t syntax does not normally use movzx, but maybe some assembler versions accept it. My copy of GNU assembler 2.22 does, but maybe OSX version doesn't. In any case, the assembler generates code for a byte source. If you do in fact have that, the proper at&t syntax would be movzbq 240(%r10), %rax, or, taking advantage of automatic zero extension, movzbl 240(%r10), %eax.
If you have a 4 byte source, then you can't use movzx at all, since it does not exist for that operand type. All you need in this case is the automatic zero extension, so you can simply do movl 240(%r10), %eax.

change instruction set in GCC

I want to test some architecture changes on an already existing architecture (x86) using simulators. However to properly test them and run benchmarks, I might have to make some changes to the instruction set, Is there a way to add these changes to GCC or any other compiler?
Simple solution:
One common approach is to add inline assembly, and encode the instruction bytes directly.
For example:
int main()
{
asm __volatile__ (".byte 0x90\n");
return 0;
}
compiles (gcc -O3) into:
00000000004005a0 <main>:
4005a0: 90 nop
4005a1: 31 c0 xor %eax,%eax
4005a3: c3 retq
So just replace 0x90 with your inst bytes. Of course you wont see the actual instruction on a regular objdump, and the program would likely not run on your system (unless you use one of the nop combinations), but the simulator should recognize it if it's properly implemented there.
Note that you can't expect the compiler to optimize well for you when it doesn't know this instruction, and you should take care and work with inline assembly clobber/input/output options if it changes state (registers, memory), to ensure correctness. Use optimizations only if you must.
Complicated solution
The alternative approach is to implement this in your compiler - it can be done in gcc, but as stated in the comments LLVM is probably one of the best ones to play with, as it's designed as a compiler development platform, but it's still very complicated as LLVM is best suited for IR optimization stages, and is somewhat less friendly when trying to modify the target-specific backends.
Still, it's doable, and you have to do that if you also plan to have your compiler decide when to issue this instruction. I'd suggest to start from the first option though, to see if your simulator even works with this addition, and only then spending time on the compiler side.
If and when you do decide to implement this in LLVM, your best bet is to define it as an intrinsic function, there's relatively more documentation about this in here - http://llvm.org/docs/ExtendingLLVM.html
You can add new instructions, or change existing by modifying group of files in GCC called "machine description". Instruction patterns in <target>.md file, some code in <target>.c file, predicates, constraints and so on. All of these lays in $GCCHOME/gcc/config/<target>/ folder. All of this stuff using on step of generation ASM code from RTL. You can also change cases of emiting instructions by change some other general GCC source files, change SSA tree generation, RTL generation, but all of this a little bit complicated.
A simple explanation what`s happened:
https://www.cse.iitb.ac.in/grc/slides/cgotut-gcc/topic5-md-intro.pdf
It's doable, and I've done it, but it's tedious. It is basically the process of porting the compiler to a new platform, using an existing platform as a model. Somewhere in GCC there is a file that defines the instruction set, and it goes through various processes during compilation that generate further code and data. It's 20+ years since I did it so I have forgotten all the details, sorry.

Why is an empty function not just a return

If I compile an empty C function
void nothing(void)
{
}
using gcc -O2 -S (and clang) on MacOS, it generates:
_nothing:
pushq %rbp
movq %rsp, %rbp
popq %rbp
ret
Why does gcc not remove everything but the ret? It seems like an easy optimisation to make unless it really does something (seems not to, to me). This pattern (push/move at the beginning, pop at the end) is also visible in other non-empty functions where rbp is otherwise unused.
On Linux using a more recent gcc (4.4.5) I see just
nothing:
rep
ret
Why the rep ? The rep is absent in non-empty functions.
Why the rep ?
The reasons are explained in this blog post. In short, jumping directly to a single-byte ret instruction would mess up the branch prediction on some AMD processors. And rather than adding a nop before the ret, a meaningless prefix byte was added to save instruction decoding bandwidth.
The rep is absent in non-empty functions.
To quote from the blog post I linked to: "[rep ret] is preferred to the simple ret either when it is the target of any kind of branch, conditional (jne/je/...) or unconditional (jmp/call/...)".
In the case of an empty function, the ret would have been the direct target of a call. In a non-empty function, it wouldn't be.
Why does gcc not remove everything but the ret?
It's possible that some compilers won't omit frame pointer code even if you've specified -O2. At least with gcc, you can explicitly tell the compiler to omit them by using the -fomit-frame-pointer option.
As explained here: http://support.amd.com/us/Processor_TechDocs/25112.PDF, a two-byte near-return instruction (i.e. rep ret) is used because a single-byte return can me mispredicted on some on some amd64 processors in some situations such as this one.
If you fiddle around with the processor targeted by gcc you may find that you can get it to generate a single-byte ret. -mtune=nocona worked for me.
I suspect early, your last code is a bug. As johnfound says. The first code is because all C Compiler must always follow _cdecl calling convention that in function means (In Intel, sorry, I don't know the AT&T Syntax):
Function Definition
_functionA:
push rbp
mov rbp, rsp
;Some function
pop rbp
ret
In caller :
call _functionA
sub esp, 0 ; Maybe if it zero, some compiler can strip it
Why GCC is always follow _cdecl calling convention when not following that is nonsense, that is the compiler isn't smarter that the advanced assembly programmer. So, it always follow _cdecl at all cost.
That is, because even so called "optimization compilers" are too dumb to generate always good machine code.
They can't generate better code than their creators made them to generate.
As long as an empty function is nonsense, they probably simply didn't bother to optimize it or even to detect this very special case.
Although, single "rep" prefix is probably a bug. It does nothing when used without string instruction, but anyway, in some newer CPU it theoretically can cause an exception. (and IMHO should)

inline assembly error: can't find a register in class 'GENERAL_REGS' while reloading 'asm'

I have an inline AT&T style assembly block, which works with XMM registers and there are no problems in Release configuration of my XCode project, however I've stumbled upon this strange error (which is supposedly a GCC bug) in Debug configuration... Can I fix it somehow? There is nothing special in assembly code, but I am using a lot of memory constraints (12 constraints), can this cause this problem?
Not a complete answer, sorry, but the comments section is too short for this ...
Can you post a sample asm("..." :::) line that demonstrates the problem ?
The use of XMM registers is not the issue, the error message indicates that GCC wanted to create code like, say:
movdqa (%rax),%xmm0
i.e. memory loads/stores through pointers held in general registers, and you specified more memory locations than available general-purpose regs (it's probably 12 in debug mode because because RBP, RSP are used for frame/stackpointer and likely RBX for the global offset table and RAX reserved for returns) without realizing register re-use potential.
You might be able to eek things out by doing something like:
void *all_mem_args_tbl[16] = { memarg1, memarg2, ... };
void *trashme;
asm ("movq (%0), %1\n\t"
"movdqa (%1), %xmm0\n\t"
"movq 8(%0), %1\n\t"
"movdqa (%1), %xmm1\n\t"
...
: "r"all_mem_args_tbl : "r"(trashme) : ...);
i.e. put all the mem locations into a table that you pass as operand, and then manage the actual general-purpose register use on your own. It might be two pointer accesses through the indirection table, but whether that makes a difference is hard to say without knowing your complete assembler code piece.
The Debug configuration uses -O0 by default. Since this flag disables optimisations, the compiler is probably not being able to allocate registers given the constraints specified by your inline assembly code, resulting in register starvation.
One solution is to specify a different optimisation level, e.g. -Os, which is the one used by default in the Release configuration.

GCC's extended version of asm

I never thought I'd be posting an assembly question. :-)
In GCC, there is an extended version of the asm function. This function can take four parameters: assembly-code, output-list, input-list and overwrite-list.
My question is, are the registers in the overwrite-list zeroed out? What happens to the values that were previously in there (from other code executing).
Update: In considering my answers thus far (thank you!), I want to add that though a register is listed in the clobber-list, it (in my instance) is being used in a pop (popl) command. There is no other reference.
No, they are not zeroed out. The purpose of the overwrite list (more commonly called the clobber list) is to inform GCC that, as a result of the asm instructions the register(s) listed in the clobber list will be modified, and so the compiler should preserve any which are currently live.
For example, on x86 the cpuid instruction returns information in four parts using four fixed registers: %eax, %ebx, %ecx and %edx, based on the input value of %eax. If we were only interested in the result in %eax and %ebx, then we might (naively) write:
int input_res1 = 0; // also used for first part of result
int res2;
__asm__("cpuid" : "+a"(input_res1), "=b"(res2) );
This would get the first and second parts of the result in C variables input_res1 and res2; however if GCC was using %ecx and %edx to hold other data; they would be overwritten by the cpuid instruction without gcc knowing. To prevent this; we use the clobber list:
int input_res1 = 0; // also used for first part of result
int res2;
__asm__("cpuid" : "+a"(input_res1), "=b"(res2)
: : "%ecx", "%edx" );
As we have told GCC that %ecx and %edx will be overwritten by this asm call, it can handle the situation correctly - either by not using %ecx or %edx, or by saving their values to the stack before the asm function and restoring after.
Update:
With regards to your second question (why you are seeing a register listed in the clobber list for a popl instruction) - assuming your asm looks something like:
__asm__("popl %eax" : : : "%eax" );
Then the code here is popping an item off the stack, however it doesn't care about the actual value - it's probably just keeping the stack balanced, or the value isn't needed in this code path. By writing this way, as opposed to:
int trash // don't ever use this.
__asm__("popl %0" : "=r"(trash));
You don't have to explicitly create a temporary variable to hold the unwanted value. Admittedly in this case there isn't a huge difference between the two, but the version with the clobber makes it clear that you don't care about the value from the stack.
If by "zeroed out" you mean "the values in the registers are replaced with 0's to prevent me from knowing what some other function was doing" then no, the registers are not zeroed out before use. But it shouldn't matter because you're telling GCC you plan to store information there, not that you want to read information that's currently there.
You give this information to GCC so that (reading the documentation) "you need not guess which registers or memory locations will contain the data you want to use" when you're finished with the assembly code (eg., you don't have to remember if the data will be in the stack register, or some other register).
GCC needs a lot of help for assembly code because "The compiler ... does not parse the assembler instruction template and does not know what it means or even whether it is valid assembler input. The extended asm feature is most often used for machine instructions the compiler itself does not know exist."
Update
GCC is designed as a multi-pass compiler. Many of the passes are in fact entirely different programs. A set of programs forming "the compiler" translate your source from C, C++, Ada, Java, etc. into assembly code. Then a separate program (gas, for GNU Assembler) takes that assembly code and turns it into a binary (and then ld and collect2 do more things to the binary). Assembly blocks exist to pass text directly to gas, and the clobber-list (and input list) exist so that the compiler can do whatever set up is needed to pass information between the C, C++, Ada, Java, etc. side of things and the gas side of things, and to guarantee that any important information currently in registers can be protected from the assembly block by copying it to memory before the assembly block runs (and copying back from memory afterward).
The alternative would be to save and restore every register for every assembly code block. On a RISC machine with a large number of registers that could get expensive (the Itanium has 128 general registers, another 128 floating point registers and 64 1-bit registers, for instance).
It's been a while since I've written any assembly code. And I have much more experience using GCC's named registers feature than doing things with specific registers. So, looking at an example:
#include <stdio.h>
long foo(long l)
{
long result;
asm (
"movl %[l], %[reg];"
"incl %[reg];"
: [reg] "=r" (result)
: [l] "r" (l)
);
return result;
}
int main(int argc, char** argv)
{
printf("%ld\n", foo(5L));
}
I have asked for an output register, which I will call reg inside the assembly code, and that GCC will automatically copy to the result variable on completion. There is no need to give this variable different names in C code vs assembly code; I only did it to show that it is possible. Whichever physical register GCC decides to use -- whether it's %%eax, %%ebx, %%ecx, etc. -- GCC will take care of copying any important data from that register into memory when I enter the assembly block so that I have full use of that register until the end of the assembly block.
I have also asked for an input register, which I will call l both in C and in assembly. GCC promises that whatever physical register it decides to give me will have the value currently in the C variable l when I enter the assembly block. GCC will also do any needed recordkeeping to protect any data that happens to be in that register before I enter the assembly block.
What if I add a line to the assembly code? Say:
"addl %[reg], %%ecx;"
Since the compiler part of GCC doesn't check the assembly code it won't have protected the data in %%ecx. If I'm lucky, %%ecx may happen to be one of the registers GCC decided to use for %[reg] or %[l]. If I'm not lucky, I will have "mysteriously" changed a value in some other part of my program.
I suspect the overwrite list is just to give GCC a hint not to store anything of value in these registers across the ASM call; since GCC doesn't analyze what ASM you're giving it, and certain instructions have side-effects that touch other registers not explicitly named in the code, this is the way to tell GCC about it.

Resources