Optimizing used registers when using inline ARM assembly in GCC - gcc

I want to write some inline ARM assembly in my C code. For this code, I need to use a register or two more than just the ones declared as inputs and outputs to the function. I know how to use the clobber list to tell GCC that I will be using some extra registers to do my computation.
However, I am sure that GCC enjoys the freedom to shuffle around which registers are used for what when optimizing. That is, I get the feeling it is a bad idea to use a fixed register for my computations.
What is the best way to use some extra register that is neither input nor output of my inline assembly, without using a fixed register?
P.S. I was thinking that using a dummy output variable might do the trick, but I'm not sure what kind of weird other effects that will have...

Ok, I've found a source that backs up the idea of using dummy outputs instead of hard registers:
4.8 Temporary registers:
People also sometimes erroneously use clobbers for temporary registers. The right way is
to make up a dummy output, and use “=r” or “=&r” depending on the permitted overlap
with the inputs. GCC allocates a register for the dummy value. The difference is that
GCC can pick a convenient register, so it has more flexibility.
from page 20 of this pdf.
For anyone who is interested in more info on inline assembly with GCC this website turned out to be very instructive.

Related

Documentation for MIPS predefined macros

When I compile a C code using GCC to MIPS, it contains code like:
daddiu $28,$28,%lo(%neg(%gp_rel(f)))
And I have trouble understanding instructions starting with %.
I found that they are called macros and predefined macros are dependent on the assembler but I couldn't find description of the macros (as %lo, %neg etc.) in the documentation of gas.
So does there exist any official documentation that explains macros used by GCC when generating MIPS code?
EDIT: The snippet of the code comes from this code.
This is a very odd instruction to find in compiled C code, since this instruction is not just using $28/$gp as a source but also updating that register, which the compiler shouldn't be doing, I would think.  That register is the global data pointer, which is setup on program start, and used by all code accessing near global variables, so it shouldn't ever change once established.  (Share a godbolt.org example, if you would.)
The functions you're referring to are for composing the address of labels that are located in global data.  Unlike x86, MIPS cannot load (or otherwise have) a 32-bit immediate in one instruction, and so it uses multiple instructions to do work with 32-bit immediates including address immediates.  A 32-bit immediate is subdivided into 2 parts — the top 16-bits are loaded using an LUI and the bottom 16-bits using an ADDI (or LW/SW instruction), forming a 2 instruction sequence.
MARS does not support these built-in functions.  Instead, it uses the pseudo instruction, la $reg, label, which is expanded by the assembler into such a sequence.  MARS also allows lw $reg, label to directly access the value of a global variable, however, that also expands to multiple instruction sequence (sometimes 3 instructions of which only 2 are really necessary..).
%lo computes the low 16-bits of a 32-bit address for the label of the argument to the "function".  %hi computes the upper 16-bits of same, and would be used with LUI.  Fundamentally, I would look at these "functions" as being a syntax for the assembly author to communicate to the assembler to share certain relocation information/requirements to the linker.  (In reverse, a disassembler may read relocation information and determine usage of %lo or %hi, and reflect that in the disassembly.)
I don't know %neg() or %gp_rel(), though could guess that %neg negates and %gp_rel produces the $28/$gp relative value of the label.
%lo and %hi are a bit odd in that the value of the high immediate sometimes is offset by +1 — this is done when the low 16-bits will appear negative.  ADDI and LW/SW will sign extend, which will add -1 to the upper 16-bits loaded via LUI, so %hi offsets its value by +1 to compensate when that happens.  This is part of the linker's operation since it knows the full 32-bit address of the label.
That generated code is super weird, and completely different from that generated by the same compiler, but 32-bit version.  I added the option -msym32 and then the generated code looks like I would expect.
So, this has something to do with the large(?) memory model on MIPS 64, using a multiple instruction sequence to locate and invoke g, and swapping the $28/$gp register as part of the call.  Register $25/$t9 is somehow also involved as the generated code sources it without defining it; later, prior to where we would expect the call it sets $25.
One thing I particularly don't understand, though, is where is the actual function invocation in that sequence!  I would have expected a jalr instruction, if it's using an indirect branch because it doesn't know where g is (except as data), but there's virtually nothing but loads and stores.
There are two additional oddities in the output: one is the blank line near where the actual invocation should be (maybe those are normal, but usually don't see those inside a function) and the other is a nop that is unnecessary but might have been intended for use in the delay slot following an invocation instruction.

Why does GCC use frame pointer when I call Win32 functions with arguments?

When I compile 32-bit C code with GCC and the -fomit-frame-pointer option, the frame pointer (ebp) is not used unless my function calls Windows API functions with stdcall and atleast one parameter.
For example, if I only use GetCommandLine() from the Windows API, which has no parameters/arguments, GCC will omit the frame pointer and use ebp for other things, speeding up the code and not having that useless prologue.
But the moment I call a stdcall Win32 function that accepts at least one argument, GCC completely ignores the -fomit-frame-pointer and uses the frame pointer anyway, and the code is worse in inspection as it can't use ebp for general purpose things. Not to mention I find the frame pointer quite pointless. I mean, I want to compile for release and distribution, why should I care about debugging? (if I want to debug I'll just use a debug build instead after reproducing the bug)
My stack most certainly does NOT contain dynamic allocation like alloca. So, the stack has a defined structure yet GCC chooses the dumb method despite my options? Is there something I'm missing to force it to not use frame pointer?
My second grip I have with it is that it refuses to use "push" instructions for Win32 functions. Every other compiler I tried, they used push instructions to push on the stack, resulting in much better more compact code, not to mention it is the most natural way to push arguments for stdcall. Yet GCC stubbornly uses "mov" instructions to move in each spot, manually, at offsets relative to esp because it needs to keep the stack pointer completely static. stdcall is made to be easy on the caller, and yet GCC completely misses the point of stdcall since it generates this crappy code when interfacing with it. What's worse, since the stack pointer is static, it still uses a frame pointer? Just why?
I tried -mpush-args, it doesn't do anything.
I also noticed that if I make my stack big enough for it to exceed a page (4096 bytes), GCC will add a prologue with a function that does nothing but "bitwise or" the stack every 4096 bytes with zero (which does nothing). I assume it's for touching the stack and automatically commiting memory with page faults if the stack was reserved? Unfortunately, it does this even if I set the initial commit of the stack (not reserve) to high enough to hold my stack, not to mention this shouldn't even be needed in the first place. Redundant code at its best.
Are these bugs in GCC? Or something I'm missing in options? Should I use something else? Please tell me if I'm missing some options.
I seriously hope I won't have to make an inline asm macro just to call stdcall functions and use push instructions (and this will avoid frame pointer too I guess). That sounds really overkill for something so basic that should be in compilers of today. And yes I use GCC 4.8.1 so not an ancient version.
As extra question, is it possible to force GCC to not save registers on the stack at function prologue? I use my own direct entry point with -nostartfiles argument, because it is a pure Windows application and it works just fine without standard lib startup. If I use attribute((noreturn)), it will discard the epilogue restoring the registers but it will still push them on the stack at prologue, I don't know if there's a way to force it to not save registers for this entry point function. Either way not a big deal in the least, it would just feel more complete I guess. Thanks!
See the answer Force GCC to push arguments on the stack before calling function (using PUSH instruction)
I.e. try -mpush-args -mno-accumulate-outgoing-args. It may also require -mno-stack-arg-probe if gcc complains.
It looks like supplying the -mpush-args -mno-accumulate-outgoing-args -mno-stack-arg-probe works, specifically the last one. Now the code is cleaner and more normal like other compilers, and it uses PUSH for arguments, even makes it easier to track in OllyDbg this way.
Unfortunately, this FORCES the stupid frame pointer to be used, even in small functions that absolutely do not need it at all. Seriously is there a way to absolutely force GCC to disable the frame pointer?!

change instruction set in GCC

I want to test some architecture changes on an already existing architecture (x86) using simulators. However to properly test them and run benchmarks, I might have to make some changes to the instruction set, Is there a way to add these changes to GCC or any other compiler?
Simple solution:
One common approach is to add inline assembly, and encode the instruction bytes directly.
For example:
int main()
{
asm __volatile__ (".byte 0x90\n");
return 0;
}
compiles (gcc -O3) into:
00000000004005a0 <main>:
4005a0: 90 nop
4005a1: 31 c0 xor %eax,%eax
4005a3: c3 retq
So just replace 0x90 with your inst bytes. Of course you wont see the actual instruction on a regular objdump, and the program would likely not run on your system (unless you use one of the nop combinations), but the simulator should recognize it if it's properly implemented there.
Note that you can't expect the compiler to optimize well for you when it doesn't know this instruction, and you should take care and work with inline assembly clobber/input/output options if it changes state (registers, memory), to ensure correctness. Use optimizations only if you must.
Complicated solution
The alternative approach is to implement this in your compiler - it can be done in gcc, but as stated in the comments LLVM is probably one of the best ones to play with, as it's designed as a compiler development platform, but it's still very complicated as LLVM is best suited for IR optimization stages, and is somewhat less friendly when trying to modify the target-specific backends.
Still, it's doable, and you have to do that if you also plan to have your compiler decide when to issue this instruction. I'd suggest to start from the first option though, to see if your simulator even works with this addition, and only then spending time on the compiler side.
If and when you do decide to implement this in LLVM, your best bet is to define it as an intrinsic function, there's relatively more documentation about this in here - http://llvm.org/docs/ExtendingLLVM.html
You can add new instructions, or change existing by modifying group of files in GCC called "machine description". Instruction patterns in <target>.md file, some code in <target>.c file, predicates, constraints and so on. All of these lays in $GCCHOME/gcc/config/<target>/ folder. All of this stuff using on step of generation ASM code from RTL. You can also change cases of emiting instructions by change some other general GCC source files, change SSA tree generation, RTL generation, but all of this a little bit complicated.
A simple explanation what`s happened:
https://www.cse.iitb.ac.in/grc/slides/cgotut-gcc/topic5-md-intro.pdf
It's doable, and I've done it, but it's tedious. It is basically the process of porting the compiler to a new platform, using an existing platform as a model. Somewhere in GCC there is a file that defines the instruction set, and it goes through various processes during compilation that generate further code and data. It's 20+ years since I did it so I have forgotten all the details, sorry.

How can I get a list of legal ARM opcodes from gcc (or elsewhere)?

I'd like to generate pseudo-random ARM instructions. Via assembler directives, I can tell gcc what mode I'm in, and it will complain if I try a set of opcodes and operands that's not legal in that mode, so it must have some internal listing of what can be done in which mode. Where does that live? Would it be easier to extract that info from LLVM?
Is this question "not even wrong"? Should I try a different approach entirely?
To answer my own question, this is actually really easy to do from arm.md and and constraints.md in gcc/config/arm/. I probably spent more time answering asking this question and answering comments for it than I did figuring this out. Turns out I just need to look for 'TARGET_THUMB1', until I get around to implementing thumb2.
For the ARM family the buck stops at the ARM ARM (ARM Architectural Reference Manual). There is an ARM instruction set section and a Thumb instruction set section. Within both each instruction tells you what generation (ARMvX where X is some number like 4 (arm7), or 5 (arm9 time frame) ,etc). Since the opcode and pseudo code is listed for each instruction you should be able to figure out what is a real instruction and, if any, are syntax to save typing on another (push and pop for example).
With the Cortex-m3 and thumb2 in particular you also need to look at the TRM (Technical Reference Manual) as well. ARM has, I forget the name, a universal syntax they are trying to use that should work on both Thumb and ARM. For example on an ARM you have three register instructions:
add r1,r1,r2
In thumb there are only two register operations
add r1,r2
The desire basically is to meet in the middle or I would say more accurately to encourage ARM assemblers to parse Thumb instructions and encode them with the equivalent ARM instruction without complaining. This may have started with thumb and not thumb2, I have always separated the two syntaxes in my code until recently (and I still generally use ARM syntax for ARM and Thumb for Thumb).
And then yes you have to see what the specific implementation of the assembler tool is, in your case binutils. And it sounds like you have found the binutils/gnu secret decoder ring.

How to get GCC to use more than two SIMD registers when using intrinsics?

I am writing some code and trying to speed it up using SIMD intrinsics SSE2/3. My code is of such nature that I need to load some data into an XMM register and act on it many times. When I'm looking at the assembler code generated, it seems that GCC keeps flushing the data back to the memory, in order to reload something else in XMM0 and XMM1. I am compiling for x86-64 so I have 15 registers. Why is GCC using only two and what can I do to ask it to use more? Is there any way that I can "pin" some value in a register? I added the "register" keyword to my variable definition, but the generated assembly code is identical.
Yes, you can. Explicit Reg Vars talks about the syntax you need to pin a variable to a specific register.
If you're getting to the point where you're specifying individual registers for each intrinsic, you might as well just write the assembly directory, especially given gcc's nasty habit of pessimizing intrinsics unnecessarily in many cases.
It sounds like you compiled with optimization disabled, so no variables are kept in registers between C statements, not even int.
Compile with gcc -O3 -march=native to let the compiler make non-terrible asm, optimized for your machine. The default is -O0 with a "generic" target ISA and tuning.
See also Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about why "debug" builds in general are like that, and the fact that register int foo; or register __m128 bar; can stay in a register even in a debug build. But it's much better to actually have the compiler optimize, as well as using registers, if you want your code to run fast overall!

Resources