Why does arm gcc use bx lr to return from functions? - gcc

I know bx is used for switching to thumb, and from this answer I came to know that:
BX won't switch to Thumb mode if the least significant bit of the target address is 0. In other words, it can be used as a regular branch as well.
I've noticed that the bx lr is also generated if compiling with -marm, so it should never do the switch, and always behave like a normal branch.
So my question is, why does the compiler generate this bx lr, as opposed to mov pc, lr or push {lr} [...] pop {pc}?

This is heavily opinion based, and you would have to dig up the authors of each compiler and ask them directly.
Why not? before thumb the compilers like gnu used mov pc,lr because that was a return. But once thumb came along and got integrated into the compilers then that changed to bx lr. It does not make sense to if-then-else this since bx lr works for all cases, no need to add code for a mov pc,lr.
As far as pop {pc} it will use it for architectures it can. Now this is more of a fair question. So for example armv4t cannot switch to thumb mode with a pop so you will see the return constructed as a pop into lr then a bx lr (or pop into a lower register if in thumb mode). Where if there is no interwork at all needed and in arm mode then a pop pc could have worked. -marm I would assume does not disable thumb-interwork. Just means you want it to generate the arm instructions. If you specify the architecture I think even armv5 it does not generate the extra bx lr and will generate a pop {pc}.
So you can try without interwork and armv4t and see if it uses a pop {pc}.
But in general you would need to contact the individual compiler developers directly for these kinds of "why" questions.

Related

How to instrument specific assembly instructions and get their arguments [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
Given any C/C++ source.c compiled with gcc
int func()
{
// bunch of code
...
}
will result in some assembly (example). . .
func():
str fp, [sp, #-4]!
add fp, sp, #0
sub sp, sp, #12
mov r3, #0
str r3, [fp, #-8]
mov r3, #55
sub sp, fp, #0
ldr fp, [sp], #4
bx lr
. . . which eventually gets turned into a binary source.obj
What I want is the ability to specify: before each assembly instruction X, call my custom function and pass as arguments the arguments of instruction X
I'm really only interested in whether a given assembly instruction executes. If I say I care about mult, I'm not necessarily saying I care whether a multiplication occurred in the original source. I understand that multiply by 2^N will result in a shift instruction. I get it.
Let's say I specify mov as the asm of interest.
The resulting assembly would be changed to the following
func():
str fp, [sp, #-4]!
add fp, sp, #0
sub sp, sp, #12
// custom code inserted here:
// I want to call another function with the arguments of **mov**
mov r3, #0
str r3, [fp, #-8]
// custom code inserted here:
// I want to call another function with the arguments of **mov**
mov r3, #55
sub sp, fp, #0
ldr fp, [sp], #4
bx lr
I understand that the custom code may have to push/pop any registers it uses depending how much gcc "knows" about it with respect to the registers it uses. The custom function may be a naked function
WHY
To toggle a pin to do real-time profiling every time instruction X is executed.
To record every time the arguments of X meet certain criteria.
Your question is unclear (even with the additional edit; the -finstrument-functions is not transforming assembler code, it is changing the way the compiler works, during optimizations and code generation; it works on intermediate compiler representations - probably at the GIMPLE level, not at the assembler or RTL level).
Perhaps you could code some GCC plugin which would work at the GIMPLE level (by adding an optimization pass transforming the appropriate GIMPLE; BTW the -finstrument-functions option is adding more passes). This could take months of work (you need to understand the internals of GCC), and you'll add your own instrumentation generating pass in the compiler.
Perhaps you are using some asm in your code. Then you might use some preprocessor macro to insert some code around it.
Perhaps you want to change your ABI or calling conventions (or the way GCC is generating assembler code). Then you need to patch the compiler itself (and implement a new target in it). This might require more than a year of work.
Be aware of various optimizations done by GCC. Sometimes you might want volatile asm instead of just asm.
My documentation page of GCC MELT gives many slides and links which should help you.
Is it possible to do this with any compiler?
Both GCC and Clang are free software, so you can study their source code and improve it for your needs. But both are very complex (many millions of lines of source code), and you'll need several years of work to fork them. By the time you did that, they would evolve significantly.
what I’d like to do is choose a set of assembly instructions - like { add, jump } - and tell the compiler to insert a snippet of my own custom assembly code just before any instruction in that set
You should read some book on compilers (e.g. the Dragon Book) and read another book on Instruction Set Architecture and Computer Architecture. You can't just insert arbitrarily some instructions in the assembler code generated by the compiler (because what you insert requires some processor resources that the compiler did manage, e.g. thru register allocation etc...)
after edition
// I want to call another function with the arguments of mov
mov r3, #0
This is not possible (or very difficult) in general. Because calling that other function will use r3 and spoil its content.
gcc -c source.c -o source.obj
is the wrong way to use GCC. You want optimization (specially for production binaries). If you care about assembler code, use gcc -O -Wall -fverbose-asm -S source.c (perhaps -O2 -march=native instead of -O ...) then look into source.s
Let's say I specify mul as the asm of interest.
Again, that is the wrong approach. You care about multiplication in the source code, or in some intermediate representation. Perhaps mul might be emitted for x*3 without -O but probably not with -O2
think and work at the GIMPLE level not at the assembler level.
examples
First, look into the source code of GCC. It is free software. If you want to understand how -finstrument-functions really works, take a few months to read about GCC internals (I gave links and references), study the actual source code of GCC, and ask on gcc#gcc.gnu.org after that.
Now, imagine you want to count and instrument how many multiplications are done (which is not the same as how many IMUL instruction, e.g. because 8*x will probably be optimized as a shift machine code instruction). Of course it depends upon the optimizations enabled, and you'll work at the GIMPLE level. You'll probably increment some counter at the end of every GCC basic blocks. So after each BB exit you'll insert an additional GIMPLE statement. Such a simple instrumentation could need months of work.
Or imagine that you want to instrument loads to detect, when possible, undefined behavior or addressing issues. This is what the address sanitizer is doing. It tooks several years of work.
Things are much more complex than what you believe.
(it is not in vain that GCC has about ten millions of source code lines; C compilers need to be complex today.)
If you don't care about the C source code, you should not care about GCC. The assembler code could be produced by Bones, by Clang, by a JVM implementation, by ocamlopt etc (and all these don't even use GCC). Or could be produced by some other version of GCC (not the one you are instrumenting).
So spend a few weeks reading more about compilers, then ask another question. That question should mention what kind of binary or of assembler you want to instrument. Instrumenting assembler code (or binary executable) is a lot harder than instrumenting GCC (and don't use textual techniques at all). It extracts first an abstracted form of the control flow graph and refines and reasons on it.
BTW, you'll find lots of textbooks and conferences on both source instrumentation and binary instrumentation (these topics are different, even if in relation). Spend a few months reading them. Your naive textual approaches have some 1960-s smells which won't scale and won't work on today's software.
See also this talk (and video): Matt Godbolt “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” CppCon 2017

Understanding 8086 assembler debugger

I'm learning assembler and I need some help with understanding codes in the debugger, especially the marked part.
mov ax, a
mov bx, 4
I know how above instructions works, but in the debugger I have "2EA10301" and "BB0400".
What do they mean?
The first instruction moves variable a from data segment to the ax register, but in debugger I have cs:[0103].
What do mean these brackets and these numbers?
Thanks for any help.
The 2EA10301 and BB0400 numbers are the opcodes for the two instructions highlighted.
2E is Code Segment (CS) prefix and instructs the CPU to access memory with the CS segment instead of the default DS one.
A1 is the opcode for MOV AX, moffs16 and 0301 is the immediate 0103h in little endian, the address to read from.
So 2EA10301 is mov ax, cs:[103h].
The square brackets are the preferred way to denote a memory access through one the addressing mode but some assemblers support the confusing syntax without the brackets.
As this syntax is ambiguous and less standardised across different assemblers than the other, it is discouraged.
During the assembling the assembler keeps a location counter incremented for each byte emitted (each "section"/segment has its own counter, i.e. the counter is reset at the beginning of each "section").
This gives each variable an offset that is used to access it and to craft the instruction, variables names are for the human, CPUs can only read from addresses, numbers.
This offset will later be and address in memory once the file is loaded.
The assembler, the linker and the loader cooperate, there are various tricks at play, to make sure the final instruction is properly formed in memory and that the offset is transformed into the right address.
In your example their efforts culminate in the value 103h, that is the address of a in memory.
Again, in your example, the offset, if the file is a COM (by the way, don't put variables in the execution flow), was still 103h due to the peculiar structure of the COM files.
But in general, it could have been another number.
BB is MOV r16, imm16 with the register BX. The base form is B8 with the lower 3 bits indicating the register to use, BX is denoted by a value of 3 (011b in binary) and indeed 0B8h + 3 = 0BBh.
After the opcode, again, the WORD immediate 0400 that encodes 4 in little endian.
You now are in the position to realise that the assembly source is not always fully informative, as the assemblers implement some form of syntactic sugar.
The instruction mov ax, a, identical to mov bx, 4 in its syntax and that technically is move the immediate value, constant and known at assembly time, given by the address of a into ax, is instead interpreted as move the content of a, a value present in memory and readable only with a memory access, into ax because a is known to be a variable.
This phenomenon is limited in the x86, being CISC, and more widespread in the RISC world, where the lack of commonly needed instructions is compensated with pseudo-instructions.
Well, first, assembler is x86 Assembly. The assembler is what turns the instructions into machine code.
When you disassemble programs, it probably will use the hex values (like 90 is NOP instruction or B8 to move something to AX).
Square brackets copies the memory address to which the register points to.
The hex on the side is called the address.
Everything is very simple. The command mov ax, cx: [0103] means that the value of 000Ah is loaded into the register ax. This value is taken from the code segment at 0103h. Slightly higher in the pictures you can see this value. cx: 0101 0B900A00. Accordingly, at the address 0101h to be the value 0Bh, 0102h to be the value 90h, 0103h to be the value 0Ah, 0104h to be the value 00h. It turns out that the AL register loads the value from the address 0103h equal to 0Ah. It turns out that the AH register loads the value from the address 0104h equal to 00h and it turns out ax = 000Ah. If instead of the ax command, cx: [0103] there was the ax command, cx: [0101], then ax = 900Bh or the ax command, cx: [0102], then ax = 0A90h.

"PUSH" "POP" Or "MOVE"?

When it comes to temporarily storage for an existing value in a register, all modern compilers(at least the ones I experienced) do PUSH and POP instructions. But why not store the data in another register if it's available?
So, where should the temporarily storage for an existing value goes? Stack Or Register?
Consider the following 1st Code:
MOV ECX,16
LOOP:
PUSH ECX ;Value saved to stack
... ;Assume that here's some code that must uses ECX register
POP ECX ;Value released from stack
SUB ECX,1
JNZ LOOP
Now consider the 2st Code:
MOV ECX,16
LOOP:
MOV ESI,ECX ;Value saved to ESI register
... ;Assume that here's some code that must uses ECX register
MOV ECX,ESI ;Value returned to ECX register
SUB ECX,1
JNZ LOOP
After all, which one of the above code is better and why?
Personally I think the first code is better on size since PUSH and POP only takes 1 bytes while MOV takes 2; and second code is better on speed because data moving between registers is faster than memory access.
It does make a lot of sense to do that. But I think the simplest answer is all the other registers are being used. In order to use some other register you would need to push it on the stack.
Compilers are smart enough. Keeping track of what is in a register for a compiler is somewhat trivial, that is not a problem. Speaking generically not necessarily x86 specific, esp when you have more registers (than an x86), you are going to have some registers that are used for input (in your calling convention), some you can trash, that may be the same as the input ones or not, some you cant trash you have to preserve them first. Some instruction sets have special registers, must use this one for auto increment, that one for register indirect, etc.
You will most definitely if not trivial to get the compiler to produce code for an arm for example where the input and the trashable registers are the same set, but that means that if you call another function and create the calling function right it needs to save something to use after the return:
unsigned int more_fun ( unsigned int );
unsigned int fun ( unsigned int x )
{
return(more_fun(x)+x);
}
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: e1a04000 mov r4, r0
8: ebfffffe bl 0 <more_fun>
c: e0840000 add r0, r4, r0
10: e8bd4010 pop {r4, lr}
14: e12fff1e bx lr
I told you it was trivial. Now to use your argument backward, why didnt they just push r0 on the stack and pop it off later, why push r4? Not r0-r3 are used for input and are volatile, r0 is the return register when it fits, r4 almost all the way up you have to preserve (one exception I think).
So r4 is assumed to be used by the caller or some caller up the line, the calling convention dictates you cannot trash it you must preserve it so you have to assume it is used. You can trash r0-r3, but you cant use one of those as the callee can trash them too, so in this case we need to take the incoming value x and both use it (pass it on) and preserve it for after the return so they did both, the "used another register with a move" but in order to do that they preserved that other register.
Why save r4 to the stack in this case is very obvious, you can save it up front with the return address, in particular arm wants you to always use the stack in 64 bit chunks so two registers at a time ideally or at least keep it aligned on a 64 bit boundary, so you have to save lr anyway, so they are going to push something else too even if they dont have, to in this case the saving of r4 is a freebie, and since they need to save r0 and at the same time use it. r4 or r5 or something above is a good choice.
BTW looks like an x86 compiler did with above.
0000000000000000 <fun>:
0: 53 push %rbx
1: 89 fb mov %edi,%ebx
3: e8 00 00 00 00 callq 8 <fun+0x8>
8: 01 d8 add %ebx,%eax
a: 5b pop %rbx
b: c3 retq
demonstration of them pushing something that they dont need to preserve:
unsigned int more_fun ( unsigned int );
unsigned int fun ( unsigned int x )
{
return(more_fun(x)+1);
}
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: ebfffffe bl 0 <more_fun>
8: e8bd4010 pop {r4, lr}
c: e2800001 add r0, r0, #1
10: e12fff1e bx lr
No reason to save r4, they just needed some register to make the stack aligned, so in this case r4 was chosen, some versions of this compiler you will see r3 or some other register used.
Remember humans (still) write compilers and the optimizers, etc. So they why this and why that is really a question for that human or those humans, and we cant really tell you what they were thinking. It is not a simple task for sure, but it is not hard to take a reasonable sized function and/or project and find opportunities to hand tune compiler output, to improve it. Of course beauty is in the eye of the beholder, one definition of improve is another's definition of make worse. One instruction mix might use less total instruction bytes, so that is "better" by program size standards, another may or may not use more instructions or bytes, but execute faster, one might have less memory accesses at the cost of instructions to ideally execute faster, etc.
There are architectures with hundreds of general purpose registers, but most of the ones we touch products with daily dont have that many, so you can generally make a function or some code that has so many variables in flight in a function that you have to start saving off to the stack mid function. So you cant always just save a few registers at the beginning and the end of the function to give you more working registers mid function, if the number of working registers you need mid function is more registers than you have. It actually takes some practice to be able to write code that doesnt optimize to the point of not needing too many registers, but once you start to see how the compilers work by examining their output, you can write trivial functions like the ones above to prevent optimizations or force preservation of registers mid function, etc.
At the end of the day for the compiler to be somewhat sane it needs a calling convention, it keeps the authors from going crazy and the compiler from being a nightmare to code and manage. And the calling convention is very clearly going to define the input and output register(s) any volatile registers, and the ones that have to be preserved.
unsigned int fun ( unsigned int x, unsigned int y, unsigned int z )
{
unsigned int a;
a=x<<y;
a+=(y<<z);
a+=x+y+z;
return(a);
}
00000000 <fun>:
0: e0813002 add r3, r1, r2
4: e0833000 add r3, r3, r0
8: e0832211 add r2, r3, r1, lsl r2
c: e0820110 add r0, r2, r0, lsl r1
10: e12fff1e bx lr
Only spent a few seconds on that but could have worked harder on it. I didnt push past four registers total, granted I had four variables. And I didnt call any functions so the compiler was free to just trash r0-r3 as needed as the dependencies worked out. So I didnt have to save r4 in order to create a temporary storage, it didnt have to use the stack it just optimized the order of execution to for example free up r2, the z variable so that later it could use r2 as an intermediate variable, one of the instances of a equals something. Keeping it down to four registers instead of burning a fifth one.
If I was more creative with my code and I added in calls to functions, I could get it to burn a lot more registers, you would see as even in this last case, the compiler has no problem whatsoever keeping track of what is where, and you will see when you play with the compilers there is no reason that they have to keep your high level language variables intact in the same register throughout much less execute in the same order you wrote your code (so long as it is legal), but they are still at the mercy of the calling convention, if any only some of the registers are considered volatile, and if you call a function from your function at a certain time in the code, then you have to preserve that content so you cant use them as long term storage, and the ones that are not volatile are already considered to be consumed so they have to be preserved to use them, then it becomes in part a performance question, does it cost more (size, speed, etc) to save to the stack on the fly or can I preserve up front in a way that possibly reduces instructions or can be invisible and/or consume less clocks with a larger transfer rather than separate, less efficient transfers mid function?
I have said this seven times now but the bottom line is the calling convention for that compiler (version) and target (and command line options/defaults). If you have volatile registers (arbitrary calling convention thing for general purpose registers, not a hardware/ISA thing) and you are not calling any other functions, then they are easy to use and save you expensive stack (memory) transactions. If you are calling someone then they can be trashed by them so they may no longer be free, depends on your code. The non-volatile registers are considered consumed by callers so you have to burn stack operations in order to use them, they are not free to use. And then it becomes performance as to when and where to use the stack, pushes and pops and movs. No two compilers are expected to generate the same code even if they use the same convention, but you can see above it is somewhat trivial to make test functions, compile them and examine the output, tweak here and there to navigate through and around that (compiler, version and target and convention and command line options) optimizer.
Using a register is a bit faster, but requires you to keep track of which registers are available, and you can run out of registers. Also, this method cannot be use recursively. In addition, some registers will get trashed if you use INT or CALL to invoke a subroutine.
Use of the stack (POP and PUSH) can be used as many times as needed (so long as you don't run out of stack space), and in addition it supports recursive logic. You can use the stack safely with INT or CALL because by convention any subroutine should reserve its own portion of the stack, and must restore it to its previous state (or else the RET instruction would fail).
Do trust the work of the optimizing compiler, based on the work of decades of code generation specialists.
They fill as much registers as available and extend to the stack when needed, comparing different options. And they also care about tradeoffs between storing a value for later reuse vs. recomputation of the value.
There is no single rule "register vs. stack", it's a matter of global optimization, taking into account the processor's peculiarities. And in general, there is no single "best solution" as it will depend on your "bestness" criteria.
Except when very creative workarounds can be found (or when exploiting data properties known of you only), you can't beat a compiler.
When thinking about speed, you always have to keep in mind a sense of proportion.
If the function being compiled calls other functions,
those push and pop instructions may be insignificant,
compared to the number of instructions executed in between them.
Compiler writers know, in that kind of case, which is very common, one shouldn't be penny-wise and pound-foolish.
By using PUSH and POP, you can save at least one registers. This will be significant if you working with limited available registers. On the other hand, yes, sometimes using MOV is better in speed, but you also have to keep in mind which register is used as a temporary storage. This will be hard if you want to store several values that needed to be processed later

Assembly registers [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
My question WAS about getting as much info as I could about registers...No luck :/
Everyone got everything so wrong [Probably because English is not my native language].
So, the question will be more general... ;(
I need a tutorial with the BASICS!
Ah...Could I be more not-specific?
Also, thanks for the help in advance!
In general you can use any of eax, ebx, ecx, edx, esi and edi pretty much as you want. They can each hold any 32-bit value.
Keep in mind that if you call any Win32 API functions that they are free to modify eax, ecx and edx. So if you need to preserve the values of those registers across a function call you'll have to save them somewhere temporarily (e.g. on the stack).
Similarly, if you write a function that is to be called by another function (e.g. a Windows callback) you should preserve ebx, esi,edi and ebp within that function.
Some instructions are hardcoded to use certain registers. For example, the loop instruction uses (e)cx, the string instructions use esi/edi, the div instruction uses eax/edx, etc. You can find all such cases by going through the descriptions for all the instructions in Intel's manual.
The "fixed uses" of the registers derive from the ancient roots back in the 8086 days (and in some ways, even from before that).
The 8086 was an accumulator machine, you were supposed to do math mostly with ax (there was no eax yet), and a bit with dx. You can see this back in many instructions, for example most ALU ops have a smaller form for op ax, imm (also op al, imm) than for op other, imm, and the ancient decimal math instructions (daa and friends) operate only on al. There are instructions that always reference (e)ax and maybe (e)dx as "high half", see the "old multiplication" (with the single explicit operand), imul with an immediate was added in the 80186, imul reg, r/m was added in the 80386 which added a whole lot of stuff including 32bit mode. With 32bit mode also came the modern ModRM/SIB structure, here are the old 16bit version and the modern 32/64bit version. In the old version, there are only 4 registers that could ever be used in a memory operand, so there's a bit of the "fixed roles for registers" again. 32bit mode mostly removed that, except that esp can never be the index register (that wouldn't normally make sense anyway).
More recently, Haswell introduced shlx which removes the restriction that shifting by a variable amount could only be done using cl as the count, and mulx partially removed the fixed roles of registers for "wide multiplication" (80186 and 80386 only added the "general" forms for multiplication without the high half), mulx still gives edx a fixed role though.
More strangely, the relatively recently added pblendvb assigned a fixed role to xmm0, previous to that the vector registers weren't encumbered by such old-fashioned restrictions. That fixed role disappeared with AVX though, which allowed the extra operand to be encoded. pcmpistri and friends still assign a fixed role to ecx though.
With x64 came a change to 8 bit register operands, if a REX prefix is present it is now possible to use spl, bpl, sil and dil, previously unencodable, but at the cost of being able to address ah, ch, dh or bh. That's probably a symptom of moving away from special roles too, since previously it wouldn't have made much sense to be able to use bpl, but now that it's "more general purpose" it might have some uses (it's still often used as a base pointer though).
The general pattern is towards fewer restrictions/fixed roles. But much of the history of x86 is still visible today.
As a general comment, before you go much further, I recommend adopting a programming style, or you'll find it very hard to follow your own code. Below is a formatted example of your code, maybe not everything is correctly formatted but it gives you an idea. Once in the habit, it's easier than making higgledy-piggledy code. One of its main advantages, is with practice you can cast your eye down the code and follow it far quicker than if you have to read every line.
.386
.model flat, stdcall
option casemap :none
include \masm32\include\windows.inc
include \masm32\include\kernel32.inc
include \masm32\include\masm32.inc
includelib \masm32\lib\kernel32.lib
includelib \masm32\lib\masm32.lib
.data
ProgramText db "Hello World!", 0
BadText db "Error: Sum is incorrect value", 0
GoodText db "Excellent! Sum is 6", 0
Sum sdword 0
.code
start:
; eax
mov ecx, 6 ; set the counter to 6 ?
xor eax, eax ; set eax to 0
_label:
add eax, ecx ; add the numbers ?
dec ecx ; from 0 to 6 ?
jnz _label ; 21
mov edx, 7 ; 21
mul edx ; multiply by 7 147
push eax ; pushes eax into the stack
pop Sum ; pops eax and places it in Sum
cmp Sum, 147 ; compares Sum to 147
jz _good ; if they are equal, go to _good
_bad:
invoke StdOut, addr BadText
jmp _quit
_good:
invoke StdOut, addr GoodText
_quit:
invoke ExitProcess, 0
end start
I'll single out one line:
push eax ; pushes eax into the stack
Don't use comments to explain what an instruction does: use them to say what you are trying to acheive, or what the register represents, to give added value to the code.
Good luck to you: plenty of practice and midnight oil!

Subtract and detect underflow, most efficient way? (x86/64 with GCC)

I'm using GCC 4.8.1 to compile C code and I need to detect if underflow occurs in a subtraction on x86/64 architecture. Both are UNSIGNED. I know in assembly is very easy, but I'm wondering if I can do it in C code and have GCC optimize it in a way, cause I can't find it. This is a very used function (or lowlevel, is that the term?) so I need it to be efficient, but GCC seems to be too dumb to recognize this simple operation? I tried so many ways to give it hints in C, but it always uses two registers instead of just a sub and a conditional jump. And to be honest I get annoyed seeing such stupid code written so MANY times (function is called a lot).
My best approach in C seemed to be the following:
if((a-=b)+b < b) {
// underflow here
}
Basically, subtract b from a, and if result underflows detect it and do some conditional processing (which is unrelated to a's value, for example, it brings an error, etc).
GCC seems too dumb to reduce the above to just a sub and a conditional jump, and believe me I tried so many ways to do it in C code, and tried alot of command line options (-O3 and -Os included of course). What GCC does is something like this (Intel syntax assembly):
mov rax, rcx ; 'a' is in rcx
sub rcx, rdx ; 'b' is in rdx
cmp rax, rdx ; useless comparison since sub already sets flags
jc underflow
Needless to say the above is stupid, when all it needs is this:
sub rcx, rdx
jc underflow
This is so annoying because GCC does understand that sub modifies flags that way, since if I typecast it into a "int" it will generate the exact above except it uses "js" which is jump with sign, instead of carry, which will not work if the unsigned values difference is high enough to have the high bit set. Nevertheless it shows it is aware of the sub instruction affecting those flags.
Now, maybe I should give up on trying to make GCC optimize this properly and do it with inline assembly which I have no problems with. Unfortunately, this requires "asm goto" because I need a conditional JUMP, and asm goto is not very efficient with an output because it's volatile.
I tried something but I have no idea if it is "safe" to use or not. asm goto can't have outputs for some reason. I do not want to make it flush all registers to memory, that would kill the entire point I'm doing this which is efficiency. But if I use empty asm statements with outputs set to the 'a' variable before and after it, will that work and is it safe? Here's my macro:
#define subchk(a,b,g) { typeof(a) _a=a; \
asm("":"+rm"(_a)::"cc"); \
asm goto("sub %1,%0;jc %l2"::"r,m,r"(_a),"r,r,m"(b):"cc":g); \
asm("":"+rm"(_a)::"cc"); }
and using it like this:
subchk(a,b,underflow)
// normal code with no underflow
// ...
underflow:
// underflow occured here
It's a bit ugly but it works just fine. On my test scenario, it compiles just FINE without volatile overhead (flushing registers to memory) without generating anything bad, and it seems it works ok, however this is just a limited test, I can't possibly test this everywhere I use this function/macro as I said it is used A LOT, so I'd like to know if someone is knowledgeable, is there something unsafe about the above construct?
Particularly, the value of 'a' is NOT NEEDED if underflow occurs, so with that in mind are there any side effects or unsafe stuff that can happen with my inline asm macro? If not I'll use it without problems till they optimize the compiler so I can replace it back after I guess.
Please don't turn this into a debate about premature optimizations or what not, stay on topic of the question, I'm fully aware of that, so thank you.
I probably miss something obvious, but why isn't this good?
extern void underflow(void) __attribute__((noreturn));
unsigned foo(unsigned a, unsigned b)
{
unsigned r = a - b;
if (r > a)
{
underflow();
}
return r;
}
I have checked, gcc optimizes it to what you want:
foo:
movl %edi, %eax
subl %esi, %eax
jb .L6
rep
ret
.L6:
pushq %rax
call underflow
Of course you can handle underflow however you want, I have just done this to keep the asm simple.
How about the following assembly code (you can wrap it into GCC format):
sub rcx, rdx ; assuming operands are in rcx, rdx
setc al ; capture carry bit int AL (see Intel "setxx" instructions)
; return AL as boolean to compiler
Then you invoke/inline the assembly code, and branch on the resulting boolean.
Have you tested whether this is actually faster? Modern x86-microarchitectures use microcode, turning single assembly instructions into sequences of simpler micro-operations. Some of them also do micro-op fusion, in which a sequence of assembly-instructions is turned into a single micro-op. In particular, sequences like test %reg, %reg; jcc target are fused, probably because global processor flags are a bane of performance.
If cmp %reg, %reg; jcc target is mOp-fused, gcc might use that to get faster code. In my experience, gcc is very good at scheduling and similar low-level optimizations.

Resources