Static Branch prediction for the ARM with __builtin_expect is not functional!!? - performance

Im doing the optimization in the C code running in the Cortex-R4.
first of all I haven't seen any change in the assembly code output when I indicated the "__builtin_expect" in condition check.
It seem like the compiler generate the unnecessary Jump.
My C code:
bit ++; (Likely)
if(__builtin_expect(bit >= 32),0)
{
bit -=32; // unlikely code
xxxxxx; // unlikely code
xxxxxx; // unlikely code
xxxxxx; // unlikely code
}
bit = bit*2 // something (Likely)
return bit;
---- Generated ASM code --------
(bit => r0)
ADD r2,r2,#1
CMP r0,#0x20
BCC NoDecrement
SUB r0,r0,#0x20
XXXXXXXXX
XXXXXXXXX
XXXXXXXXX
NoDecrement LSL r0,r0,#1
BX lr
---- My expected ASM Code --------
ADD r2,r2,#1
CMP r0,#0x20
BHE Decrement
JumbBack LSL r0,r0,#1
BX lr
Decrement SUB r0,r0,#0x20
XXXXXXXXX
XXXXXXXXX
XXXXXXXXX
B JumbBack
suppose if this piece of C code runs in a loop, then each time it has to jump (because the if condition is passed only once).
Is there any other compiler setting which actually, generates the code as expected..??

You wrote:
if(__builtin_expect(bit >= 32),0)
{
...
}
The code inside the curly braces will never be executed, because it's surrounded by if(foo,0) which is equivalent to if(0) for any value of foo, no matter what builtin you're trying to use. If you turn on optimization with -O2, you'll see that the compiler removes the dead code completely, rather than just jumping around it. I think you probably meant to write
if (__builtin_expect(bit >= 32, 0)) {
bit -= 32;
}
If I do this, I get exactly the forward branch I'd expect (with clang -O1 or higher).
extern void something();
int foo(int bit)
{
++bit;
if (__builtin_expect(bit >= 32, 0)) {
bit -= 32; // "Decrement"
something();
}
bit = bit*2;
something();
return bit;
}
Here's the code from clang -arch armv7 -O2 -S:
_foo:
# BB#0:
push {r4, r7, lr}
adds r4, r0, #1
add r7, sp, #4
cmp r4, #32
bge LBB0_2 // a forward branch for the unlikely case
LBB0_1:
lsls r4, r4, #1
blx _something
mov r0, r4
pop {r4, r7, pc}
LBB0_2: // "Decrement"
sub.w r4, r0, #31
blx _something
b LBB0_1

Related

Understand a piece of assembly template code for arm gcc

Below code contains some inline assembly template:
static inline uintptr_t arch_syscall_invoke3(uintptr_t arg1, uintptr_t arg2,
uintptr_t arg3,
uintptr_t call_id)
{
register uint32_t ret __asm__("r0") = arg1;
register uint32_t r1 __asm__("r1") = arg2;
register uint32_t r2 __asm__("r2") = arg3;
register uint32_t r6 __asm__("r6") = call_id;
__asm__ volatile("svc %[svid]\n"
: "=r"(ret), "=r"(r1), "=r"(r2) <===================== HERE 1
: [svid] "i" (_SVC_CALL_SYSTEM_CALL), <===================== HERE 2
"r" (ret), "r" (r1), "r" (r2), "r" (r6)
: "r8", "memory", "r3", "ip");
return ret;
}
And I got the final assembly with https://godbolt.org/z/znMeEMrEz like this:
push {r6, r7, r8} ------------- A -------------
sub sp, sp, #20
add r7, sp, #0
str r0, [r7, #12]
str r1, [r7, #8]
str r2, [r7, #4]
str r3, [r7]
ldr r0, [r7, #12]
ldr r1, [r7, #8]
ldr r2, [r7, #4]
ldr r6, [r7]
svc #3 ------------- B -------------
mov r3, r0 ------------- C1 -------------
mov r0, r3 ------------- C2 -------------
adds r7, r7, #20
mov sp, r7
pop {r6, r7, r8}
bx lr
From A to B, the assembly code just ensure the input arguments are present in the targeted registers. I guess this is some system call convention.
I don't understand the purpose of HERE 1 and HERE 2.
Question 1:
According to here, HERE 1 should be the OutputOperands part, which means
A comma-separated list of the C variables modified by the instructions in the AssemblerTemplate.
Does this mean the specific requested system call function will modify the ret/r0, r1 and r2regitser?
Question 2:
For HERE 2, it means InputOperands, which means:
A comma-separated list of C expressions read by the instructions in the AssemblerTemplate. An empty list is permitted. See InputOperands.
According to here, the SVC instruction expects only 1 argument imm.
But we specify 4 input operands like ret, r1, r2, r6.
Why do we need to specify so many of them?
I guess these registers are used by svc handler so I need to prepare them before the SVC instruction. But what if I just prepare them like from A to B and do not mention them as the input operands? Will there be some error?
Question 3:
And at last, what's the point of the C1 and C2? They seem totally redundant. The r0 is still there.
I guess this is some system call convention.
This is the result of compilation without optimizations. Looking closely at what's going on in that code one can see that after saving r6, r7 and r8 all it does is moving r3 to r6, everything else is redundant.
Question 1: Does this mean the specific requested system call function will modify the ret/r0, r1 and r2 regitser?
Yes.
Question 2: According to here, the SVC instruction expects only 1 argument imm. But we specify 4 input operands like ret, r, r2, r6.
We specify imm to generate a correct SVC instruction and we specify the rest to make sure that the system call we invoke will find its arguments in the registers documented in the system call ABI.
Why do we need to specify so many of them?
According to the function name it's a 3-argument syscall, so we have 3 syscall parameters and apparently the system call identifier.
But what if I just prepare them like from A to B and do not mention them as the input operands? Will there be some error?
One cannot reliably do the just prepare them like from A to B part without mentioning them as inputs in that asm statement. Just assigning function arguments to local variables is not enough because nothing will enforce the correct ordering of this assignment and the asm statement. There will be no compile-time error unless compiling with warnings-as-errors and having enabled the warning for unused but set variables.
Question 3: And at last, what's the point of the C1 and C2? They seem totally redundant. The r0 is still there.
They are. Compiling with -O will eliminate this redundant move as well as most of the prologue.

Cortex-M compiler generates improper FOR loop

Tested and reproduced on Cortex-M 4 and Cortex-M 0.
I have discovered an issue with the GCC compiler. When a function is declared as type int (non-void), and contains a for loop, but does not have a return statement, the for loop will not break; after disassembling the compiled code, there is a difference between functions with a return, and without a return.
When this code is compiled, it does not throw an error message. On the first compile, a warning of missing return statements is thrown, but after that the warning will not reappear until you restart the IDE. An issue of this magnitude should probably fail to compile, or at least crash the Arduino, but it just never breaks out of the for loop.
I am mainly looking to find the proper channels to report this, since I am not sure if GNU ARM Embedded Toolchain launchpad or GNU Bugzilla are maintained anymore. If anyone knows which site (or both) are still maintained, or if there's a direct contact to someone in the project who I can share this with, please share.
Below is a more thorough description of the behavior.
Arduino Code
============
This is an attempt at a minimum reproducible example. I have run into this issue on two separate occasions in larger projects, which cause the program to behave in extremely unexpected and hard to debug ways (but always fixed by adding a return statement in the function definition).
/*
gcc compiler error demonstration for Adafruit GrandCentral
gcc version: gcc version 9.2.1 20191025 (release) [ARM/arm-9-branch revision 277599] (GNU Tools for Arm Embedded Processors 9-2019-q4-major)
Arduino IDE: all warinings on
Arduino IDE version 1.8.13
Adafruit SAMD version 1.8.11
based on Blink
modified to call two functions which are identical except one does not have a return
statement even though it is of return type int.
In the list file, myList.GrandCentral.lst ,AFunctionWithReturn shows both the comparison of
i with Count and the conditional comparison i>7 with break assembly instructions
The AFunctionNoReturn does not show any assembly instructions for the end of
loop comparision or the conditional comparison i>7 with break
Found 4/8/21 Robert Calay and Tristan Calay
Turns an LED on for one second, then off for one second, repeatedly.
Most Arduinos have an on-board LED you can control. On the UNO, MEGA and ZERO
it is attached to digital pin 13, on MKR1000 on pin 6. LED_BUILTIN is set to
the correct LED pin independent of which board is used.
If you want to know what pin the on-board LED is connected to on your Arduino
model, check the Technical Specs of your board at:
https://www.arduino.cc/en/Main/Products
modified 8 May 2014
by Scott Fitzgerald
modified 2 Sep 2016
by Arturo Guadalupi
modified 8 Sep 2016
by Colby Newman
This example code is in the public domain.
http://www.arduino.cc/en/Tutorial/Blink
*/
#define MAIN
//#include "Serial3.h" We are re-directing serial port output to SERCOM 5 on the Grand Central M4.
int AFunctionWithReturn(int count)
{
Serial.print("CountWR");
Serial.println(count);
for(int i=0;i<count;i++) {
Serial.println(i);
if (i>7)
break;
}
return(1);
}
int AFunctionNoReturn(int count)
{
Serial.print("CountNR");
Serial.println(count);
for(int i=0;i<count;i++) {
Serial.println(i);
if (i>7)
break;
}
//Note: No return statement here.
}
// the setup function runs once when you press reset or power the board
void setup() {
Serial.begin(115200);
// initialize digital pin LED_BUILTIN as an output.
pinMode(LED_BUILTIN, OUTPUT);
AFunctionWithReturn(10); //This loops 8 times
AFunctionNoReturn(10); //This loops forever, never reaching loop()
}
// the loop function runs over and over again forever
void loop() {
digitalWrite(LED_BUILTIN, HIGH); // turn the LED on (HIGH is the voltage level)
delay(1000); // wait for a second
digitalWrite(LED_BUILTIN, LOW); // turn the LED off by making the voltage LOW
delay(1000); // wait for a second
}
/*
OUTPUT ON ADAFRUIT GRANDCENTRAL SERIAL PORT
CountWR10
0
1
2
3
4
5
6
7
8
CountNR10
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
....
DOES NOT STOP CONTINUES 2000000+
*
*
*/
Disassembled Code
=================
There is a strange behavior in the brackets here. I'm no expert on the low level code, but it seems like AFunctionNoReturn calls itself recursively here. If not, it still has no break condition, and it does not have a compare call like AFunctionWithReturn in cmp r4, r5.
int AFunctionWithReturn(int count)
{
42bc: b570 push {r4, r5, r6, lr}
Serial.print("CountWR");
42be: 490c ldr r1, [pc, #48] ; (42f0 <_Z19AFunctionWithReturni+0x34>)
Serial.println(count);
for(int i=0;i<count;i++) {
Serial.println(i);
42c0: 4e0c ldr r6, [pc, #48] ; (42f4 <_Z19AFunctionWithReturni+0x38>)
{
42c2: 4605 mov r5, r0
Serial.print("CountWR");
42c4: 480b ldr r0, [pc, #44] ; (42f4 <_Z19AFunctionWithReturni+0x38>)
42c6: f000 fafa bl 48be <_ZN5Print5printEPKc>
Serial.println(count);
42ca: 480a ldr r0, [pc, #40] ; (42f4 <_Z19AFunctionWithReturni+0x38>)
42cc: 220a movs r2, #10
42ce: 4629 mov r1, r5
42d0: f000 fb43 bl 495a <_ZN5Print7printlnEii>
for(int i=0;i<count;i++) {
42d4: 2400 movs r4, #0
42d6: 42ac cmp r4, r5
42d8: da08 bge.n 42ec <_Z19AFunctionWithReturni+0x30>
Serial.println(i);
42da: 220a movs r2, #10
42dc: 4621 mov r1, r4
42de: 4630 mov r0, r6
42e0: f000 fb3b bl 495a <_ZN5Print7printlnEii>
if (i>7)
42e4: 2c08 cmp r4, #8
42e6: d001 beq.n 42ec <_Z19AFunctionWithReturni+0x30>
for(int i=0;i<count;i++) {
42e8: 3401 adds r4, #1
42ea: e7f4 b.n 42d6 <_Z19AFunctionWithReturni+0x1a>
break;
}
return(1);
}
int AFunctionNoReturn(int count)
{
42f8: b538 push {r3, r4, r5, lr}
Serial.print("CountNR");
42fa: 4909 ldr r1, [pc, #36] ; (4320 <_Z17AFunctionNoReturni+0x28>)
Serial.println(count);
for(int i=0;i<count;i++) {
Serial.println(i);
42fc: 4d09 ldr r5, [pc, #36] ; (4324 <_Z17AFunctionNoReturni+0x2c>)
{
42fe: 4604 mov r4, r0
Serial.print("CountNR");
4300: 4808 ldr r0, [pc, #32] ; (4324 <_Z17AFunctionNoReturni+0x2c>)
4302: f000 fadc bl 48be <_ZN5Print5printEPKc>
Serial.println(count);
4306: 4621 mov r1, r4
4308: 4806 ldr r0, [pc, #24] ; (4324 <_Z17AFunctionNoReturni+0x2c>)
430a: 220a movs r2, #10
430c: f000 fb25 bl 495a <_ZN5Print7printlnEii>
for(int i=0;i<count;i++) {
4310: 2400 movs r4, #0
Serial.println(i);
4312: 4621 mov r1, r4
4314: 220a movs r2, #10
4316: 4628 mov r0, r5
4318: f000 fb1f bl 495a <_ZN5Print7printlnEii>
for(int i=0;i<count;i++) {
431c: 3401 adds r4, #1
431e: e7f8 b.n 4312 <_Z17AFunctionNoReturni+0x1a>
4320: 00006538 .word 0x00006538
4324: 2000011c .word 0x2000011c
00004328 <loop>:
AFunctionWithReturn(10);
AFunctionNoReturn(10);
}
Perhaps the most helpful thing that can be said is: "Why do you want to miss out the return statement? what are you hoping to achieve?"
The various language standards (Arduino is sort-of-C++ but with some funny pre-processing) tell you what will happen if you write valid code. They do not always tell you what happens if you write invalid code. In this case the compiler has very helpfully pointed out why your code is wrong, but then after that it is totally free to do anything. No matter what it does this is never a bug in the compiler, it is a bug in your code. This sometimes called "garbage in - garbage out".
To perhaps explain why you got the particular result you did, think about it like this: the compiler knows that in a valid program execution never runs to the end of the function without a return statement, so if there isn't a return statement after the loop, it is safe to assume that it never leaves the loop. Making this assumption helps to optimize valid code to run faster. If this assumption changes what an invalid program does, then the compiler authors usually don't care. They are usually only interested in what valid programs do.
(Regarding the launchpad page, if you click on the big link at the top of the page, you will see a message about where the site has moved to).

Stop ARM GCC Optimising Out Function Call

volatile static const uint8_t mcau8IsBlank[] = {0xFF}; // Value in MCU FLASH memory
// The above value may actually be modified by a FLASH Write elsewhere in the code
bool halIsBlank() {
return ((*(uint8_t*)mcau8IsBlank));
}
void someFuncInAnotherFile() {
uint8_t data[64];
data[0] = halIsBlank(); // ARM GCC is optimising away this function call
// Replacing it simply with a 0xFF constant
// ... etc
// ... transmit data
}
How do I get ARM GCC to not optimise out the call to halIsBlank()? The compiler is assuming that mcau8IsBlank[] is always == 0xFF and is thus simply replacing the call with a 0xFF constant.
I can disable optimisation of the calling function (someFuncInAnotherFile()) by adding __attribute__((optimize(0))) to it, but it would be better to add some attribute to the called function (halIsBlank()) (and no attributes or keywords that I've tried seem to do the trick)?
If an object is declared as const then any attempt to modify it leads to undefined behaviour. The compiler is allowed to assume that a const object is constant. And you explictely cast away the volatileness of the array, so the compiler can assume it is not volatile at this point.
I'd remove that cast to (uint8_t *) which seems to be pointless anyway.
so.c
const unsigned char one = 0x11;
unsigned char two = 0x22;
volatile unsigned char three = 0x33;
extern unsigned char four;
unsigned int get_one ( void )
{
return(one);
}
unsigned int get_two ( void )
{
return(two);
}
unsigned int get_three ( void )
{
return(three);
}
unsigned int get_four ( void )
{
return(four);
}
four.c
unsigned char four = 0x44;
gnu ld linker script
MEMORY
{
rom : ORIGIN = 0x10000000, LENGTH = 0x1000
ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*)
four.o(.data) } > rom
.rodata : { *(.rodata*) } > rom
.data : { *(.data*) } > ram
}
result
Disassembly of section .text:
10000000 <_start>:
10000000: 20001000 andcs r1, r0, r0
10000004: 10000009 andne r0, r0, r9
10000008 <reset>:
10000008: e7fe b.n 10000008 <reset>
...
1000000c <get_one>:
1000000c: 2011 movs r0, #17
1000000e: 4770 bx lr
10000010 <get_two>:
10000010: 4b01 ldr r3, [pc, #4] ; (10000018 <get_two+0x8>)
10000012: 7818 ldrb r0, [r3, #0]
10000014: 4770 bx lr
10000016: bf00 nop
10000018: 20000000 andcs r0, r0, r0
1000001c <get_three>:
1000001c: 4b01 ldr r3, [pc, #4] ; (10000024 <get_three+0x8>)
1000001e: 7858 ldrb r0, [r3, #1]
10000020: 4770 bx lr
10000022: bf00 nop
10000024: 20000000 andcs r0, r0, r0
10000028 <get_four>:
10000028: 4b01 ldr r3, [pc, #4] ; (10000030 <get_four+0x8>)
1000002a: 7818 ldrb r0, [r3, #0]
1000002c: 4770 bx lr
1000002e: bf00 nop
10000030: 10000034 andne r0, r0, r4, lsr r0
10000034 <four>:
10000034: Address 0x0000000010000034 is out of bounds.
Disassembly of section .rodata:
10000035 <one>:
10000035: Address 0x0000000010000035 is out of bounds.
Disassembly of section .data:
20000000 <two>:
20000000: Address 0x0000000020000000 is out of bounds.
20000001 <three>:
20000001: Address 0x0000000020000001 is out of bounds.
Because one is const the local function optimizes, but it is also global so added to flash (in case other objects reference it). Make it static and the allocation in flash goes away.
Two is plain old .data. Has to build the code this way, linker adds the address at link time.
Three is volatile global handled the same way as two because it was global and in .data, volatile does not do much here.
Four is a solution if you choose. Define it outside this file/optimization domain, and the compiler has to generate code to reach an unknown length to get it. In the linker script tell the linker to place it in flash. So while it is in flash and technically not read/write, if you have a way to write the flash then this will work.
Well actually it will not because when you erase the flash to change four you wipe out some percentage of this .text code along with it. You need to know the part what the erase blocks are and put things like these in one of those erase blocks, and you have to save all of them to ram if you want to change one of them, save all, erase, write back all including any changed values...And rare that you can execute in the same flash logic as the flash being erased so may need a trampoline to do this save, erase, restore routine. (more linker magic and a copy and jump)
One function calling another in the same optimization domain is going to likely inline it so you will want to find a please do not inline command line option, although for this case that does not make any sense, you want to optimize and possibly make the small function static so it goes away all together.

Would a compiled program have different machine codes when executed on PC, Mac, Linux etc?

I'm just getting started learning the very fundamentals of computers and programming. I've grasped that, in compiled programs, the machine code generated is specific to the type of processors and their instruction sets. What I'd like to know is, say, I have Windows, OS X and Linux all running on the exact same hardware (processor to be specific), would the machine code generated from this compiled program differ across the OSes? Is machine code OS dependent or will it be an exact same copy of bits and bytes across all the OS?
What happened when you tried it? As answered the file formats supported may vary, but you asked about machine code.
The machine code for the same processor core is the same of course. But only some percentage of the code is generic
a=b+c:
printf("%u\n",a);
Assume even you are using the same compiler version targeted at the same cpu but with a different operating system (same computer running linux then later windows) the addition is ideally the same assuming the top level function/source code is the same.
First off the entry point of the code may vary from one OS to another, so the linker may make the program different, for position dependent code, fixed addresses will end up in the binary, you can call that machine code or not, but the specific addresses may result in different instructions. A branch/jump may have to be encoded differently based on the address of course, but in one system you may have one form of branch another may require a trampoline to get from one place to another.
Then there are the system calls themselves, no reason to assume that the system calls between operating systems are the same. This can make the code vary in size, etc which can again cause the compiler or linker to have to make different machine code choices based on how near or far a jmp target is for some instruction sets or can the address be encoded as an immediate or do you have to load it from a nearby location then branch to that indirectly.
EDIT
Long before you start to ponder/worry about what happens on different operating systems on the same platform or target. Understand the basics of putting a program together, and what kinds of things can change the machine code.
A very simple program/function
extern unsigned int dummy ( unsigned int );
unsigned int fun ( unsigned int a, unsigned int b )
{
dummy(a+b+3);
return(a+b+7);
}
compile then disassemble
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: e0804001 add r4, r0, r1
8: e2840003 add r0, r4, #3
c: ebfffffe bl 0 <dummy>
10: e2840007 add r0, r4, #7
14: e8bd4010 pop {r4, lr}
18: e12fff1e bx lr
There is actually a ton of stuff going on there. This is arm, full sized (not thumb...yet). The a parameter comes in in r0, b in r1, result out in r0. lr is the return address register basically, so if we are calling another function we need to save that (on the stack) likewise we are going to re-use r0 to call dummy and in fact with this calling convention any function can modify/destroy r0-r3, so the compiler is going to need to deal with our two parameters, since I intentionally used them in the same way the compiler can optimize a+b into a register and save that on the stack, actually for performance reasons no doubt, they save r4 on the stack and then use r4 to save a+b, you cannot modify r4 at will in a function based on the calling convention so any nested function would have to preserve it and return with it in the as found state, so it is safe to just leave a+b there when calling other functions.
They add 3 to our a+b sum in r4 and call dummy. When it returns they add 7 to the a+b sum in r4 and return in r0.
From a machine code perspective this is not yet linked and dummy is an external function
c: ebfffffe bl 0 <dummy>
I call it dummy because when we use it here in a second it does nothing but return, a dummy function. The instruction encoded there is clearly wrong branching to the beginning of fun would not work that is recursion that is not what we asked for. So lets link it, at a minimum we need to declare a _start label to make the gnu linker happy, but I want to do more than that:
.globl _start
_start
bl fun
b .
.globl dummy
dummy:
bx lr
and linking for an entry address of 0x1000 produced this
00001000 <_start>:
1000: eb000001 bl 100c <fun>
1004: eafffffe b 1004 <_start+0x4>
00001008 <dummy>:
1008: e12fff1e bx lr
0000100c <fun>:
100c: e92d4010 push {r4, lr}
1010: e0804001 add r4, r0, r1
1014: e2840003 add r0, r4, #3
1018: ebfffffa bl 1008 <dummy>
101c: e2840007 add r0, r4, #7
1020: e8bd4010 pop {r4, lr}
1024: e12fff1e bx lr
The linker filled in the address for dummy by modifying the instruction that calls it, so you can see that the machine code has changed.
1018: ebfffffa bl 1008 <dummy>
Depending on how far away things are or other factors can change this, the bl instruction here has a long range but not the full address space, so if the program is sufficiently large and there is a lot of code between the caller and the callee then the linker may have to do more work. For different reasons I can cause that. Arm has arm and thumb modes and you have to use specific instructions in order to switch, bl not being one of them (or at least not for all of the arms).
If I add these two lines in front of the dummy function
.thumb
.thumb_func
.globl dummy
dummy:
bx lr
Forcing the assembler to generate thumb instructions and mark the dummy label as a thumb label then
00001000 <_start>:
1000: eb000001 bl 100c <fun>
1004: eafffffe b 1004 <_start+0x4>
00001008 <dummy>:
1008: 4770 bx lr
100a: 46c0 nop ; (mov r8, r8)
0000100c <fun>:
100c: e92d4010 push {r4, lr}
1010: e0804001 add r4, r0, r1
1014: e2840003 add r0, r4, #3
1018: eb000002 bl 1028 <__dummy_from_arm>
101c: e2840007 add r0, r4, #7
1020: e8bd4010 pop {r4, lr}
1024: e12fff1e bx lr
00001028 <__dummy_from_arm>:
1028: e59fc000 ldr r12, [pc] ; 1030 <__dummy_from_arm+0x8>
102c: e12fff1c bx r12
1030: 00001009 andeq r1, r0, r9
1034: 00000000 andeq r0, r0, r0
Because the BX is required to switch modes in this case and fun is arm mode and dummy is thumb mode the linker has very nicely for us added a trampoline function I call it to bounce off of to get from fun to dummy. The link register (lr) contains a bit that tells the bx on the return which mode to switch to so there is no extra work there to modify the dummy function.
Had there have been a great distance between the two functions in memory I would hope the linker would have also patched that up for us, but you never know until you try.
.globl _start
_start:
bl fun
b .
.globl dummy
dummy:
bx lr
.space 0x10000000
sigh, oh well
arm-none-eabi-ld -Ttext=0x1000 v.o so.o -o so.elf
v.o: In function `_start':
(.text+0x0): relocation truncated to fit: R_ARM_CALL against symbol `fun' defined in .text section in so.o
if we change one plus to a minus:
extern unsigned int dummy ( unsigned int );
unsigned int fun ( unsigned int a, unsigned int b )
{
dummy(a-b+3);
return(a+b+7);
}
and it gets more complicated
00000000 <fun>:
0: e92d4070 push {r4, r5, r6, lr}
4: e1a04001 mov r4, r1
8: e1a05000 mov r5, r0
c: e0400001 sub r0, r0, r1
10: e2800003 add r0, r0, #3
14: ebfffffe bl 0 <dummy>
18: e2840007 add r0, r4, #7
1c: e0800005 add r0, r0, r5
20: e8bd4070 pop {r4, r5, r6, lr}
24: e12fff1e bx lr
they can no longer optimize the a+b result so more stack space or in the case of this optimizer, save other things on the stack to make room in registers. Now you ask why is r6 pushed on the stack? It is not being modified? This abi requires a 64 bit aligned stack so that means pushing four registers to save three things or push the three things and then modify the stack pointer, for this instruction set pushing the four things is cheaper than fetching another instruction and executing it.
if for whatever reason the external function becomes local
void dummy ( unsigned int )
{
}
unsigned int fun ( unsigned int a, unsigned int b )
{
dummy(a-b+3);
return(a+b+7);
}
that changes things again
00000000 <dummy>:
0: e12fff1e bx lr
00000004 <fun>:
4: e2811007 add r1, r1, #7
8: e0810000 add r0, r1, r0
c: e12fff1e bx lr
Since dummy doesnt use the parameter passed and the optimizer can now see it, then there is no reason to waste instructions subtracting and adding 3, that is all dead code, so remove it. We are no longer calling dummy since it is dead code so no need to save the link register on the stack and save the parameters just do the addition and return.
static void dummy ( unsigned int x )
{
}
unsigned int fun ( unsigned int a, unsigned int b )
{
dummy(a-b+3);
return(a+b+7);
}
making dummy local/static and nobody using it
00000000 <fun>:
0: e2811007 add r1, r1, #7
4: e0810000 add r0, r1, r0
8: e12fff1e bx lr
last experiment
static unsigned int dummy ( unsigned int x )
{
return(x+1);
}
unsigned int fun ( unsigned int a, unsigned int b )
{
unsigned int c;
c=dummy(a-b+3);
return(a+b+c);
}
dummy is static and called, but it is optimized here to be inline, so there is no call to it, so neither external folks can use it (static) nor does anyone inside this file use it, so there is no reason to generate it.
The compiler examines all of the operations and optimizes it. a-b+3+1+a+b = a+a+4 = (2*a)+4 = (a<<1)+4;
Why did they use a shift left instead of just add r0,r0,r0, dont know maybe the shift is faster in the pipe, or maybe it is irrelevant and either one was just as good and the compiler author chose this method, or perhaps the internal code which is somewhat generic figured this out and before it went to the backend it had been converted into a shift rather than an add.
00000000 <fun>:
0: e1a00080 lsl r0, r0, #1
4: e2800004 add r0, r0, #4
8: e12fff1e bx lr
command lines used for these experiments
arm-none-eabi-gcc -c -O2 so.c -o so.o
arm-none-eabi-as v.s -o v.o
arm-none-eabi-ld -Ttext=0x1000 v.o so.o -o so.elf
arm-none-eabi-objdump -D so.o
arm-none-eabi-objdump -D so.elf
The point being you can do these kinds of simple experiments yourself and begin to understand what is going on when and where the compiler and linker makes modifications to the machine code if that is how you like to think of it. And then realize which I sorta showed here when I added the non-static dummy function (the fun() function now was pushed deeper into memory) as you add more code, for example a C library from one operating system to the next may change or may be mostly identical except for the system calls so they may vary in size causing other code to possibly be moved around a larger puts() might cause printf() to live at a different address all other factors held constant. If not liking statically then no doubt there will be differences, just the file format and mechanism used to find a .so file on linux or a .dll on windows parse it, connect the dots runtime between the system calls in the application to the shared libraries. The file format and the location of shared libraries by themselves in application space will cause the binary that is linked with the operating specific stub to be different. And then eventually the actual system call itself.
Binaries are generally not portable across systems. Linux (and Unix) use ELF executable format, macOS uses Mach-O and Windows uses PE.

Using GCC's builtin functions in arm

I'm working on a cortex-m3 board with a bare-metal toolchain without libc.
I implemented memcpy which copies data byte-to-byte but it's too slow. In GCC manual, it says it provides __builtin_memcpy and I decided to use it. So here is the implementation with __builtin_memcpy.
#include <stddef.h>
void *memcpy(void *dest, const void *src, size_t n)
{
return __builtin_memcpy(dest,src,n);
}
When I compile this code, it becomes a recursive function which never ends.
$ arm-none-eabi-gcc -march=armv7-m -mcpu=cortex-m3 -mtune=cortex-m3 \
-O2 -ffreestanding -c memcpy.c -o memcpy.o
$ arm-none-eabi-objdump -d memcpy.o
memcpy.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <memcpy>:
0: f7ff bffe b.w 0 <memcpy>
Am I doing wrong? How can I use the compiler-generated memcpy version?
Builtin functions are not supposed to be used to implement itself :)
Builtin functions are supposed to be used in application code - then the compiler may or may not generate some special insn sequence or a call to the underlying real function
Compare:
int a [10], b [20];
void
foo ()
{
__builtin_memcpy (a, b, 10 * sizeof (int));
}
This results in:
foo:
stmfd sp!, {r4, r5}
ldr r4, .L2
ldr r5, .L2+4
ldmia r4!, {r0, r1, r2, r3}
mov ip, r5
stmia ip!, {r0, r1, r2, r3}
ldmia r4!, {r0, r1, r2, r3}
stmia ip!, {r0, r1, r2, r3}
ldmia r4, {r0, r1}
stmia ip, {r0, r1}
ldmfd sp!, {r4, r5}
bx lr
But:
void
bar (int n)
{
__builtin_memcpy (a, b, n * sizeof (int));
}
results in a call to the memcpy function:
bar:
mov r2, r0, asl #2
stmfd sp!, {r3, lr}
ldr r1, .L5
ldr r0, .L5+4
bl memcpy
ldmfd sp!, {r3, lr}
bx lr
Theoretically, library is not part of C compiler and not part of toolchain.
Thus, if you wrotememcpy(&a,&b,sizeof(a)) compiler MUST generate subroutine call.
The idea of __builtin : to inform compiler, that the function is standard and can be optimized. Thus, if you wrote __builtin_memcpy(&a,&b,sizeof(a)) compiler MAY generate subroutine call, but in most cases it will not happens. For example, if size is known as 4 at compile time - only one mov command will be generated. (Another advantage - even in case of subroutine call compiler is informed, that library function has no side effects).
So, it's ALWAYS better to use __builtin_memcpy instead of memcpy. In modern libraries it was done by #define memcpy __builtin_memcpy just in string.h
But you still need implement memcpy somewhere, call will be generated in sophistical places. For string functions on ARM, it's strictly recommended 4-byte implementation.

Resources