Using GCC's builtin functions in arm

Using GCC's builtin functions in arm - gcc

I'm working on a cortex-m3 board with a bare-metal toolchain without libc.
I implemented memcpy which copies data byte-to-byte but it's too slow. In GCC manual, it says it provides __builtin_memcpy and I decided to use it. So here is the implementation with __builtin_memcpy.
#include <stddef.h>
void *memcpy(void *dest, const void *src, size_t n)
{
return __builtin_memcpy(dest,src,n);
}
When I compile this code, it becomes a recursive function which never ends.
$ arm-none-eabi-gcc -march=armv7-m -mcpu=cortex-m3 -mtune=cortex-m3 \
-O2 -ffreestanding -c memcpy.c -o memcpy.o
$ arm-none-eabi-objdump -d memcpy.o
memcpy.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <memcpy>:
0: f7ff bffe b.w 0 <memcpy>
Am I doing wrong? How can I use the compiler-generated memcpy version?

Builtin functions are not supposed to be used to implement itself :)
Builtin functions are supposed to be used in application code - then the compiler may or may not generate some special insn sequence or a call to the underlying real function
Compare:
int a [10], b [20];
void
foo ()
{
__builtin_memcpy (a, b, 10 * sizeof (int));
}
This results in:
foo:
stmfd sp!, {r4, r5}
ldr r4, .L2
ldr r5, .L2+4
ldmia r4!, {r0, r1, r2, r3}
mov ip, r5
stmia ip!, {r0, r1, r2, r3}
ldmia r4!, {r0, r1, r2, r3}
stmia ip!, {r0, r1, r2, r3}
ldmia r4, {r0, r1}
stmia ip, {r0, r1}
ldmfd sp!, {r4, r5}
bx lr
But:
void
bar (int n)
{
__builtin_memcpy (a, b, n * sizeof (int));
}
results in a call to the memcpy function:
bar:
mov r2, r0, asl #2
stmfd sp!, {r3, lr}
ldr r1, .L5
ldr r0, .L5+4
bl memcpy
ldmfd sp!, {r3, lr}
bx lr

Theoretically, library is not part of C compiler and not part of toolchain.
Thus, if you wrotememcpy(&a,&b,sizeof(a)) compiler MUST generate subroutine call.
The idea of __builtin : to inform compiler, that the function is standard and can be optimized. Thus, if you wrote __builtin_memcpy(&a,&b,sizeof(a)) compiler MAY generate subroutine call, but in most cases it will not happens. For example, if size is known as 4 at compile time - only one mov command will be generated. (Another advantage - even in case of subroutine call compiler is informed, that library function has no side effects).
So, it's ALWAYS better to use __builtin_memcpy instead of memcpy. In modern libraries it was done by #define memcpy __builtin_memcpy just in string.h
But you still need implement memcpy somewhere, call will be generated in sophistical places. For string functions on ARM, it's strictly recommended 4-byte implementation.

Related

ARM GCC hardfault when using -O2

When using ARM GCC g++ compiler with optimization level -O2 (and up) this code:
void foo(void)
{
DBB("#0x%08X: 0x%08X", 1, *((uint32_t *)1));
DBB("#0x%08X: 0x%08X", 0, *((uint32_t *)0));
}
Compiles to:
0800abb0 <_Z3foov>:
800abb0: b508 push {r3, lr}
800abb2: 2301 movs r3, #1
800abb4: 4619 mov r1, r3
800abb6: 681a ldr r2, [r3, #0]
800abb8: 4802 ldr r0, [pc, #8] ; (800abc4 <_Z3foov+0x14>)
800abba: f007 fa83 bl 80120c4 <debug_print_blocking>
800abbe: 2300 movs r3, #0
800abc0: 681b ldr r3, [r3, #0]
800abc2: deff udf #255 ; 0xff
800abc4: 08022704 stmdaeq r2, {r2, r8, r9, sl, sp}
And this gives me hardfault at undefined instruction #0x0800abc2.
Also, if there is more code after that, it is not compiled into final binary.
The question is why compiler generates it like that, why undefined istruction?
By the way, it works fine for stuff like this:
...
uint32_t num = 2;
num -= 2;
DBB("#0x%08X: 0x%08X", 0, *((uint32_t *)num));
...
Compiler version:
arm-none-eabi-g++.exe (GNU Tools for ARM Embedded Processors 6-2017-q2-update) 6.3.1 20170620 (release) [ARM/embedded-6-branch revision 249437]

You can disable this (and verify this answer) by using -fno-delete-null-pointer-checks
The pointer you are passing has a value which matches the null pointer, and the compiler can see that from static analysis, so it faults (because that is the defined behaviour).
In your second example, the static analysis doesn't identify a NULL.

How to create an array of booleans in arm assembly?

I need to specify each boolean manually like in a fixed table, so using
Array: .skip 400
I will be declaring an array of 400 bytes,so how can i set the boolean values?

ARM registers are 32 bits each. You only need a bit to represent a boolean. So you can use the following 'C' code to access an array,
uint32_t load_bool(uint32_t index)
{
return (bool_array[index>>2] & (1<<(index&3)));
}
void store_bool(uint32_t index, int value)
{
uint32_t target = bool_array[index>>2];
if(value)
target |= (1<<(index&3));
else
target &= ~(1<<(index&3));
bool_array[index>>2] = target;
}
Use a compiler to target your CPU; for instance tuning godbolt output on a Cortex-A5 gives,
load_bool(unsigned int):
ldr r3, =bool_array
mov r2, r0, lsr #2
ldr r3, [r3, r2, asl #2]
and r0, r0, #3
mov r2, #1
and r0, r3, r2, asl r0
bx lr
store_bool(unsigned int, int):
ldr r3, =bool_array
mov r2, r0, lsr #2
cmp r1, #0
ldr r1, [r3, r2, asl #2]
and r0, r0, #3
mov ip, #1
orrne r0, r1, ip, asl r0
biceq r0, r1, ip, asl r0
str r0, [r3, r2, asl #2]
bx lr
The instructions tst, bclr, etc might be useful if you choose a macro instead of a function call (bit index known at compile/assemble time). Also, ldrb or byte access might be better on older platforms/CPUs. Most ARM CPUs have a 32bit bus, so the cycles for ldrb and ldr are equal.

Boolean variables in C and C++ are basically treated as a native integer assigned 1 for true and 0 for false; in ARM's case it would be a 32-bit integer. So if you need to access the structure as an array of Booleans in C/C++ you would need to access them as 32-bit integers aligned on a 4-byte boundary. However if you only need to access it from other assembly code you can use each byte as it's own boolean variable and simply manipulate the array on a byte level.
In ARM assembly, this would be the difference between accessing the array with LDR vs with LDRB.

Would a compiled program have different machine codes when executed on PC, Mac, Linux etc?

I'm just getting started learning the very fundamentals of computers and programming. I've grasped that, in compiled programs, the machine code generated is specific to the type of processors and their instruction sets. What I'd like to know is, say, I have Windows, OS X and Linux all running on the exact same hardware (processor to be specific), would the machine code generated from this compiled program differ across the OSes? Is machine code OS dependent or will it be an exact same copy of bits and bytes across all the OS?

What happened when you tried it? As answered the file formats supported may vary, but you asked about machine code.
The machine code for the same processor core is the same of course. But only some percentage of the code is generic
a=b+c:
printf("%u\n",a);
Assume even you are using the same compiler version targeted at the same cpu but with a different operating system (same computer running linux then later windows) the addition is ideally the same assuming the top level function/source code is the same.
First off the entry point of the code may vary from one OS to another, so the linker may make the program different, for position dependent code, fixed addresses will end up in the binary, you can call that machine code or not, but the specific addresses may result in different instructions. A branch/jump may have to be encoded differently based on the address of course, but in one system you may have one form of branch another may require a trampoline to get from one place to another.
Then there are the system calls themselves, no reason to assume that the system calls between operating systems are the same. This can make the code vary in size, etc which can again cause the compiler or linker to have to make different machine code choices based on how near or far a jmp target is for some instruction sets or can the address be encoded as an immediate or do you have to load it from a nearby location then branch to that indirectly.
EDIT
Long before you start to ponder/worry about what happens on different operating systems on the same platform or target. Understand the basics of putting a program together, and what kinds of things can change the machine code.
A very simple program/function
extern unsigned int dummy ( unsigned int );
unsigned int fun ( unsigned int a, unsigned int b )
{
dummy(a+b+3);
return(a+b+7);
}
compile then disassemble
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: e0804001 add r4, r0, r1
8: e2840003 add r0, r4, #3
c: ebfffffe bl 0 <dummy>
10: e2840007 add r0, r4, #7
14: e8bd4010 pop {r4, lr}
18: e12fff1e bx lr
There is actually a ton of stuff going on there. This is arm, full sized (not thumb...yet). The a parameter comes in in r0, b in r1, result out in r0. lr is the return address register basically, so if we are calling another function we need to save that (on the stack) likewise we are going to re-use r0 to call dummy and in fact with this calling convention any function can modify/destroy r0-r3, so the compiler is going to need to deal with our two parameters, since I intentionally used them in the same way the compiler can optimize a+b into a register and save that on the stack, actually for performance reasons no doubt, they save r4 on the stack and then use r4 to save a+b, you cannot modify r4 at will in a function based on the calling convention so any nested function would have to preserve it and return with it in the as found state, so it is safe to just leave a+b there when calling other functions.
They add 3 to our a+b sum in r4 and call dummy. When it returns they add 7 to the a+b sum in r4 and return in r0.
From a machine code perspective this is not yet linked and dummy is an external function
c: ebfffffe bl 0 <dummy>
I call it dummy because when we use it here in a second it does nothing but return, a dummy function. The instruction encoded there is clearly wrong branching to the beginning of fun would not work that is recursion that is not what we asked for. So lets link it, at a minimum we need to declare a _start label to make the gnu linker happy, but I want to do more than that:
.globl _start
_start
bl fun
b .
.globl dummy
dummy:
bx lr
and linking for an entry address of 0x1000 produced this
00001000 <_start>:
1000: eb000001 bl 100c <fun>
1004: eafffffe b 1004 <_start+0x4>
00001008 <dummy>:
1008: e12fff1e bx lr
0000100c <fun>:
100c: e92d4010 push {r4, lr}
1010: e0804001 add r4, r0, r1
1014: e2840003 add r0, r4, #3
1018: ebfffffa bl 1008 <dummy>
101c: e2840007 add r0, r4, #7
1020: e8bd4010 pop {r4, lr}
1024: e12fff1e bx lr
The linker filled in the address for dummy by modifying the instruction that calls it, so you can see that the machine code has changed.
1018: ebfffffa bl 1008 <dummy>
Depending on how far away things are or other factors can change this, the bl instruction here has a long range but not the full address space, so if the program is sufficiently large and there is a lot of code between the caller and the callee then the linker may have to do more work. For different reasons I can cause that. Arm has arm and thumb modes and you have to use specific instructions in order to switch, bl not being one of them (or at least not for all of the arms).
If I add these two lines in front of the dummy function
.thumb
.thumb_func
.globl dummy
dummy:
bx lr
Forcing the assembler to generate thumb instructions and mark the dummy label as a thumb label then
00001000 <_start>:
1000: eb000001 bl 100c <fun>
1004: eafffffe b 1004 <_start+0x4>
00001008 <dummy>:
1008: 4770 bx lr
100a: 46c0 nop ; (mov r8, r8)
0000100c <fun>:
100c: e92d4010 push {r4, lr}
1010: e0804001 add r4, r0, r1
1014: e2840003 add r0, r4, #3
1018: eb000002 bl 1028 <__dummy_from_arm>
101c: e2840007 add r0, r4, #7
1020: e8bd4010 pop {r4, lr}
1024: e12fff1e bx lr
00001028 <__dummy_from_arm>:
1028: e59fc000 ldr r12, [pc] ; 1030 <__dummy_from_arm+0x8>
102c: e12fff1c bx r12
1030: 00001009 andeq r1, r0, r9
1034: 00000000 andeq r0, r0, r0
Because the BX is required to switch modes in this case and fun is arm mode and dummy is thumb mode the linker has very nicely for us added a trampoline function I call it to bounce off of to get from fun to dummy. The link register (lr) contains a bit that tells the bx on the return which mode to switch to so there is no extra work there to modify the dummy function.
Had there have been a great distance between the two functions in memory I would hope the linker would have also patched that up for us, but you never know until you try.
.globl _start
_start:
bl fun
b .
.globl dummy
dummy:
bx lr
.space 0x10000000
sigh, oh well
arm-none-eabi-ld -Ttext=0x1000 v.o so.o -o so.elf
v.o: In function `_start':
(.text+0x0): relocation truncated to fit: R_ARM_CALL against symbol `fun' defined in .text section in so.o
if we change one plus to a minus:
extern unsigned int dummy ( unsigned int );
unsigned int fun ( unsigned int a, unsigned int b )
{
dummy(a-b+3);
return(a+b+7);
}
and it gets more complicated
00000000 <fun>:
0: e92d4070 push {r4, r5, r6, lr}
4: e1a04001 mov r4, r1
8: e1a05000 mov r5, r0
c: e0400001 sub r0, r0, r1
10: e2800003 add r0, r0, #3
14: ebfffffe bl 0 <dummy>
18: e2840007 add r0, r4, #7
1c: e0800005 add r0, r0, r5
20: e8bd4070 pop {r4, r5, r6, lr}
24: e12fff1e bx lr
they can no longer optimize the a+b result so more stack space or in the case of this optimizer, save other things on the stack to make room in registers. Now you ask why is r6 pushed on the stack? It is not being modified? This abi requires a 64 bit aligned stack so that means pushing four registers to save three things or push the three things and then modify the stack pointer, for this instruction set pushing the four things is cheaper than fetching another instruction and executing it.
if for whatever reason the external function becomes local
void dummy ( unsigned int )
{
}
unsigned int fun ( unsigned int a, unsigned int b )
{
dummy(a-b+3);
return(a+b+7);
}
that changes things again
00000000 <dummy>:
0: e12fff1e bx lr
00000004 <fun>:
4: e2811007 add r1, r1, #7
8: e0810000 add r0, r1, r0
c: e12fff1e bx lr
Since dummy doesnt use the parameter passed and the optimizer can now see it, then there is no reason to waste instructions subtracting and adding 3, that is all dead code, so remove it. We are no longer calling dummy since it is dead code so no need to save the link register on the stack and save the parameters just do the addition and return.
static void dummy ( unsigned int x )
{
}
unsigned int fun ( unsigned int a, unsigned int b )
{
dummy(a-b+3);
return(a+b+7);
}
making dummy local/static and nobody using it
00000000 <fun>:
0: e2811007 add r1, r1, #7
4: e0810000 add r0, r1, r0
8: e12fff1e bx lr
last experiment
static unsigned int dummy ( unsigned int x )
{
return(x+1);
}
unsigned int fun ( unsigned int a, unsigned int b )
{
unsigned int c;
c=dummy(a-b+3);
return(a+b+c);
}
dummy is static and called, but it is optimized here to be inline, so there is no call to it, so neither external folks can use it (static) nor does anyone inside this file use it, so there is no reason to generate it.
The compiler examines all of the operations and optimizes it. a-b+3+1+a+b = a+a+4 = (2*a)+4 = (a<<1)+4;
Why did they use a shift left instead of just add r0,r0,r0, dont know maybe the shift is faster in the pipe, or maybe it is irrelevant and either one was just as good and the compiler author chose this method, or perhaps the internal code which is somewhat generic figured this out and before it went to the backend it had been converted into a shift rather than an add.
00000000 <fun>:
0: e1a00080 lsl r0, r0, #1
4: e2800004 add r0, r0, #4
8: e12fff1e bx lr
command lines used for these experiments
arm-none-eabi-gcc -c -O2 so.c -o so.o
arm-none-eabi-as v.s -o v.o
arm-none-eabi-ld -Ttext=0x1000 v.o so.o -o so.elf
arm-none-eabi-objdump -D so.o
arm-none-eabi-objdump -D so.elf
The point being you can do these kinds of simple experiments yourself and begin to understand what is going on when and where the compiler and linker makes modifications to the machine code if that is how you like to think of it. And then realize which I sorta showed here when I added the non-static dummy function (the fun() function now was pushed deeper into memory) as you add more code, for example a C library from one operating system to the next may change or may be mostly identical except for the system calls so they may vary in size causing other code to possibly be moved around a larger puts() might cause printf() to live at a different address all other factors held constant. If not liking statically then no doubt there will be differences, just the file format and mechanism used to find a .so file on linux or a .dll on windows parse it, connect the dots runtime between the system calls in the application to the shared libraries. The file format and the location of shared libraries by themselves in application space will cause the binary that is linked with the operating specific stub to be different. And then eventually the actual system call itself.

Binaries are generally not portable across systems. Linux (and Unix) use ELF executable format, macOS uses Mach-O and Windows uses PE.

How to force gcc generate thumb 32 bit instructions?

Is it possible to force generating thumb 32 bit instructions when possible?
For example I have:
int main(void) {
8000280: b480 push {r7}
8000282: b085 sub sp, #20
8000284: af00 add r7, sp, #0
uint32_t a, b, c;
a = 1;
8000286: 2301 movs r3, #1
8000288: 60fb str r3, [r7, #12]
b = 1;
800028a: 2301 movs r3, #1
800028c: 60bb str r3, [r7, #8]
c = a+b;
800028e: 68fa ldr r2, [r7, #12]
8000290: 68bb ldr r3, [r7, #8]
8000292: 4413 add r3, r2
8000294: 607b str r3, [r7, #4]
while (1) ;
8000296: e7fe b.n 8000296 <main+0x16>
But they are all thumb 16 bit. For testing reason I want thumb 32 bit instructions.

Well, I don't think you can do that directly, but there is a very strange way of achieving that in a very indirect way. Just a few days ago I've seen a project where the compilation process was not just a simple arm-none-eabi-gcc ... -c file.c -o file.o. The project was calling arm-none-eabi-gcc to also generate extended assembly listing, which was later assembled manually with arm-none-eabi-as into an object file. If you would do it like that, then between these two steps you could modify the assembly listing to have wide instructions only - in most (all?) cases you could just use sed to add .w suffix to the instructions (change add r3, r2 into add.w r3, r2 and so on). Whether or not such level of build complication is worth it, is up to you...

Inline NOPs not optimized out in LLVM

I'm working through an example in this overview of compiling inline ARM assembly using GCC. Rather than GCC, I'm using llvm-gcc 4.2.1, and I'm compiling the following C code:
#include <stdio.h>
int main(void) {
printf("Volatile NOP\n");
asm volatile("mov r0, r0");
printf("Non-volatile NOP\n");
asm("mov r0, r0");
return 0;
}
Using the following commands:
llvm-gcc -emit-llvm -c -o compiled.bc input.c
llc -O3 -march=arm -o output.s compiled.bc
My output.s ARM ASM file looks like this:
.syntax unified
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.file "compiled.bc"
.text
.globl main
.align 2
.type main,%function
main: # #main
# BB#0: # %entry
str lr, [sp, #-4]!
sub sp, sp, #16
str r0, [sp, #12]
ldr r0, .LCPI0_0
str r1, [sp, #8]
bl puts
#APP
mov r0, r0
#NO_APP
ldr r0, .LCPI0_1
bl puts
#APP
mov r0, r0
#NO_APP
mov r0, #0
str r0, [sp, #4]
str r0, [sp]
ldr r0, [sp, #4]
add sp, sp, #16
ldr lr, [sp], #4
bx lr
# BB#1:
.align 2
.LCPI0_0:
.long .L.str
.align 2
.LCPI0_1:
.long .L.str1
.Ltmp0:
.size main, .Ltmp0-main
.type .L.str,%object # #.str
.section .rodata.str1.1,"aMS",%progbits,1
.L.str:
.asciz "Volatile NOP"
.size .L.str, 13
.type .L.str1,%object # #.str1
.section .rodata.str1.16,"aMS",%progbits,1
.align 4
.L.str1:
.asciz "Non-volatile NOP"
.size .L.str1, 17
The two NOPs are between their respective #APP/#NO_APP pairs. My expectation is that the asm() statement without the volatile keyword will be optimized out of existence due to the -O3 flag, but clearly both inline assembly statements survive.
Why does the asm("mov r0, r0") line not get recognized and removed as a NOP?

As Mystical and Mārtiņš Možeiko have describe the compiler does not optimize the code; ie, change the instructions. What the compiler does optimize is when the instruction is scheduled. When you use volatile, then the compiler will not re-schedule. In your example, re-scheduling would be moving before or after the printf.
The other optimization the compiler might make is to get C values to register for you. Register allocation is very important to optimization. This doesn't optimize the assembler, but allow the compiler to do sensible things with other code with-in the function.
To see the effect of volatile, here is some sample code,
int example(int test, int add)
{
int v1=5, v2=0;
int i=0;
if(test) {
asm volatile("add %0, %1, #7" : "=r" (v2) : "r" (v2));
i+= add * v1;
i+= v2;
} else {
asm ("add %0, %1, #7" : "=r" (v2) : "r" (v2));
i+= add * v1;
i+= v2;
}
return i;
}
The two branches have identical code except for the volatile. gcc 4.7.2 generates the following code for an ARM926,
example:
cmp r0, #0
bne 1f /* branch if test set? */
add r1, r1, r1, lsl #2
add r0, r0, #7 /* add seven delayed */
add r0, r0, r1
bx lr
1: mov r0, #0 /* test set */
add r0, r0, #7 /* add seven immediate */
add r1, r1, r1, lsl #2
add r0, r0, r1
bx lr
Note: The assembler branches are reversed to the 'C' code. The 2nd branch is slower on some processors due to pipe lining. The compiler prefers that
add r1, r1, r1, lsl #2
add r0, r0, r1
do not execute sequentially.
The Ethernut ARM Tutorial is an excellent resource. However, optimize is a bit of an overloaded word. The compiler doesn't analyze the assembler, only the arguments and where the code will be emitted.

volatile is implied if the asm statement has no outputs declared.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Using GCC's builtin functions in arm - gcc

Related

ARM GCC hardfault when using -O2

How to create an array of booleans in arm assembly?

Would a compiled program have different machine codes when executed on PC, Mac, Linux etc?

How to force gcc generate thumb 32 bit instructions?

Inline NOPs not optimized out in LLVM

Categories

Resources