Would a compiled program have different machine codes when executed on PC, Mac, Linux etc? - compilation

I'm just getting started learning the very fundamentals of computers and programming. I've grasped that, in compiled programs, the machine code generated is specific to the type of processors and their instruction sets. What I'd like to know is, say, I have Windows, OS X and Linux all running on the exact same hardware (processor to be specific), would the machine code generated from this compiled program differ across the OSes? Is machine code OS dependent or will it be an exact same copy of bits and bytes across all the OS?

What happened when you tried it? As answered the file formats supported may vary, but you asked about machine code.
The machine code for the same processor core is the same of course. But only some percentage of the code is generic
a=b+c:
printf("%u\n",a);
Assume even you are using the same compiler version targeted at the same cpu but with a different operating system (same computer running linux then later windows) the addition is ideally the same assuming the top level function/source code is the same.
First off the entry point of the code may vary from one OS to another, so the linker may make the program different, for position dependent code, fixed addresses will end up in the binary, you can call that machine code or not, but the specific addresses may result in different instructions. A branch/jump may have to be encoded differently based on the address of course, but in one system you may have one form of branch another may require a trampoline to get from one place to another.
Then there are the system calls themselves, no reason to assume that the system calls between operating systems are the same. This can make the code vary in size, etc which can again cause the compiler or linker to have to make different machine code choices based on how near or far a jmp target is for some instruction sets or can the address be encoded as an immediate or do you have to load it from a nearby location then branch to that indirectly.
EDIT
Long before you start to ponder/worry about what happens on different operating systems on the same platform or target. Understand the basics of putting a program together, and what kinds of things can change the machine code.
A very simple program/function
extern unsigned int dummy ( unsigned int );
unsigned int fun ( unsigned int a, unsigned int b )
{
dummy(a+b+3);
return(a+b+7);
}
compile then disassemble
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: e0804001 add r4, r0, r1
8: e2840003 add r0, r4, #3
c: ebfffffe bl 0 <dummy>
10: e2840007 add r0, r4, #7
14: e8bd4010 pop {r4, lr}
18: e12fff1e bx lr
There is actually a ton of stuff going on there. This is arm, full sized (not thumb...yet). The a parameter comes in in r0, b in r1, result out in r0. lr is the return address register basically, so if we are calling another function we need to save that (on the stack) likewise we are going to re-use r0 to call dummy and in fact with this calling convention any function can modify/destroy r0-r3, so the compiler is going to need to deal with our two parameters, since I intentionally used them in the same way the compiler can optimize a+b into a register and save that on the stack, actually for performance reasons no doubt, they save r4 on the stack and then use r4 to save a+b, you cannot modify r4 at will in a function based on the calling convention so any nested function would have to preserve it and return with it in the as found state, so it is safe to just leave a+b there when calling other functions.
They add 3 to our a+b sum in r4 and call dummy. When it returns they add 7 to the a+b sum in r4 and return in r0.
From a machine code perspective this is not yet linked and dummy is an external function
c: ebfffffe bl 0 <dummy>
I call it dummy because when we use it here in a second it does nothing but return, a dummy function. The instruction encoded there is clearly wrong branching to the beginning of fun would not work that is recursion that is not what we asked for. So lets link it, at a minimum we need to declare a _start label to make the gnu linker happy, but I want to do more than that:
.globl _start
_start
bl fun
b .
.globl dummy
dummy:
bx lr
and linking for an entry address of 0x1000 produced this
00001000 <_start>:
1000: eb000001 bl 100c <fun>
1004: eafffffe b 1004 <_start+0x4>
00001008 <dummy>:
1008: e12fff1e bx lr
0000100c <fun>:
100c: e92d4010 push {r4, lr}
1010: e0804001 add r4, r0, r1
1014: e2840003 add r0, r4, #3
1018: ebfffffa bl 1008 <dummy>
101c: e2840007 add r0, r4, #7
1020: e8bd4010 pop {r4, lr}
1024: e12fff1e bx lr
The linker filled in the address for dummy by modifying the instruction that calls it, so you can see that the machine code has changed.
1018: ebfffffa bl 1008 <dummy>
Depending on how far away things are or other factors can change this, the bl instruction here has a long range but not the full address space, so if the program is sufficiently large and there is a lot of code between the caller and the callee then the linker may have to do more work. For different reasons I can cause that. Arm has arm and thumb modes and you have to use specific instructions in order to switch, bl not being one of them (or at least not for all of the arms).
If I add these two lines in front of the dummy function
.thumb
.thumb_func
.globl dummy
dummy:
bx lr
Forcing the assembler to generate thumb instructions and mark the dummy label as a thumb label then
00001000 <_start>:
1000: eb000001 bl 100c <fun>
1004: eafffffe b 1004 <_start+0x4>
00001008 <dummy>:
1008: 4770 bx lr
100a: 46c0 nop ; (mov r8, r8)
0000100c <fun>:
100c: e92d4010 push {r4, lr}
1010: e0804001 add r4, r0, r1
1014: e2840003 add r0, r4, #3
1018: eb000002 bl 1028 <__dummy_from_arm>
101c: e2840007 add r0, r4, #7
1020: e8bd4010 pop {r4, lr}
1024: e12fff1e bx lr
00001028 <__dummy_from_arm>:
1028: e59fc000 ldr r12, [pc] ; 1030 <__dummy_from_arm+0x8>
102c: e12fff1c bx r12
1030: 00001009 andeq r1, r0, r9
1034: 00000000 andeq r0, r0, r0
Because the BX is required to switch modes in this case and fun is arm mode and dummy is thumb mode the linker has very nicely for us added a trampoline function I call it to bounce off of to get from fun to dummy. The link register (lr) contains a bit that tells the bx on the return which mode to switch to so there is no extra work there to modify the dummy function.
Had there have been a great distance between the two functions in memory I would hope the linker would have also patched that up for us, but you never know until you try.
.globl _start
_start:
bl fun
b .
.globl dummy
dummy:
bx lr
.space 0x10000000
sigh, oh well
arm-none-eabi-ld -Ttext=0x1000 v.o so.o -o so.elf
v.o: In function `_start':
(.text+0x0): relocation truncated to fit: R_ARM_CALL against symbol `fun' defined in .text section in so.o
if we change one plus to a minus:
extern unsigned int dummy ( unsigned int );
unsigned int fun ( unsigned int a, unsigned int b )
{
dummy(a-b+3);
return(a+b+7);
}
and it gets more complicated
00000000 <fun>:
0: e92d4070 push {r4, r5, r6, lr}
4: e1a04001 mov r4, r1
8: e1a05000 mov r5, r0
c: e0400001 sub r0, r0, r1
10: e2800003 add r0, r0, #3
14: ebfffffe bl 0 <dummy>
18: e2840007 add r0, r4, #7
1c: e0800005 add r0, r0, r5
20: e8bd4070 pop {r4, r5, r6, lr}
24: e12fff1e bx lr
they can no longer optimize the a+b result so more stack space or in the case of this optimizer, save other things on the stack to make room in registers. Now you ask why is r6 pushed on the stack? It is not being modified? This abi requires a 64 bit aligned stack so that means pushing four registers to save three things or push the three things and then modify the stack pointer, for this instruction set pushing the four things is cheaper than fetching another instruction and executing it.
if for whatever reason the external function becomes local
void dummy ( unsigned int )
{
}
unsigned int fun ( unsigned int a, unsigned int b )
{
dummy(a-b+3);
return(a+b+7);
}
that changes things again
00000000 <dummy>:
0: e12fff1e bx lr
00000004 <fun>:
4: e2811007 add r1, r1, #7
8: e0810000 add r0, r1, r0
c: e12fff1e bx lr
Since dummy doesnt use the parameter passed and the optimizer can now see it, then there is no reason to waste instructions subtracting and adding 3, that is all dead code, so remove it. We are no longer calling dummy since it is dead code so no need to save the link register on the stack and save the parameters just do the addition and return.
static void dummy ( unsigned int x )
{
}
unsigned int fun ( unsigned int a, unsigned int b )
{
dummy(a-b+3);
return(a+b+7);
}
making dummy local/static and nobody using it
00000000 <fun>:
0: e2811007 add r1, r1, #7
4: e0810000 add r0, r1, r0
8: e12fff1e bx lr
last experiment
static unsigned int dummy ( unsigned int x )
{
return(x+1);
}
unsigned int fun ( unsigned int a, unsigned int b )
{
unsigned int c;
c=dummy(a-b+3);
return(a+b+c);
}
dummy is static and called, but it is optimized here to be inline, so there is no call to it, so neither external folks can use it (static) nor does anyone inside this file use it, so there is no reason to generate it.
The compiler examines all of the operations and optimizes it. a-b+3+1+a+b = a+a+4 = (2*a)+4 = (a<<1)+4;
Why did they use a shift left instead of just add r0,r0,r0, dont know maybe the shift is faster in the pipe, or maybe it is irrelevant and either one was just as good and the compiler author chose this method, or perhaps the internal code which is somewhat generic figured this out and before it went to the backend it had been converted into a shift rather than an add.
00000000 <fun>:
0: e1a00080 lsl r0, r0, #1
4: e2800004 add r0, r0, #4
8: e12fff1e bx lr
command lines used for these experiments
arm-none-eabi-gcc -c -O2 so.c -o so.o
arm-none-eabi-as v.s -o v.o
arm-none-eabi-ld -Ttext=0x1000 v.o so.o -o so.elf
arm-none-eabi-objdump -D so.o
arm-none-eabi-objdump -D so.elf
The point being you can do these kinds of simple experiments yourself and begin to understand what is going on when and where the compiler and linker makes modifications to the machine code if that is how you like to think of it. And then realize which I sorta showed here when I added the non-static dummy function (the fun() function now was pushed deeper into memory) as you add more code, for example a C library from one operating system to the next may change or may be mostly identical except for the system calls so they may vary in size causing other code to possibly be moved around a larger puts() might cause printf() to live at a different address all other factors held constant. If not liking statically then no doubt there will be differences, just the file format and mechanism used to find a .so file on linux or a .dll on windows parse it, connect the dots runtime between the system calls in the application to the shared libraries. The file format and the location of shared libraries by themselves in application space will cause the binary that is linked with the operating specific stub to be different. And then eventually the actual system call itself.

Binaries are generally not portable across systems. Linux (and Unix) use ELF executable format, macOS uses Mach-O and Windows uses PE.

Related

Why is ARM gcc calling __udivsi3 when dividing by a constant?

I'm using the latest available version of ARM-packaged GCC:
arm-none-eabi-gcc (GNU Arm Embedded Toolchain 10-2020-q4-major) 10.2.1 20201103 (release)
Copyright (C) 2020 Free Software Foundation, Inc.
When I compile this code using "-mcpu=cortex-m0 -mthumb -Ofast":
int main(void) {
uint16_t num = (uint16_t) ADC1->DR;
ADC1->DR = num / 7;
}
I would expect that the division would be accomplished by a multiplication and a shift, but instead this code is being generated:
08000b5c <main>:
8000b5c: b510 push {r4, lr}
8000b5e: 4c05 ldr r4, [pc, #20] ; (8000b74 <main+0x18>)
8000b60: 2107 movs r1, #7
8000b62: 6c20 ldr r0, [r4, #64] ; 0x40
8000b64: b280 uxth r0, r0
8000b66: f7ff facf bl 8000108 <__udivsi3>
8000b6a: b280 uxth r0, r0
8000b6c: 6420 str r0, [r4, #64] ; 0x40
8000b6e: 2000 movs r0, #0
8000b70: bd10 pop {r4, pc}
8000b72: 46c0 nop ; (mov r8, r8)
8000b74: 40012400 .word 0x40012400
Using __udivsi3 instead of multiply and shift is terribly inefficient. Am I using the wrong flags, or missing something else, or is this a GCC bug?
The Cortex-M0 lacks instructions to perform a 32x32->64-bit multiply. Because num is an unsigned 16-bit quantity, multiplying it by 9363 and shifting right 16 would yield a correct result in all cases, but--likely because a uint16_t will be promoted to int before the multiply, gcc does not include such optimizations.
From what I've observed, gcc does a generally poor job of optimizing for the Cortex-M0, failing to employ some straightforward optimizations which would be appropriate for that platform, but sometimes employing "optimizations" which aren't. Given something like
void test1(uint8_t *p)
{
for (int i=0; i<32; i++)
p[i] = (p[i]*9363) >> 16; // Divide by 7
}
gcc happens to generate okay code for the Cortex-M0 at -O2, but if the multiplication were replaced with an addition the compiler would generate code which reloads the constant 9363 on every iteration of the loop. When using addition, even if the code were changed to:
void test2(uint16_t *p)
{
register unsigned u9363 = 9363;
for (int i=0; i<32; i++)
p[i] = (p[i]+u9363) >> 16;
}
gcc would still bring the load of the constant into the loop. Sometimes gcc's optimizations may also have unexpected behavioral consequences. For example, one might expect that on a platform like a Cortex-M0, invoking something like:
unsigned short test(register unsigned short *p)
{
register unsigned short temp = *p;
return temp - (temp >> 15);
}
while an interrupt changes the contents of *p might yield behavior consistent with the old value or the new value. The Standard wouldn't require such treatment, but most implementations intended to be suitable for embedded programming tasks will offer stronger guarantees than what the Standard requires. If either the old or new value would be equally acceptable, letting the compiler use whichever is more convenient may allow more efficient code than using volatile. As it happens, however, the "optimized" code from gcc will replace the two uses of temp with separate loads of *p.
If you're using gcc with the Cortex-M0 and are at all concerned about performance or the possibility of "astonishing" behaviors, get in the habit of inspecting the compiler's output. For some kinds of loop, it might even be worth considering testing out -O0. If code makes suitable use of the register keyword, its performance can sometimes beat that of identical code processed with -O2.
Expanding on supercat's answer.
Feed this:
unsigned short fun ( unsigned short x )
{
return(x/7);
}
to something with a larger multiply:
00000000 <fun>:
0: e59f1010 ldr r1, [pc, #16] ; 18 <fun+0x18>
4: e0832190 umull r2, r3, r0, r1
8: e0400003 sub r0, r0, r3
c: e08300a0 add r0, r3, r0, lsr #1
10: e1a00120 lsr r0, r0, #2
14: e12fff1e bx lr
18: 24924925 .word 0x24924925
1/7 in binary (long division):
0.001001001001001
111)1.000000
111
====
1000
111
===
1
0.001001001001001001001001001001
0.0010 0100 1001 0010 0100 1001 001001
0x2492492492...
0x24924925>>32 (rounded up)
For this to work you need a 64 bit result, you take the top half and do some adjustments, so for example:
7 * 0x24924925 = 0x100000003
and you take the top 32 bits (not completely this simple but for this value you can see it working).
The all thumbs variant multiply is 32 bits = 32 bits * 32 bits, so the result would be 0x00000003 and that does not work.
So 0x24924 which we can make 0x2493 as supercat did or 0x2492.
Now we can use the 32x32 = 32 bit multiply:
0x2492 * 7 = 0x0FFFE
0x2493 * 7 = 0x10005
Let's run with the one larger:
0x100000000/0x2493 = a number greater than 65536. so that is fine.
but:
0x3335 * 0x2493 = 0x0750DB6F
0x3336 * 0x2493 = 0x07510002
0x3335 / 7 = 0x750
0x3336 / 7 = 0x750
So you can only get so far with that approach.
If we follow the model of the arm code:
for(ra=0;ra<0x10000;ra++)
{
rb=0x2493*ra;
rd=rb>>16;
rb=ra-rd;
rb=rd+(rb>>1);
rb>>=2;
rc=ra/7;
printf("0x%X 0x%X 0x%X \n",ra,rb,rc);
if(rb!=rc) break;
}
Then it works from 0x0000 to 0xFFFF, so you could write the asm to do that (note it needs to be 0x2493 not 0x2492).
If you know the operand is not going above a certain value then you can use more bits of 1/7th to multiply against.
In any case when the compiler does not do this optimization for you then you might still have a chance yourself.
Now that I think about it I ran into this before, and now it makes sense. But I was on a full sized arm and I called a routine I compiled in arm mode (the other code was in thumb mode), and had a switch statement basically if denominator = 1 then result = x/1; if denominator = 2 then result = x/2 and so on. And then it avoided the gcclib function and generated the 1/x multiplies. (I had like 3 or 4 different constants to divide by):
unsigned short udiv7 ( unsigned short x )
{
unsigned int r0;
unsigned int r3;
r0=x;
r3=0x2493*r0;
r3>>=16;
r0=r0-r3;
r0=r3+(r0>>1);
r0>>=2;
return(r0);
}
Assuming I made no mistakes:
00000000 <udiv7>:
0: 4b04 ldr r3, [pc, #16] ; (14 <udiv7+0x14>)
2: 4343 muls r3, r0
4: 0c1b lsrs r3, r3, #16
6: 1ac0 subs r0, r0, r3
8: 0840 lsrs r0, r0, #1
a: 18c0 adds r0, r0, r3
c: 0883 lsrs r3, r0, #2
e: b298 uxth r0, r3
10: 4770 bx lr
12: 46c0 nop ; (mov r8, r8)
14: 00002493 .word 0x00002493
That should be faster than a generic division library routine.
Edit
I think I see what supercat has done with the solution that works:
((i*37449 + 16384u) >> 18)
We have this as the 1/7th fraction:
0.001001001001001001001001001001
but we can only do a 32 = 32x32 bit multiply. The leading zeros give us some breathing room we might be able to take advantage of. So instead of 0x2492/0x2493 we can try:
1001001001001001
0x9249
0x9249*0xFFFF = 0x92486db7
And so far it won't overflow:
rb=((ra*0x9249) >> 18);
by itself it fails at 7 * 0x9249 = 0x3FFFF, 0x3FFFF>>18 is zero not 1.
So maybe
rb=((ra*0x924A) >> 18);
that fails at:
0xAAAD 0x1862 0x1861
So what about:
rb=((ra*0x9249 + 0x8000) >> 18);
and that works.
What about supercat's?
rb=((ra*0x9249 + 0x4000) >> 18);
and that runs clean for all values 0x0000 to 0xFFFF:
rb=((ra*0x9249 + 0x2000) >> 18);
and that fails here:
0xE007 0x2000 0x2001
So there are a couple of solutions that work.
unsigned short udiv7 ( unsigned short x )
{
unsigned int ret;
ret=x;
ret=((ret*0x9249 + 0x4000) >> 18);
return(ret);
}
00000000 <udiv7>:
0: 4b03 ldr r3, [pc, #12] ; (10 <udiv7+0x10>)
2: 4358 muls r0, r3
4: 2380 movs r3, #128 ; 0x80
6: 01db lsls r3, r3, #7
8: 469c mov ip, r3
a: 4460 add r0, ip
c: 0c80 lsrs r0, r0, #18
e: 4770 bx lr
10: 00009249 .word 0x00009249
Edit
As far as the "why" question goes, that is not a Stack Overflow question; if you want to know why gcc doesn't do this, ask the authors of that code. All we can do is speculate here and the speculation is they may either have chosen not to because of the number of instructions or they may have chosen not to because they have an algorithm that states because this is not a 64 = 32x32 bit multiply then do not bother.
Again the why question is not a Stack Overflow question, so perhaps we should just close this question and delete all of the answers.
I found this to be incredibly educational (once you know/understand what was being said).
Another WHY? question is why did gcc do it the way they did it when they could have done it the way supercat or I did it?
The compiler can only rearrange integer expressions if it knows that the result will be correct for any input allowed by the language.
Because 7 is co-prime to 2, it is impossible to carry out dividing any input by seven with multiplying and shifting.
If you know that it is possible for the input that you intend to provide, then you have to do it yourself using the multiply and shift operators.
Depending on the size of the input, you will have to choose how much to shift so that the output is correct (or at least good enough for your application) and so that the intermediate doesn't overflow. The compiler has no way of knowing what is accurate enough for your application, or what your maximum input will be. If it allows any input up to the maximum of the type, then every multiplication will overflow.
In general GCC will only carry out division using shifting if the divisor is not co-prime to 2, that is if it is a power of two.

Stop ARM GCC Optimising Out Function Call

volatile static const uint8_t mcau8IsBlank[] = {0xFF}; // Value in MCU FLASH memory
// The above value may actually be modified by a FLASH Write elsewhere in the code
bool halIsBlank() {
return ((*(uint8_t*)mcau8IsBlank));
}
void someFuncInAnotherFile() {
uint8_t data[64];
data[0] = halIsBlank(); // ARM GCC is optimising away this function call
// Replacing it simply with a 0xFF constant
// ... etc
// ... transmit data
}
How do I get ARM GCC to not optimise out the call to halIsBlank()? The compiler is assuming that mcau8IsBlank[] is always == 0xFF and is thus simply replacing the call with a 0xFF constant.
I can disable optimisation of the calling function (someFuncInAnotherFile()) by adding __attribute__((optimize(0))) to it, but it would be better to add some attribute to the called function (halIsBlank()) (and no attributes or keywords that I've tried seem to do the trick)?
If an object is declared as const then any attempt to modify it leads to undefined behaviour. The compiler is allowed to assume that a const object is constant. And you explictely cast away the volatileness of the array, so the compiler can assume it is not volatile at this point.
I'd remove that cast to (uint8_t *) which seems to be pointless anyway.
so.c
const unsigned char one = 0x11;
unsigned char two = 0x22;
volatile unsigned char three = 0x33;
extern unsigned char four;
unsigned int get_one ( void )
{
return(one);
}
unsigned int get_two ( void )
{
return(two);
}
unsigned int get_three ( void )
{
return(three);
}
unsigned int get_four ( void )
{
return(four);
}
four.c
unsigned char four = 0x44;
gnu ld linker script
MEMORY
{
rom : ORIGIN = 0x10000000, LENGTH = 0x1000
ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*)
four.o(.data) } > rom
.rodata : { *(.rodata*) } > rom
.data : { *(.data*) } > ram
}
result
Disassembly of section .text:
10000000 <_start>:
10000000: 20001000 andcs r1, r0, r0
10000004: 10000009 andne r0, r0, r9
10000008 <reset>:
10000008: e7fe b.n 10000008 <reset>
...
1000000c <get_one>:
1000000c: 2011 movs r0, #17
1000000e: 4770 bx lr
10000010 <get_two>:
10000010: 4b01 ldr r3, [pc, #4] ; (10000018 <get_two+0x8>)
10000012: 7818 ldrb r0, [r3, #0]
10000014: 4770 bx lr
10000016: bf00 nop
10000018: 20000000 andcs r0, r0, r0
1000001c <get_three>:
1000001c: 4b01 ldr r3, [pc, #4] ; (10000024 <get_three+0x8>)
1000001e: 7858 ldrb r0, [r3, #1]
10000020: 4770 bx lr
10000022: bf00 nop
10000024: 20000000 andcs r0, r0, r0
10000028 <get_four>:
10000028: 4b01 ldr r3, [pc, #4] ; (10000030 <get_four+0x8>)
1000002a: 7818 ldrb r0, [r3, #0]
1000002c: 4770 bx lr
1000002e: bf00 nop
10000030: 10000034 andne r0, r0, r4, lsr r0
10000034 <four>:
10000034: Address 0x0000000010000034 is out of bounds.
Disassembly of section .rodata:
10000035 <one>:
10000035: Address 0x0000000010000035 is out of bounds.
Disassembly of section .data:
20000000 <two>:
20000000: Address 0x0000000020000000 is out of bounds.
20000001 <three>:
20000001: Address 0x0000000020000001 is out of bounds.
Because one is const the local function optimizes, but it is also global so added to flash (in case other objects reference it). Make it static and the allocation in flash goes away.
Two is plain old .data. Has to build the code this way, linker adds the address at link time.
Three is volatile global handled the same way as two because it was global and in .data, volatile does not do much here.
Four is a solution if you choose. Define it outside this file/optimization domain, and the compiler has to generate code to reach an unknown length to get it. In the linker script tell the linker to place it in flash. So while it is in flash and technically not read/write, if you have a way to write the flash then this will work.
Well actually it will not because when you erase the flash to change four you wipe out some percentage of this .text code along with it. You need to know the part what the erase blocks are and put things like these in one of those erase blocks, and you have to save all of them to ram if you want to change one of them, save all, erase, write back all including any changed values...And rare that you can execute in the same flash logic as the flash being erased so may need a trampoline to do this save, erase, restore routine. (more linker magic and a copy and jump)
One function calling another in the same optimization domain is going to likely inline it so you will want to find a please do not inline command line option, although for this case that does not make any sense, you want to optimize and possibly make the small function static so it goes away all together.

LPC4088 checksum value for Thumb?

In the LPC4088 user manual (p. 876) we can read that LPC4088 microcontroler has a really extraordinary startup procedure:
This looks like a total nonsense and I need someone to help me clear things out... In the world of ARM I've heard countless times to put vector table looking like this:
reset: b _start
undefined: b undefined
software_interrupt: b software_interrupt
prefetch_abort: b prefetch_abort
data_abort: b data_abort
nop
interrupt_request: b interrupt_request
fast_interrupt_request: b fast_interrupt_request
exactly at location 0x00000000 in my binary file, but why would we do that if this location is shadowed at boot with a boot ROM vector table which can't even be changed as it is read-only?! So where can we put our own vector table? I thought about putting it at 0x1FFF0000 so it would be transferred to location 0x00000000 at reset but can't do that because of read-only area...
Now to the second part. ARM expects to find exactly 8 vectors at 0x00000000 and at reset boot ROM checks if sum of 8 vectors is zero and only if this is true user code executes. To pass this check we need to sum up first 7 vectors and save it's 2's complement to the last vector which is a vector for fast interrupt requests residing at 0x0000001C. Well this is only true if your code is 4-bytes aligned (ARM encoding) but is it still true if your code is 2-bytes aligned (Thumb encoding) which is the case with all Cortex-M4 cores that can only execute Thumb encoded opcodes... So why did they explicitly mention that 2's complement of the sum has to be at 0x0000001C when this will never come in to play with Cortex-M4. Is 0x0000000E the proper address to save the 2's complement to?
And third part. Why would boot ROM even check if sum of first 8 vectors is zero when they are already in boot ROM?! And are read-only!
Can you see something is weird here? I need someone to explain to me the unclarities in the above three paragraphs...
you need to read the arm documentation as well as the nxp documentation. The non-cortex-m cores boot differently than the cortex-m cores you keep getting stuck there.
The cortex m is documented in the armv7m ARM ARM (architectural reference manual). It is based on VECTORS not INSTRUCTIONS. An address to the handler not an instruction like in full sized arm cores. Exception 7 is documented as reserved (for the ARM7TDMI based mcus from them it was the reserved vector they used for this checksum as well). Depending on the arm core you are using they expect as many as 144 or 272 (exceptions plus up to 128 or 256 interrupts depending on what the core supports).
(note the aarch64 processor, armv8 in 64 bit mode also boots differently than the traditional full sized 32 bit arm processor, even bigger table).
This checksum thing is classic NXP and makes sense, no reason to launch into an erased or not properly prepared flash and brick or hang.
.cpu cortex-m0
.thumb
.thumb_func
.globl _start
_start:
.word 0x20001000 # 0 SP load
.word reset # 1 Reset
.word hang # 2 NMI
.word hang # 3 HardFault
.word hang # 4 MemManage
.word hang # 5 BusFault
.word hang # 6 UsageFault
.word 0x00000000 # 7 Reserved
.thumb_func
hang: b hang
.thumb_func
reset:
b hang
which gives:
Disassembly of section .text:
00000000 <_start>:
0: 20001000 andcs r1, r0, r0
4: 00000023 andeq r0, r0, r3, lsr #32
8: 00000021 andeq r0, r0, r1, lsr #32
c: 00000021 andeq r0, r0, r1, lsr #32
10: 00000021 andeq r0, r0, r1, lsr #32
14: 00000021 andeq r0, r0, r1, lsr #32
18: 00000021 andeq r0, r0, r1, lsr #32
1c: 00000000 andeq r0, r0, r0
00000020 <hang>:
20: e7fe b.n 20 <hang>
00000022 <reset>:
22: e7fd b.n 20 <hang>
now make an ad-hoc tool that does the checksum and adds it to the binary
Looking at the above program as words this is the program:
0x20001000
0x00000023
0x00000021
0x00000021
0x00000021
0x00000021
0x00000021
0xDFFFEF38
0xE7FDE7FE
and if you flash it the bootloader should be happy with it and let it run.
Now that is assuming the checksum is word based if it is byte based then you would want a different number.
99% of baremetal programming is reading and research. If you had a binary from them already built or used a sandbox that supports this processor or family you could examine the binary built and see how all of this works. Or look at someones github examples or blog to see how this works. They did document this, and they have used this scheme for many years now before they were NXP, so nothing really new...Now is it a word based or byte based checksum, the documentation implies word based and that makes more sense. but a simple experiment and/or looking at sandbox produced binaries would have resolved that.
How I did it for this answer.
#include <stdio.h>
unsigned int data[8]=
{
0x20001000,
0x00000023,
0x00000021,
0x00000021,
0x00000021,
0x00000021,
0x00000021,
0x00000000,
};
int main ( void )
{
unsigned int ra;
unsigned int rb;
rb=0;
for(ra=0;ra<7;ra++)
{
rb+=data[ra];
}
data[7]=(-rb);
rb=0;
for(ra=0;ra<8;ra++)
{
rb+=data[ra];
printf("0x%08X 0x%08X\n",data[ra],rb);
}
return(0);
}
output:
0x20001000 0x20001000
0x00000023 0x20001023
0x00000021 0x20001044
0x00000021 0x20001065
0x00000021 0x20001086
0x00000021 0x200010A7
0x00000021 0x200010C8
0xDFFFEF38 0x00000000
then cut and pasted stuff into the answer.
How I have done it in the past is make an adhoc util that I call from my makefile that operates on the objcopied .bin file and either modifies that one or creates a new .bin file that has the checksum applied. You should be able to write that in 20-50 lines of code, choose your favorite language.
another comment question:
.cpu cortex-m0
.thumb
.word one
.word two
.word three
.thumb_func
one:
nop
two:
.thumb_func
three:
nop
Disassembly of section .text:
00000000 <one-0xc>:
0: 0000000d andeq r0, r0, sp
4: 0000000e andeq r0, r0, lr
8: 0000000f andeq r0, r0, pc
0000000c <one>:
c: 46c0 nop ; (mov r8, r8)
0000000e <three>:
e: 46c0 nop ; (mov r8, r8)
the .thumb_func affects the label AFTER...

Using GCC's builtin functions in arm

I'm working on a cortex-m3 board with a bare-metal toolchain without libc.
I implemented memcpy which copies data byte-to-byte but it's too slow. In GCC manual, it says it provides __builtin_memcpy and I decided to use it. So here is the implementation with __builtin_memcpy.
#include <stddef.h>
void *memcpy(void *dest, const void *src, size_t n)
{
return __builtin_memcpy(dest,src,n);
}
When I compile this code, it becomes a recursive function which never ends.
$ arm-none-eabi-gcc -march=armv7-m -mcpu=cortex-m3 -mtune=cortex-m3 \
-O2 -ffreestanding -c memcpy.c -o memcpy.o
$ arm-none-eabi-objdump -d memcpy.o
memcpy.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <memcpy>:
0: f7ff bffe b.w 0 <memcpy>
Am I doing wrong? How can I use the compiler-generated memcpy version?
Builtin functions are not supposed to be used to implement itself :)
Builtin functions are supposed to be used in application code - then the compiler may or may not generate some special insn sequence or a call to the underlying real function
Compare:
int a [10], b [20];
void
foo ()
{
__builtin_memcpy (a, b, 10 * sizeof (int));
}
This results in:
foo:
stmfd sp!, {r4, r5}
ldr r4, .L2
ldr r5, .L2+4
ldmia r4!, {r0, r1, r2, r3}
mov ip, r5
stmia ip!, {r0, r1, r2, r3}
ldmia r4!, {r0, r1, r2, r3}
stmia ip!, {r0, r1, r2, r3}
ldmia r4, {r0, r1}
stmia ip, {r0, r1}
ldmfd sp!, {r4, r5}
bx lr
But:
void
bar (int n)
{
__builtin_memcpy (a, b, n * sizeof (int));
}
results in a call to the memcpy function:
bar:
mov r2, r0, asl #2
stmfd sp!, {r3, lr}
ldr r1, .L5
ldr r0, .L5+4
bl memcpy
ldmfd sp!, {r3, lr}
bx lr
Theoretically, library is not part of C compiler and not part of toolchain.
Thus, if you wrotememcpy(&a,&b,sizeof(a)) compiler MUST generate subroutine call.
The idea of __builtin : to inform compiler, that the function is standard and can be optimized. Thus, if you wrote __builtin_memcpy(&a,&b,sizeof(a)) compiler MAY generate subroutine call, but in most cases it will not happens. For example, if size is known as 4 at compile time - only one mov command will be generated. (Another advantage - even in case of subroutine call compiler is informed, that library function has no side effects).
So, it's ALWAYS better to use __builtin_memcpy instead of memcpy. In modern libraries it was done by #define memcpy __builtin_memcpy just in string.h
But you still need implement memcpy somewhere, call will be generated in sophistical places. For string functions on ARM, it's strictly recommended 4-byte implementation.

Fixed point math with ARM Cortex-M4 and gcc compiler

I'm using Freescale Kinetis K60 and using the CodeWarrior IDE (which I believe uses GCC for the complier).
I want to multiply two 32 bit numbers (which results in a 64 bit number) and only retain the upper 32 bits.
I think the correct assembly instruction for the ARM Cortex-M4 is the SMMUL instruction. I would prefer to access this instruction from C code rather than assembly. How do I do this?
I imagine the code would ideally be something like this:
int a,b,c;
a = 1073741824; // 0x40000000 = 0.5 as a D0 fixed point number
b = 1073741824; // 0x40000000 = 0.5 as a D0 fixed point number
c = ((long long)a*b) >> 31; // 31 because there are two sign bits after the multiplication
// so I can throw away the most significant bit
When I try this in CodeWarrior, I get the correct result for c (536870912 = 0.25 as a D0 FP number). I don't see the SMMUL instruction anywhere and the multiply is 3 instructions (UMULL, MLA, and MLA -- I don't understand why it is using a unsigned multiply, but that is another question). I have also tried a right shift of 32 since that might make more sense for the SMMUL instruction, but that doesn't do anything different.
The problem you get with optimizing that code is:
08000328 <mul_test01>:
8000328: f04f 5000 mov.w r0, #536870912 ; 0x20000000
800032c: 4770 bx lr
800032e: bf00 nop
your code doesnt do anything runtime so the optimizer can just compute the final answer.
this:
.thumb_func
.globl mul_test02
mul_test02:
smull r2,r3,r0,r1
mov r0,r3
bx lr
called with this:
c = mul_test02(0x40000000,0x40000000);
gives 0x10000000
UMULL gives the same result because you are using positive numbers, the operands and results are all positive so it doesnt get into the signed/unsigned differences.
Hmm, well you got me on this one. I would read your code as telling the compiler to promote the multiply to a 64 bit. smull is two 32 bit operands giving a 64 bit result, which is not what your code is asking for....but both gcc and clang used the smull anyway, even if I left it as an uncalled function, so it didnt know at compile time that the operands had no significant digits above 32, they still used smull.
Perhaps the shift was the reason.
Yup, that was it..
int mul_test04 ( int a, int b )
{
int c;
c = ((long long)a*b) >> 31;
return(c);
}
gives
both gcc and clang (well clang recycles r0 and r1 instead of using r2 and r3)
08000340 <mul_test04>:
8000340: fb81 2300 smull r2, r3, r1, r0
8000344: 0fd0 lsrs r0, r2, #31
8000346: ea40 0043 orr.w r0, r0, r3, lsl #1
800034a: 4770 bx lr
but this
int mul_test04 ( int a, int b )
{
int c;
c = ((long long)a*b);
return(c);
}
gives this
gcc:
08000340 <mul_test04>:
8000340: fb00 f001 mul.w r0, r0, r1
8000344: 4770 bx lr
8000346: bf00 nop
clang:
0800048c <mul_test04>:
800048c: 4348 muls r0, r1
800048e: 4770 bx lr
So with the bit shift the compilers realize that you are only interested in the upper portion of the result so they can discard the upper portion of the operands which means smull can be used.
Now if you do this:
int mul_test04 ( int a, int b )
{
int c;
c = ((long long)a*b) >> 32;
return(c);
}
both compilers get even smarter, in particular clang:
0800048c <mul_test04>:
800048c: fb81 1000 smull r1, r0, r1, r0
8000490: 4770 bx lr
gcc:
08000340 <mul_test04>:
8000340: fb81 0100 smull r0, r1, r1, r0
8000344: 4608 mov r0, r1
8000346: 4770 bx lr
I can see that 0x40000000 considered as a float where you are keeping track of the decimal place, and that place is a fixed location. 0x20000000 would make sense as the answer. I cant yet decide if that 31 bit shift works universally or just for this one case.
A complete example used for the above is here
https://github.com/dwelch67/stm32vld/tree/master/stm32f4d/sample01
and I did run it on an stm32f4 to verify it works and the results.
EDIT:
If you pass the parameters into the function instead of hardcoding them within the function:
int myfun ( int a, int b )
{
return(a+b);
}
The compiler is forced to make runtime code instead of optimize the answer at compile time.
Now if you call that function from another function with hardcoded numbers:
...
c=myfun(0x1234,0x5678);
...
In this calling function the compiler may choose to compute the answer and just place it there at compile time. If the myfun() function is global (not declared as static) the compiler doesnt know if some other code to be linked later will use it so even near the call point in this file it optimizes an answer it still has to produce the actual function and leave it in the object for other code in other files to call, so you can still examine what the compiler/optimizer does with that C code. Unless you use llvm for example where you can optimize the whole project (across files) external code calling this function will use the real function and not a compile time computed answer.
both gcc and clang did what I am describing, left runtime code for the function as a global function, but within the file it computed the answer at compile time and placed the hardcoded answer in the code instead of calling the function:
int mul_test04 ( int a, int b )
{
int c;
c = ((long long)a*b) >> 31;
return(c);
}
in another function in the same file:
hexstring(mul_test04(0x40000000,0x40000000),1);
The function itself is implemented in the code:
0800048c <mul_test04>:
800048c: fb81 1000 smull r1, r0, r1, r0
8000490: 0fc9 lsrs r1, r1, #31
8000492: ea41 0040 orr.w r0, r1, r0, lsl #1
8000496: 4770 bx lr
but where it is called they have hardcoded the answer because they had all the information needed to do so:
8000520: f04f 5000 mov.w r0, #536870912 ; 0x20000000
8000524: 2101 movs r1, #1
8000526: f7ff fe73 bl 8000210 <hexstring>
If you dont want the hardcoded answer you need to use a function that is not in the same optimization pass.
Manipulating the compiler and optimizer comes down to a lot of practice and it is not an exact science as the compilers and optimizers are constantly evolving (for better or worse).
By isolating a small bit of code in a function you are causing problems in another way, larger functions are more likely to need a stack frame and evict variables from registers to the stack as it goes, smaller functions might not need to do that and the optimizers may change how the code is implemented as a result. You test the code fragment one way to see what the compiler is doing then use it in a larger function and dont get the result you want. If there is an exact instruction or sequence of instructions you want implemented....Implement them in assembler. If you were targeting a specific set of instructions in a specific instruction set/processor then avoid the game, avoid your code changing when you change computers/compilers/etc, and just use assembler for that target. if needed ifdef or otherwise use conditional compile options to build for different targets without the assembler.
GCC supports actual fixed-point types: http://gcc.gnu.org/onlinedocs/gcc/Fixed_002dPoint.html
I'm not sure what instruction it will use, but it might make you life easier.

Resources