In the LPC4088 user manual (p. 876) we can read that LPC4088 microcontroler has a really extraordinary startup procedure:
This looks like a total nonsense and I need someone to help me clear things out... In the world of ARM I've heard countless times to put vector table looking like this:
reset: b _start
undefined: b undefined
software_interrupt: b software_interrupt
prefetch_abort: b prefetch_abort
data_abort: b data_abort
nop
interrupt_request: b interrupt_request
fast_interrupt_request: b fast_interrupt_request
exactly at location 0x00000000 in my binary file, but why would we do that if this location is shadowed at boot with a boot ROM vector table which can't even be changed as it is read-only?! So where can we put our own vector table? I thought about putting it at 0x1FFF0000 so it would be transferred to location 0x00000000 at reset but can't do that because of read-only area...
Now to the second part. ARM expects to find exactly 8 vectors at 0x00000000 and at reset boot ROM checks if sum of 8 vectors is zero and only if this is true user code executes. To pass this check we need to sum up first 7 vectors and save it's 2's complement to the last vector which is a vector for fast interrupt requests residing at 0x0000001C. Well this is only true if your code is 4-bytes aligned (ARM encoding) but is it still true if your code is 2-bytes aligned (Thumb encoding) which is the case with all Cortex-M4 cores that can only execute Thumb encoded opcodes... So why did they explicitly mention that 2's complement of the sum has to be at 0x0000001C when this will never come in to play with Cortex-M4. Is 0x0000000E the proper address to save the 2's complement to?
And third part. Why would boot ROM even check if sum of first 8 vectors is zero when they are already in boot ROM?! And are read-only!
Can you see something is weird here? I need someone to explain to me the unclarities in the above three paragraphs...
you need to read the arm documentation as well as the nxp documentation. The non-cortex-m cores boot differently than the cortex-m cores you keep getting stuck there.
The cortex m is documented in the armv7m ARM ARM (architectural reference manual). It is based on VECTORS not INSTRUCTIONS. An address to the handler not an instruction like in full sized arm cores. Exception 7 is documented as reserved (for the ARM7TDMI based mcus from them it was the reserved vector they used for this checksum as well). Depending on the arm core you are using they expect as many as 144 or 272 (exceptions plus up to 128 or 256 interrupts depending on what the core supports).
(note the aarch64 processor, armv8 in 64 bit mode also boots differently than the traditional full sized 32 bit arm processor, even bigger table).
This checksum thing is classic NXP and makes sense, no reason to launch into an erased or not properly prepared flash and brick or hang.
.cpu cortex-m0
.thumb
.thumb_func
.globl _start
_start:
.word 0x20001000 # 0 SP load
.word reset # 1 Reset
.word hang # 2 NMI
.word hang # 3 HardFault
.word hang # 4 MemManage
.word hang # 5 BusFault
.word hang # 6 UsageFault
.word 0x00000000 # 7 Reserved
.thumb_func
hang: b hang
.thumb_func
reset:
b hang
which gives:
Disassembly of section .text:
00000000 <_start>:
0: 20001000 andcs r1, r0, r0
4: 00000023 andeq r0, r0, r3, lsr #32
8: 00000021 andeq r0, r0, r1, lsr #32
c: 00000021 andeq r0, r0, r1, lsr #32
10: 00000021 andeq r0, r0, r1, lsr #32
14: 00000021 andeq r0, r0, r1, lsr #32
18: 00000021 andeq r0, r0, r1, lsr #32
1c: 00000000 andeq r0, r0, r0
00000020 <hang>:
20: e7fe b.n 20 <hang>
00000022 <reset>:
22: e7fd b.n 20 <hang>
now make an ad-hoc tool that does the checksum and adds it to the binary
Looking at the above program as words this is the program:
0x20001000
0x00000023
0x00000021
0x00000021
0x00000021
0x00000021
0x00000021
0xDFFFEF38
0xE7FDE7FE
and if you flash it the bootloader should be happy with it and let it run.
Now that is assuming the checksum is word based if it is byte based then you would want a different number.
99% of baremetal programming is reading and research. If you had a binary from them already built or used a sandbox that supports this processor or family you could examine the binary built and see how all of this works. Or look at someones github examples or blog to see how this works. They did document this, and they have used this scheme for many years now before they were NXP, so nothing really new...Now is it a word based or byte based checksum, the documentation implies word based and that makes more sense. but a simple experiment and/or looking at sandbox produced binaries would have resolved that.
How I did it for this answer.
#include <stdio.h>
unsigned int data[8]=
{
0x20001000,
0x00000023,
0x00000021,
0x00000021,
0x00000021,
0x00000021,
0x00000021,
0x00000000,
};
int main ( void )
{
unsigned int ra;
unsigned int rb;
rb=0;
for(ra=0;ra<7;ra++)
{
rb+=data[ra];
}
data[7]=(-rb);
rb=0;
for(ra=0;ra<8;ra++)
{
rb+=data[ra];
printf("0x%08X 0x%08X\n",data[ra],rb);
}
return(0);
}
output:
0x20001000 0x20001000
0x00000023 0x20001023
0x00000021 0x20001044
0x00000021 0x20001065
0x00000021 0x20001086
0x00000021 0x200010A7
0x00000021 0x200010C8
0xDFFFEF38 0x00000000
then cut and pasted stuff into the answer.
How I have done it in the past is make an adhoc util that I call from my makefile that operates on the objcopied .bin file and either modifies that one or creates a new .bin file that has the checksum applied. You should be able to write that in 20-50 lines of code, choose your favorite language.
another comment question:
.cpu cortex-m0
.thumb
.word one
.word two
.word three
.thumb_func
one:
nop
two:
.thumb_func
three:
nop
Disassembly of section .text:
00000000 <one-0xc>:
0: 0000000d andeq r0, r0, sp
4: 0000000e andeq r0, r0, lr
8: 0000000f andeq r0, r0, pc
0000000c <one>:
c: 46c0 nop ; (mov r8, r8)
0000000e <three>:
e: 46c0 nop ; (mov r8, r8)
the .thumb_func affects the label AFTER...
Related
I'm using a Cortex-M0 MCU from NXP (LPC845) and I'm trying to figure out what GCC is trying to do :)
Basically, the C code (pseudo) is as follows:
volatile uint8_t readb1 = 0x1a; // dummy
readb1 = GpioPadB(GPIO_PIN);
and the macro I wrote is
(*((volatile uint8_t*)(SOME_GPIO_ADDRESS)))
Now the code is working, but it produced some extra UXTB instruction I don't understand
00000378: ldrb r3, [r3, #0]
0000037a: ldr r2, [pc, #200] ; (0x444 <AppInit+272>)
0000037c: uxtb r3, r3
0000037e: strb r3, [r2, #0]
105 asm("nop");
My explanation is as follows:
load BYTE from address specified in R3, put result in R3 <-- this is load from GPIO register as BYTE
load in R2 address of readb1 variable
UXTB extends the uint8 value ??? But rotate argument is 0, so basically does nothing for uint8 !
store as BYTE to R2's address (my variable) data from R3
Why does that?
First of all, it should know that data in R3 has just a BYTE meaning (it already generates LDRB correctly). Second, the STRB will already trim 7..0 LSB so why using UXTB ?
Thanks for clarifications,
EDITED:
Compiler version:
gcc version 9.2.1 20191025 (release) [ARM/arm-9-branch revision 277599] (GNU Tools for Arm Embedded Processors 9-2019-q4-major)
I use -O3
Looks like an extra instruction left in by the compiler and/or there is some nuance to the cortex-m or newer cores (would love to know what that nuance is).
#define GpioPadB(x) (*((volatile unsigned char *)(x)))
volatile unsigned char readb1;
void fun ( void )
{
readb1 = 0x1A;
readb1 = GpioPadB(0x1234000);
}
an apt gotten gcc
arm-none-eabi-gcc --version
arm-none-eabi-gcc (15:4.9.3+svn231177-1) 4.9.3 20150529 (prerelease)
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
arm-none-eabi-gcc -O2 -c -mthumb so.c -o so.o
arm-none-eabi-objdump -d so.o
00000000 <fun>:
0: 231a movs r3, #26
2: 4a03 ldr r2, [pc, #12] ; (10 <fun+0x10>)
4: 7013 strb r3, [r2, #0]
6: 4b03 ldr r3, [pc, #12] ; (14 <fun+0x14>)
8: 781b ldrb r3, [r3, #0]
a: 7013 strb r3, [r2, #0]
c: 4770 bx lr
e: 46c0 nop ; (mov r8, r8)
10: 00000000 .word 0x00000000
14: 01234000 .word 0x01234000
as one would expect.
arm-none-eabi-gcc -O2 -c -mthumb -march=armv7-m so.c -o so.o
arm-none-eabi-objdump -d so.o
so.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <fun>:
0: 4a03 ldr r2, [pc, #12] ; (10 <fun+0x10>)
2: 211a movs r1, #26
4: 4b03 ldr r3, [pc, #12] ; (14 <fun+0x14>)
6: 7011 strb r1, [r2, #0]
8: 781b ldrb r3, [r3, #0]
a: b2db uxtb r3, r3
c: 7013 strb r3, [r2, #0]
e: 4770 bx lr
10: 00000000 .word 0x00000000
14: 01234000 .word 0x01234000
with the extra utxb instruction in there
Something a bit newer
arm-none-eabi-gcc --version
arm-none-eabi-gcc (GCC) 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
for armv6m and armv7m
00000000 <fun>:
0: 231a movs r3, #26
2: 4a03 ldr r2, [pc, #12] ; (10 <fun+0x10>)
4: 7013 strb r3, [r2, #0]
6: 4b03 ldr r3, [pc, #12] ; (14 <fun+0x14>)
8: 781b ldrb r3, [r3, #0]
a: 7013 strb r3, [r2, #0]
c: 4770 bx lr
e: 46c0 nop ; (mov r8, r8)
10: 00000000 .word 0x00000000
14: 01234000 .word 0x01234000
for armv4t
00000000 <fun>:
0: 231a movs r3, #26
2: 4a03 ldr r2, [pc, #12] ; (10 <fun+0x10>)
4: 7013 strb r3, [r2, #0]
6: 4b03 ldr r3, [pc, #12] ; (14 <fun+0x14>)
8: 781b ldrb r3, [r3, #0]
a: 7013 strb r3, [r2, #0]
c: 4770 bx lr
e: 46c0 nop ; (mov r8, r8)
10: 00000000 .word 0x00000000
14: 01234000 .word 0x01234000
and the utxb is gone.
I think it is just a missed optimization, peephole or otherwise.
As answered already though, when you use non-gpr-sized variables you can expect and/or tolerate the compiler converting up to the register size. Varies by compiler and target as to whether they do it on the way in or the way out (when a variable is read or just before it is written or used down the road).
For x86 where you can access various portions of the register separately (or use memory based operands) you will see they do not do this (in gcc) even for cases when it clearly needs a sign extension or padding. And sort it out down the road when the value is used.
You can search the gcc sources for utxb and perhaps see the issue or a comment.
EDIT
Note that clang takes a different path, it burns clocks generating the address but does not do the extension
00000000 <fun>:
0: f240 0000 movw r0, #0
4: f2c0 0000 movt r0, #0
8: 211a movs r1, #26
a: 7001 strb r1, [r0, #0]
c: f244 0100 movw r1, #16384 ; 0x4000
10: f2c0 1123 movt r1, #291 ; 0x123
14: 7809 ldrb r1, [r1, #0]
16: 7001 strb r1, [r0, #0]
18: 4770 bx lr
clang --version
clang version 11.1.0 (https://github.com/llvm/llvm-project.git 1fdec59bffc11ae37eb51a1b9869f0696bfd5312)
Target: armv7m-none-unknown-eabi
Thread model: posix
InstalledDir: /opt/llvm11armv7m/bin
I think it is simply an optimization problem with gcc/gnu.
The "volatile" modifier is to blame. It does not call type extensions when written, because it doesn't make sense. But when reading, it always calls the extension. Because now the data is stored in a register, and must be ready for any operations, over the entire range of the visibility limit.
Abandoning "volatile" removes any additional operations on the data, but it can also remove the very fact of using the variable.
https://godbolt.org/z/cGvc8r6se
First of all, it should know that data in R3 has just a BYTE meaning
Registers are only 32 bits. They do not have any other "meaning". The register must contain the same value as the loaded byte - thus UXTB. Any other operation later (for example adding something requires the whole register to contain the correct value.
Generally speaking, using shorter types than 32 bit usually adds some overhead as Cortex-Mx processors do not do operations on the "portions" of the registers.
To fix this problem, you need to file a bug at https://gcc.gnu.org/bugzilla/. But there are two difficult situations.
There are a lot of bugs related to "volatile", and all of them are not closed, and most of them are not even confirmed. As far as I understand, the developers are already tired of fighting windmills, and do not even react to it.
To successfully fix the problem - you need to find the extreme, the very one that wrote the root of evil. Authorship and all. You will not be allowed into someone else's branch, and only the most advanced are allowed into the master.
But even before this moment, you need to find the reason for this behavior, and here again there are problems.
The GCC code is huge, you can search endlessly.
My personal opinion: GCC treats ARM kernel registers as part of fast memory. This memory can be accessed via a physical address, which only adds to the problems. Well, if this is memory, and the dimension does not match, then, according to GCC, you need to add expansion commands.
Why does GCC use the correct commands when simply accessed? - well, he reads from memory to memory. Emphasis - "from memory". No matter what happens next, you need to read it right now.
I am relatively inexperienced with ARM assembly, and need help understanding a few lines. I have used Godbolt to compile C++ 11 code with the ARM gcc 8.2 compiler and got these lines of assembly:
.L10:
.word .LANCHOR0
I read that .LANCHOR0 are section anchors, but what does that mean?
I understand that .word and .data can be used together to declare variables and assign values to memory spaces like this:
.data ! start a group of variable declarations
x: .word 23 ! int x = 23;
But, what does
.L10:
.word .LANCHOR0
do? There is no label preceding .word here.
Secondly, what does it mean when a block of .word lines are proceeded by another block of assembly instructions like this?
.L7:
.word 131586
.word .LANCHOR0
_GLOBAL__sub_I_unsigned int atlantic_line_ns::getSimInstAddr<atlantic_line_ns::Cr>():
mov r2, #0
ldr r3, .L10
str r2, [r3]
str r2, [r3, #4]
bx lr
Thanks in advance.
UPDATE
After reading the ARM documentation a bit more, I understand that
.L7:
.word 131586
.word .LANCHOR0
Allocates 2 memory locations, storing values, 131586 and the value at .LANCHOR0, there. Are these memory locations next to each other? Also, what does .LANCHR0 mean?
The following line
.L10:
.word .LANCHOR0
instructs assembler to allocate 4 bytes to hold the address of code/data at .LANCHOR0 label. The address itself is located at label .L10. Note that both .LANCHOR0 and .L10 are ordinary local labels (suffix simply denotes that compiler used them for different purposes).
As you noted, it's possible to intermix code and readonly data in assembly as e.g. in
.L7:
.word 131586
.word .LANCHOR0
mov r2, #0
ldr r3, .L10
Compiler frequently does this to declare local function data (literal pool, static variables, etc.). Data from .word lines next to each other will also be consecutive in object code.
I'm just getting started learning the very fundamentals of computers and programming. I've grasped that, in compiled programs, the machine code generated is specific to the type of processors and their instruction sets. What I'd like to know is, say, I have Windows, OS X and Linux all running on the exact same hardware (processor to be specific), would the machine code generated from this compiled program differ across the OSes? Is machine code OS dependent or will it be an exact same copy of bits and bytes across all the OS?
What happened when you tried it? As answered the file formats supported may vary, but you asked about machine code.
The machine code for the same processor core is the same of course. But only some percentage of the code is generic
a=b+c:
printf("%u\n",a);
Assume even you are using the same compiler version targeted at the same cpu but with a different operating system (same computer running linux then later windows) the addition is ideally the same assuming the top level function/source code is the same.
First off the entry point of the code may vary from one OS to another, so the linker may make the program different, for position dependent code, fixed addresses will end up in the binary, you can call that machine code or not, but the specific addresses may result in different instructions. A branch/jump may have to be encoded differently based on the address of course, but in one system you may have one form of branch another may require a trampoline to get from one place to another.
Then there are the system calls themselves, no reason to assume that the system calls between operating systems are the same. This can make the code vary in size, etc which can again cause the compiler or linker to have to make different machine code choices based on how near or far a jmp target is for some instruction sets or can the address be encoded as an immediate or do you have to load it from a nearby location then branch to that indirectly.
EDIT
Long before you start to ponder/worry about what happens on different operating systems on the same platform or target. Understand the basics of putting a program together, and what kinds of things can change the machine code.
A very simple program/function
extern unsigned int dummy ( unsigned int );
unsigned int fun ( unsigned int a, unsigned int b )
{
dummy(a+b+3);
return(a+b+7);
}
compile then disassemble
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: e0804001 add r4, r0, r1
8: e2840003 add r0, r4, #3
c: ebfffffe bl 0 <dummy>
10: e2840007 add r0, r4, #7
14: e8bd4010 pop {r4, lr}
18: e12fff1e bx lr
There is actually a ton of stuff going on there. This is arm, full sized (not thumb...yet). The a parameter comes in in r0, b in r1, result out in r0. lr is the return address register basically, so if we are calling another function we need to save that (on the stack) likewise we are going to re-use r0 to call dummy and in fact with this calling convention any function can modify/destroy r0-r3, so the compiler is going to need to deal with our two parameters, since I intentionally used them in the same way the compiler can optimize a+b into a register and save that on the stack, actually for performance reasons no doubt, they save r4 on the stack and then use r4 to save a+b, you cannot modify r4 at will in a function based on the calling convention so any nested function would have to preserve it and return with it in the as found state, so it is safe to just leave a+b there when calling other functions.
They add 3 to our a+b sum in r4 and call dummy. When it returns they add 7 to the a+b sum in r4 and return in r0.
From a machine code perspective this is not yet linked and dummy is an external function
c: ebfffffe bl 0 <dummy>
I call it dummy because when we use it here in a second it does nothing but return, a dummy function. The instruction encoded there is clearly wrong branching to the beginning of fun would not work that is recursion that is not what we asked for. So lets link it, at a minimum we need to declare a _start label to make the gnu linker happy, but I want to do more than that:
.globl _start
_start
bl fun
b .
.globl dummy
dummy:
bx lr
and linking for an entry address of 0x1000 produced this
00001000 <_start>:
1000: eb000001 bl 100c <fun>
1004: eafffffe b 1004 <_start+0x4>
00001008 <dummy>:
1008: e12fff1e bx lr
0000100c <fun>:
100c: e92d4010 push {r4, lr}
1010: e0804001 add r4, r0, r1
1014: e2840003 add r0, r4, #3
1018: ebfffffa bl 1008 <dummy>
101c: e2840007 add r0, r4, #7
1020: e8bd4010 pop {r4, lr}
1024: e12fff1e bx lr
The linker filled in the address for dummy by modifying the instruction that calls it, so you can see that the machine code has changed.
1018: ebfffffa bl 1008 <dummy>
Depending on how far away things are or other factors can change this, the bl instruction here has a long range but not the full address space, so if the program is sufficiently large and there is a lot of code between the caller and the callee then the linker may have to do more work. For different reasons I can cause that. Arm has arm and thumb modes and you have to use specific instructions in order to switch, bl not being one of them (or at least not for all of the arms).
If I add these two lines in front of the dummy function
.thumb
.thumb_func
.globl dummy
dummy:
bx lr
Forcing the assembler to generate thumb instructions and mark the dummy label as a thumb label then
00001000 <_start>:
1000: eb000001 bl 100c <fun>
1004: eafffffe b 1004 <_start+0x4>
00001008 <dummy>:
1008: 4770 bx lr
100a: 46c0 nop ; (mov r8, r8)
0000100c <fun>:
100c: e92d4010 push {r4, lr}
1010: e0804001 add r4, r0, r1
1014: e2840003 add r0, r4, #3
1018: eb000002 bl 1028 <__dummy_from_arm>
101c: e2840007 add r0, r4, #7
1020: e8bd4010 pop {r4, lr}
1024: e12fff1e bx lr
00001028 <__dummy_from_arm>:
1028: e59fc000 ldr r12, [pc] ; 1030 <__dummy_from_arm+0x8>
102c: e12fff1c bx r12
1030: 00001009 andeq r1, r0, r9
1034: 00000000 andeq r0, r0, r0
Because the BX is required to switch modes in this case and fun is arm mode and dummy is thumb mode the linker has very nicely for us added a trampoline function I call it to bounce off of to get from fun to dummy. The link register (lr) contains a bit that tells the bx on the return which mode to switch to so there is no extra work there to modify the dummy function.
Had there have been a great distance between the two functions in memory I would hope the linker would have also patched that up for us, but you never know until you try.
.globl _start
_start:
bl fun
b .
.globl dummy
dummy:
bx lr
.space 0x10000000
sigh, oh well
arm-none-eabi-ld -Ttext=0x1000 v.o so.o -o so.elf
v.o: In function `_start':
(.text+0x0): relocation truncated to fit: R_ARM_CALL against symbol `fun' defined in .text section in so.o
if we change one plus to a minus:
extern unsigned int dummy ( unsigned int );
unsigned int fun ( unsigned int a, unsigned int b )
{
dummy(a-b+3);
return(a+b+7);
}
and it gets more complicated
00000000 <fun>:
0: e92d4070 push {r4, r5, r6, lr}
4: e1a04001 mov r4, r1
8: e1a05000 mov r5, r0
c: e0400001 sub r0, r0, r1
10: e2800003 add r0, r0, #3
14: ebfffffe bl 0 <dummy>
18: e2840007 add r0, r4, #7
1c: e0800005 add r0, r0, r5
20: e8bd4070 pop {r4, r5, r6, lr}
24: e12fff1e bx lr
they can no longer optimize the a+b result so more stack space or in the case of this optimizer, save other things on the stack to make room in registers. Now you ask why is r6 pushed on the stack? It is not being modified? This abi requires a 64 bit aligned stack so that means pushing four registers to save three things or push the three things and then modify the stack pointer, for this instruction set pushing the four things is cheaper than fetching another instruction and executing it.
if for whatever reason the external function becomes local
void dummy ( unsigned int )
{
}
unsigned int fun ( unsigned int a, unsigned int b )
{
dummy(a-b+3);
return(a+b+7);
}
that changes things again
00000000 <dummy>:
0: e12fff1e bx lr
00000004 <fun>:
4: e2811007 add r1, r1, #7
8: e0810000 add r0, r1, r0
c: e12fff1e bx lr
Since dummy doesnt use the parameter passed and the optimizer can now see it, then there is no reason to waste instructions subtracting and adding 3, that is all dead code, so remove it. We are no longer calling dummy since it is dead code so no need to save the link register on the stack and save the parameters just do the addition and return.
static void dummy ( unsigned int x )
{
}
unsigned int fun ( unsigned int a, unsigned int b )
{
dummy(a-b+3);
return(a+b+7);
}
making dummy local/static and nobody using it
00000000 <fun>:
0: e2811007 add r1, r1, #7
4: e0810000 add r0, r1, r0
8: e12fff1e bx lr
last experiment
static unsigned int dummy ( unsigned int x )
{
return(x+1);
}
unsigned int fun ( unsigned int a, unsigned int b )
{
unsigned int c;
c=dummy(a-b+3);
return(a+b+c);
}
dummy is static and called, but it is optimized here to be inline, so there is no call to it, so neither external folks can use it (static) nor does anyone inside this file use it, so there is no reason to generate it.
The compiler examines all of the operations and optimizes it. a-b+3+1+a+b = a+a+4 = (2*a)+4 = (a<<1)+4;
Why did they use a shift left instead of just add r0,r0,r0, dont know maybe the shift is faster in the pipe, or maybe it is irrelevant and either one was just as good and the compiler author chose this method, or perhaps the internal code which is somewhat generic figured this out and before it went to the backend it had been converted into a shift rather than an add.
00000000 <fun>:
0: e1a00080 lsl r0, r0, #1
4: e2800004 add r0, r0, #4
8: e12fff1e bx lr
command lines used for these experiments
arm-none-eabi-gcc -c -O2 so.c -o so.o
arm-none-eabi-as v.s -o v.o
arm-none-eabi-ld -Ttext=0x1000 v.o so.o -o so.elf
arm-none-eabi-objdump -D so.o
arm-none-eabi-objdump -D so.elf
The point being you can do these kinds of simple experiments yourself and begin to understand what is going on when and where the compiler and linker makes modifications to the machine code if that is how you like to think of it. And then realize which I sorta showed here when I added the non-static dummy function (the fun() function now was pushed deeper into memory) as you add more code, for example a C library from one operating system to the next may change or may be mostly identical except for the system calls so they may vary in size causing other code to possibly be moved around a larger puts() might cause printf() to live at a different address all other factors held constant. If not liking statically then no doubt there will be differences, just the file format and mechanism used to find a .so file on linux or a .dll on windows parse it, connect the dots runtime between the system calls in the application to the shared libraries. The file format and the location of shared libraries by themselves in application space will cause the binary that is linked with the operating specific stub to be different. And then eventually the actual system call itself.
Binaries are generally not portable across systems. Linux (and Unix) use ELF executable format, macOS uses Mach-O and Windows uses PE.
I have built a simple application using GNU gcc (4.8.1429) for ARM (ATsam4LC4B: cortexM4 core, 256K flash) that:
loads at 0x3F000 (linker option -Ttext=0x3F000) and initializes,
copies itself (mirrors) at 0
jumps to the mirror Reset_Handler() (near 0) using assembly
the mirror application begins to run as expected
The problem raises when the flash memory location of the mirror main() call (flash location 0x532) is reached:
The instruction ldr [pc, #32] that is located there, loads r3 with 0x3F565 (thus calling the original main()), instead of the expected 0x565 (so that to call the mirror main() as expected).
This happens even if all the registers containing 0x3F... are zeroed prior to the ldr instruction.
The debugger reports the PC register to be 0x534 as expected and in compliance to the flash region within the stepping is performed.
Can anybody please explain to me what is going on?
Thank you in advance.
The program is compiled with [-mthumb -nostdlib -nostartfiles -ffreestanding -msingle-pic-base -mno-pic-data-is-text-relative -mlong-calls -mpoke-function-name] flags and linked with the [-Wl,--gc-sections -mcpu=cortex-m4 -Ttext=0x3f000 -nostdlib -nostartfiles -ffreestanding -Wl,-dead-strip -Wl,-static -Wl,--entry=Reset_Handler -Wl,--cref -mthumb] flags. The linker script defines rom start=0.
The region of interest within the produced .lss file contains (omitted lines are marked as ...):
...
void Reset_Handler(void)
{
...
/* Initialize the relocate segment */
...
/* Clear the zero segment */
...
/* Set the vector table base address */
/* Initialize the C library */
// __libc_init_array();
/* Branch to main function */
main();
3f532: 4b08 ldr r3, [pc, #32] ; (3f554 <Reset_Handler+0x5c>)
3f534: 4798 blx r3
3f536: e7fe b.n 3f536 <Reset_Handler+0x3e>
3f538: 20000000 .word 0x20000000
3f53c: 0003f59c .word 0x0003f59c
3f540: 20000004 .word 0x20000004
3f544: 20000004 .word 0x20000004
3f548: 20000008 .word 0x20000008
3f54c: e000ed00 .word 0xe000ed00
3f550: 0003f000 .word 0x0003f000
3f554: 0003f565 .word 0x0003f565
3f558: 6e69616d .word 0x6e69616d
3f55c: 00 .byte 0x00
3f55d: 00 .byte 0x00
3f55e: bf00 nop
3f560: ff000008 .word 0xff000008
0003f564 <main>:
int main (void)
{
3f564: b510 push {r4, lr}
...
asm("bx %0"::"r"(*(unsigned*)0x3F004 - 0x3F000));
3f580: 4b05 ldr r3, [pc, #20] ; (3f598 <main+0x34>)
3f582: 681b ldr r3, [r3, #0]
3f584: f5a3 337c sub.w r3, r3, #258048 ; 0x3f000
3f588: 4718 bx r3
}
return 0;
}
3f58a: 2000 movs r0, #0
3f58c: bd10 pop {r4, pc}
3f58e: bf00 nop
3f590: e0001000 .word 0xe0001000
3f594: 0003f329 .word 0x0003f329
3f598: 0003f004 .word 0x0003f004
...
#Notlikethat answered my question.
I was confused assuming that pc+32 address should be loaded to r3 directly, instead of the contents of this address.
In the CMSIS definitions for gcc you can find something like this:
static __INLINE void __DMB(void) { __ASM volatile ("dmb"); }
My question is: what use does a memory barrier have if it does not declare "memory" in the clobber list?
Is it an error in the core_cm3.h or is there a reason why gcc should behave correctly without any additional help?
I did some testing with gcc 4.5.2 (built with LTO). If I compile this code:
static inline void __DMB(void) { asm volatile ("dmb"); }
static inline void __DMB2(void) { asm volatile ("dmb" ::: "memory"); }
char x;
char test1 (void)
{
x = 15;
return x;
}
char test2 (void)
{
x = 15;
__DMB();
return x;
}
char test3 (void)
{
x = 15;
__DMB2();
return x;
}
using arm-none-eabi-gcc -Os -mcpu=cortex-m3 -mthumb -c dmb.c, then from arm-none-eabi-objdump -d dmb.o I get this:
00000000 <test1>:
0: 4b01 ldr r3, [pc, #4] ; (8 <test1+0x8>)
2: 200f movs r0, #15
4: 7018 strb r0, [r3, #0]
6: 4770 bx lr
8: 00000000 .word 0x00000000
0000000c <test2>:
c: 4b02 ldr r3, [pc, #8] ; (18 <test2+0xc>)
e: 200f movs r0, #15
10: 7018 strb r0, [r3, #0]
12: f3bf 8f5f dmb sy
16: 4770 bx lr
18: 00000000 .word 0x00000000
0000001c <test3>:
1c: 4b03 ldr r3, [pc, #12] ; (2c <test3+0x10>)
1e: 220f movs r2, #15
20: 701a strb r2, [r3, #0]
22: f3bf 8f5f dmb sy
26: 7818 ldrb r0, [r3, #0]
28: 4770 bx lr
2a: bf00 nop
2c: 00000000 .word 0x00000000
It is obvious that __DBM() only inserts the dmb instruction and it takes DMB2() to actually force the compiler to flush the values cached in the registers.
I guess I found a CMSIS bug.
IMHO the CMSIS version is right.
Injecting the barrier instruction without the memory in clobber list achieves exactly what it is supposed to do:
If the previous write on "x" variable was buffered then it is committed. This is useful, for example, if you are going to pass "x" address to as a DMA address, or if you are going to setup MPU.
It has no effect on returning "x" (your program is guaranteed to be correct even if you omit memory barrier).
On the other hand by inserting memory in clobber list, you have no kind of effect in situations like the example before (DMA, MPU..).
The only difference in the latter case is that if you have for example an ISR modifying the value of "x" right after "strb" , then the value that will be returned is the value modified by the ISR, because the clobber caused the compiler to read from memory to register again.
But if you want to obtain this thing then you should use "volatile" variables.
In other words: the barrier forces cache vs memory commit in order to guarantee consistency with other HW resources that might access RAM memory, while clobbering memory causes compiler to stop assuming the memory has not changed and to read again in local registers, that is another thing with different purposes (it does not matter if a memory change is still in cache or already committed on RAM, because an eventual asm load operation is guaranteed to work in both cases without barriers).