I have this code that works fine:
void function( void )
{
__asm volatile
(
" ldr r3, .ADDRESS \n"
" mov r2, %0 \n"
" str r2, [r3, %1] \n"
".ADDRESS: .word 0x401C4000 \n"
:: "i" (1<<17), "i" (16)
);
}
But to declare .ADDRESS I used the magic number 0x401C4000. I actually have a macro for this.
For Example:
#define ADDR_BASE 0x401C4000
#define ADDR ((void *)ADDR_BASE)
void function( void )
{
__asm volatile
(
" ldr r3, .ADDRESS \n"
" mov r2, %0 \n"
" str r2, [r3, %1] \n"
".ADDRESS: .word %2 \n"
:: "i" (1<<17), "i" (16), "i" (ADDR)
);
}
That doesn't build.
How can I use a macro in this case?
See the edit below. I'm only leaving this first solution as an example of what you shouldn't do.
I found this post where a solution is given to my same problem, but it involves x86.
I wanted to try it anyway, and it works.
Then use %c2 instead of %2, as shown here:
#define ADDR_BASE 0x401C4000
#define ADDR ((void *)0x401C4000)
void function( void )
{
__asm volatile
(
" ldr r3, .ADDRESS \n"
" mov r2, %0 \n"
" str r2, [r3, %1] \n"
".ADDRESS: .word %c2 \n"
:: "i" (1<<17), "i" (16), "i" (ADDR): "r2", "r3", "memory"
);
}
EDIT
The solution I suggested above may not work by seeing the generated code:
20 function:
21 # Function supports interworking.
22 # args = 0, pretend = 0, frame = 0
23 # frame_needed = 1, uses_anonymous_args = 0
24 # link register save eliminated.
25 0000 04B02DE5 str fp, [sp, #-4]!
26 0004 00B08DE2 add fp, sp, #0
27 .syntax divided
28 # 6 "test.c" 1
29 0008 04309FE5 ldr r3, .ADDRESS
30 000c 0228A0E3 mov r2, #131072
31 0010 102083E5 str r2, [r3, #16]
32 0014 00401C40 .ADDRESS: .word 1075593216
33
34 # 0 "" 2
35 .arm
36 .syntax unified
37 0018 0000A0E1 nop
38 001c 00D08BE2 add sp, fp, #0
39 # sp needed
40 0020 04B09DE4 ldr fp, [sp], #4
41 0024 1EFF2FE1 bx lr
You can see at line 32 the symbol .ADDRESS has been issued, but it is surrounded by code, and there is no instruction to skip it. I believe that it can be attempted to be executed as if it were an instruction.
Maybe a better solution, suggested by Peter, is
#define ADDR_BASE 0x401C4000
#define ADDR ((void *)0x401C4000)
void function( void )
{
__asm volatile
(
"str %0, [%2, %1]"
:: "r" (1<<17), "i" (16), "r" (ADDR) : "memory"
);
}
Related
So I'm having trouble with my program. It's supposed to read in a text file
that has a number on each line. It then stores that in an array, sorts it using selection sort, and then outputs it to a new file. The reading of and writing to the file work perfectly fine but my code for the sort isn't working properly. When I run the program, it only seems to store some of the numbers
in the array and then a bunch of zeroes.
So if my input is 112323, 32, 12, 19, 2, 1, 23. The output is 0,0,0,0, 2,1,23. I'm pretty sure the problem's with how I'm storing and loading from the array
onto the registers because assuming that part works, I can't find any reason why the selection sort algorithm shouldn't work.
Ok thanks to your help, I figured out that I needed to change the load and store instruction so that it matches the specifier used (ldr -> ldrb and str -> strb). But I need to make a sorting algorithm that works for 32 bit numbers so which combination of specifiers and load/store instructions would allow me to do that? Or would I have to load/store 8 bits a time? And if so, how would I do that?
.data
.balign 4
readfile: .asciz "myfile.txt"
.balign 4
readmode: .asciz "r"
.balign 4
writefile: .asciz "output.txt"
.balign 4
writemode: .asciz "w"
.balign 4
return: .word 0
.balign 4
scanformat: .asciz "%d"
.balign 4
printformat: .asciz "%d\n"
.balign 4
a: .space 32
.text
.global main
.global fopen
.global fprintf
.global fclose
.global fscanf
.global printf
main:
ldr r1, =return
str lr, [r1]
ldr r0, =readfile
ldr r1, =readmode
bl fopen
mov r4, r0
mov r5, #0
ldr r6, =a
loop:
cmp r5, #7
beq sort
mov r0, r4
ldr r1, =scanformat
mov r2, r6
bl fscanf
add r5, r5, #1
add r6, r6, #1
b loop
sort:
mov r5,#0 /*array parser for first loop*/
mov r6, #0 /* #stores index of minimum*/
mov r7, #0 /* #temp*/
mov r8, #0 /*# array parser for second loop*/
mov r9, #7 /*# stores length of array*/
ldr r10, =a /*# the array*/
mov r11, #0 /*#used to obtain offset for min*/
mov r12, #0 /*# used to obtain offset for second parser access*/
loop3:
cmp r5, r9 /*# check if first parser reached end of array*/
beq write /* #if it did array is sorted write it to file*/
mov r6, r5 /*#set the min index to the current position*/
mov r8, r6 /*#set the second parser to where first parser is at*/
b loop4 /*#start looking for min in this subarray*/
loop4:
cmp r8, r9 /* #if reached end of list min is found*/
beq increment /* #get out of this loop and increment 1st parser**/
lsl r7, r6, #3 /*multiplies min index by 8 */
ADD r7, r10, r7 /* adds offset to r10 address storing it in r7 */
ldr r11, [r7] /* loads value of min in r11 */
lsl r7, r8, #3 /* multiplies second parse index by 8 */
ADD r7, r10, r7 /* adds offset to r10 address storing in r7 */
ldr r12, [r7] /* loads value of second parse into r12 */
cmp r11, r12 /* #compare current min to the current position of 2nd parser !!!!!*/
movgt r6, r8 /*# set new min to current position of second parser */
add r8, r8, #1 /*increment second parser*/
b loop4 /*repeat */
increment:
lsl r11, r5, #3 /* multiplies first parse index by 8 */
ADD r11, r10, r11 /* adds offset to r10 address stored in r11*/
ldr r8, [r11] /* loads value in memory address in r11 to r8*/
lsl r12, r6, #3 /*multiplies min index by 8 */
ADD r12, r10, r12 /*ads offset to r10 address stored in r12 */
ldr r7, [r12] /* loads value in memory address in r12 to r7 */
str r8, [r12] /* # stores value of first parser where min was !!!!!*/
str r7, [r11] /*# store value of min where first parser was !!!!!*/
add r5, r5, #1 /*#increment the first parser*/
ldr r0,=printformat
mov r1, r7
bl printf
b loop3 /*#go to loop1*/
write:
mov r0, r4
bl fclose
ldr r0, =writefile
ldr r1, =writemode
bl fopen
mov r4, r0
mov r5, #0
ldr r6, =a
loop2:
cmp r5, #7
beq end
mov r0, r4
ldr r1, =printformat
ldrb r2, [r6]
bl fprintf
add r5, r5, #1
add r6, r6, #1
b loop2
end:
mov r0, r4
bl fclose
ldr r0, =a
ldr r0, [r0]
ldr lr, =return
ldr lr, [lr]
bx lr
I figured out that I needed to change the load and store instruction
so that it matches the specifier used (ldr -> ldrb and str -> strb).
But I need to make a sorting algorithm that works for 32 bit numbers
so which combination of specifiers and load/store instructions would
allow me to do that?
If you want to read 32b (4 bytes) values from memory, you have to have 4 bytes values in memory to begin with. Well that should not be surprising :)
Eg if your input is numbers 1, 2, 3, 4, each number is 32b value than in memory that would be
0x00000000: 01 00 00 00 | 02 00 00 00 <- 32b values of 1 & 2
0x00000008: 03 00 00 00 | 04 00 00 00 <- 32b values of 3 & 4
In such case ldr would read 32b each time and you would get 1, 2, 3, 4 with each read in register.
Now, you have in memory byte values (based on your statement that `ldrb` gives right result), eg
0x00000000: 01
0x00000001: 02
0x00000002: 03
0x00000003: 04
or same in one line
0x00000000: 01 02 03 04
So reading 8bit by ldrb gives you numbers 1, 2, 3, 4
But ldr would do read 32b value from memory (all 4 bytes at once) and you would get 32b value 0x04030201 in register.
Note: examples for little-endian systems
gcc ARM for STM32F407 micro
The following function is used as a sanity check in FreeRtosTCP
UBaseType_t bIsValidNetworkDescriptor( const NetworkBufferDescriptor_t * pxDesc )
{
uint32_t offset = ( uint32_t ) ( ((const char *)pxDesc) - ((const char *)xNetworkBuffers) );
if( ( offset >= (uint32_t)(sizeof( xNetworkBuffers )) ) || ( ( offset % sizeof( xNetworkBuffers[0] ) ) != 0 ) )
return pdFALSE;
return (UBaseType_t) (pxDesc - xNetworkBuffers) + 1;
}
The line in question is ---> offset >= (uint32_t)(sizeof( xNetworkBuffers ))
gcc produces a bhi instruction after the cmp instead of a bhs.
If tries casting both as shown in the code above but nothing seems to get the bhs instruction to be used.
Any help appreciated.
Thanks.
Joe
Well knowing the exact size of the xNetworkBuffers array compiler can simply optimize it. Being curious I gave it a try. Following is the code with little modifications and the asm output and the explanation:
#include <stdint.h>
typedef struct abc {
char data[10];
}NetworkBufferDescriptor_t;
NetworkBufferDescriptor_t xNetworkBuffers[5];
int bIsValidNetworkDescriptor( const NetworkBufferDescriptor_t * pxDesc )
{
uint32_t offset = ( uint32_t ) ( ((const char *)pxDesc) - ((const char *)xNetworkBuffers) );
if( ( offset >= (uint32_t)(sizeof( xNetworkBuffers )) ) || ( ( offset % sizeof( xNetworkBuffers[0] ) ) != 0 ) )
return 0;
return (int) (pxDesc - xNetworkBuffers) + 1;
}
and the asm output is:
bIsValidNetworkDescriptor:
# Function supports interworking.
# args = 0, pretend = 0, frame = 16
# frame_needed = 1, uses_anonymous_args = 0
# link register save eliminated.
str fp, [sp, #-4]!
add fp, sp, #0
sub sp, sp, #20
str r0, [fp, #-16]
ldr r3, [fp, #-16]
ldr r2, .L5
sub r3, r3, r2
str r3, [fp, #-8]
ldr r3, [fp, #-8]
cmp r3, #49
bhi .L2
ldr r1, [fp, #-8]
ldr r3, .L5+4
umull r2, r3, r1, r3
lsr r2, r3, #3
mov r3, r2
lsl r3, r3, #2
add r3, r3, r2
lsl r3, r3, #1
sub r2, r1, r3
cmp r2, #0
beq .L3
.L2:
mov r3, #0
b .L4
.L3:
ldr r3, [fp, #-16]
ldr r2, .L5
sub r3, r3, r2
asr r2, r3, #1
mov r3, r2
lsl r3, r3, #1
add r3, r3, r2
lsl r1, r3, #4
add r3, r3, r1
lsl r1, r3, #8
add r3, r3, r1
lsl r1, r3, #16
add r3, r3, r1
lsl r3, r3, #2
add r3, r3, r2
add r3, r3, #1
.L4:
mov r0, r3
add sp, fp, #0
# sp needed
ldr fp, [sp], #4
bx lr
.L6:
.align 2
.L5:
In the block quoted asm code you can see that it is comparing with 49 not 50 (which is the actual size of xNetworkBuffers) so the conclusion I got is
offset >= (uint32_t)(sizeof( xNetworkBuffers ))
is also equal to
offset > (uint32_t)(sizeof( xNetworkBuffers ) - 1) )
and in that case compiler can use BHI producing the same results
I think the code generated by GCC is correct, technically speaking. offset cannot be larger than INT_MAX, because this is the maximum value representable in ptrdiff_t on this architecture.
You can compute the difference like this:
uintptr_t offset = (uintptr_t)pxDesc - (uintptr_t)xNetworkBuffers;
This is still implementation-defined, but it will avoid the overflow problem.
I have a C program in file delay.c:
void delay(int num)
{
volatile int i;
for(i=0; i<num; i++);
}
Then I compile the program with gcc 4.6.3 on ARM emulator (armel, more specifically) with command gcc -g -O1 -o delay.o delay.c. The assembly in delay.o is:
00000000 <delay>:
0: e24dd008 sub sp, sp, #8
4: e3a03000 mov r3, #0
8: e58d3004 str r3, [sp, #4]
c: e59d3004 ldr r3, [sp, #4]
10: e1500003 cmp r0, r3
14: da000005 ble 30 <delay+0x30>
18: e59d3004 ldr r3, [sp, #4]
1c: e2833001 add r3, r3, #1
20: e58d3004 str r3, [sp, #4]
24: e59d3004 ldr r3, [sp, #4]
28: e1530000 cmp r3, r0
2c: bafffff9 blt 18 <delay+0x18>
30: e28dd008 add sp, sp, #8
34: e12fff1e bx lr
I want to figure out where the variable i is on the stack of function delay from debugging information. Below is the information about delay and i in .debug_info section:
<1><25>: Abbrev Number: 2 (DW_TAG_subprogram)
<26> DW_AT_external : 1
<27> DW_AT_name : (indirect string, offset: 0x19): delay
<2b> DW_AT_decl_file : 1
<2c> DW_AT_decl_line : 1
<2d> DW_AT_prototyped : 1
<2e> DW_AT_low_pc : 0x0
<32> DW_AT_high_pc : 0x38
<36> DW_AT_frame_base : 0x0 (location list)
<3a> DW_AT_sibling : <0x59>
...
<2><4b>: Abbrev Number: 4 (DW_TAG_variable)
<4c> DW_AT_name : i
<4e> DW_AT_decl_file : 1
<4f> DW_AT_decl_line : 3
<50> DW_AT_type : <0x60>
<54> DW_AT_location : 0x20 (location list)
It shows that the location of i is in the location list. So I output the location list:
Offset Begin End Expression
00000000 00000000 00000004 (DW_OP_breg13 (r13): 0)
00000000 00000004 00000038 (DW_OP_breg13 (r13): 8)
00000000 <End of list>
00000020 0000000c 00000020 (DW_OP_fbreg: -12)
00000020 00000024 00000028 (DW_OP_reg3 (r3))
00000020 00000028 00000038 (DW_OP_fbreg: -12)
00000020 <End of list>
From address 4 to 38, the frame base of delay should be r13 + 8. So from address c to 20 and from address 28 to 38, the location of i is r13 + 8 -12 = r13 - 4.
However, from the assembly, we can know that there is no location r13 - 4 and i is apparently at location r13 + 4.
Do I miss some calculation step? Anyone can explain the difference of i's location between calculation from debugging information and in assembly?
Thanks in advance!
TL;DR The analysis in the question is correct and the discrepancy is a bug in one of the gcc components (GNU Arm Embedded Toolchain is an obvious place to log one).
As it stands, this other answer is incorrect because it erroneously conflates the value of the stack pointer on evaluation of a location expression with the earlier value of the stack pointer on entry to the function.
As far as the DWARF is concerned, the location of i varies with the program counter. Consider, for example, the text address delay+0x18. At this point, the location of i is given by DW_OP_fbreg(-12), i.e. 12 bytes below the frame base. The frame base is given by the parent DW_TAG_subprogram's DW_AT_frame_base attribute which, in this case, is also dependent on the program counter: for delay+0x18 its expression is DW_OP_breg13(8), i.e. r13 + 8. Importantly, this calculation uses the current value of r13, i.e. the value of r13 when the program counter is equal to delay+0x18.
Thus the DWARF asserts that, at delay+0x18, i is located at r13 + 8 - 12, i.e. 4 bytes below the bottom of the existing stack. Inspection of the assembly shows that, at delay+018, i should be found 4 bytes above the bottom of the stack. Therefore the DWARF is in error and whatever generated it is defective.
One can demonstrate the bug using gdb with a simple wrapper around the test case provided in the question:
$ cat delay.c
void delay(int num)
{
volatile int i;
for(i=0; i<num; i++);
}
$ gcc-4.6 -g -O1 -c delay.c
$ cat main.c
void delay(int);
int main(int argc, char **argv) {
delay(3);
}
$ gcc-4.6 -o test main.c delay.o
$ gdb ./test
.
.
.
(gdb)
Set a breakpoint at delay+0x18 and run to the second occurrence (where we expect i to be 1):
(gdb) break *delay+0x18
Breakpoint 1 at 0x103cc: file delay.c, line 4.
(gdb) run
Starting program: /home/pi/test
Breakpoint 1, 0x000103cc in delay (num=3) at delay.c:4
4 for(i=0; i<num; i++);
(gdb) cont
Continuing.
Breakpoint 1, 0x000103cc in delay (num=3) at delay.c:4
4 for(i=0; i<num; i++);
(gdb)
We know from the disassembly that i is four bytes above the stack pointer. Indeed, there it is:
(gdb) print *((int *)($r13 + 4))
$1 = 1
(gdb)
However, the bogus DWARF means that gdb looks in the wrong place:
(gdb) print i
$2 = 0
(gdb)
As explained above, the DWARF is incorrectly giving the location of i at four bytes below the stack pointer. There's a zero there, hence the reported value of i:
(gdb) print *((int *)($r13 - 4))
$3 = 0
(gdb)
This isn't a coincidence. A magic number written into this bogus location below the stack pointer reappears when gdb is asked to print i:
(gdb) set *((int *)($r13 - 4)) = 42
(gdb) print i
$6 = 42
(gdb)
Thus, at delay+0x18, the DWARF incorrectly encodes the location of i as r13 - 4 even though its true location is r13 + 4.
One can go a step further by editing the compilation unit by hand and replacing DW_OP_fbreg(-12) (bytes 0x91 0x74) with DW_OP_fbreg(-4) (bytes 0x91 0x7c). This gives
$ readelf --debug-dump=loc delay.modified.o
Contents of the .debug_loc section:
Offset Begin End Expression
00000000 00000000 00000004 (DW_OP_breg13 (r13): 0)
0000000c 00000004 00000038 (DW_OP_breg13 (r13): 8)
00000018 <End of list>
00000020 0000000c 00000020 (DW_OP_fbreg: -4)
0000002c 00000024 00000028 (DW_OP_reg3 (r3))
00000037 00000028 00000038 (DW_OP_fbreg: -4)
00000043 <End of list>
$
In other words, the DWARF has been corrected so that at, e.g., delay+0x18 the location of i is given as frame base - 4 = r13 + 8 - 4 = r13 + 4, matching the assembly. Repeating the gdb experiment with the corrected DWARF shows the expected value of i each time around the loop:
$ gcc-4.6 -o test.modified main.c delay.modified.o
$ gdb ./test.modified
.
.
.
(gdb) break *delay+0x18
Breakpoint 1 at 0x103cc: file delay.c, line 4.
(gdb) run
Starting program: /home/pi/test.modified
Breakpoint 1, 0x000103cc in delay (num=3) at delay.c:4
4 for(i=0; i<num; i++);
(gdb) print i
$1 = 0
(gdb) cont
Continuing.
Breakpoint 1, 0x000103cc in delay (num=3) at delay.c:4
4 for(i=0; i<num; i++);
(gdb) print i
$2 = 1
(gdb) cont
Continuing.
Breakpoint 1, 0x000103cc in delay (num=3) at delay.c:4
4 for(i=0; i<num; i++);
(gdb) print i
$3 = 2
(gdb) cont
Continuing.
[Inferior 1 (process 30954) exited with code 03]
(gdb)
I am not agree with the OP's asm analysis:
00000000 <delay>: ; so far, let's suppose sp = sp(0)
0: e24dd008 sub sp, sp, #8 ; sp = sp(0) - 8
4: e3a03000 mov r3, #0 ; r3 = 0
8: e58d3004 str r3, [sp, #4] ; store the value of r3 in (sp + 4)
c: e59d3004 ldr r3, [sp, #4] ; load (sp + 4) in r3
10: e1500003 cmp r0, r3 ; compare r3 and r0
14: da000005 ble 30 <delay+0x30> ; go to end of loop
18: e59d3004 ldr r3, [sp, #4] ; i is in r3, and it is being loaded from
; (sp + 4), that is,
; sp(i) = sp(0) - 8 + 4 = sp(0) - 4
1c: e2833001 add r3, r3, #1 ; r3 = r3 + 1, that is, increment i
20: e58d3004 str r3, [sp, #4] ; store i (which is in r3) in (sp + 4),
; being again sp(i) = sp(0) - 8 + 4 = \
; sp(0) - 4
24: e59d3004 ldr r3, [sp, #4] ; load sp + 4 in r3
28: e1530000 cmp r3, r0 ; compare r3 and r0
2c: bafffff9 blt 18 <delay+0x18> ; go to init of loop
30: e28dd008 add sp, sp, #8 ; sp = sp + 8
34: e12fff1e bx lr ;
So i is located in sp(0) - 4, which matchs with the dwarf analysis (which says that i is being located in 0 + 8 - 12)
Edit in order to add information regarding my DWARF analysis:
According to this line: 00000020 0000000c 00000020 (DW_OP_fbreg: -12) , being DW_OP_fbreg :
The DW_OP_fbreg operation provides a signed LEB128 offset from
the address specified by
the location description in the DW_AT_frame_base attribute of the
current function. (This is
typically a “stack pointer” register plus or minus some offset.
On more sophisticated systems
it might be a location list that adjusts the offset according to
changes in the stack pointer as
the PC changes.)
,the address is frame_base + offset, where:
frame_base : is the stack pointer +/- some offset, and according to the previous line (00000000 00000004 00000038 (DW_OP_breg13 (r13): 8)), from 00000004 to 00000038, it has an offset of +8 (r13 is SP)
offset: obviously it is -12
Given that, DWARF indicates that it is pointing to sp(0) + 8 - 12 = sp(0) - 4
Consider the following code:
extern unsigned int foo(char c, char **p, unsigned int *n);
unsigned int test(const char *s, char **p, unsigned int *n)
{
unsigned int done = 0;
while (*s)
done += foo(*s++, p, n);
return done;
}
Output in Assembly:
00000000 <test>:
0: b5f8 push {r3, r4, r5, r6, r7, lr}
2: 0005 movs r5, r0
4: 000e movs r6, r1
6: 0017 movs r7, r2
8: 2400 movs r4, #0
a: 7828 ldrb r0, [r5, #0]
c: 2800 cmp r0, #0
e: d101 bne.n 14 <test+0x14>
10: 0020 movs r0, r4
12: bdf8 pop {r3, r4, r5, r6, r7, pc}
14: 003a movs r2, r7
16: 0031 movs r1, r6
18: f7ff fffe bl 0 <foo>
1c: 3501 adds r5, #1
1e: 1824 adds r4, r4, r0
20: e7f3 b.n a <test+0xa>
C code compiled using arm-none-eabi-gcc versions: 4.9.1, 5.4.0, 6.3.0 and 7.1.0 on
Linux host. Assembly output is the same for all GCC versions.
CFLAGS := -Os -march=armv6-m -mcpu=cortex-m0plus -mthumb
My understanding of the execution flow is following:
Push R3-R7 + LR onto the stack (totally unclear)
Move R0 to R5 (this is clear)
Move R1 to R6 and R2 to R7 (totally unclear)
Dereference R5 into R0 (This is clear)
Compare R0 with 0 (This is clear)
If R0 != 0 go to line 14: - Restore R1 from R6 and R2 from R7 and call foo(),
If R0 == 0 stay at line 10, restore R3 - R7 + PC from stack (totally unclear)
Increment R5 (clear)
accumulate result from foo() (clear)
Branch back to line a: (clear)
My own Assembly. Not extensively tested, but definitely I would not need more than R4 + LR to be pushed onto the stack:
EDIT: According to the provided answers, my example from below will fail due to R1 and R2 not being persistent through call to foo()
51 unsigned int __attribute__((naked)) test_asm(const char *s, char **p, unsigned int *n)
52 {
53 // r0 - *s (move ptr to r3 and dereference it to r0)
54 // r1 - **p
55 // r2 - *n
56 asm volatile(
57 " push {r4, lr} \n\t"
58 " movs r4, #0 \n\t"
59 " movs r3, r0 \n\t"
60 "1: \n\t"
61 " ldrb r0, [r3, #0] \n\t"
62 " cmp r0, #0 \n\t"
63 " beq 2f \n\t"
64 " bl foo \n\t"
65 " add r4, r4, r0 \n\t"
66 " add r3, #1 \n\t"
67 " b 1b \n\t"
68 "2: \n\t"
69 " movs r0, r4 \n\t"
70 " pop {r4, pc} \n\t"
71 );
72 }
Questions:
Why GCC stores so many registers for such trivial function?
Why it pushes R3 while it is written in ABI that R0-R3 are argument registers
and supposed to be a caller save and should be safely used inside called function
in this case test()
Why it copy R1 to R6 and R2 to R7 while the prototype of extern function almost
ideally matches the test() function. So R1 and R2 are already ready to be passed
to foo() routine. My understanding is that only R0 need to be dereferenced before
call to foo()
LR must be saved since test is not a leaf function. r5-r7 are used by the function to store values that are used across function calls and since they are not scratch they must be saved. r3 is pushed to align the stack.
Adding an extra register to push is a fast and compact way to align the stack.
r1 and r2 may be trashed by the call to foo and since the values initially stored in these registers are needed after the call they must be stored in a location that survive calls.
I'm working through an example in this overview of compiling inline ARM assembly using GCC. Rather than GCC, I'm using llvm-gcc 4.2.1, and I'm compiling the following C code:
#include <stdio.h>
int main(void) {
printf("Volatile NOP\n");
asm volatile("mov r0, r0");
printf("Non-volatile NOP\n");
asm("mov r0, r0");
return 0;
}
Using the following commands:
llvm-gcc -emit-llvm -c -o compiled.bc input.c
llc -O3 -march=arm -o output.s compiled.bc
My output.s ARM ASM file looks like this:
.syntax unified
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.file "compiled.bc"
.text
.globl main
.align 2
.type main,%function
main: # #main
# BB#0: # %entry
str lr, [sp, #-4]!
sub sp, sp, #16
str r0, [sp, #12]
ldr r0, .LCPI0_0
str r1, [sp, #8]
bl puts
#APP
mov r0, r0
#NO_APP
ldr r0, .LCPI0_1
bl puts
#APP
mov r0, r0
#NO_APP
mov r0, #0
str r0, [sp, #4]
str r0, [sp]
ldr r0, [sp, #4]
add sp, sp, #16
ldr lr, [sp], #4
bx lr
# BB#1:
.align 2
.LCPI0_0:
.long .L.str
.align 2
.LCPI0_1:
.long .L.str1
.Ltmp0:
.size main, .Ltmp0-main
.type .L.str,%object # #.str
.section .rodata.str1.1,"aMS",%progbits,1
.L.str:
.asciz "Volatile NOP"
.size .L.str, 13
.type .L.str1,%object # #.str1
.section .rodata.str1.16,"aMS",%progbits,1
.align 4
.L.str1:
.asciz "Non-volatile NOP"
.size .L.str1, 17
The two NOPs are between their respective #APP/#NO_APP pairs. My expectation is that the asm() statement without the volatile keyword will be optimized out of existence due to the -O3 flag, but clearly both inline assembly statements survive.
Why does the asm("mov r0, r0") line not get recognized and removed as a NOP?
As Mystical and Mārtiņš Možeiko have describe the compiler does not optimize the code; ie, change the instructions. What the compiler does optimize is when the instruction is scheduled. When you use volatile, then the compiler will not re-schedule. In your example, re-scheduling would be moving before or after the printf.
The other optimization the compiler might make is to get C values to register for you. Register allocation is very important to optimization. This doesn't optimize the assembler, but allow the compiler to do sensible things with other code with-in the function.
To see the effect of volatile, here is some sample code,
int example(int test, int add)
{
int v1=5, v2=0;
int i=0;
if(test) {
asm volatile("add %0, %1, #7" : "=r" (v2) : "r" (v2));
i+= add * v1;
i+= v2;
} else {
asm ("add %0, %1, #7" : "=r" (v2) : "r" (v2));
i+= add * v1;
i+= v2;
}
return i;
}
The two branches have identical code except for the volatile. gcc 4.7.2 generates the following code for an ARM926,
example:
cmp r0, #0
bne 1f /* branch if test set? */
add r1, r1, r1, lsl #2
add r0, r0, #7 /* add seven delayed */
add r0, r0, r1
bx lr
1: mov r0, #0 /* test set */
add r0, r0, #7 /* add seven immediate */
add r1, r1, r1, lsl #2
add r0, r0, r1
bx lr
Note: The assembler branches are reversed to the 'C' code. The 2nd branch is slower on some processors due to pipe lining. The compiler prefers that
add r1, r1, r1, lsl #2
add r0, r0, r1
do not execute sequentially.
The Ethernut ARM Tutorial is an excellent resource. However, optimize is a bit of an overloaded word. The compiler doesn't analyze the assembler, only the arguments and where the code will be emitted.
volatile is implied if the asm statement has no outputs declared.