Inline NOPs not optimized out in LLVM - gcc

I'm working through an example in this overview of compiling inline ARM assembly using GCC. Rather than GCC, I'm using llvm-gcc 4.2.1, and I'm compiling the following C code:
#include <stdio.h>
int main(void) {
printf("Volatile NOP\n");
asm volatile("mov r0, r0");
printf("Non-volatile NOP\n");
asm("mov r0, r0");
return 0;
}
Using the following commands:
llvm-gcc -emit-llvm -c -o compiled.bc input.c
llc -O3 -march=arm -o output.s compiled.bc
My output.s ARM ASM file looks like this:
.syntax unified
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.file "compiled.bc"
.text
.globl main
.align 2
.type main,%function
main: # #main
# BB#0: # %entry
str lr, [sp, #-4]!
sub sp, sp, #16
str r0, [sp, #12]
ldr r0, .LCPI0_0
str r1, [sp, #8]
bl puts
#APP
mov r0, r0
#NO_APP
ldr r0, .LCPI0_1
bl puts
#APP
mov r0, r0
#NO_APP
mov r0, #0
str r0, [sp, #4]
str r0, [sp]
ldr r0, [sp, #4]
add sp, sp, #16
ldr lr, [sp], #4
bx lr
# BB#1:
.align 2
.LCPI0_0:
.long .L.str
.align 2
.LCPI0_1:
.long .L.str1
.Ltmp0:
.size main, .Ltmp0-main
.type .L.str,%object # #.str
.section .rodata.str1.1,"aMS",%progbits,1
.L.str:
.asciz "Volatile NOP"
.size .L.str, 13
.type .L.str1,%object # #.str1
.section .rodata.str1.16,"aMS",%progbits,1
.align 4
.L.str1:
.asciz "Non-volatile NOP"
.size .L.str1, 17
The two NOPs are between their respective #APP/#NO_APP pairs. My expectation is that the asm() statement without the volatile keyword will be optimized out of existence due to the -O3 flag, but clearly both inline assembly statements survive.
Why does the asm("mov r0, r0") line not get recognized and removed as a NOP?

As Mystical and Mārtiņš Možeiko have describe the compiler does not optimize the code; ie, change the instructions. What the compiler does optimize is when the instruction is scheduled. When you use volatile, then the compiler will not re-schedule. In your example, re-scheduling would be moving before or after the printf.
The other optimization the compiler might make is to get C values to register for you. Register allocation is very important to optimization. This doesn't optimize the assembler, but allow the compiler to do sensible things with other code with-in the function.
To see the effect of volatile, here is some sample code,
int example(int test, int add)
{
int v1=5, v2=0;
int i=0;
if(test) {
asm volatile("add %0, %1, #7" : "=r" (v2) : "r" (v2));
i+= add * v1;
i+= v2;
} else {
asm ("add %0, %1, #7" : "=r" (v2) : "r" (v2));
i+= add * v1;
i+= v2;
}
return i;
}
The two branches have identical code except for the volatile. gcc 4.7.2 generates the following code for an ARM926,
example:
cmp r0, #0
bne 1f /* branch if test set? */
add r1, r1, r1, lsl #2
add r0, r0, #7 /* add seven delayed */
add r0, r0, r1
bx lr
1: mov r0, #0 /* test set */
add r0, r0, #7 /* add seven immediate */
add r1, r1, r1, lsl #2
add r0, r0, r1
bx lr
Note: The assembler branches are reversed to the 'C' code. The 2nd branch is slower on some processors due to pipe lining. The compiler prefers that
add r1, r1, r1, lsl #2
add r0, r0, r1
do not execute sequentially.
The Ethernut ARM Tutorial is an excellent resource. However, optimize is a bit of an overloaded word. The compiler doesn't analyze the assembler, only the arguments and where the code will be emitted.

volatile is implied if the asm statement has no outputs declared.

Related

Why adding delay when gcc -o2 optimization is used?

I read an example code of STM32 with LCD and found below code, and its purpose is to write the LCD controller register index as output data of LCD controller.
void LCD_WR_REG(uint16_t regval)
{
regval = regval; // Necessary delay when using -o2 optimization
LCD->LCD_REG = regval;
}
I searched for a while for -o2, but didn't get much useful info about the what the comment here means, or why a self assignment is necessary here.
The comment is simply wrong. This operation will be optimized out. I believe that this comment was written where the original author of the code is struggling to make it work and something else was in this line.
LCD_WR_REG:
ldr r3, .L3
strh r0, [r3] # movhi
bx lr
.L3:
.word 1207993344
It could have some effect if regval was declared as volatile
void LCD_WR_REG1(volatile uint16_t regval)
{
regval = regval; // Necessary delay when using -o2 optimization
LCD->LCD_REG = regval;
}
LCD_WR_REG1:
sub sp, sp, #8
strh r0, [sp, #6] # movhi
ldrh r3, [sp, #6]
strh r3, [sp, #6] # movhi
ldr r2, .L7
ldrh r3, [sp, #6]
strh r3, [r2] # movhi
add sp, sp, #8
bx lr
.L7:
.word 1207993344
https://godbolt.org/z/Th7naabf7

No FPU support with gcc for ARM Cortex M?

I have the following function from a well known benchmark that I am compiling with gcc-arm-none-eabi-10-2020-q4-major:
#include <unistd.h>
double b[1000], c[1000];
void tuned_STREAM_Scale(double scalar)
{
ssize_t j;
for (j = 0; j < 1000; j++)
b[j] = scalar* c[j];
}
I am using the following compiler options:
arm-none-eabi-gcc -O3 -mcpu=cortex-m7 -mthumb -mfloat-abi=hard -mfpu=fpv5-sp-d16 -c test.c
However, if I check the compiled code, the compiler seems unable to use a basic FPU multiply instruction, and just uses the __aeabi_dmul function from libgcc (we can however see that a FPU vmov is used):
00000000 <tuned_STREAM_Scale>:
0: e92d 41f0 stmdb sp!, {r4, r5, r6, r7, r8, lr}
4: 4c08 ldr r4, [pc, #32] ; (28 <tuned_STREAM_Scale+0x28>)
6: 4d09 ldr r5, [pc, #36] ; (2c <tuned_STREAM_Scale+0x2c>)
8: f504 58fa add.w r8, r4, #8000 ; 0x1f40
c: ec57 6b10 vmov r6, r7, d0
10: e8f4 0102 ldrd r0, r1, [r4], #8
14: 4632 mov r2, r6
16: 463b mov r3, r7
18: f7ff fffe bl 0 <__aeabi_dmul>
1c: 4544 cmp r4, r8
1e: e8e5 0102 strd r0, r1, [r5], #8
22: d1f5 bne.n 10 <tuned_STREAM_Scale+0x10>
24: e8bd 81f0 ldmia.w sp!, {r4, r5, r6, r7, r8, pc}
If I compare with another compiler, the code is incomparably more efficient:
00000000 <tuned_STREAM_Scale>:
0: 4808 ldr r0, [pc, #32] ; (24 <tuned_STREAM_Scale+0x24>)
2: b580 push {r7, lr}
4: 4b06 ldr r3, [pc, #24] ; (20 <tuned_STREAM_Scale+0x20>)
6: 27c8 movs r7, #200 ; 0xc8
8: c806 ldmia r0!, {r1, r2}
a: ec42 1b11 vmov d1, r1, r2
e: ee20 1b01 vmul.f64 d1, d0, d1
12: 1e7f subs r7, r7, #1
14: ec52 1b11 vmov r1, r2, d1
18: c306 stmia r3!, {r1, r2}
1a: d1f5 bne.n 8 <tuned_STREAM_Scale+0x8>
1c: bd80 pop {r7, pc}
If I check inside gcc package the various libgcc object files depending on CPU or FPU options, I cannot find any FPU instructions in __aeabi_dmul or any other function.
I find very strange that gcc is not able to use a basic FPU multiplication, and I could not find in any documentation or README this limitation, so I am wondering if I am not doing anything wrong. I have checked older gcc versions and I still have this problem. Would it be due to gcc or to the compiled binaries from ARM?
The clue is in the compiler options you already posted:
-mfpu=fpv5-sp-d16 "sp" means single precision.
You told it not to generate hardware double instructions, which is correct for most Cortex-M7 processors because they can't execute them. If you have an M7 which can then you need to set the correct fpu argument.

"Access to unaligned memory location, bad address=ffffff"

I'm trying to read integers from an input.txt file, below is my read loop where I'm attempting to read and store the integers into an array. I keep getting "Access to unaligned memory location, bad address=ffffff" on any line after the line with "LDR R2, [R2,R5,LSL #2]...im using ARM SIM. Does anyone know what I'm doing wrong?
start:
MOV R5, #0 #int i
MOV R1, #0
swi SWI_Open
LDR R1,=InFileH
STR R0,[R1]
MOV R3, #0
readloop:
LDR R0, =InFileH
LDR R0, [R0]
swi SWI_RdInt
CMP R0, #0
BEQ readdone
#the int is now in R0
MOV R1, R0
LDR R3,=a
STR R2,[R3,R5,LSR#2]
MOV R2, R1
ADD R5, R5, #1 #i++
bal readloop
readdone:
MOV R0, #0
swi SWI_Close
swi SWI_Exit
.data
.align 4
InFileH: .skip 4
InFile: .asciz "numbers.txt"
OutFile: .asciz "numsort.txt"
OutFileH: .skip 4
NewLine: .asciz "\n"
a: .skip 400
i had faced similar issue while programming arm assembly
this was because it was expecting offset in multiples of 4
STR R2, [R1, #2]
the above instruction throws the similar error. so it was resolved by using
STR R2, [R1, #4]
for better understanding clickhere

Is this a bug in GCC or is my code wrong?

I have this C code:
int test(signed char anim_col)
{
if (anim_col >= 31) {
return 1;
} else if (anim_col <= -15) {
return -2;
}
return 0;
}
That compiles to the following thumb code with -Os -mthumb using Android NDK r4b:
test:
mov r3, #1
cmp r0, #30
bgt .L3
mov r3, #0
add r0, r0, #14
bge .L3
mov r3, #2
neg r3, r3
.L3:
mov r0, r3
bx lr
But with the latest Android NDK r5 it compiles to this broken code:
test:
mov r3, #1
cmp r0, #30
bgt .L3
lsl r0, r0, #24
lsr r0, r0, #24
mov r3, #0
cmp r0, #127 ## WTF?! should be <= -15 ##
bls .L3
mov r3, #2
neg r3, r3
.L3:
mov r0, r3
bx lr
This seems... strange. If anim_col is less than 0 it will return -2 instead of only returning -2 when less than or equal to -15. The complete command line to reproduce this is as follows:
android-ndk-r4b/build/prebuilt/linux-x86/arm-eabi-4.4.0/bin/arm-eabi-gcc -c -o test.o -Os test.c --save-temps -mthumb
and
android-ndk-r5/toolchains/arm-linux-androideabi-4.4.3/prebuilt/linux-x86/bin/arm-linux-androideabi-gcc -c -o test.o -Os test.c --save-temps -mthumb
Is this a known GCC bug? I find it hard to believe, that doesn't happen in real life! Surely my code is wrong?!
It's a GCC bug!
As of NDK r5b, this bug has been fixed.
This release of the NDK does not
include any new features compared to
r5. The r5b release addresses the
following problems in the r5 release:
Fixes a compiler bug in the
arm-linux-androideabi-4.4.3 toolchain.
The previous binary generated invalid
thumb instruction sequences when
dealing with signed chars.

Generating %pc relative address of constant data

Is there a way to have gcc generate %pc relative addresses of constants? Even when the string appears in the text segment, arm-elf-gcc will generate a constant pointer to the data, load the address of the pointer via a %pc relative address and then dereference it. For a variety of reasons, I need to skip the middle step. As an example, this simple function:
const char * filename(void)
{
static const char _filename[]
__attribute__((section(".text")))
= "logfile";
return _filename;
}
generates (when compiled with arm-elf-gcc-4.3.2 -nostdlib -c
-O3 -W -Wall logfile.c):
00000000 <filename>:
0: e59f0000 ldr r0, [pc, #0] ; 8 <filename+0x8>
4: e12fff1e bx lr
8: 0000000c .word 0x0000000c
0000000c <_filename.1175>:
c: 66676f6c .word 0x66676f6c
10: 00656c69 .word 0x00656c69
I would have expected it to generate something more like:
filename:
add r0, pc, #0
bx lr
_filename.1175:
.ascii "logfile\000"
The code in question needs to be partially position independent since it will be relocated in memory at load time, but also integrate with code that was not compiled -fPIC, so there is no global offset table.
My current work around is to call a non-inline function (which will be done via a %pc relative address) to find the offset from the compiled location in a technique similar to how -fPIC code works:
static intptr_t
__attribute__((noinline))
find_offset( void )
{
uintptr_t pc;
asm __volatile__ (
"mov %0, %%pc" : "=&r"(pc)
);
return pc - 8 - (uintptr_t) find_offset;
}
But this technique requires that all data references be fixed up manually, so the filename() function in the above example would become:
const char * filename(void)
{
static const char _filename[]
__attribute__((section(".text")))
= "logfile";
return _filename + find_offset();
}
Hmmm, maybe you have to compile it as -fPIC to get PIC. Or simply write it in assembler, assembler is a lot easier than the C you are writing.
00000000 :
0: e59f300c ldr r3, [pc, #12] ; 14
4: e59f000c ldr r0, [pc, #12] ; 18
8: e08f3003 add r3, pc, r3
c: e0830000 add r0, r3, r0
10: e12fff1e bx lr
14: 00000004 andeq r0, r0, r4
18: 00000000 andeq r0, r0, r0
0000001c :
1c: 66676f6c strbtvs r6, [r7], -ip, ror #30
20: 00656c69 rsbeq r6, r5, r9, ror #24
Are you getting the same warning I am getting?
/tmp/ccySyaUE.s: Assembler messages:
/tmp/ccySyaUE.s:35: Warning: ignoring changed section attributes for .text

Resources