GCC for Aarch64: what generated NOPs are used for? - gcc

I built CoreMark for Aarch64 using aarch64-none-elf-gcc with the following options:
-mcpu=cortex-a57 -Wall -Wextra -g -O2
In disassembled code I see many NOPs.
A few examples:
0000000040001540 <matrix_mul_const>:
40001540: 13003c63 sxth w3, w3
40001544: 34000240 cbz w0, 4000158c <matrix_mul_const+0x4c>
40001548: 2a0003e6 mov w6, w0
4000154c: 52800007 mov w7, #0x0 // #0
40001550: 52800008 mov w8, #0x0 // #0
40001554: d503201f nop
40001558: 2a0703e4 mov w4, w7
4000155c: d503201f nop
40001560: 78e45845 ldrsh w5, [x2, w4, uxtw #1]
...
00000000400013a0 <core_init_matrix>:
400013a0: 7100005f cmp w2, #0x0
400013a4: 2a0003e6 mov w6, w0
400013a8: 1a9f1442 csinc w2, w2, wzr, ne // ne = any
400013ac: 52800004 mov w4, #0x0 // #0
400013b0: 34000620 cbz w0, 40001474 <core_init_matrix+0xd4>
400013b4: d503201f nop
400013b8: 2a0403e0 mov w0, w4
400013bc: 11000484 add w4, w4, #0x1
A simple question: what these NOPs are used for?
UPD. Yes, it is related to alignment. Here is the corresponding generated assembly code:
matrix_mul_const:
.LVL41:
.LFB4:
.loc 1 270 1 is_stmt 1 view -0
.cfi_startproc
.loc 1 271 5 view .LVU127
.loc 1 272 5 view .LVU128
.loc 1 272 19 view .LVU129
.loc 1 270 1 is_stmt 0 view .LVU130
sxth w3, w3
.loc 1 272 19 view .LVU131
cbz w0, .L25
.loc 1 276 51 view .LVU132
mov w6, w0
mov w7, 0
.loc 1 272 12 view .LVU133
mov w8, 0
.LVL42:
.p2align 3,,7
.L27:
.loc 1 274 23 is_stmt 1 view .LVU134
.loc 1 270 1 is_stmt 0 view .LVU135
mov w4, w7
.LVL43:
.p2align 3,,7
.L28:
.loc 1 276 13 is_stmt 1 discriminator 3 view .LVU136
.loc 1 276 28 is_stmt 0 discriminator 3 view .LVU137
ldrsh w5, [x2, w4, uxtw 1]
Here we see .p2align 3,,7. These .p2align xxx are result of -O2:
$ aarch64-none-elf-gcc -Wall -Wextra -g -O1 -ffreestanding -c core_matrix.c -S ;\
grep '.p2align' core_matrix.s | sort | uniq
<nothing>
$ aarch64-none-elf-gcc -Wall -Wextra -g -O2 -ffreestanding -c core_matrix.c -S ;\
grep '.p2align' core_matrix.s | sort | uniq
.p2align 2,,3
.p2align 3,,7
.p2align 4,,11

Related

How can I use GCC to compile a binary file which can be used for my FPGA,where I have used verilog to synthesis

First I synthesized a CPU that supports RISCV32IM using verilog, but I can't test if the CPU is working properly. I hope a compiler(such as GCC) can generate instructions to help me test, but normal compilers can only generate EXE files that require the operating system. Obviously, my FPGA can't do this.
I only need a series of RISCV32IM instructions that can run on FPGA and can implement the corresponding functions. If I can, I want his first instruction to be the program entry, which will save me energy.
Of course you can it is somewhat trivial, you did or someone selected baremetal tab for you, it is a baremetal program I assume you want to run.
so.s
lui x2,0x22222
lui x3,0x33333
lui x4,0x44444
lui x5,0x55555
lui x6,0x66666
j .
riscv32-none-elf-as so.s -o so.o
riscv32-none-elf-objdump -d -Mnumeric so.o
so.o: file format elf32-littleriscv
Disassembly of section .text:
00000000 <.text>:
0: 22222137 lui x2,0x22222
4: 333331b7 lui x3,0x33333
8: 44444237 lui x4,0x44444
c: 555552b7 lui x5,0x55555
10: 66666337 lui x6,0x66666
14: 0000006f j 14 <.text+0x14>
now you can just
riscv32-none-elf-ld -Ttext=0 so.o -o so.elf
riscv32-none-elf-ld: warning: cannot find entry symbol _start; defaulting to 0000000000000000
riscv32-none-elf-objdump -d -Mnumeric so.elf
so.elf: file format elf32-littleriscv
Disassembly of section .text:
00000000 <__BSS_END__-0x1018>:
0: 22222137 lui x2,0x22222
4: 333331b7 lui x3,0x33333
8: 44444237 lui x4,0x44444
c: 555552b7 lui x5,0x55555
10: 66666337 lui x6,0x66666
14: 0000006f j 14 <__BSS_END__-0x1004>
but at least with arm and not sure about other binutils targets there are very very old, longstanding bugs in the tools when used like that (get gaps in the binary, etc). So
memmap
MEMORY
{
hello : ORIGIN = 0x00000000, LENGTH = 0x3000
}
SECTIONS
{
.text : { *(.text*) } > hello
.rodata : { *(.rodata*) } > hello
.bss : { *(.bss*) } > hello
.data : { *(.data*) } > hello
}
and
riscv32-none-elf-ld -T memmap so.o -o so.elf
riscv32-none-elf-objdump -d -Mnumeric so.elf
so.elf: file format elf32-littleriscv
Disassembly of section .text:
00000000 <.text>:
0: 22222137 lui x2,0x22222
4: 333331b7 lui x3,0x33333
8: 44444237 lui x4,0x44444
c: 555552b7 lui x5,0x55555
10: 66666337 lui x6,0x66666
14: 0000006f j 14 <.text+0x14>
so now you have an elf (or exe or whatever) you can
riscv32-none-elf-objcopy so.elf -O binary so.bin
hexdump -C so.bin
00000000 37 21 22 22 b7 31 33 33 37 42 44 44 b7 52 55 55 |7!"".1337BDD.RUU|
00000010 37 63 66 66 6f 00 00 00 |7cffo...|
00000018
or
riscv32-none-elf-objcopy --srec-forceS3 so.elf -O srec so.srec
cat so.srec
S00A0000736F2E7372656338
S3150000000037212222B731333337424444B75255554C
S30D00000010376366666F0000000D
S70500000000FA
or
riscv32-none-elf-objcopy so.elf -O ihex so.ihex
cat so.ihex
:1000000037212222B731333337424444B752555552
:08001000376366666F00000013
and so on.
Then you can also...
so.s
lui x2,0x00002
jal notmain
j .
.globl hello
hello:
ret
notmain.c
void hello ( unsigned int );
void notmain ( void )
{
unsigned int r;
for (r=0;r<32;r++)
{
hello(r);
}
}
build with commands like these
riscv32-none-elf-as -march=rv32im so.s -o so.o
riscv32-none-elf-gcc -O2 -c -fomit-frame-pointer -march=rv32im -mabi=ilp32 notmain.c -o notmain.o
riscv32-none-elf-ld -T memmap so.o notmain.o -o so.elf
riscv32-none-elf-objdump -D -Mnumeric so.elf
riscv32-none-elf-objcopy -O binary so.elf so.bin
giving
Disassembly of section .text:
00000000 <hello-0xc>:
0: 00002137 lui x2,0x2
4: 00c000ef jal x1,10 <notmain>
8: 0000006f j 8 <hello-0x4>
0000000c <hello>:
c: 00008067 ret
00000010 <notmain>:
10: ff010113 addi x2,x2,-16 # 1ff0 <notmain+0x1fe0>
14: 00812423 sw x8,8(x2)
18: 00912223 sw x9,4(x2)
1c: 00112623 sw x1,12(x2)
20: 00000413 li x8,0
24: 02000493 li x9,32
28: 00040513 mv x10,x8
2c: 00140413 addi x8,x8,1
30: fddff0ef jal x1,c <hello>
34: fe941ae3 bne x8,x9,28 <notmain+0x18>
38: 00c12083 lw x1,12(x2)
3c: 00812403 lw x8,8(x2)
40: 00412483 lw x9,4(x2)
44: 01010113 addi x2,x2,16
48: 00008067 ret
and you can use the .bin file or .srec or whatever you prefer.
basic bare metal stuff...gnu works great for this, very easy to use. llvm/clang is more complicated to figure out but technically will work as well.
I changed the line to
for (r=0;r<3200;r++)
because it was unrolling the loop. I gave up trying to keep track of the ever changing generic llvm tool command line options, so now I build specific for riscv32 and can then use the generic program names as cross tools...
clang -c -march=rv32im so.s -o so.o
clang -c -O2 -march=rv32im notmain.c -o notmain.o
ld.lld -T memmap so.o notmain.o -o so.elf
llvm-objcopy -O binary so.elf so.bin
llvm-objdump -D -Mnumeric so.elf
so.elf: file format elf32-littleriscv
Disassembly of section .text:
00000000 <.text>:
0: 37 21 00 00 lui x2, 2
4: ef 00 c0 00 jal 0x10 <notmain>
8: 6f 00 00 00 j 0x8 <.text+0x8>
0000000c <hello>:
c: 67 80 00 00 ret
00000010 <notmain>:
10: 13 01 01 ff addi x2, x2, -16
14: 23 26 11 00 sw x1, 12(x2)
18: 23 24 81 00 sw x8, 8(x2)
1c: 23 22 91 00 sw x9, 4(x2)
20: 13 04 00 00 li x8, 0
24: 37 15 00 00 lui x10, 1
28: 93 04 05 c8 addi x9, x10, -896
0000002c <.LBB0_1>:
2c: 13 05 04 00 mv x10, x8
30: ef f0 df fd jal 0xc <hello>
34: 13 04 14 00 addi x8, x8, 1
38: e3 1a 94 fe bne x8, x9, 0x2c <.LBB0_1>
3c: 83 20 c1 00 lw x1, 12(x2)
40: 03 24 81 00 lw x8, 8(x2)
44: 83 24 41 00 lw x9, 4(x2)
48: 13 01 01 01 addi x2, x2, 16
4c: 67 80 00 00 ret

Gcc -g What happens?

The assembly file is obtained by using gcc -g -S, and the part of .s file is as follows:
.L3:
.loc 1 22 11
mov eax, DWORD PTR -12[rbp]
mov edx, eax
mov rcx, QWORD PTR .refptr._ZSt4cout[rip]
call _ZNSolsEi
.loc 1 22 18
mov rdx, QWORD PTR .refptr._ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_[rip]
mov rcx, rax
call _ZNSolsEPFRSoS_E
.loc 1 23 7
mov DWORD PTR -12[rbp], 0
.loc 1 12 2
add DWORD PTR -4[rbp], 1
jmp .L6
What does .loc 1 22 11 stand for?
When the -g flag is added to gcc it directs the compiler to add debugging information. .loc appears only when the compiler generates debugging information with -g flag:
https://sourceware.org/binutils/docs-2.38/as/Loc.html#Loc

Ada listing files.... what are the right compiler in GNAT to get them to come out

I am used to getting nice listing files from C code where I can see lovely source code intertwined with opcodes and hex offsets for debugging as seen here: List File In C (.LST) List File In C (.LST)
And the -S directive gets me the assembler code only from g++ for Ada.... but I can't seem to get it to give up the good stuff so I can debug a nasty elaboration crash.
Any thoughts on the GNAT compiler switches to send in?
Maybe this helps. The next command generates something similar to what you refer to:
$ gnatmake -g main.adb -cargs -Wa,-adhln > main.lst
The -cargs (a so-called mode switch) causes gnatmake to pass the subsequent arguments to the compiler. The compiler subsequently passes the -adhln switches to the assembler (see here). But you might as wel use objdump -d -S main.o to see the assembly/source code after build.
main.adb
with Ada.Text_IO; use Ada.Text_IO;
procedure Main is
begin
Put_Line ("Hello, world!");
end Main;
output (main.lst)
1 .file "main.adb"
2 .text
3 .Ltext0:
4 .section .rodata
5 .LC1:
6 0000 48656C6C .ascii "Hello, world!"
6 6F2C2077
6 6F726C64
6 21
7 000d 000000 .align 8
8 .LC0:
9 0010 01000000 .long 1
10 0014 0D000000 .long 13
11 .text
12 .align 2
13 .globl _ada_main
15 _ada_main:
16 .LFB1:
17 .file 1 "main.adb"
1:main.adb **** with Ada.Text_IO; use Ada.Text_IO;
2:main.adb ****
3:main.adb **** procedure Main is
18 .loc 1 3 1
19 .cfi_startproc
20 0000 55 pushq %rbp
21 .cfi_def_cfa_offset 16
22 .cfi_offset 6, -16
23 0001 4889E5 movq %rsp, %rbp
24 .cfi_def_cfa_register 6
25 0004 53 pushq %rbx
26 0005 4883EC08 subq $8, %rsp
27 .cfi_offset 3, -24
28 .LBB2:
4:main.adb **** begin
5:main.adb **** Put_Line ("Hello, world!");
29 .loc 1 5 4
30 0009 B8000000 movl $.LC1, %eax
30 00
31 000e BA000000 movl $.LC0, %edx
31 00
32 0013 4889C1 movq %rax, %rcx
33 0016 4889D3 movq %rdx, %rbx
34 0019 4889D0 movq %rdx, %rax
35 001c 4889CF movq %rcx, %rdi
36 001f 4889C6 movq %rax, %rsi
37 0022 E8000000 call ada__text_io__put_line__2
37 00
38 .LBE2:
6:main.adb **** end Main;
39 .loc 1 6 5
40 0027 4883C408 addq $8, %rsp
41 002b 5B popq %rbx
42 002c 5D popq %rbp
43 .cfi_def_cfa 7, 8
44 002d C3 ret
45 .cfi_endproc
46 .LFE1:
48 .Letext0:
You might want to look at the section on debugging control in the top-secret GNAT documentation, especially the -gnatG switch.

How to make GCC generate vector instructions as ICC does?

I've been using ICC on my project, and ICC will utilize vector instructions very well. recently I tried to use GCC (version 5.5) to compile the same code, however on some modules, GCC's version is 10 times slower than ICC's. This happens when I do complex multiply etc.
A sample code will be like:
definitions:
float *ptr1 = _mm_malloc(1280 , 64);
float *ptr2 = _mm_malloc(1280 , 64);
float complex *realptr1 = (float complex *)&ptr1[storageOffset];
float complex *realptr2 = (float complex *)&ptr2[storageOffset];
Pragma and compiler options:
__assume_aligned(realptr1, 64);
__assume_aligned(realptr2, 64);
#pragma ivdep
#pragma vector aligned
for (j = 0; j < 512; j++) {
float complex derSlot0 = realptr1[j] * realptr2[j];
float complex derSlot1 = realptr1[j] + realptr2[j];
realptr1[j] = derSlot0;
realptr2[j] = derSlot1;
}
ICC compiled result of the major loop will be like:
..B1.6: # Preds ..B1.6 ..B1.5
# Execution count [5.12e+02]
vmovups 32(%r15,%rdx,8), %ymm9 #35.29
lea (%r15,%rdx,8), %rax #37.5
vmovups (%rax), %ymm3 #35.29
vaddps 32(%rbx,%rdx,8), %ymm9, %ymm11 #36.43
vaddps (%rbx,%rdx,8), %ymm3, %ymm5 #36.43
vmovshdup 32(%rbx,%rdx,8), %ymm6 #35.43
vshufps $177, %ymm9, %ymm9, %ymm7 #35.43
vmulps %ymm7, %ymm6, %ymm8 #35.43
vmovshdup (%rbx,%rdx,8), %ymm0 #35.43
vshufps $177, %ymm3, %ymm3, %ymm1 #35.43
vmulps %ymm1, %ymm0, %ymm2 #35.43
vmovsldup 32(%rbx,%rdx,8), %ymm10 #35.43
vfmaddsub213ps %ymm8, %ymm9, %ymm10 #35.43
vmovups %ymm11, 32(%rbx,%rdx,8) #38.5
vmovups %ymm10, 32(%rax) #37.5
vmovsldup (%rbx,%rdx,8), %ymm4 #35.43
vfmaddsub213ps %ymm2, %ymm3, %ymm4 #35.43
vmovups %ymm5, (%rbx,%rdx,8) #38.5
vmovups %ymm4, (%rax) #37.5
addq $8, %rdx #32.3
cmpq $512, %rdx #32.3
jb ..B1.6 # Prob 99% #32.3
The command line used for icc is:
icc -march=core-avx2 -S -fsource-asm -c test.c
For GCC, what I've already done include: replace "#pragma ivdep" with "#pragma GCC ivdep", replace "__assume_aligned(realptr1, 64);" with "realptr1 = __builtin_assume_aligned(realptr1, 64);"
The command for GCC is:
gcc -c -O2 -ftree-vectorize -mavx2 -g -Wa,-a,-ad gcctest.c
and the result for the same loop is something like this:
109 .L7:
110 00d8 C5FA103B vmovss (%rbx), %xmm7
111 00dc 4883C308 addq $8, %rbx
112 00e0 C5FA1073 vmovss -4(%rbx), %xmm6
112 FC
113 00e5 4983C408 addq $8, %r12
114 00e9 C4C17A10 vmovss -8(%r12), %xmm5
114 6C24F8
115 00f0 C4C17A10 vmovss -4(%r12), %xmm4
115 6424FC
116 .LBB2:
117 .loc 1 35 0 discriminator 3
118 00f7 C5F828C7 vmovaps %xmm7, %xmm0
119 00fb C5F828CE vmovaps %xmm6, %xmm1
120 00ff C5FA1165 vmovss %xmm4, -80(%rbp)
120 B0
121 0104 C5F828DC vmovaps %xmm4, %xmm3
122 0108 C5FA116D vmovss %xmm5, -76(%rbp)
122 B4
123 010d C5F828D5 vmovaps %xmm5, %xmm2
124 0111 C5FA1175 vmovss %xmm6, -72(%rbp)
124 B8
125 0116 C5FA117D vmovss %xmm7, -68(%rbp)
125 BC
126 011b E8000000 call __mulsc3
126 00
127 .LVL7:
128 .loc 1 38 0 discriminator 3
129 0120 C5FA107D vmovss -68(%rbp), %xmm7
129 BC
130 0125 C5FA106D vmovss -76(%rbp), %xmm5
130 B4
131 012a C5FA1075 vmovss -72(%rbp), %xmm6
131 B8
132 012f C5D258EF vaddss %xmm7, %xmm5, %xmm5
133 0133 C5FA1065 vmovss -80(%rbp), %xmm4
133 B0
134 .loc 1 35 0 discriminator 3
135 0138 C5F9D645 vmovq %xmm0, -56(%rbp)
135 C8
136 .loc 1 38 0 discriminator 3
137 013d C5DA58E6 vaddss %xmm6, %xmm4, %xmm4
138 .loc 1 35 0 discriminator 3
139 0141 C5FA1045 vmovss -52(%rbp), %xmm0
139 CC
140 .LVL8:
141 .loc 1 37 0 discriminator 3
142 0146 C5FA104D vmovss -56(%rbp), %xmm1
142 C8
143 014b C5FA114B vmovss %xmm1, -8(%rbx)
143 F8
144 .LVL9:
145 0150 C5FA1143 vmovss %xmm0, -4(%rbx)
145 FC
146 .loc 1 38 0 discriminator 3
147 0155 C4C17A11 vmovss %xmm5, -8(%r12)
147 6C24F8
148 015c C4C17A11 vmovss %xmm4, -4(%r12)
148 6424FC
149 .LBE2:
150 .loc 1 32 0 discriminator 3
151 0163 4C39EB cmpq %r13, %rbx
152 0166 0F856CFF jne .L7
152 FFFF
So, I can see that GCC uses some kind of vector instructions, but still it it not as good as ICC.
My question is that, are there any more options I can do to make GCC perform better?
Thanks a lot.
You didn't post full code to test but you may start with adding
-ffast-math
and optionally
-mfma
so more or less you will end up with
vmovaps ymm0, YMMWORD PTR [rbx+rax]
vmovaps ymm3, YMMWORD PTR [r12+rax]
vpermilps ymm2, ymm0, 177
vpermilps ymm4, ymm3, 245
vpermilps ymm1, ymm3, 160
vmulps ymm2, ymm2, ymm4
vmovaps ymm4, ymm0
vfmsub132ps ymm4, ymm2, ymm1
vfmadd132ps ymm1, ymm2, ymm0
vaddps ymm0, ymm0, ymm3
vmovaps YMMWORD PTR [rbx+rax], ymm0
vblendps ymm1, ymm4, ymm1, 170
vmovaps YMMWORD PTR [r12+rax], ymm1
add rax, 32
cmp rax, 4096
jne .L6

Where does C function's TAN return its value in 64-bit GCC?

I am linking my assembly function with GCC on linux 64-bit. The library I use is TAN from math.h. I link it with;
gcc -s prog.o -o prog -lm
The program works but the return value is 0.0000000 (for 3.4 radian). I use extrn in my assembly code;
extrn tan
extrn printf
I use xmm0 to pass the argument (in radian) to the TAN function. Now I am not sure which register is used to return the value from TAN. Is it xmm0, st0 or in RAX? I can't find a decent reference on this.
For my gcc, it's xmm0.
Here's a C program:
#include <stdio.h>
#include <math.h>
int main () {
double x = tan(M_PI/4.0);
// RESULT: x=1.000000
printf ("x=%f\n", x);
return 0;
}
And here's the corresponding "gcc -S":
.Ltext0:
.section .rodata
.LC1:
.string "x=%f\n"
.text
.globl main
.type main, #function
main:
.LFB0:
.file 1 "x.cpp"
.loc 1 4 0
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $32, %rsp
.LBB2:
.loc 1 6 0
movabsq $4607182418800017407, %rax
movq %rax, -8(%rbp)
.loc 1 8 0
movq -8(%rbp), %rax
movq %rax, -24(%rbp)
movsd -24(%rbp), %xmm0
movl $.LC1, %edi
movl $1, %eax
call printf
.loc 1 9 0
movl $0, %eax
.LBE2:
.loc 1 10 0
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc

Resources