arm-none-eabi-gcc not inferring floating point multiply-accumulate from code

arm-none-eabi-gcc not inferring floating point multiply-accumulate from code - gcc

The ARM fpv5 instruction set supports double precision floating point operations, including single cycle multiply accumulate instructions (VMLA/VMLS) as detailed in their ISA documentation.
Unfortunately, I can't get my code to use this instruction from within any C application.
Here is a simple example:
float64_t a=0, b=0, c=0;
while(1)
{
b += 1.643;
c += 3.901;
a += b * c; // multiply accumulate???
do_stuff(a) // use the MAC result
}
The code above generates the following assembly for (what I believe should be) a MAC operation
170 a += b * c;
00000efe: vldr d6, [r7, #64] ; 0x40
00000f02: vldr d7, [r7, #56] ; 0x38
00000f06: vmul.f64 d7, d6, d7
00000f0a: vldr d6, [r7, #72] ; 0x48
00000f0e: vadd.f64 d7, d6, d7
00000f12: vstr d7, [r7, #72] ; 0x48
As you can see, it does the multiply and the addition step separately. Is there a good reason why the compiler cant use the VMLA.f64 instruction here?
Target: ARM Cortex M7 (NXP iMXRT1051)
Toolchain: arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 8-2018-q4-major) 8.2.1 20181213 (release) [gcc-8-branch revision 267074]

Solved. It was the optimization level. When set to -O3 the instructions changed to properly use the MAC.
I thought that taking advantage of hardware acceleration (e.g. FPU) wouldn't be dependent on the optimization level since its essentially "free", but I guess I was wrong.

Related

Why is ARM gcc calling __udivsi3 when dividing by a constant?

I'm using the latest available version of ARM-packaged GCC:
arm-none-eabi-gcc (GNU Arm Embedded Toolchain 10-2020-q4-major) 10.2.1 20201103 (release)
Copyright (C) 2020 Free Software Foundation, Inc.
When I compile this code using "-mcpu=cortex-m0 -mthumb -Ofast":
int main(void) {
uint16_t num = (uint16_t) ADC1->DR;
ADC1->DR = num / 7;
}
I would expect that the division would be accomplished by a multiplication and a shift, but instead this code is being generated:
08000b5c <main>:
8000b5c: b510 push {r4, lr}
8000b5e: 4c05 ldr r4, [pc, #20] ; (8000b74 <main+0x18>)
8000b60: 2107 movs r1, #7
8000b62: 6c20 ldr r0, [r4, #64] ; 0x40
8000b64: b280 uxth r0, r0
8000b66: f7ff facf bl 8000108 <__udivsi3>
8000b6a: b280 uxth r0, r0
8000b6c: 6420 str r0, [r4, #64] ; 0x40
8000b6e: 2000 movs r0, #0
8000b70: bd10 pop {r4, pc}
8000b72: 46c0 nop ; (mov r8, r8)
8000b74: 40012400 .word 0x40012400
Using __udivsi3 instead of multiply and shift is terribly inefficient. Am I using the wrong flags, or missing something else, or is this a GCC bug?

The Cortex-M0 lacks instructions to perform a 32x32->64-bit multiply. Because num is an unsigned 16-bit quantity, multiplying it by 9363 and shifting right 16 would yield a correct result in all cases, but--likely because a uint16_t will be promoted to int before the multiply, gcc does not include such optimizations.
From what I've observed, gcc does a generally poor job of optimizing for the Cortex-M0, failing to employ some straightforward optimizations which would be appropriate for that platform, but sometimes employing "optimizations" which aren't. Given something like
void test1(uint8_t *p)
{
for (int i=0; i<32; i++)
p[i] = (p[i]*9363) >> 16; // Divide by 7
}
gcc happens to generate okay code for the Cortex-M0 at -O2, but if the multiplication were replaced with an addition the compiler would generate code which reloads the constant 9363 on every iteration of the loop. When using addition, even if the code were changed to:
void test2(uint16_t *p)
{
register unsigned u9363 = 9363;
for (int i=0; i<32; i++)
p[i] = (p[i]+u9363) >> 16;
}
gcc would still bring the load of the constant into the loop. Sometimes gcc's optimizations may also have unexpected behavioral consequences. For example, one might expect that on a platform like a Cortex-M0, invoking something like:
unsigned short test(register unsigned short *p)
{
register unsigned short temp = *p;
return temp - (temp >> 15);
}
while an interrupt changes the contents of *p might yield behavior consistent with the old value or the new value. The Standard wouldn't require such treatment, but most implementations intended to be suitable for embedded programming tasks will offer stronger guarantees than what the Standard requires. If either the old or new value would be equally acceptable, letting the compiler use whichever is more convenient may allow more efficient code than using volatile. As it happens, however, the "optimized" code from gcc will replace the two uses of temp with separate loads of *p.
If you're using gcc with the Cortex-M0 and are at all concerned about performance or the possibility of "astonishing" behaviors, get in the habit of inspecting the compiler's output. For some kinds of loop, it might even be worth considering testing out -O0. If code makes suitable use of the register keyword, its performance can sometimes beat that of identical code processed with -O2.

Expanding on supercat's answer.
Feed this:
unsigned short fun ( unsigned short x )
{
return(x/7);
}
to something with a larger multiply:
00000000 <fun>:
0: e59f1010 ldr r1, [pc, #16] ; 18 <fun+0x18>
4: e0832190 umull r2, r3, r0, r1
8: e0400003 sub r0, r0, r3
c: e08300a0 add r0, r3, r0, lsr #1
10: e1a00120 lsr r0, r0, #2
14: e12fff1e bx lr
18: 24924925 .word 0x24924925
1/7 in binary (long division):
0.001001001001001
111)1.000000
111
====
1000
111
===
1
0.001001001001001001001001001001
0.0010 0100 1001 0010 0100 1001 001001
0x2492492492...
0x24924925>>32 (rounded up)
For this to work you need a 64 bit result, you take the top half and do some adjustments, so for example:
7 * 0x24924925 = 0x100000003
and you take the top 32 bits (not completely this simple but for this value you can see it working).
The all thumbs variant multiply is 32 bits = 32 bits * 32 bits, so the result would be 0x00000003 and that does not work.
So 0x24924 which we can make 0x2493 as supercat did or 0x2492.
Now we can use the 32x32 = 32 bit multiply:
0x2492 * 7 = 0x0FFFE
0x2493 * 7 = 0x10005
Let's run with the one larger:
0x100000000/0x2493 = a number greater than 65536. so that is fine.
but:
0x3335 * 0x2493 = 0x0750DB6F
0x3336 * 0x2493 = 0x07510002
0x3335 / 7 = 0x750
0x3336 / 7 = 0x750
So you can only get so far with that approach.
If we follow the model of the arm code:
for(ra=0;ra<0x10000;ra++)
{
rb=0x2493*ra;
rd=rb>>16;
rb=ra-rd;
rb=rd+(rb>>1);
rb>>=2;
rc=ra/7;
printf("0x%X 0x%X 0x%X \n",ra,rb,rc);
if(rb!=rc) break;
}
Then it works from 0x0000 to 0xFFFF, so you could write the asm to do that (note it needs to be 0x2493 not 0x2492).
If you know the operand is not going above a certain value then you can use more bits of 1/7th to multiply against.
In any case when the compiler does not do this optimization for you then you might still have a chance yourself.
Now that I think about it I ran into this before, and now it makes sense. But I was on a full sized arm and I called a routine I compiled in arm mode (the other code was in thumb mode), and had a switch statement basically if denominator = 1 then result = x/1; if denominator = 2 then result = x/2 and so on. And then it avoided the gcclib function and generated the 1/x multiplies. (I had like 3 or 4 different constants to divide by):
unsigned short udiv7 ( unsigned short x )
{
unsigned int r0;
unsigned int r3;
r0=x;
r3=0x2493*r0;
r3>>=16;
r0=r0-r3;
r0=r3+(r0>>1);
r0>>=2;
return(r0);
}
Assuming I made no mistakes:
00000000 <udiv7>:
0: 4b04 ldr r3, [pc, #16] ; (14 <udiv7+0x14>)
2: 4343 muls r3, r0
4: 0c1b lsrs r3, r3, #16
6: 1ac0 subs r0, r0, r3
8: 0840 lsrs r0, r0, #1
a: 18c0 adds r0, r0, r3
c: 0883 lsrs r3, r0, #2
e: b298 uxth r0, r3
10: 4770 bx lr
12: 46c0 nop ; (mov r8, r8)
14: 00002493 .word 0x00002493
That should be faster than a generic division library routine.
Edit
I think I see what supercat has done with the solution that works:
((i*37449 + 16384u) >> 18)
We have this as the 1/7th fraction:
0.001001001001001001001001001001
but we can only do a 32 = 32x32 bit multiply. The leading zeros give us some breathing room we might be able to take advantage of. So instead of 0x2492/0x2493 we can try:
1001001001001001
0x9249
0x9249*0xFFFF = 0x92486db7
And so far it won't overflow:
rb=((ra*0x9249) >> 18);
by itself it fails at 7 * 0x9249 = 0x3FFFF, 0x3FFFF>>18 is zero not 1.
So maybe
rb=((ra*0x924A) >> 18);
that fails at:
0xAAAD 0x1862 0x1861
So what about:
rb=((ra*0x9249 + 0x8000) >> 18);
and that works.
What about supercat's?
rb=((ra*0x9249 + 0x4000) >> 18);
and that runs clean for all values 0x0000 to 0xFFFF:
rb=((ra*0x9249 + 0x2000) >> 18);
and that fails here:
0xE007 0x2000 0x2001
So there are a couple of solutions that work.
unsigned short udiv7 ( unsigned short x )
{
unsigned int ret;
ret=x;
ret=((ret*0x9249 + 0x4000) >> 18);
return(ret);
}
00000000 <udiv7>:
0: 4b03 ldr r3, [pc, #12] ; (10 <udiv7+0x10>)
2: 4358 muls r0, r3
4: 2380 movs r3, #128 ; 0x80
6: 01db lsls r3, r3, #7
8: 469c mov ip, r3
a: 4460 add r0, ip
c: 0c80 lsrs r0, r0, #18
e: 4770 bx lr
10: 00009249 .word 0x00009249
Edit
As far as the "why" question goes, that is not a Stack Overflow question; if you want to know why gcc doesn't do this, ask the authors of that code. All we can do is speculate here and the speculation is they may either have chosen not to because of the number of instructions or they may have chosen not to because they have an algorithm that states because this is not a 64 = 32x32 bit multiply then do not bother.
Again the why question is not a Stack Overflow question, so perhaps we should just close this question and delete all of the answers.
I found this to be incredibly educational (once you know/understand what was being said).
Another WHY? question is why did gcc do it the way they did it when they could have done it the way supercat or I did it?

The compiler can only rearrange integer expressions if it knows that the result will be correct for any input allowed by the language.
Because 7 is co-prime to 2, it is impossible to carry out dividing any input by seven with multiplying and shifting.
If you know that it is possible for the input that you intend to provide, then you have to do it yourself using the multiply and shift operators.
Depending on the size of the input, you will have to choose how much to shift so that the output is correct (or at least good enough for your application) and so that the intermediate doesn't overflow. The compiler has no way of knowing what is accurate enough for your application, or what your maximum input will be. If it allows any input up to the maximum of the type, then every multiplication will overflow.
In general GCC will only carry out division using shifting if the divisor is not co-prime to 2, that is if it is a power of two.

What do the constraints "Rah" and "Ral" mean in extended inline assembly?

This question is inspired by a question asked by someone on another forum. In the following code what does the extended inline assembly constraint Rah and Ral mean. I haven't seen these before:
#include<stdint.h>
void tty_write_char(uint8_t inchar, uint8_t page_num, uint8_t fg_color)
{
asm (
"int $0x10"
:
: "b" ((uint16_t)page_num<<8 | fg_color),
"Rah"((uint8_t)0x0e), "Ral"(inchar));
}
void tty_write_string(const char *string, uint8_t page_num, uint8_t fg_color)
{
while (*string)
tty_write_char(*string++, page_num, fg_color);
}
/* Use the BIOS to print the first command line argument to the console */
int main(int argc, char *argv[])
{
if (argc > 1)
tty_write_string(argv[1], 0, 0);
return 0;
}
In particular are the use of Rah and Ral as constraints in this code:
asm (
"int $0x10"
:
: "b" ((uint16_t)page_num<<8 | fg_color),
"Rah"((uint8_t)0x0e), "Ral"(inchar));
The GCC Documentation doesn't have an l or h constraint for either simple constraints or x86/x86 machine constraints. R is any legacy register and a is the AX/EAX/RAX register.
What am I not understanding?

What you are looking at is code that is intended to be run in real mode on an x86 based PC with a BIOS. Int 0x10 is a BIOS service that has the ability to write to the console. In particular Int 0x10/AH=0x0e is to write a single character to the TTY (terminal).
That in itself doesn't explain what the constraints mean. To understand the constraints Rah and Ral you have to understand that this code isn't being compiled by a standard version of GCC/CLANG. It is being compiled by a GCC port called ia16-gcc. It is a special port that targets 8086/80186 and 80286 and compatible processors. It doesn't generate 386 instructions or use 32-bit registers in code generation. This experimental version of GCC is to target 16-bit environments like DOS (FreeDOS, MSDOS), and ELKS.
The documentation for ia16-gcc is hard to find online in HTML format but I have produced a copy for the recent GCC 6.3.0 versions of the documentation on GitHub. The documentation was produced by building ia16-gcc from source and using make to generate the HTML. If you review the machine constraints for Intel IA-16—config/ia16 you should now be able to see what is going on:
Ral The al register.
Rah The ah register.
This version of GCC doesn't understand the R constraint by itself anymore. The inline assembly you are looking at matches that of the parameters for Int 0x10/Ah=0xe:
VIDEO - TELETYPE OUTPUT
AH = 0Eh
AL = character to write
BH = page number
BL = foreground color (graphics modes only)
Return:
Nothing
Desc: Display a character on the screen, advancing the cursor
and scrolling the screen as necessary
Other Information
The documentation does list all the constraints that are available for the IA16 target:
Intel IA-16—config/ia16/constraints.md
a
The ax register. Note that for a byte operand,
this constraint means that the operand can go into either al or ah.
b
The bx register.
c
The cx register.
d
The dx register.
S
The si register.
D
The di register.
Ral
The al register.
Rah
The ah register.
Rcl
The cl register.
Rbp
The bp register.
Rds
The ds register.
q
Any 8-bit register.
T
Any general or segment register.
A
The dx:ax register pair.
j
The bx:dx register pair.
l
The lower half of pairs of 8-bit registers.
u
The upper half of pairs of 8-bit registers.
k
Any 32-bit register group with access to the two lower bytes.
x
The si and di registers.
w
The bx and bp registers.
B
The bx, si, di and bp registers.
e
The es register.
Q
Any available segment register—either ds or es (unless one or both have been fixed).
Z
The constant 0.
P1
The constant 1.
M1
The constant -1.
Um
The constant -256.
Lbm
The constant 255.
Lor
Constants 128 … 254.
Lom
Constants 1 … 254.
Lar
Constants -255 … -129.
Lam
Constants -255 … -2.
Uo
Constants 0xXX00 except -256.
Ua
Constants 0xXXFF.
Ish
A constant usable as a shift count.
Iaa
A constant multiplier for the aad instruction.
Ipu
A constant usable with the push instruction.
Imu
A constant usable with the imul instruction except 257.
I11
The constant 257.
N
Unsigned 8-bit integer constant (for in and out instructions).
There are many new constraints and some repurposed ones.
In particular the a constraint for the AX register doesn't work like other versions of GCC that target 32-bit and 64-bit code. The compiler is free to choose either AH or AL with the a constraint if the values being passed are 8 bit values. This means it is possible for the a constraint to appear twice in an extended inline assembly statement.
You could have compiled your code to a DOS EXE with this command:
ia16-elf-gcc -mcmodel=small -mregparmcall -march=i186 \
-Wall -Wextra -std=gnu99 -O3 int10h.c -o int10h.exe
This targets the 80186. You can generate 8086 compatible code by omitting the -march=i186 The generated code for main would look something like:
00000000 <main>:
0: 83 f8 01 cmp ax,0x1
3: 7e 1d jle 22 <tty_write_string+0xa>
5: 56 push si
6: 89 d3 mov bx,dx
8: 8b 77 02 mov si,WORD PTR [bx+0x2]
b: 8a 04 mov al,BYTE PTR [si]
d: 20 c0 and al,al
f: 74 0d je 1e <tty_write_string+0x6>
11: 31 db xor bx,bx
13: b4 0e mov ah,0xe
15: 46 inc si
16: cd 10 int 0x10
18: 8a 04 mov al,BYTE PTR [si]
1a: 20 c0 and al,al
1c: 75 f7 jne 15 <main+0x15>
1e: 31 c0 xor ax,ax
20: 5e pop si
21: c3 ret
22: 31 c0 xor ax,ax
24: c3 ret
When run with the command line int10h.exe "Hello, world!" should print:
Hello, world!
Special Note: The IA16 port of GCC is very experimental and does have some code generation bugs especially when higher optimization levels are used. I wouldn't use it for mission critical applications at this point in time.

Hard float abi selected but soft float library used by GCC on STM32

I'm developping on a STM32L4 that embeds a FPv4-SP FPU.
I'm testing the FPU usage. I am compiling using hard float abi:
arm-atollic-eabi-gcc -c (...) __VFP_FP__ -mcpu=cortex-m4 -mthumb -mfpu=fpv4-sp-d16 -mfloat-abi=hard xxx.o -o xxx.o xxx.c
(I've added the same option -mfloat-abi to the link command, even though I don't think it is useful)
However, looking to assembly code, I noticed that software floating point library functions are called:
35 volatile float f = 0.125;
0800a2b4: mov.w r3, #1040187392 ; 0x3e000000
0800a2b8: str r3, [r7, #4]
37 f = f/0.4;
0800a2ba: ldr r3, [r7, #4]
0800a2bc: mov r0, r3
0800a2be: bl 0x8000348 <__extendsfdf2>
0800a2c2: add r3, pc, #100 ; (adr r3, 0x800a328 <csem_tests+136>)
0800a2c4: ldrd r2, r3, [r3]
0800a2c8: bl 0x8000644 <__divdf3>
What am I missing ?

I don't know if answering my own question is the right way to go, sorry for inconvenience if it isn't, but I guess it is better than deleting the post.
I've found the issue: the float variable I used for testing was in fact casted to double, and since the FPU is single precision only, the operation was handled in software. Forcing the variable to float like this :
float f = (float)0.125;
f = f/(float)0.68768;
solved the issue, even if I don't really understand why the compiler casted this variable to double.

because string constants are always double (and all operations are done on double if one of the operands is double) unless you use the 'f' suffix - 0.125for command line option -fsingle-precision-constant.
If you want a "pure" FPU code you need to use -ffast-math & -fno-math-errno as well

OpenMP SIMD on Power8

I'm wondering whether there is any compiler (gcc, xlc, etc.) on Power8 that supports OpenMP SIMD constructs on Power8? I tried with XL (13.1) but I couldn't compile successfully. Probably it doesn't support simd construct yet.
I could compile with gcc 4.9.1 (with these flags -fopenmp -fopenmp-simd and -O1). I put differences between 2 asm files.
Can I say that gcc 4.9 is able to generate altivec code? In order to optimize more, what am I supposed to do? (I tried with -O3, restrict treatment)
My code is very simple:
int *x, *y, *z;
x = (int*) malloc(n * sizeof(int));
y = (int*) malloc(n * sizeof(int));
z = (int*) malloc(n * sizeof(int));
#pragma omp simd
for(i = 0; i < N; ++i)
z[i] = a * x[i] + y[i];
And generated assembly is here
.L7:
lwz 9,124(31)
extsw 9,9
std 9,104(31)
lfd 0,104(31)
stfd 0,104(31)
ld 8,104(31)
sldi 9,8,2
ld 10,152(31)
add 9,10,9
lwz 10,124(31)
extsw 10,10
std 10,104(31)
lfd 0,104(31)
stfd 0,104(31)
ld 7,104(31)
sldi 10,7,2
ld 8,136(31)
add 10,8,10
lwz 10,0(10)
extsw 10,10
lwz 8,132(31)
mullw 10,8,10
extsw 8,10
lwz 10,124(31)
extsw 10,10
std 10,104(31)
lfd 0,104(31)
stfd 0,104(31)
ld 7,104(31)
sldi 10,7,2
ld 7,144(31)
add 10,7,10
lwz 10,0(10)
extsw 10,10
add 10,8,10
extsw 10,10
stw 10,0(9)
lwz 9,124(31)
addi 9,9,1
stw 9,124(31)
GCC with -O1 -fopenmp-simd
.L7:
lwz 9,108(31)
mtvsrwa 0,9
mfvsrd 8,0
sldi 9,8,2
ld 10,136(31)
add 9,10,9
lwz 10,108(31)
mtvsrwa 0,10
mfvsrd 7,0
sldi 10,7,2
ld 8,120(31)
add 10,8,10
lwz 10,0(10)
extsw 10,10
lwz 8,116(31)
mullw 10,8,10
extsw 8,10
lwz 10,108(31)
mtvsrwa 0,10
mfvsrd 7,0
sldi 10,7,2
ld 7,128(31)
add 10,7,10
lwz 10,0(10)
extsw 10,10
add 10,8,10
extsw 10,10
stw 10,0(9)
lwz 9,108(31)
addi 9,9,1
stw 9,108(31)
In order to clarify and understand details, I have one more application which is n^2 nbody application. This time my question is related with these compilers (gcc 4.9 and XL 13.1 ) and architectures (Intel and Power).
I put all the codes into gist https://gist.github.com/grypp/8b9f0f0f98af78f4223e#file-input-c
( full version of input code input.c )
Power8 & XLC - It says "was not SIMD vectorized because it contains function calls. (there is sqrtf)". It's reasonable. But in the asm code I can see xsnmsubmdp is it normal? (the assembly: https://gist.github.com/grypp/8b9f0f0f98af78f4223e#file-power8-xlc-noinnersimd-asm)
Power8 & gcc I tried to compile it in 2 ways (with omp simd construct and without). It changed my asm code, is it normal? (According to OpenMP, the code should not contain function call) (Assembilies: https://gist.github.com/grypp/8b9f0f0f98af78f4223e#file-power8-gcc-noinnersimd-asm & https://gist.github.com/grypp/8b9f0f0f98af78f4223e#file-power8-gcc-innersimd-asm)
i74820K & gcc I did a same test with omp simd and without it. The output codes are different as well. Does FMA effect this code block ? (Assembilies: https://gist.github.com/grypp/8b9f0f0f98af78f4223e#file-i74820k-gcc-noinnersimd-asm & https://gist.github.com/grypp/8b9f0f0f98af78f4223e#file-i74820k-gcc-innersimd-asm)
Thanks in advance

The XL compiler on POWER Linux currently only supports a subset of the OpenMP 4.0 features. The SIMD construct feature is not supported at the moment, so the compiler will not recognize the construct in your source code.
However, if vectorization is what you're looking for then the good news is that the XL compiler should already automatically vectorize your code as long as you use at least the following optimization options
-O3 -qhot -qarch=pwr8 -qtune=pwr8
These options will enable high-order loop transformations along with POWER8 specific optimizations, including loop auto-vectorization for your loop.
Afterwards, you should see some VMX & VSX instructions in the generated assembly code similar to the following:
188: 19 2e 80 7c lxvw4x vs36,0,r5
18c: 84 09 a6 10 vslw v5,v6,v1
190: 10 00 e7 38 addi r7,r7,16
194: 10 00 a5 38 addi r5,r5,16
198: 40 28 63 10 vadduhm v3,v3,v5
19c: 80 20 63 10 vadduwm v3,v3,v4
1a0: 19 4f 66 7c stxvw4x vs35,r6,r9
1a4: 14 02 86 41 beq cr1,3b8 <foo+0x3b8>
1a8: 10 00 20 39 li r9,16
1ac: 19 4e 27 7d lxvw4x vs41,r7,r9
1b0: 19 3e a0 7c lxvw4x vs37,0,r7
By the way, you can also get an optimization report from the XL compilers by using the -qreport option. This will explain which loops were vectorized and which loops were not and for what reason. e.g.
1586-542 (I) Loop (loop index 1 with nest-level 0 and iteration count
100) at test.c was SIMD vectorized.
or
1586-549 (I) Loop (loop index 2) at test.c was not SIMD
vectorized because a data dependence prevents SIMD vectorization.
Hope this helps!

I don't have access to a Power-based machine right now, but some experimentation with the AST dumper on x86 shows that GCC 4.9.2 starts producing SIMD code only once the level of optimisation reaches O1, i.e. the following options should do the trick:
-fopenmp-simd -O1
The same is true for GCC 5.1.0.
Also note that the vectoriser applies a cost model that might prevent it from actually producing vectorised code in some cases. See -fsimd-cost-model and similar options here on how to override that behaviour.

Fixed point math with ARM Cortex-M4 and gcc compiler

I'm using Freescale Kinetis K60 and using the CodeWarrior IDE (which I believe uses GCC for the complier).
I want to multiply two 32 bit numbers (which results in a 64 bit number) and only retain the upper 32 bits.
I think the correct assembly instruction for the ARM Cortex-M4 is the SMMUL instruction. I would prefer to access this instruction from C code rather than assembly. How do I do this?
I imagine the code would ideally be something like this:
int a,b,c;
a = 1073741824; // 0x40000000 = 0.5 as a D0 fixed point number
b = 1073741824; // 0x40000000 = 0.5 as a D0 fixed point number
c = ((long long)a*b) >> 31; // 31 because there are two sign bits after the multiplication
// so I can throw away the most significant bit
When I try this in CodeWarrior, I get the correct result for c (536870912 = 0.25 as a D0 FP number). I don't see the SMMUL instruction anywhere and the multiply is 3 instructions (UMULL, MLA, and MLA -- I don't understand why it is using a unsigned multiply, but that is another question). I have also tried a right shift of 32 since that might make more sense for the SMMUL instruction, but that doesn't do anything different.

The problem you get with optimizing that code is:
08000328 <mul_test01>:
8000328: f04f 5000 mov.w r0, #536870912 ; 0x20000000
800032c: 4770 bx lr
800032e: bf00 nop
your code doesnt do anything runtime so the optimizer can just compute the final answer.
this:
.thumb_func
.globl mul_test02
mul_test02:
smull r2,r3,r0,r1
mov r0,r3
bx lr
called with this:
c = mul_test02(0x40000000,0x40000000);
gives 0x10000000
UMULL gives the same result because you are using positive numbers, the operands and results are all positive so it doesnt get into the signed/unsigned differences.
Hmm, well you got me on this one. I would read your code as telling the compiler to promote the multiply to a 64 bit. smull is two 32 bit operands giving a 64 bit result, which is not what your code is asking for....but both gcc and clang used the smull anyway, even if I left it as an uncalled function, so it didnt know at compile time that the operands had no significant digits above 32, they still used smull.
Perhaps the shift was the reason.
Yup, that was it..
int mul_test04 ( int a, int b )
{
int c;
c = ((long long)a*b) >> 31;
return(c);
}
gives
both gcc and clang (well clang recycles r0 and r1 instead of using r2 and r3)
08000340 <mul_test04>:
8000340: fb81 2300 smull r2, r3, r1, r0
8000344: 0fd0 lsrs r0, r2, #31
8000346: ea40 0043 orr.w r0, r0, r3, lsl #1
800034a: 4770 bx lr
but this
int mul_test04 ( int a, int b )
{
int c;
c = ((long long)a*b);
return(c);
}
gives this
gcc:
08000340 <mul_test04>:
8000340: fb00 f001 mul.w r0, r0, r1
8000344: 4770 bx lr
8000346: bf00 nop
clang:
0800048c <mul_test04>:
800048c: 4348 muls r0, r1
800048e: 4770 bx lr
So with the bit shift the compilers realize that you are only interested in the upper portion of the result so they can discard the upper portion of the operands which means smull can be used.
Now if you do this:
int mul_test04 ( int a, int b )
{
int c;
c = ((long long)a*b) >> 32;
return(c);
}
both compilers get even smarter, in particular clang:
0800048c <mul_test04>:
800048c: fb81 1000 smull r1, r0, r1, r0
8000490: 4770 bx lr
gcc:
08000340 <mul_test04>:
8000340: fb81 0100 smull r0, r1, r1, r0
8000344: 4608 mov r0, r1
8000346: 4770 bx lr
I can see that 0x40000000 considered as a float where you are keeping track of the decimal place, and that place is a fixed location. 0x20000000 would make sense as the answer. I cant yet decide if that 31 bit shift works universally or just for this one case.
A complete example used for the above is here
https://github.com/dwelch67/stm32vld/tree/master/stm32f4d/sample01
and I did run it on an stm32f4 to verify it works and the results.
EDIT:
If you pass the parameters into the function instead of hardcoding them within the function:
int myfun ( int a, int b )
{
return(a+b);
}
The compiler is forced to make runtime code instead of optimize the answer at compile time.
Now if you call that function from another function with hardcoded numbers:
...
c=myfun(0x1234,0x5678);
...
In this calling function the compiler may choose to compute the answer and just place it there at compile time. If the myfun() function is global (not declared as static) the compiler doesnt know if some other code to be linked later will use it so even near the call point in this file it optimizes an answer it still has to produce the actual function and leave it in the object for other code in other files to call, so you can still examine what the compiler/optimizer does with that C code. Unless you use llvm for example where you can optimize the whole project (across files) external code calling this function will use the real function and not a compile time computed answer.
both gcc and clang did what I am describing, left runtime code for the function as a global function, but within the file it computed the answer at compile time and placed the hardcoded answer in the code instead of calling the function:
int mul_test04 ( int a, int b )
{
int c;
c = ((long long)a*b) >> 31;
return(c);
}
in another function in the same file:
hexstring(mul_test04(0x40000000,0x40000000),1);
The function itself is implemented in the code:
0800048c <mul_test04>:
800048c: fb81 1000 smull r1, r0, r1, r0
8000490: 0fc9 lsrs r1, r1, #31
8000492: ea41 0040 orr.w r0, r1, r0, lsl #1
8000496: 4770 bx lr
but where it is called they have hardcoded the answer because they had all the information needed to do so:
8000520: f04f 5000 mov.w r0, #536870912 ; 0x20000000
8000524: 2101 movs r1, #1
8000526: f7ff fe73 bl 8000210 <hexstring>
If you dont want the hardcoded answer you need to use a function that is not in the same optimization pass.
Manipulating the compiler and optimizer comes down to a lot of practice and it is not an exact science as the compilers and optimizers are constantly evolving (for better or worse).
By isolating a small bit of code in a function you are causing problems in another way, larger functions are more likely to need a stack frame and evict variables from registers to the stack as it goes, smaller functions might not need to do that and the optimizers may change how the code is implemented as a result. You test the code fragment one way to see what the compiler is doing then use it in a larger function and dont get the result you want. If there is an exact instruction or sequence of instructions you want implemented....Implement them in assembler. If you were targeting a specific set of instructions in a specific instruction set/processor then avoid the game, avoid your code changing when you change computers/compilers/etc, and just use assembler for that target. if needed ifdef or otherwise use conditional compile options to build for different targets without the assembler.

GCC supports actual fixed-point types: http://gcc.gnu.org/onlinedocs/gcc/Fixed_002dPoint.html
I'm not sure what instruction it will use, but it might make you life easier.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio