PowerPC GCC floating point instructions - gcc

Currently I am developing for a MPC5777 board, with e200z7 cores. Most of the things are going well, but I am stuck with a problem that is really annoying me already.
I am trying to use floating point operations on portions of my code, using the embedded hardware support. My toolchain is GCC 6.3 (powerpc-gcc), for which I am using the following flags:
ASFLAGS_BASE = -g -a32 -mbooke -me500 --fatal-warnings
ARCH_FLAGS = -mpowerpc-gpopt -mfprnd -misel -m32 -mhard-float -mabi=spe -mmfpgpr -mfloat-gprs=single
Please notice the -mfloat-gprs=single flag. That is the one that is giving problems.
When I use -mfloat-gprs=single, I am not able to compile things properly, as some functions are not implemented:
undefined reference to `__extendsfdf2`,
undefined reference to `__adddf3`,
undefined reference to `__divdf3`,
- among others.
Now, if I compile using -mfloat-gprs=double, it goes till the end and generate all my execution files. BUT, using this flag also generates extra functions, not implemented by the e200z7. I can't tell for sure all of them, as the code is getting bigger and it is mostly impossible to track all generated assembly. For instance, at the moment my execution gets stuck when it reaches the efscfd instruction, which is implemented by the e500 core, that has double precision floating point support, but not for the e200, that has single precision support only.
So, any piece of advice here would be amazingly welcome!
Thanks in advance,

In case someone ends up here with a similar problem, I have fixed it by using three flags:
-mfloat-gprs=single -Wdouble-promotion -fsingle-precision-constant
What they do is:
-mfloat-gprs=single tells the compiler to use general purpose register for floating point operations, instead of floating point registers. =single means it has single precision
-Wdouble-promotion enables a compiler warning for when gcc tries to convert a single float to a double float
-fsingle-precision-constant enforces GCC not to convert single float to double float
I am using -fsingle-precision-constant and -Wdouble-promotion at the same time to be 100% that double will not be used.

Related

Can I make my compiler use fast-math on a per-function basis?

Suppose I have
template <bool UsesFastMath> void foo(float* data, size_t length);
and I want to compile one instantiation with -ffast-math (--use-fast-math for nvcc), and the other instantiation without it.
This can be achieved by instantiating each of the variants in a separate translation unit, and compiling each of them with a different command-line - with and without the switch.
My question is whether it's possible to indicate to popular compilers (*) to apply or not apply -ffast-math for individual functions - so that I'll be able to have my instantiations in the same translation unit.
Notes:
If the answer is "no", bonus points for explaining why not.
This is not the same questions as this one, which is about turning fast-math on and off at runtime. I'm much more modest...
(*) by popular compilers I mean any of: gcc, clang, msvc icc, nvcc (for GPU kernel code) about which you have that information.
In GCC you can declare functions like following:
__attribute__((optimize("-ffast-math")))
double
myfunc(double val)
{
return val / 2;
}
This is GCC-only feature.
See working example here -> https://gcc.gnu.org/ml/gcc/2009-10/msg00385.html
It seems that GCC not verifies optimize() arguments. So typos like "-ffast-match" will be silently ignored.
As of CUDA 7.5 (the latest version I am familiar with, although CUDA 8.0 is currently shipping), nvcc does not support function attributes that allow programmers to apply specific compiler optimizations on a per-function basis.
Since optimization configurations set via command line switches apply to the entire compilation unit, one possible approach is to use as many different compilation units as there are different optimization configurations, as already noted in the question; source code may be shared and #include-ed from a common file.
With nvcc, the command line switch --use_fast_math basically controls three areas of functionality:
Flush-to-zero mode is enabled (that is, denormal support is disabled)
Single-precision reciprocal, division, and square root are switched to approximate versions
Certain standard math functions are replaced by equivalent, lower-precision, intrinsics
You can apply some of these changes with per-operation granularity by using appropriate intrinsics, others by using PTX inline assembly.

GCC: avoiding long time linking while using static arrays

My question is practically repeats this one, which asks why this issue occurs. I would like ot know if it is possible to avoid it.
The issue is: if I allocate a huge amount of memory statically:
unsigned char static_data[ 8 * BYTES_IN_GYGABYTE ];
then linker (ld) takes very long time to make an executable. There is a good explanation from #davidg about this behaviour in question I gave above:
This leaves us with the follow series of steps:
The assembler tells the linker that it needs to create a section of memory that is 1GB long.
The linker goes ahead and allocates this memory, in preparation for placing it in the final executable.
The linker realizes that this memory is in the .bss section and is marked NOBITS, meaning that the data is just 0, and doesn't need to be physically placed into the final executable. It avoids writing out the 1GB of data, instead just throwing the allocated memory away.
The linker writes out to the final ELF file just the compiled code, producing a small executable.
A smarter linker might be able to avoid steps 2 and 3 above, making your compile time much faster
Ok. #davidg had explained why does linker takes a lot of time, but I want to know how can I avoid it. Maybe GCC have some options, that will say to linker to be a little smarter and to avoid steps 2 and 3 above ?
Thank you.
P.S. I use GCC 4.5.2 at Ubuntu
You can allocate the static memory in the release version only:
#ifndef _DEBUG
unsigned char static_data[ 8 * BYTES_IN_GYGABYTE ];
#else
unsigned char *static_data;
#endif
I would have 2 ideas in mind that could help:
As already mentioned in some comment: place it in a separate compilation unit.That itself will not reduce linking time. But maybe together with incremental linking it helps (ld option -r).
Other is similar. Place it in a separate compilation unit, and generate a shared library from it. And just link later with the shared library.
Sadly I can not promise that one of it helps, as I have no way to test: my gcc(4.7.2) and bin tools dont show this time consuming behaviour, 8, 16 or 32 Gigabytes testprogram compile and link in under a second.

Mysterious stack issue with GCC 4.6.2

I'm working on the Pintos toy operating system at university, but there's a strange bug when using GCC 4.6.2. When I push my system call arguments (just 3 pushl-s in inline assembly), some mysterious data also appears on the stack, and the arguments are in the wrong order. Setting -fno-omit-frame-pointer gets rid of the strange data, but the arguments are still in the wrong order. GCC 4.5 works fine. Any idea what specific option could fix this?
NOTE: the problem still occurs with -O0.
Without a code example and a listing of the result from your different compilations, it's difficult to help you. But here are three possible causes for your problems:
Make sure you understand how arguments are pushed to the stack. Arguments are pushed from the back. This makes it possible for printf(char *, ...) to examine the first item to find out how many more there are. If you want to call the function int foo(int a, int b, int c), you'll need to push c, then b and finally a.
Could the strange data on the stack be a return address or EFLAGS? I don't know Pintos and how system calls are made, but make sure that you understand the difference between CALL/RET and INT/IRET. INT pushes the flags onto the stack.
If your inline assembly has side effects, you might want to write volatile/__volatile__ in front of it. Otherwise GCC is allowed to move it when optimizing.
I need to see your code to better understand what's going on.
The culprit was -fomit-frame-pointer, which has been enabled by default since 4.6.2. -fno-omit-frame-pointer fixed the issue.
Did you clean the parameters on stack after the syscall? gcc may not be aware that you touch the stack and generate code depends on the stack pointer it expected.
-fno-omit-frame-pointer force gcc to use e/rbp for accessing locate data but it just hide the actual problem.

Optimizing used registers when using inline ARM assembly in GCC

I want to write some inline ARM assembly in my C code. For this code, I need to use a register or two more than just the ones declared as inputs and outputs to the function. I know how to use the clobber list to tell GCC that I will be using some extra registers to do my computation.
However, I am sure that GCC enjoys the freedom to shuffle around which registers are used for what when optimizing. That is, I get the feeling it is a bad idea to use a fixed register for my computations.
What is the best way to use some extra register that is neither input nor output of my inline assembly, without using a fixed register?
P.S. I was thinking that using a dummy output variable might do the trick, but I'm not sure what kind of weird other effects that will have...
Ok, I've found a source that backs up the idea of using dummy outputs instead of hard registers:
4.8 Temporary registers:
People also sometimes erroneously use clobbers for temporary registers. The right way is
to make up a dummy output, and use “=r” or “=&r” depending on the permitted overlap
with the inputs. GCC allocates a register for the dummy value. The difference is that
GCC can pick a convenient register, so it has more flexibility.
from page 20 of this pdf.
For anyone who is interested in more info on inline assembly with GCC this website turned out to be very instructive.

How to get GCC to use more than two SIMD registers when using intrinsics?

I am writing some code and trying to speed it up using SIMD intrinsics SSE2/3. My code is of such nature that I need to load some data into an XMM register and act on it many times. When I'm looking at the assembler code generated, it seems that GCC keeps flushing the data back to the memory, in order to reload something else in XMM0 and XMM1. I am compiling for x86-64 so I have 15 registers. Why is GCC using only two and what can I do to ask it to use more? Is there any way that I can "pin" some value in a register? I added the "register" keyword to my variable definition, but the generated assembly code is identical.
Yes, you can. Explicit Reg Vars talks about the syntax you need to pin a variable to a specific register.
If you're getting to the point where you're specifying individual registers for each intrinsic, you might as well just write the assembly directory, especially given gcc's nasty habit of pessimizing intrinsics unnecessarily in many cases.
It sounds like you compiled with optimization disabled, so no variables are kept in registers between C statements, not even int.
Compile with gcc -O3 -march=native to let the compiler make non-terrible asm, optimized for your machine. The default is -O0 with a "generic" target ISA and tuning.
See also Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about why "debug" builds in general are like that, and the fact that register int foo; or register __m128 bar; can stay in a register even in a debug build. But it's much better to actually have the compiler optimize, as well as using registers, if you want your code to run fast overall!

Resources