gcc options for a freescale iMX6q ARM processor - gcc

I am trying to figure out gcc options for a toolchain I am setting up, for development board:
Sabre-lite which is based around the Freescale's iMX6q quad processor.
Now I know that iMX6 is basically a cortex-a9 processor that has co-processors vfpv3 and neon, and also vector graphics, 2D and even 3D engines.
However, the release notes and use guide docs haven't been too clear on how to enable any options that can be enabled in gcc.
In fact the options that I can 'play' with are the following.
-march= armv7-a - ok this one is pretty obvious.
-mfpu= vfpv3/neon - i can use only the vfpv3 co-processor, or both (respectively, depends on option)
-mfloat-abi=softfp/soft/hard - I guess I can choose hard here, as there is hardware for fp operations
-mcpu=cortex-a9 - is it option even necessary? it is not clear if it just an alias for -march or something else.
Are there other options I should enable?
Why does the toolchain have as default options to build the linux kernel/uboot/packages the following:
-march= armv7-a -mfpu= vfpv3 -mfloat-abi=softfp
Thank you for your help

Use -mthumb -O3 -march=armv7-a -mcpu=cortex-a9 -mtune=cortex-a9 -mfpu=neon -mvectorize-with-neon-quad -mfloat-abi=softfp. Note that by default the compiler will not vectorize floating-point operating using NEON because NEON does not support denormal numbers. If you are fine with some loss of precision you can make gcc use NEON for floating-point by adding -ffast-math switch.

I can't answer everything, but that '--softfp' means to use the FPU, but maintain compatibility with code that doesn't.
Slightly outdated ARM FP document

Related

gcc -Og flag is optimizing out variables set by inline calls [duplicate]

When I compile my C++ program with g++ using the -Og option I see variables that are <optimized out>, and also the current line sometimes skips around. Is this behaviour expected at this optimization level, or do I have some problem? The man page of gcc says:
-Og Optimize debugging experience. -Og enables optimizations that do not interfere with debugging. It should be the optimization level of choice for the standard edit-compile-debug cycle, offering a reasonable level of optimization while maintaining fast compilation and a good debugging experience.
hence I did not expect this behaviour. On my system I have g++ version 4.9.2 and gdb version 7.7.1.
This is normal behaviour when compiling with the -Og option. At this optimisation level the compiler is allowed to optimize as long as it adheres to the as-if rule. This could include the removal of variables (or conversion to constants), as well as the dropping of unused functions.
The recommendation is either to get used to the skipping or to compile with the -O0option.

what is compiler feedback based optimization? is it available with arm gcc compiler?

what is compiler feedback(not linker feedback) based optimization? How to get this feedback file for arm gcc compiler?
Read the chapter of the GCC documentation dedicated to optimizations (and also the section about ARM in GCC: ARM options)
You can use:
link-time optimization (LTO) by compiling and linking with -flto in addition of other optimization flags (so make CC='gcc -flto -O2'): the linking phase also do optimizations (so the compiler is linking files containing not only object code, but also intermediate GIMPLE internal compiler representation)
profile-guided optimization (PGO, with -fprofile-generate, -fprofile-use, -fauto-profile etc...): you first generate code with profiling instructions, you run some representative benchmarks to get profiling information, and you compile a second time using these profiling information.
You could mix both approaches and give a lot of other optimization flags. Be sure to be consistent with them.
On x86 & x86-64 (and ARM natively) you might also use -mtune=native and there are lots of other -mtune possibilities.
Some people call profile-based optimization compiler feedback optimization (because dynamic runtime profile information is given back into the compiler). I prefer the "profile-guided optimization" term. See also this old question.

arm-none-eabi-gcc: -march option v/s -mcpu option

I have been following j lynch tutorial from atmel for developing small programms for at91sam7s256 (microcontroller). I have done a bit tinkering and used arm-none-eabi instead of arm-elf (old one). By default i found that gcc compiles assuming -march=armv4t even if one does not mention anything about chip. How much difference it would if i use -mcpu=arm7tdmi?
Even searching a lot on google i could not find a detailed tutorial which would explain all possible command like options including separate linker options,assembler and objcopy options like -MAP etc.
Can you provide any such material where all possibilities are explained?
Providing information about the specific processor gives the compiler additional information for selecting the most efficient mix of instructions, and the most efficient way of scheduling those instructions. It depends very much on the specific processor how much performance difference explicitly specifying -mcpu makes. There could be no difference whatsoever - the only way to know is to measure.
But in general - if you are building a specific image for a specific device, then you should provide the compiler with as much information as possible.
Note: your current instance of gcc compiles assuming -march=armv4t - this is certainly not a universal guarantee for all arm gcc toolchains.

Testing FPU on arm processor

I am using a Wandboard-Quad that contains an i.MX6 ARM processor. This processor has an FPU that I would like to utilize. Before I do, I want to test how much improvement I will get. I have a benchmark algorithm and have tried it with no optimization, and with -mfpu=vfp and there appears to be no improvement -- I do get improvement with optimization = 3.
I am using arm-linux-gnueabi libraries -- Any thoughts on what is incorrect and how I can tell if I am using the FPU?
Thanks,
Adam
Look at the assembler output with a -S flag and see if there are any fpu instructions being generated. That's probably the easiest thing.
Beyond that, there is a chance that your algorithm was using floating point so rarely that any use would be masked by loading and unloading the FPU registers. In that case, O3 optimizations in your other parts of the code would show you gains separate of the FPU usage.
-mfpu option works only when GCC is performing vectorization. Vectorization itself requires reasonable optimization level (minimum is -O2 with -ftree-vectorize option on). So try -O3 -ftree-vectorize -mfpu=vfp to utilize FPU and measure difference against simple -O3 level.
Also see ARM GCC docs for cases where -funsafe-math-optimizations may be required.
Without any optimisation the output from GCC is so inefficient that you might actually not be able to measure the difference between software and hardware floating point.
To see the benefits that the FPU adds, you need to test with a consistent optimisation level, and then use either -msoft-float or -mhard-float.
This will force the compiler to link against different libraries and make function calls for the floating-point operations rather than using native instructions. It is still possible that the underlying library uses hardware floating point, but I wouldn't worry about that too much.
You can select different sets of FP instructions using -mfpu=. For i.MX6 I think you want -mfpu=neon, as that should enable all applicable floating-point instructions (not just the NEON ones).

How can I determine whether my program is using SSE2 (via gcc optimization)?

I have a C++ program which is compiled under gcc (gcc version 4.5.1) with the -O3 flag. I'm thinking about whether or not it would be worthwhile making an SSE2 version of this program (or at least, the busiest of it). However, I'm worried that the compiler has already done this through automatic vectorization.
Question: How do I determine (a) whether or not my program is using SSE/SSE2 and (b) how much time is spent using SSE/SSE2 (i.e. profiling)?
The easiest way to tell if you are gaining any benefit from compiler vectorization is to run the code with and without the -ftree-vectorize flag and compare the results.
-O3 will automatically enable that option. So you might want to try it under -O2 instead.
To see which loops were vectorized, which were not, and why, you can add the -ftree-vectorizer-verbose option.
The last option, of course, is to look at the assembly. It's very easy to identify vectorized code in assembly.

Resources