I am working on GCC compiler. I have previously worked on tasking compiler where most of the flags are different.
Can someone help me to map below mentioned flags of tasking with GCC.
-t (section information is listed on list file)
--emit-locals = -equ (emit equ symbol to object file)
-wa-ogs (short and long option names to assembler)
--tasking-sfr (compiler includes regcpu.sfr and assembler includes regcpu.def)
--language = -gcc,-volatile,+strings (enable number of GCC extensions, don't optimization across volatile access, disable const check for string lateral)
--switch = auto (which code must be generated for switch statement)
--align = 0 (compiler aligns variables and function to minimum alignment)
--default-near-size = 8 (compiler allocates objects smaller than or equal to threshold memory in __near memory)
--default-a0-size = 0 (compiler will never use __a0 memory)
--default-a1-size = 0 (compiler will never use __a1 memory)
--tradeoff =4 (whether used optimization should optimize for more speed/smaller code size)
--source (merge c source code with assembly code in output file)
Related
The GCC manual lists all optimization flags being applied for the different levels of optimizations (-O1, -O2, etc.). However when compiling and measuring a benchmark program (e.g. cBench's automotive_bitcount) there is a significant difference when applying an optimization level instead of turning on all the listed optimizations manually. For -O1 with the automotive_bitcount program, I measured a speedup of roughly 100% when compiling with -O1 instead of manually applying all the listed flags. Those "hidden" optimizations seem in fact to be the main part of the optimization work GCC does for -O1. When applying the flags manually, I only get a speedup of about 10% compared to no optimizations.
The same can be observed when applying all enabled flags from gcc -c -Q -O3 --help=optimizers.
In the GCC manual I found this section which would explain this behavior:
Not all optimizations are controlled directly by a flag. Only optimizations that have a flag are listed in this section.
Most optimizations are completely disabled at -O0 or if an -O level is not set on the command line, even if individual optimization flags are specified.
Since I couldn't find any further documentation on those optimizations, I wonder if there is a way of controlling them and what the optimizations are in detail?
Some optimizations are directly gated by -O flags e.g. complete unroller:
{
public:
pass_complete_unrolli (gcc::context *ctxt)
: gimple_opt_pass (pass_data_complete_unrolli, ctxt)
{}
/* opt_pass methods: */
virtual bool gate (function *) { return optimize >= 2; }
virtual unsigned int execute (function *);
}; // class pass_complete_unrolli
and for others -O influences their internal algorithms e.g. in optimization of expressions:
/* If FROM is a SUBREG, put it into a register. Do this
so that we always generate the same set of insns for
better cse'ing; if an intermediate assignment occurred,
we won't be doing the operation directly on the SUBREG. */
if (optimize > 0 && GET_CODE (from) == SUBREG)
from = force_reg (from_mode, from);
There is no way to work around this, you have to use -O.
related to "ld: 32-bit RIP relative reference out of range" on Mac OSX but not solved and in a more complex context. the relevant computer(s) have >32GB of RAM.
static const int K=1024;
static const int M=K*K;
static const int G=K*M;
const int MC = G;
void donada(float *rvec, const int MC) { rvec[MC-1]= 1.0; return; }
float notused[1][MC]; // 4GB, ramp up with first index
float used[MC]; // 4GB
int main() {
donada( used, MC );
donada( notused[1], MC );
}
and gcc -Wall -o test test.cc. compiling this program not on osx yields
ld: 32-bit RIP relative reference out of range (4294967395 max is
+/-2GB): from _main (0x100000F92) to _used (0x200001000) in '_main' from /var/folders/yl/8gp3pgbn1l562ywg_q86rk6800\ 00z9/T/test-b3bebf.o
for architecture x86_64
on linux, there is a similar error
test.cc:(.text+0x18): relocation truncated to fit: R_X86_64_32 against symbol `used' defined in .bss section in /tmp/ccqcNh2C.o
I first thought compiler flag -Os would fix this, but it does not. It would be appropriate for gcc or clang to provide a more suggestive error message.
the relevant computer(s) have >32GB of RAM.
That's actually not very relevant. The issue is that 64-bit GCC defaults to -mcmodel=small, and you are trying to access data that is 4GiB away from its base symbol, which is not comatible with the small model.
From documentation:
-mcmodel=small
Generate code for the small code model: the program and its symbols
must be linked in the lower 2 GB of the address space. Pointers are 64 bits.
Programs can be statically or dynamically linked. This is the default code model.
-mcmodel=medium
Generate code for the medium model: The program is linked in the lower 2 GB
of the address space. Small symbols are also placed there.
Symbols with sizes larger than -mlarge-data-threshold are put into large data
or bss sections and can be located above 2GB.
Programs can be statically or dynamically linked.
-mcmodel=large
Generate code for the large model: This model makes no assumptions about addresses
and sizes of sections.
To correctly link your program, you need to use -mcmodel=large.
However note that this is not well tested (almost nobody does that), and that all code you (statically) link into your program will need to be built that way.
It is probably much better to dynamically allocate your arrays instead.
I first thought compiler flag -Os would fix this
It can't: -Os minimizes code size. Your program is that you are forcing the compiler to allocate very large contiguous data array. There is nothing the compiler could optimize for size there.
I have a few questions about Stack Guard and SSP protections. First question is about Stack Guard and its three types of canaries, if I am correctly - terminator, random and random XOR.
I'd like to know, how to disabled Stack Guard on x86 Linux system? Somewhere I read, it's possible with this command, while compiling with gcc '-disable-stackguard-randomization', it's same like with this command for enable '-enable-stackguard-randomization', both doesn't work. If needed, my gcc version is 4.8.2.
Next question about Stack guard, when I will able to enable/disable it, how can I set, which type of canaries I want to use? What I read, terminator canaries are used by default, for random I have to compiled with '-enable-stackguard-randomization', but how about random XOR? (Or with null 0x00000000)
Now about SSP(ProPolice), I know, for random canary I have to compiled with 'fstack-protector-all', but how about terminator, is it same as in Stack Guard, by default?
Last one, if anyone of you, can tell me, where I can find random canary in memory. For example, I have this scenario - compiled C program, like 'gcc -g example.c -o example -fstack-protector-all', so with random canaries. Let's say, I'm able to get address of canary, after every execution. So expect, I have: Canary = 0x1ae3f900. From a different papers, I get some info, that canary is located in .bss segment. So I get address of .bss segment using readelf: 'readelf -a ./example | grep bss'. It's 080456c9. In gdb I set some breakpoints, to get address of canary, but when I check .bss address x/20x 0x080456c9, all I see are only 0x00000000 addresses, but
canary is nowhere. Plus, I checked __stack_chk_fail's if it isn't there, but with same result, I can't see it there. I get address of stack_chk_fail from PLT/GOT.
Thank in advance for your answer and time.
Stack Smashing Protection (SSP) is an improvement over StackGuard. SSP was first implemented in gcc 4.1.
I'd like to know, how to disabled Stack Guard on x86 Linux system?
Use -fno-stack-protector to disable the userland SSP.
The --disable-stackguard-randomization and --enable-stackguard-randomization are build options for glibc source code.
when I will able to enable/disable it, how can I set, which type of
canaries I want to use?
This is not configurable in gcc as far as I know. Since glibc 2.10, the stack canary is generated in a function called _dl_setup_stack_chk_guard. Here is some part of its code:
if (dl_random == NULL)
{
ret.bytes[sizeof (ret) - 1] = 255;
ret.bytes[sizeof (ret) - 2] = '\n';
}
else
{
memcpy (ret.bytes, dl_random, sizeof (ret));
ret.num &= ~(uintptr_t) 0xff;
}
dl_random holds the address of the auxiliary vector entry for AT_RANDOM, which is a 16-byte random value initialized by the kernel while creating the process. If you are running on a kernel or an emulator that doesn't initialize AT_RANDOM, the check dl_random == NULL would be true and the canary used is the terminator value with the first and second most significant bytes initialized to 255 and \n, respectively. All other bytes are zero. Usually AT_RANDOM is initialized by the kernel and so the least 7 significant bytes of AT_RANDOM are copied. The last byte of canary is set to zero.
So if you want to use a particular method to generate the canary, you can change this code and build you own glibc.
As an alternative method, #PeterCordes have suggested in the comments to write your canary value to memory location %%fs:0x28 (see the code below) at the top of the main function and restore the runtime-generated canary just before returning from main.
Now about SSP(ProPolice), I know, for random canary I have to compiled
with 'fstack-protector-all', but how about terminator, is it same as
in Stack Guard, by default?
All variants of the -fstack-protector option use SSP. These don't affect how the canary is generated.
Last one, if anyone of you, can tell me, where I can find random
canary in memory.
The canary is generated dynamically early at process startup; you can't use readelf to get the canary. According to this article, you can use the following code to get the canary when compiling for i386:
int read_canary()
{
int val = 0;
__asm__("movl %%gs:0x14, %0;"
: "=r"(val)
:
:);
return val;
}
and for x86_64:
long read_canary()
{
long val = 0;
__asm__("movq %%fs:0x28, %0;"
: "=r"(val)
:
:);
return val;
}
When I run my program, which just calculates a sine wave:
for(i = 0; i < ADS1299_SIGNAL_WINDOW; i++){
TEST[i] = (float32_t)(10.0f * (float32_t)(arm_sin_f32((float32_t)(3.14f * i/ADS1299_SIGNAL_WINDOW))));
}
The compiler generates the following line, which results in a hard fault:
800702a: ed2d 8b04 vpush {d8-d9}
What is happening? For reference, here are my flags for the compiler:
SETTINGS="-g -nostartfiles -mthumb -mthumb-interwork -march=armv7e-m -mcpu=cortex-m4 -mfpu=fpv4-sp-d16 -mfloat-abi=hard -fsingle-precision-constant -fdata-sections -ffunction-sections -O3 -Wl,-T,../STM32F407VG_FLASH.ld"
DECLARE="-DARM_MATH_CM4 -D__FPU_PRESENT=1 -D__FPU_USED"
.... -larm_cortexM4lf_math
The problem is that you're doing both the CPACR enable, and some floating-point operations in the same scope. Because code in main uses floating-point registers, the compiler (being well-behaved and respecting the ABI), will emit code to preserve those registers on entry to main. Before any other code in main executes. Including the write to CAPCR which makes them accessible. Oops.
To avoid that, either enable FP in the CPACR before entry to main in a reset handler (if your toolchain allows), or simply do all FP operations in another function, and ensure main itself doesn't touch any FP registers.
It would also be wise (if you haven't already) to ensure you have a DSB; ISB synchronisation sequence after the CPACR write. Otherwise, you could potentially still get a fault from any stale FP instuctions already in the pipeline.
I think the problem is that the FPU is not enabled. I've got same problem with Nordic Semiconductors SDK examples on Keil 4. In the Keil IDE the check box for FPU enable is marked, but in the SystemInit code it is conditional compilation like this:
void SystemInit(void) {
#if (__FPU_USED == 1)
SCB->CPACR |= (3UL << 20) | (3UL << 22);
__DSB();
__ISB();
#endif
}
But I think the Keil 4 IDE does not set this __FPU_USED to 1 and on the VPUSH instruction I've got a HardFault because the FPU is not enabled.
I think you need to enable the FPU in SystemInit and the problem then will be solved.
If you use the FPU, the stack should be aligned to 8 bytes boundaries. If you are using an RTOS, check the thread stack initialization code. If you are running on pure bare metal, check the startup code for the stack setup.
I'm using MinGW64 build based on GCC 4.6.1 for Windows 64bit target. I'm playing around with the new Intel's AVX instructions. My command line arguments are -march=corei7-avx -mtune=corei7-avx -mavx.
But I started running into segmentation fault errors when allocating local variables on the stack. GCC uses the aligned moves VMOVAPS and VMOVAPD to move __m256 and __m256d around, and these instructions require 32-byte alignment. However, the stack for Windows 64bit has only 16 byte alignment.
How can I change the GCC's stack alignment to 32 bytes?
I have tried using -mstackrealign but to no avail, since that aligns only to 16 bytes. I couldn't make __attribute__((force_align_arg_pointer)) work either, it aligns to 16 bytes anyway. I haven't been able to find any other compiler options that would address this. Any help is greatly appreciated.
EDIT:
I tried using -mpreferred-stack-boundary=5, but GCC says that 5 is not supported for this target. I'm out of ideas.
I have been exploring the issue, filed a GCC bug report, and found out that this is a MinGW64 related problem. See GCC Bug#49001. Apparently, GCC doesn't support 32-byte stack alignment on Windows. This effectively prevents the use of 256-bit AVX instructions.
I investigated a couple ways how to deal with this issue. The simplest and bluntest solution is to replace of aligned memory accesses VMOVAPS/PD/DQA by unaligned alternatives VMOVUPS etc. So I learned Python last night (very nice tool, by the way) and pulled off the following script that does the job with an input assembler file produced by GCC:
import re
import fileinput
import sys
# fix aligned stack access
# replace aligned vmov* by unaligned vmov* with 32-byte aligned operands
# see Intel's AVX programming guide, page 39
vmova = re.compile(r"\s*?vmov(\w+).*?((\(%r.*?%ymm)|(%ymm.*?\(%r))")
aligndict = {"aps" : "ups", "apd" : "upd", "dqa" : "dqu"};
for line in fileinput.FileInput(sys.argv[1:],inplace=1):
m = vmova.match(line)
if m and m.group(1) in aligndict:
s = m.group(1)
print line.replace("vmov"+s, "vmov"+aligndict[s]),
else:
print line,
This approach is pretty safe and foolproof. Though I observed a performance penalty on rare occasions. When the stack is unaligned, the memory access crosses the cache line boundary. Fortunately, the code performs as fast as aligned accesses most of the time. My recommendation: inline functions in critical loops!
I also attempted to fix the stack allocation in every function prolog using another Python script, trying to align it always at the 32-byte boundary. This seems to work for some code, but not for other. I have to rely on the good will of GCC that it will allocate aligned local variables (with respect to the stack pointer), which it usually does. This is not always the case, especially when there is a serious register spilling due to the necessity to save all ymm register before a function call. (All ymm registers are callee-save). I can post the script if there's an interest.
The best solution would be to fix GCC MinGW64 build. Unfortunately, I have no knowledge of its internal workings, just started using it last week.
You can get the effect you want by
Declaring your variables not as variables, but as fields in a struct
Declaring an array that is larger than the structure by an appropriate amount of padding
Doing pointer/address arithmetic to find a 32 byte aligned address in side the array
Casting that address to a pointer to your struct
Finally using the data members of your struct
You can use the same technique when malloc() does not align stuff on the heap appropriately.
E.g.
void foo() {
struct I_wish_these_were_32B_aligned {
vec32B foo;
char bar[32];
}; // not - no variable definition, just the struct declaration.
unsigned char a[sizeof(I_wish_these_were_32B_aligned) + 32)];
unsigned char* a_aligned_to_32B = align_to_32B(a);
I_wish_these_were_32B_aligned* s = (I_wish_these_were_32B_aligned)a_aligned_to_32B;
s->foo = ...
}
where
unsigned char* align_to_32B(unsiged char* a) {
uint64_t u = (unit64_t)a;
mask_aligned32B = (1 << 5) - 1;
if (u & mask_aligned32B == 0) return (unsigned char*)u;
return (unsigned char*)((u|mask_aligned_32B) + 1);
}
I just ran in the same issue of having segmentation faults when using AVX inside my functions. And it was also due to the stack misalignment. Given the fact that this is a compiler issue (and the options that could help are not available in Windows), I worked around the stack usage by:
Using static variables (see this issue). Given the fact that they are not stored in the stack, you can force their alignment by using __attribute__((align(32))) in your declaration. For example: static __m256i r __attribute__((aligned(32))).
Inlining the functions/methods receiving/returning AVX data. You can force GCC to inline your function/method by adding inline and __attribute__((always_inline)) to your function prototype/declaration. Inlining your functions increase the size of your program, but they also prevent the function from using the stack (and hence, avoids the stack-alignment issue). Example: inline __m256i myAvxFunction(void) __attribute__((always_inline));.
Be aware that the usage of static variables is no thread-safe, as mentioned in the reference. If you are writing a multi-threaded application you may have to add some protection for your critical paths.