Why does the GCC on OSX 10.5 has the -fPIC option turned on by default? Afterall, doesn't it generate larger and slower code?
Unless your program has a lot of very small functions, all of which use global or static variables, or objective-c, any performance decrease or size difference will be unnoticeable. PIC isn't used for automatic local variables since they are already accessed using the stack. In functions which need it, the set up only requires four instructions, which isn't much compared to the code in the function. Each access using PIC is only one byte longer than an access without it, so again there isn't much difference.
If you are building for 64 bit, PIC will probably be smaller, and there will likely be no performance difference. The x86-64 architecture added a new instruction-relative addressing, which means there is no set up required for PIC. This new addressing mode is actually one byte shorter than encoding the absolute address in the instruction, since the SIB byte isn't used.
Finally, using PIC makes your code more secure. If your code has to be loaded at the same place every time, then someone could find the location of important functions and data and cause problems at runtime. However, if the OS can choose to load your code at a different address, anyone trying to cause problems has to find out where the functions and data structures are every time the program is run.
Related
I want to reserve/allocate a range of memory in RAM and the same application should not overwrite or use that range of memory for heap/stack storage. How to allocate a range of memory in ram protected from stack/heap overwrite?
I thought about adding(or allocating) an array to the application itself and reserve memory, But its optimized out by compiler as its not referenced anywhere in the application.
I am using ARM GNU toolchain for compiling.
There are several solutions to this problem. Listing in best to worse order,
Use the linker
Annotate the variable
Global scope
Volatile (maybe)
Linker script
You can obviously use a linker file to do this. It is the proper tool for the job. Pass the linker the --verbose parameter to see what the default script is. You may then modify it to precisely reserve the memory.
Variable Attributes
With more recent versions of gcc, the attribute used will also do what you want. Most modern gcc versions will support this. It is also significantly easier than the linker script; but only the linker script gives precise control over the position of the hole in a reliable manner.
Global scope
You may also give your array global scope and the compiler should not eliminate it. This may not be true if you use link time optimization.
Volatile
Theoretically, a compiler may eliminate a static volatile array. The volatile comes into play when you have code involving the array. It modifies the access behavior so the compiler will never caches access to that range. Dr. Dobbs on volatile At least the behavior is unclear to me and I would not recommend this method. It may work with some versions (and optimization levels) of the compiler and not others.
Limitations
Also, the linker option -gc-sections, can eliminate space reserved with either the global scope and the volatile methods as the symbol may not be annotated in any way in object formats; see the linker script (KEEP).
Only the Linker script can definitely restrict over-writes by the stack. You need to position the top of the stack before your reserved area. Typically, the heap grows up and the stack grows down. So these two collide with each other. This is particular to your environment/C library (for instance newlib is the typical ARM bare metal library). Looking at the linker file will give the best clue to this.
My guess is you want a fallow area to reserve for some sort of debugging information in the event of a system crash? A more explicit explaination of you problem would be helpful. You don't seem to be concerned with the position of the memory, so I guess this is not hardware related.
I'm working on the development of Boot that will be embedded in a PROM chip for a project.
I was tasked with making an estimation of the final memory size that the software will probably take but I've never done this before.
I searched a bit around and I'm thinking about doing the following:
Counting all the variables, this size goes directly to the size total
Estimating a number of line of codes each function will take (the code hasn't been written yet)
Finding out an approximate number of asm instruction per c instruction
Total size = Total nb line of codes * avg asm instruction per c instruction * 32bit
My solution could very well be bogus, I hope someone will be able to help.
On principle - You are on the right track:
You need to distinguish between several types of memory footprint:
Stack
Dynamic memory (malloc, new, etc.)
Initialised variables
Un-initialised variables
Code
Stack is mostly impacted by recursion, local variables and function parameters.
Dynamic memory (heap) is obvious and also probably not relevant to you - so I'll ignore it for now.
Initialised variables are interesting since you need to count them twice - once for the program footprint on the PROM (similar to code and constants) and once for the RAM footprint.
Un-initialised variables obviously go toward the RAM and counting the size is almost good enough (you also need to consider alignment and padding.
The hardest to estimate is code or what goes into PROM, you need to count constants and local variables as well as the code, the code itself is more or less what you suspect (after adding padding, alignment, function call overhead, interrupt vector initialisation etc.) but many things can make it larger than expected, such as inline functions, library functions (many seemingly trivial operations involve such functions), casting etc.
On way of answering the question would be from experience or assessment of existing code with similar functionality. However there will be a number of factors that affect code size:
Target architecture and instruction set.
Compiler and compiler options used.
Library code usage.
Capability of development staff.
Required functionality.
The "development of Boot" tells us nothing about the requirements or functionality of your boot process. This will have the greatest affect on code size. As an example of how target can make a difference, 8-bit targets typically have greater code density, but generate more code for arithmetic on larger data types, while on say an ARM target where you can select between Thumb and ARM instruction sets, the code density will change significantly.
If you have no prior experience or representative code base to work from, then I suggest you perform a few experiments to get some metrics you can work with:
Build an empty application - just an empty main() function if C or C++; that will give you the basic fixed overhead of the runtime start-up.
If you are using library code, that will probably take a significant amount of space; add dummy calls to all library interfaces you will make use of in the final application, that will tell you how much code will be taken up by library code (assuming the library code is not in-lined).
Thereafter it will depend on functionality; you might implement a subset of the required functionality, and then estimate what proportion of the final build that might constitute.
Regarding your suggestions, remember that variables do not occupy space in ROM, though any constant initialisers will do so. Typically a boot-loader can use all available RAM because the application start-up will re-establish a new runtime environment for itself, discarding the boot-loader environment and variables.
If you were to provide details of functionality and target, you may be able to leverage the experience of the community in estimating the required resources. For example I might be able to tell you (from experience) that a boot-loader with support for Flash programming that loads via a UART using XMODEM protocol on an ARM7 using ARM instruction set will fit in 4k Bytes, or that adding support for loading via SD card may add a further 6Kb, and say USB Virtual Comm Port a further 4Kb. However your requirements are possibly unique and you will have to determine the resource load for yourself somehow.
Is it a correct assessment that compiling with -m32 and -mtune=native on a AMD64 machine will use 32 bits for pointers and hence slightly smaller data structures, and consequently a denser cache packing. My application (it is still in its design stage) will not need more than a GB of memory so a 32 bit address space should be fine. I am hoping to utilize the bit-shrunk data structures for more data in cache, and also the potential to use more registers for integer math. What I am trying to achieve is the best of both worlds, more registers by virtue of -mtune=native, but smaller pointers by -m32. Will the availability of extra registers be taken care of largely automatically or do I have to deal explicitly with sse. I would rather not.
At least if I understand what you want, I don't think the compiler is going to do the job for you. It'd going to produce either 32-bit or 64-bit code. With 64-bit code, the pointers will be 64 bits -- if you want 32-bit pointers, you're going to have to generate 32-bit code (with its commensurate limitations, such as on what registers it'll use).
To get what you want, I'm pretty sure you'll have to split the pointer up yourself: store a (probably 64-bit) base address, and inside your data structure, store offsets rather than complete addresses. You'll need to add the base and offset together yourself before you attempt to de-reference though.
Yes, -m32 should use 32-bit pointers. You could confirm that by printing sizeof(int *).
However 64bit applications have more registers accessible in the instruction set (in addition to them being larger) - so you might get the opposite effect. It depends on the application which effect (larger pointers vs more registers) has a bigger impact, generally I would suggest trying both and comparing the performance.
We need to link one of our executables with this flag as it uses lots of memory.
But why give one EXE file special treatment. Why not standardize on /LARGEADDRESSAWARE?
So the question is: Is there anything wrong with using /LARGEADDRESSAWARE even if you don't need it. Why not use it as standard for all EXE files?
blindly applying the LargeAddressAware flag to your 32bit executable deploys a ticking time bomb!
by setting this flag you are testifying to the OS:
yes, my application (and all DLLs being loaded during runtime) can cope with memory addresses up to 4 GB.
so don't restrict the VAS for the process to 2 GB but unlock the full range (of 4 GB)".
but can you really guarantee?
do you take responsibility for all the system DLLs, microsoft redistributables and 3rd-party modules your process may use?
usually, memory allocation returns virtual addresses in low-to-high order. so, unless your process consumes a lot of memory (or it has a very fragmented virtual address space), it will never use addresses beyond the 2 GB boundary. this is hiding bugs related to high addresses.
if such bugs exist they are hard to identify. they will sporadically show up "sooner or later". it's just a matter of time.
luckily there is an extremely handy system-wide switch built into the windows OS:
for testing purposes use the MEM_TOP_DOWN registry setting.
this forces all memory allocations to go from the top down, instead of the normal bottom up.
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management]
"AllocationPreference"=dword:00100000
(this is hex 0x100000. requires windows reboot, of course)
with this switch enabled you will identify issues "sooner" rather than "later".
ideally you'll see them "right from the beginning".
side note: for first analysis i strongly recommend the tool VMmap (SysInternals).
conclusions:
when applying the LAA flag to your 32bit executable it is mandatory to fully test it on a x64 OS with the TopDown AllocationPreference switch set.
for issues in your own code you may be able to fix them.
just to name one very obvious example: use unsigned integers instead of signed integers for memory pointers.
when encountering issues with 3rd-party modules you need to ask the author to fix his bugs. unless this is done you better remove the LargeAddressAware flag from your executable.
a note on testing:
the MemTopDown registry switch is not achieving the desired results for unit tests that are executed by a "test runner" that itself is not LAA enabled.
see: Unit Testing for x86 LargeAddressAware compatibility
PS:
also very "related" and quite interesting is the migration from 32bit code to 64bit.
for examples see:
As a programmer, what do I need to worry about when moving to 64-bit windows?
https://www.sec.cs.tu-bs.de/pubs/2016-ccs.pdf (twice the bits, twice the trouble)
Because lots of legacy code is written with the expectation that "negative" pointers are invalid. Anything in the top two Gb of a 32bit process has the msb set.
As such, its far easier for Microsoft to play it safe, and require applications that (a) need the full 4Gb and (b) have been developed and tested in a large memory scenario, to simply set the flag.
It's not - as you have noticed - that hard.
Raymond Chen - in his blog The Old New Thing - covers the issues with turning it on for all (32bit) applications.
No, "legacy code" in this context (C/C++) is not exclusively code that plays ugly tricks with the MSB of pointers.
It also includes all the code that uses 'int' to store the difference between two pointer, or the length of a memory area, instead of using the correct type 'size_t' : 'int' being signed has 31 bits, and can not handle a value of more than 2 Gb.
A way to cure a good part of your code is to go over it and correct all of those innocuous "mixing signed and unsigned" warnings. It should do a good part of the job, at least if you haven't defined function where an argument of type int is actually a memory length.
However that "legacy code" will probably apparently work right for quite a while, even if you correct nothing.
You'll only break when you'll allocate more than 2 Gb in one block. Or when you'll compare two unrelated pointers that are more than 2 Gb away from each other.
As comparing unrelated pointers is technically an undefined behaviour anyway, you won't encounter that much code that does it (but you can never be sure).
And very frequently even if in total you need more than 2Gb, your program actually never makes single allocations that are larger than that. In fact in Windows, even with LARGEADDRESSAWARE you won't be able by default to allocate that much given the way the memory is organized. You'd need to shuffle the system DLL around to get a continuous block of more than 2Gb
But Murphy's laws says that kind of code will breaks one day, it's just that it will happen very long after you've enable LARGEADDRESSAWARE without checking, and when nobody will remember this has been done.
I can understand this requirement for the old PPC RISC systems and even for x86-64, but for the old tried-and-true x86? In this case, the stack needs to be aligned on 4 byte boundaries only. Yes, some of the MMX/SSE instructions require 16byte alignments, but if that is a requirement of the callee, then it should ensure the alignments are correct. Why burden every caller with this extra requirement? This can actually cause some drops in performance because every call-site must manage this requirement. Am I missing something?
Update: After some more investigation into this and some consultation with some internal colleagues, I have some theories about this:
Consistency between the PPC, x86, and x64 version of the OS
It seems that the GCC codegen now consistently does a sub esp,xxx and then "mov"s the data onto the stack rather than simply doing a "push" instruction. This could actually be faster on some hardware.
While this does complicate the call sites a little, there is very little extra overhead when using the default "cdecl" convention where the caller cleans up the stack.
The issue I have with the last item, is that for calling conventions that rely on the callee cleaning the stack, the above requirements really "uglifies" the codegen. For instance, what some compiler decided to implement a faster register-based calling style for its own internal use (ie any code that isn't intended to be called from other languages or sources)? This stack-alignment thing could negate some of the performance gains achieved by passing some parameters in registers.
Update: So far the only real answers have been consistency, but to me that's a bit too easy of an answer. I have well over 20 years experience with the x86 architecture and if consistency, not performance, or something else concrete, is really the reason then I respectfully suggest that is a bit naive for the developers to require it. They're ignoring nearly three decades of tools and support. Especially if they're expecting tools vendors to quickly and easily adapt their tools for their platform (maybe not... it is Apple...) without having to jump through several seemingly unnecessary hoops.
I'll give this topic another day or so then close it...
Related
It’s my stack frame, I don’t care about your stack frame!
From "Intel®64 and IA-32 Architectures Optimization Reference Manual", section 4.4.2:
"For best performance, the Streaming SIMD Extensions and Streaming SIMD Extensions 2 require their memory operands to be aligned to 16-byte boundaries. Unaligned data can cause significant performance penalties compared to aligned data."
From Appendix D:
"It is important to ensure that the stack frame is aligned to a 16-byte boundary upon function entry to keep local __m128 data, parameters, and XMM register spill locations aligned throughout a function invocation."
http://www.intel.com/Assets/PDF/manual/248966.pdf
I am not sure as I don't have first hand proof, but I believe the reason is SSE. SSE is much faster if your buffers are already aligned on a 16 bytes boundary (movps vs movups), and any x86 has at least sse2 for mac os x. It can be taken care of by the application user, but the cost is pretty significant. If the overall cost for making it mandatory in the ABI is not too significant, it may worth it. SSE is used quite pervasively in mac os X: accelerate framework, etc...
I believe it's to keep it inline with the x86-64 ABI.
First, note that the 16 bytes alignment is an exception introduced by Apple to the System V IA-32 ABI.
The stack alignment is only needed when calling system functions, because many system libraries are using SSE or Altivec extensions which require the 16 bytes alignment. I found an explicit reference in the libgmalloc MAN page.
You can perfectly handle your stack frame the way you want, but if you try to call a system function with a misaligned stack, you will end up with a misaligned_stack_error message.
Edit:
For the record, you can get rid of alignment problems when compiling with GCC by using the mstack-realign option.
This is an efficiency issue.
Making sure the stack is 16-byte aligned in every function that uses the new SSE instructions adds a lot of overhead for using those instructions, effectively reducing performance.
On the other hand, keeping the stack 16-byte aligned at all times ensures that you can use SSE instructions freely with no performance penalty. There is no cost to this (cost measured in instructions at least). It only involves changing a constant in the prologue of the function.
Wasting stack space is cheap, it is probably the hottest part of the cache.
My guess is that Apple believes everyone just uses XCode (gcc) which aligns the stack for you. So requiring the stack to be aligned so the kernel doesn't have to is just a micro-optimization.
While I cannot really answer your question of WHY, you may find the manuals at the following site useful:
http://www.agner.org/optimize/
Regarding the ABI, have a look especially at:
http://www.agner.org/optimize/calling_conventions.pdf
Hope that's useful.
Hmm, didn't OS X ABI also do funny RISC like things like passing small structs in registers?
So that points to the consistency with other platforms theory.
Come to think of it, the FreeBSD syscall api also aligns 64-bit values. (like e.g. lseek and mmap)
In order to maintain consistency in kernel. This allows the same kernel to be booted on multiple architectures without modicfication.
Not sure why no one has considered the possibility of easy portability from legacy PowerPC-based platform?
Read this:
http://developer.apple.com/library/mac/#documentation/DeveloperTools/Conceptual/LowLevelABI/100-32-bit_PowerPC_Function_Calling_Conventions/32bitPowerPC.html#//apple_ref/doc/uid/TP40002438-SW20
And then zoomed into "32-bit PowerPC Function Calling Conventions" and finally this:
"These are the embedding alignment modes available in the 32-bit
PowerPC environment:
Power alignment mode is derived from the alignment rules used by the
IBM XLC compiler for the AIX operating system. It is the default
alignment mode for the PowerPC-architecture version of GCC used on AIX
and Mac OS X. Because this mode is most likely to be compatible
between PowerPC-architecture compilers from different vendors, it’s
typically used with data structures that are shared between different
programs."
In view of the legacy PowerPC-based background of OSX, portability is a major consideration - it dictates following the convention all the way back to AIX's XLC compiler. When you think in terms of the need to make sure all the tools and applications will work together with minimal rework, I think it is important to stick to the same legacy ABI as far as possible.
That gives the philosophy, and reading further is the rule explicitly mentioned ("Prolog and Epilog"):
The called function is responsible for allocating
its own stack frame, making sure to preserve 16-byte alignment in the
stack. This operation is accomplished by a section of code called the
prolog, which the compiler places before the body of the subroutine.
After the body of the subroutine, the compiler places an epilog to
restore the processor to the state it was prior to the subroutine
call.