Compiler assumptions about relative locations from memory objects - gcc

I wonder what assumptions compilers make about the relative locations of memory objects.
For example if we allocate two stack variables of size 1 byte each, right after another and initialize them both with zero, can a compiler optimize this case by only emitting one single instruction that overwrites both bytes in memory with zeros, because the compiler knows the relative position of both variables?
I am interested specifically in the more well known compilers like gcc, g++, clang, the Windows C/C++ compiler etc.

A compiler can optimize multiple assignments into one.
a = 0;
b = 0;
might become something like
*(short*)&a = 0;
The subtle part is "if we allocate two stack variables of size 1 byte each, right after another" since you cannot really do that. A compiler can shuffle stack positions around at will. Also, simply declaring variables will not necessarily mean any stack allocation. Variables might just be in registers. In C you would have to use alloca and even that does not provide "right after another".
Even more general, the C standard does not allow you to compare the memory positions of different objects. This is undefined behavior.

Related

Mutable data types that use stack allocation

Based on my earlier question, I understand the benefit of using stack allocation. Suppose I have an array of arrays. For example, A is a list of matrices and each element A[i] is a 1x3 matrix. The length of A and the dimension of A[i] are known at run time (given by the user). Each A[i] is a matrix of Float64 and this is also known at run time. However, through out the program, I will be modifying the values of A[i] element by element. What data structure can also allow me to use stack allocation? I tried StaticArrays but it doesn't allow me to modify a static array.
StaticArrays defines MArray (MVector, MMatrix) types that are fixed-size and mutable. If you use these there's a higher chance of the compiler determining that they can be stack-allocated, but it's not guaranteed. Moreover, since the pattern you're using is that you're passing the mutable state vector into a function which presumably modifies it, it's not going to be valid or helpful to stack allocate that anyway. If you're going to allocate state once and modify it throughout the program, it doesn't really matter if it is heap or stack allocated—stack allocation is only a big win for objects that are allocated, used locally and then don't escape the local scope, so they can be “freed” simply by popping the stack.
From the code snippet you showed in the linked question, the state vector is allocated in the outer function, test_for_loop, which shouldn't be a big deal since it's done once at the beginning of execution. Using a variably sized state vector to index into an array with a splat (...) might be an issue, however, and that's done in test_function. Using something with fixed size like MVector might be better for that. It might, however, be better still, to use a state tuple and return a new rather than mutated state tuple at the end. The compiler is very good at turning that kind of thing into very efficient code because of immutability.
Note that by convention test_function should be called test_function! since it modifies its M argument and even more so if it modifies the state vector.
I would also note that this isn't a great question/answer pair since it's not standalone at all and really just a continuation of your other question. StackOverflow isn't very good for this kind of iterative question/discussion interaction, I'm afraid.

What may be the problem with Clang's -fmerge-all-constants?

The compilation flag -fmerge-all-constants merges identical constants into a single variable. I keep reading that this results in non-conforming code, and Linus Torvalds wrote that it's inexcusable, but why?
What can possibly happen when you merge two or more identical constant variables?
There are times when programs declare a constant object because they need something with a unique address, and there are times when any address that points to storage holding the proper sequence of byte values would be equally usable. In C, if one writes:
char const * const HelloShared1 = "Hello";
char const * const HelloShared2 = "Hello";
char const HelloUnique1[] = "Hello";
char const HelloUnique2[] = "Hello";
a compiler would have to reserve space for at least three copies of the word Hello, followed by a zero byte. The names HelloUnique1 and HelloUnique2 would refer to two of those copies, and the names HelloShared1 and HelloShared2 would need to identify storage that was distinct from that used by HelloUnique1 and HelloUnique2, but HelloShared1 and HelloShared2 could at the compiler's convenience identify the same storage.
Unfortunately, while the C and C++ Standards usefully provides two ways of specifying objects that hold string literal data, so as to allow programmers to indicate when multiple copies of the same information may be placed in the same storage, it fails to specify any means of specifying the same semantics for any other kind of constant data. For most kinds of applications, situations where a program would care about whether two objects share the same address would be far less common than those where using the same storage for constant objects holding the same data would be advantageous.
Being able to invite an implementation to make optimizations which would not be allowable by the Standard is useful, if one recognizes that programs should not be expected to be compatible with all optimizations, nor vice versa, and if compiler writers do a good job of documenting what kinds of programs different optimizations are compatible with and letting compiler writers enable only optimizations that are known to be compatible with their code.
Fundamentally, optimizations that assume programs won't do X will be useful for applications that don't involve doing X, but at best counter-productive for those that do. The described optimizations would fall into this category. I wouldn't see any basis for complaining about a compiler that makes such optimizations available but doesn't enable them by default. On the other hand, some people believe any program that isn't compatible with any imaginable optimization as "broken".

how can i further understand the compilation process used by gcc?

I was trying to reverse engineer some psp programs developed using the free
pspsdk
https://sourceforge.net/projects/minpspw/
I noticed that i created a function to see how MIPS handles more than 4 arguments (a0-a4).
Everyone i know has told me that they get passed onto the stack.
To my surprise, that 5th argument was actually passed to register t0 and to compiler didn't even use the stack!
it also inlined a function without even having used a jal or jump to it. (obvious optimization).
Altough there was indeed a space a memory and you could double check by using print with function pointer argument. That actual code that was executed was automatically inlined without the need of a function call instruction.
^^ but that doesn't really benefit me for a reverse engineer attempt...
there is a man page for this version of gcc. and it takes seconds to install if anyone is able to provide it's man for compilation if there is one.
It's so long i don't even know how to reference information reliably
How arguments are passed is specified by the ABI (application binary interface). So you have to find respective documents.
Moreover, there is more than one such ABI, namely n32 and n64. In the case of mips-gcc, some of the decisions are commented in the GCC sources like in ./gcc/config/mips/mips.h
/* This structure has to cope with two different argument allocation
schemes. Most MIPS ABIs view the arguments as a structure, of which
the first N words go in registers and the rest go on the stack. If I
< N, the Ith word might go in Ith integer argument register or in a
floating-point register. For these ABIs, we only need to remember
the offset of the current argument into the structure.
The EABI instead allocates the integer and floating-point arguments
separately. The first N words of FP arguments go in FP registers,
the rest go on the stack. Likewise, the first N words of the other
arguments go in integer registers, and the rest go on the stack. We
need to maintain three counts: the number of integer registers used,
the number of floating-point registers used, and the number of words
passed on the stack.
We could keep separate information for the two ABIs (a word count for
the standard ABIs, and three separate counts for the EABI). But it
seems simpler to view the standard ABIs as forms of EABI that do not
allocate floating-point registers.
So for the standard ABIs, the first N words are allocated to integer
registers, and mips_function_arg decides on an argument-by-argument
basis whether that argument should really go in an integer register,
or in a floating-point one. */
There are more such comments in the mips backend. Search for "cumulative" or "CUMULATIVE" in mips.c and mips.h.

Allocatable arrays performance

There is an mpi-version of a program which uses COMMON blocks to store arrays that are used everywhere through the code. Unfortunately, there is no way to declare arrays in COMMON block size of which would be known only run-time. So, as a workaround I decided to move that arrays in modules which accept ALLOCATABLE arrays inside. That is, all arrays in COMMON blocks were vanished, instead ALLOCATE was used. So, this was the only thing I changed in my program. Unfortunately, performance of the program was awful (when compared to COMMON blocks realization). As to mpi-settings, there is a single mpi-process on each computational node and each mpi-process has a single thread.
I found similar question asked here but don't think (don't understand :) ) how it could be applied to my case (where each process has a single thread). I appreciate any help.
Here is a simple example which illustrates what I was talking about (below is a pseudocode):
"SOURCE FILE":
SUBROUTINE ZEROSET()
INCLUDE 'FILE_1.INC'
INCLUDE 'FILE_2.INC'
INCLUDE 'FILE_3.INC'
....
INCLUDE 'FILE_N.INC'
ARRAY_1 = 0.0
ARRAY_2 = 0.0
ARRAY_3 = 0.0
ARRAY_4 = 0.0
...
ARRAY_N = 0.0
END SUBROUTINE
As you may see, ZEROSET() has no parallel or MPI stuff. FILE_1.INC, FILE_2, ... , FILE_N.INC are files where ARRAY_1, ARRAY_2 ... ARRAY_N are defined in COMMON blocks. Something like that
REAL ARRAY_1
COMMON /ARRAY_1/ ARRAY_1(NX, NY, NZ)
Where NX, NY, NZ are well defined parameters described with help of PARAMETER directive.
When I use modules, I just destroyed all COMMON blocks, so FILE_I.INC looks like
REAL, ALLOCATABLE:: ARRAY_I(:,:,:)
And then just changed "INCLUDE 'FILE_I.INC'" statement above to "USE FILE_I". Actually, when parallel program is executed, one particular process does not need a whole (NX, NY, NZ) domain, so I calculate parameters and then allocate ARRAY_I (only ONCE!).
Subroutine ZEROSET() is executed 0.18 seconds with COMMON blocks and 0.36 with modules (when array's dimensions are calculated runtime). So, the performance worsened by two times.
I hope that everything is clear now. I appreciate you help very much.
Using allocatable arrays in modules can often hurt performance because the compiler has no idea about sizes at compile time. You will get much better performance with many compilers with this code:
subroutine X
use Y ! Has allocatable array A(N,N) in it
call Z(A,N)
end subroutine
subroutine Z(A,N)
Integer N
real A(N,N)
do stuff here
end
Then this code:
subroutine X
use Y ! Has allocatable array A(N,N) in it
do stuff here
end subroutine
The compiler will know that the array is NxN and the do loops are over N and be able to take advantage of that fact (most codes work that way on arrays). Also, after any subroutine calls in "do stuff here", the compiler will have to assume that array "A" might have changed sizes or moved locations in memory and recheck. That kills optimization.
This should get you most of your performance back.
Common blocks are located in a specific place in memory also, and that allows optimizations also.
Actually I guess, your problem here is, in combination with stack vs. heap memory, indeed compiler optimization based. Depending on the compiler you're using, it might do some more efficient memory blanking, and for a fixed chunk of memory it does not even need to check the extent and location of it within the subroutine. Thus, in the fixed sized arrays there won't be nearly no overhead involved.
Is this routine called very often, or why do you care about these 0.18 s?
If it is indeed relevant, the best option would be to get rid of the 0 setting at all, and instead for example separate the first iteration loop and use it for the initialization, this way you do not have to introduce additional memory accesses, just for initialization with 0. However it would duplicate some code...
I could think of just these reasons when it comes to fortran performance using arrays:
arrays on the stack VS heap, but I doubt this could have a huge performance impact.
passing arrays to a subroutine, because the best way to do that depends on the array, see this page on using arrays efficiently

How to determine maximum stack usage in embedded system with gcc?

I'm writing the startup code for an embedded system -- the code that loads the initial stack pointer before jumping to the main() function -- and I need to tell it how many bytes of stack my application will use (or some larger, conservative estimate).
I've been told the gcc compiler now has a -fstack-usage option and -fcallgraph-info option that can somehow be used to statically calculates the exact "Maximum Stack Usage" for me.
( "Compile-time stack requirements analysis with GCC" by Botcazou, Comar, and Hainque ).
Nigel Jones says that recursion is a really bad idea in embedded systems ("Computing your stack size" 2009), so I've been careful not to make any mutually recursive functions in this code.
Also, I make sure that none of my interrupt handlers ever re-enable interrupts until their final return-from-interrupt instruction, so I don't need to worry about re-entrant interrupt handlers.
Without recursion or re-entrant interrupt handlers, it should possible to statically determine the maximum stack usage. (And so most of the answers to How to determine maximum stack usage? do not apply).
My understanding is I (or preferably, some bit of code on my PC that is automatically run every time I rebuild the executable) first find the maximum stack depth for each interrupt handler when it's not interrupted by a higher-priority interrupt, and the maximum stack depth of the main() function when it is not interrupted.
Then I add them all up to find the total (worst-case) maximum stack depth. That occurs (in my embedded system) when the main() background task is at its maximum depth when it is interrupted by the lowest-priority interrupt, and that interrupt is at its maximum depth when it is interrupted by the next-lowest-priority interrupt, and so on.
I'm using YAGARTO with gcc 4.6.0 to compile code for the LM3S1968 ARM Cortex-M3.
So how do I use the -fstack-usage option and -fcallgraph-info option with gcc to calculate the maximum stack depth? Or is there some better approach to determine maximum stack usage?
(See How to determine maximum stack usage in embedded system? for almost the same question targeted to the Keil compiler .)
GCC docs :
-fstack-usage
Makes the compiler output stack usage information for the program, on a per-function basis. The filename for the dump is made by appending .su to the auxname. auxname is generated from the name of the output file, if explicitly specified and it is not an executable, otherwise it is the basename of the source file. An entry is made up of three fields:
The name of the function.
A number of bytes.
One or more qualifiers: static, dynamic, bounded.
The qualifier static means that the function manipulates the stack statically: a fixed number of bytes are allocated for the frame on function entry and released on function exit; no stack adjustments are otherwise made in the function. The second field is this fixed number of bytes.
The qualifier dynamic means that the function manipulates the stack dynamically: in addition to the static allocation described above, stack adjustments are made in the body of the function, for example to push/pop arguments around function calls. If the qualifier bounded is also present, the amount of these adjustments is bounded at compile-time and the second field is an upper bound of the total amount of stack used by the function. If it is not present, the amount of these adjustments is not bounded at compile-time and the second field only represents the bounded part.
I can't find any references to -fcallgraph-info
You could potentially create the information you need from -fstack-usage and -fdump-tree-optimized
For each leaf in -fdump-tree-optimized, get its parents and sum their stack size number (keeping in mind that this number lies for any function with "dynamic" but not "bounded") from -fstack-usage, find the max of these values and this should be your maximum stack usage.
Just in case no one comes up with a better answer, I'll post what I had in the comment to your other question, even though I have no experience using these options and tools:
GCC 4.6 adds the -fstack-usage option which gives the stack usage statistics on a function-by-function basis.
If you combine this information with a call graph produced by cflow or a similar tool you can get the kind of stack depth analysis you're looking for (a script could probably be written pretty easily to do this). Have the script read the stack-usage info and load up a map of function names with the stack used by the function. Then have the script walk the cflow graph (which can be an easy-to-parse text tree), adding up the stack usage associated with each line for each branch in the call graph.
So, it looks like this can be done with GCC, but you might have to cobble together the right set of tools.
Quite late, but for anyone looking at this, the answers given involving combining the outputs from fstack-usage and call graph tools like cflow can end up being wildly incorrect for any dynamic allocation, even bounded, because there's no information about when that dynamic stack allocation occurs. It's therefore not possible to know to what functions you should apply the value towards. As a contrived example, if (simplified) fstack-usage output is:
main 1024 dynamic,bounded
functionA 512 static
functionB 16 static
and a very simple call tree is:
main
functionA
functionB
The naive approach to combine these may result in main -> functionA being chosen as the path of maximum stack usage, at 1536 bytes. But, if the largest dynamic stack allocation in main() is to push a large argument like a record to functionB() directly on the stack in a conditional block that calls functionB (I already said this was contrived), then really main -> functionB is the path of maximum stack usage, at 1040 bytes. Depending on existing software design, and also for other more restricted targets that pass everything on the stack, cumulative errors may quickly lead you toward looking at entirely wrong paths claiming significantly overstated maximum stack sizes.
Also, depending on your classification of "reentrant" when talking about interrupts, it's possible to miss some stack allocations entirely. For instance, many Coldfire processors' level 7 interrupt is edge-sensitive and therefore ignores the interrupt disable mask, so if a semaphore is used to leave the instruction early, you may not consider it reentrant, but the initial stack allocation will still happen before the semaphore is checked.
In short, you have to be extremely careful about using this approach.
I ended up writing a python script to implement τεκ's answer. It's too much code to post here, but can be found on github
I am not familiar with the -fstack-usage and -fcallgraph-info options. However, it is always possible to figure out actual stack usage by:
Allocate adequate stack space (for this experiment), and initialize it to something easily identifiable. I like 0xee.
Run the application and test all its internal paths (by all combinations of input and parameters). Let it run for more than "long enough".
Examine the stack area and see how much of the stack was used.
Make that the stack size, plus 10% or 20% to tolerate software updates and rare conditions.
There are generally two approaches - static and runtime.
Static: compile your project with -fdump-rtl-expand -fstack-usage and from the *.expand script get the call tree and stack usage of each function. Then iterate over all leaves in the call tree and calculate stack usage in each leaf and get the highest stack usage. Then compare that value with available memory on the target. This works statically and doesn't require running the program. This does not work with recursive functions. Does not work with VLA arrays. In case sbrk() operates on a linker section not on a statically preallocated buffer, it does not take dynamic allocation into account, which may grow on itself from the other side. I have a script in my tree ,stacklyze.sh that I explored this option with.
Runtime: before and after each function call check the current stack usage. Compile the code with -finstrument-functions. Then define two functions in your code that roughly should get the current stack usage and operate on them:
static unsigned long long max_stack_usage = 0;
void __cyg_profile_func_enter(void * this, void * call) __attribute__((no_instrument_function)) {
// get current stack usage using some hardware facility or intrisic function
// like __get_SP() on ARM with CMSIS
unsigned cur_stack_usage = __GET_CURRENT_STACK_POINTER() - __GET_BEGINNING_OF_STACK();
// use debugger to output current stack pointer
// for example semihosting on ARM
__DEBUGGER_TRANSFER_STACK_POINTER(cur_stack_usage);
// or you could store the max somewhere
// then just run the program
if (max_stack_usage < cur_stack_usage) {
max_stack_usage = max_stack_usage;
}
// you could also manually inspect with debugger
unsigned long long somelimit = 0x2000;
if (cur_stack_usage > somelimit) {
__BREAKPOINT();
}
}
void __cyg_profile_func_exit(void * this, void * call) __attribute__((no_instrument_function)) {
// well, nothing
}
Before and after each function is made - you can check the current stack usage. Because function is called before stack is used within the function, this method does not take the current function stack usage - which is only one function and doesn't do much, and can be somehow mitigated by getting which function is it and then getting stack usage with -fstack-usage and adding it to result.
In general you need to combine call-graph information with the .su files generated by -fstack-usage to find the deepest stack usage starting from a specific function. Starting at main() or a thread entry-point will then give you the worst-case usage for that thread.
Helpfully the work to create such a tool has been done for you as discussed here, using a Perl script from here.

Resources