Using the "naked" attribute for functions in GCC - gcc

GCC documentation states in 6.30 Declaring Attributes of Functions:
naked
Use this attribute on the ARM, AVR, IP2K, RX and SPU ports to indicate that the specified function does not need prologue/epilogue sequences generated by the compiler. It is up to the programmer to provide these sequences. The only statements that can be safely included in naked functions are asm statements that do not have operands. All other statements, including declarations of local variables, if statements, and so forth, should be avoided. Naked functions should be used to implement the body of an assembly function, while allowing the compiler to construct the requisite function declaration for the assembler.
Can I safely call functions using C syntax from naked functions, or only by using asm?

You can safely call functions from a naked function, provided that the called functions have a full prologue and epilogue.
Note that it is a bit of a nonsense to assert that you can 'safely' use assembly language in a naked function. You are entirely responsible for anything you do using assembly language, as you are for any calls you make to 'safe' functions.
To ensure that your generic called function is not static or inlined, it should be in a seperate compilation unit.
"naked" functions do not include any prologue or epilogue -- they are naked. In particular, they do not include operations on the stack for local variables, to save or restore registers, or to return to a calling function.
That does not mean that no stack exists -- the stack is initialized in the program initialisation, not in any function initialization. Since a stack exists, called function prologues and epilogues work correctly. A function call can safely push it's return address, any registers used, and space for any local variables. On return (using the return address), the registers are restored and the stack space is released.
Static or inlined-functions may not have a full prologue and epilogue. They can and may depend on the calling function to manage the stack and to restore corrupted registers.
This leads to the next point: you need the prologue and epilogue only to encapsulate the operations of the called function. If the called function is also safe (no explicit or implicit local variables, no changes to status registers), it can be safely static and/or inlined. As with asm, it would be your responsibility to make sure this is true.

If the only thing you do in the naked function is call another function you can just use a single JMP machine code instruction.
The function you jump to will have a valid prologue and it should return directly to the caller of the naked function since JMP doesn't push a return address on the stack.

The only statements that can be safely included in naked functions are asm statements that do not have operands. All other statements, including declarations of local variables, if statements, and so forth, should be avoided.
Based on the description you already gave, I would assume that even function calls are not suitable for the "naked" keyword.

Related

Make-array in SBCL

How does make-array work in SBCL? Are there some equivalents of new and delete operators in C++, or is it something else, perhaps assembler level?
I peeked into the source, but didn't understand anything.
When using SBCL compiled from source and an environment like Emacs/Slime, it is possible to navigate the code quite easily using M-. (meta-point). Basically, the make-array symbol is bound to multiple things: deftransform definitions, and a defun. The deftransform are used mostly for optimization, so better just follow the function, first.
The make-array function delegates to an internal make-array% one, which is quite complex: it checks the parameters, and dispatches to different specialized implementation of arrays, based on those parameters: a bit-vector is implemented differently than a string, for example.
If you follow the case for simple-array, you find a function which calls allocate-vector-with-widetag, which in turn calls allocate-vector.
Now, allocate-vector is bound to several objects, multiple defoptimizers forms, a function and a define-vop form.
The function is only:
(defun allocate-vector (type length words)
(allocate-vector type length words))
Even if it looks like a recursive call, it isn't.
The define-vop form is a way to define how to compile a call to allocate-vector. In the function, and anywhere where there is a call to allocate-vector, the compiler knows how to write the assembly that implements the built-in operation. But the function itself is defined so that there is an entry point with the same name, and a function object that wraps over that code.
define-vop relies on a Domain Specific Language in SBCL that abstracts over assembly. If you follow the definition, you can find different vops (virtual operations) for allocate-vector, like allocate-vector-on-heap and allocate-vector-on-stack.
Allocation on heap translates into a call to calc-size-in-bytes, a call to allocation and put-header, which most likely allocates memory and tag it (I followed the definition to src/compiler/x86-64/alloc.lisp).
How memory is allocated (and garbage collected) is another problem.
allocation emits assembly code using %alloc-tramp, which in turns executes the following:
(invoke-asm-routine 'call (if to-r11 'alloc-tramp-r11 'alloc-tramp) node)
There are apparently assembly routines called alloc-tramp-r11 and alloc-tramp, which are predefined assembly instructions. A comment says:
;;; Most allocation is done by inline code with sometimes help
;;; from the C alloc() function by way of the alloc-tramp
;;; assembly routine.
There is a base of C code for the runtime, see for example /src/runtime/alloc.c.
The -tramp suffix stands for trampoline.
Have also a look at src/runtime/x86-assem.S.

I don't understand meaning of this: +"a function to be evaluated during reloc processing"

I don't understand meaning of this:
+"a function to be evaluated during reloc processing" - it is from flags of objdump.
How function can be evaluated during reloc processing?
Is it sequence of cpu opcodes (subrotinue) that must be called?
Or what?
https://sourceware.org/glibc/wiki/GNU_IFUNC
ifunc symbol points to resolver, which is itself in object file, and linker sees it and calls with some args which it knows somehow... and gets back address of best implementation of function.
That is what called EVALUATION.
It is all done for sake of performance. Attempt to choose best code for specific CPU.

Matlab: Are local functions (subfunctions) compiled together with main function or separately?

I have heard that MATLAB has an automatic in-need compilation of functions which could create a lot of function-call overhead if you call a function many times like in the following code:
function output = BigFunction( args )
for i = 1:10000000
SmallFunction( args );
end
end
Is it faster to call the function SmallFunction() if I put it in the same file as BigFunction() as a local function? Or is there any good solution other than pasting the code from SmallFunction() into the BigFunction() to optimize the performance?
Edit: It may be false assumption that the function-call overhead is because of the in-need compilation. The question is how to cut down on the overhead without making the code look awful.
Matlab hashes the functions it reads into memory. The functions are only compiled once if they exist as an independent function in its own file. If you put BigFunction in BigFunction.m and SmallFunction in SmallFunction.m then you should recieve the optimization benefit of having the m-script compiled once.
The answer to my first question is that a local function performs the same as a function in another file.
An idea for the second question is to, if possible, make SmallFunction() an inline-function, which has less function-call overhead. I found more about function-call performances in the MathWorks forum, and I paste the question and answer below:
Question:
I have 7 different types of function call:
An Inline function. The body of the function is directory written down (inline).
A function is defined in a separate MATLAB file. The arguments are passed by the calling function (file-pass).
A function is defined in a separate MATLAB file. The arguments are provided by referencing global variables; only indices are provided by the calling function (file-global).
A nested function. The arguments are passed by the enclosing function (nest-pass).
A nested function. The arguments are those shared with the enclosing function; only indices are provided by the enclosing function (nest-share).
A sub function. The arguments are passed by the calling function (sub-pass).
A sub function. The arguments are provided by referencing global variables; only indices are provided by the calling function (sub-global).
I would like to know which function call provides better performance than the others in general.
The answer from MathWorks Support Team pasted here:
The ordering of performance of each function call from the fastest to the slowest tends to be as follows:
inline > file-pass = nest-pass = sub-pass > nest-share > sub-global > file-global
(A>B means A is faster than B and A=B means A is as fast as B)
First, inline is the fastest as it does not incur overhead associated with function call.
Second, when the arguments are passed to the callee function, the calling function sets up the arguments in such a way that the callee function knows where to retrieve them. This setup associated with function call in general incurs performance overhead, and therefore file-pass, nest-pass, and sub-pass are slower than inline.
Third, if the workspace is shared with nested functions and the arguments to a nested function are those shared within the workspace, rather than pass-by-value, then performance of that function call is inhibited. If MATLAB sees a shared variable within the shared workspace, it searches the workspace for the variable. On the other hand, if the arguments are passed by the calling function, then MATLAB does not have to search for them. The time taken for this search explains that type nest-share is slower than file-pass, nest-pass, and sub-pass.
Finally, when a function call involves global variables, performance is even more inhibited. This is because to look for global variables, MATLAB has to expand its search space to the outside of the current workspace. Furthermore, the reason a function call involving global variables appears a lot slower than the others is that MATLAB Accelerator does not optimize such a function call. When MATLAB Accelerator is turned off with the following command,
feature accel off
the difference in performance between inline and file-global becomes less significant.
Please note that the behaviors depend largely on various factors such as operating systems, CPU architectures, MATLAB Interpreter, and what the MATLAB code is doing.

Compiling Fortran external symbols

When compiling fortran code into object files: how does the compiler determine the symbol names?
when I use the intrinsic function "getarg" the compiler converts it into a symbol called "_getarg#12"
I looked in the external libraries and found that the symbol name inside is called "_getarg#16" what is the significance of the "#[number]" at the end of "getarg" ?
_name#length is highly Windows-specific name mangling applied to the name of routines that obey the stdcall (or __stdcall by the name of the keyword used in C) calling convention, a variant of the Pascal calling convention. This is the calling convention used by all Win32 API functions and if you look at the export tables of DLLs like KERNEL32.DLL and USER32.DLL you'd see that all symbols are named like this.
The _...#length decoration gives the number of bytes occupied by the routine arguments. This is necessary since in the stdcall calling conventions it is the callee who cleans up the arguments from the stack and not the caller as is the case with the C calling convention. When the compiler generates a call to func with two 4-byte arguments, it puts a reference to _func#8 in the object code. If the real func happens to have different number or size of arguments, its decorated name would be something different, e.g. _func#12 and hence a link error would occur. This is very useful with dynamic libraries (DLLs). Imagine that a DLL was replaced with another version where func takes one additional argument. If it wasn't for the name mangling (the technical term for prepending _ and adding #length to the symbol name), the program would still call into func with the wrong arguments and then func would increment the stack pointer with more bytes than was the size of the passed argument list, thus breaking the caller. With name mangling in place the loader would not launch the executable at all since it would not be able to resolve the reference to _func#8.
In your case it looks like the external library is not really intended to be used with this compiler or you are missing some pragma or compiler option. The getarg intrinsic takes two arguments - one integer and one assumed-sized character array (string). Some compilers pass the character array size as an additional argument. With 32-bit code this would result in 2 pointers and 1 integer being passed, totalling in 12 bytes of arguments, hence the _getarg#12. The _getarg#16 could be, for example, 64-bit routine with strings being passed by some kind of descriptor.
As IanH reminded me in his comment, another reason for this naming discrepancy could be that you are calling getarg with fewer arguments than expected. Fortran has this peculiar feature of "prototypeless" routine calls - Fortran compilers can generate calls to routines without actually knowing their signature, unlike in C/C++ where an explicit signature has to be supplied in the form of a function prototype. This is possible since in Fortran all arguments are passed by reference and pointers are always the same size, no matter the actual type they point to. In this particular case the stdcall name mangling plays the role of a very crude argument checking mechanism. If it wasn't for the mangling (e.g. on Linux with GNU Fortran where such decorations are not employed or if the default calling convention was cdecl) one could call a routine with different number of arguments than expected and the linker would happily link the object code into an executable that would then most likely crash at run time.
This is totally implementation dependent. You did not say, which compiler do you use. The (nonstandard) intrinsic can exist in more versions for different integer or character kinds. There can also be more versions of the runtime libraries for more computer architectures (e.g. 32 bit and 64 bit).

Why isn't pass struct by reference a common optimization?

Up until today, I had always thought that decent compilers automatically convert struct pass-by-value to pass-by-reference if the struct is large enough that the latter would be faster. To the best of my knowledge, this seems like a no-brainer optimization. However, to satisfy my curiosity as to whether this actually happens, I created a simple test case in both C++ and D and looked at the output of both GCC and Digital Mars D. Both insisted on passing 32-byte structs by value when all the function in question did was add up the members and return the values, with no modification of the struct passed in. The C++ version is below.
#include "iostream.h"
struct S {
int i, j, k, l, m, n, o, p;
};
int foo(S s) {
return s.i + s.j + s.k + s.l + s.m + s.n + s.o + s.p;
}
int main() {
S s;
int bar = foo(s);
cout << bar;
}
My question is, why the heck wouldn't something like this be optimized by the compiler to pass-by-reference instead of actually pushing all those ints onto the stack?
Note: Compiler switches used: GCC -O2 (-O3 inlined foo().), DMD -O -inline -release.
Edit: Obviously, in the general case the semantics of pass-by-value vs. pass-by-reference won't be the same, such as if copy constructors are involved or the original struct is modified in the callee. However, in a lot of real-world scenarios, the semantics will be identical in terms of observable behavior. These are the cases I'm asking about.
Don't forget that in C/C++ the compiler needs to be able to compile a call to a function based only on the function declaration.
Given that callers might be using only that information, there's no way for a compiler to compile the function to take advantage of the optimization you're talking about. The caller can't know the function won't modify anything and so it can't pass by ref. Since some callers might pass by value due to lack of detailed information, the function has to be compiled assuming pass-by-value and everybody needs to pass by value.
Note that even if you marked the parameter as 'const', the compiler still can't perform the optimization, because the function could be lying and cast away the constness (this is permitted and well-defined as long as the object being passed in is actually not const).
I think that for static functions (or those in an anonymous namespace), the compiler could possibly make the optimization you're talking about, since the function does not have external linkage. As long as the address of the function isn't passed to some other routine or stored in a pointer, it should not be callable from other code. In this case the compiler could have full knowledge of all callers, so I suppose it could make the optimization.
I'm not sure if any do (actually, I'd be surprised if any do, since it probably couldn't be applied very often).
Of course, as the programmer (when using C++) you can force the compiler to perform this optimization by using const& parameters whenever possible. I know you're asking why the compiler can't do it automatically, but I suppose this is the next best thing.
The problem is you're asking the compiler to make a decision about the intention of user code. Maybe I want my super large struct to be passed by value so that I can do something in the copy constructor. Believe me, someone out there has something they validly need to be called in a copy constructor for just such a scenario. Switching to a by ref will bypass the copy constructor.
Having this be a compiler generated decision would be a bad idea. The reason being is that it makes it impossible to reason about the flow of your code. You can't look at a call and know what exactly it will do. You have to a) know the code and b) guess the compiler optimization.
One answer is that the compiler would need to detect that the called method does not modify the contents of the struct in any way. If it did, then the effect of passing by reference would differ from that of passing by value.
It is true that compilers in some languages could do this if they have access to the function being called and if they can assume that the called function will not be changing. This is sometimes referred to as global optimization and it seems likely that some C or C++ compilers would in fact optimize cases such as this - more likely by inlining the code for such a trivial function.
I think this is definitely an optimization you could implement (under some assumptions, see last paragraph), but it's not clear to me that it would be profitable. Instead of pushing arguments onto the stack (or passing them through registers, depending on the calling convention), you would push a pointer through which you would read values. This extra indirection would cost cycles. It would also require the passed argument to be in memory (so you could point to it) instead of in registers. It would only be beneficial if the records being passed had many fields and the function receiving the record only read a few of them. The extra cycles wasted by indirection would have to make up for the cycles not wasted by pushing unneeded fields.
You may be surprised that the reverse optimization, argument promotion, is actually implemented in LLVM. This converts a reference argument into a value argument (or an aggregate into scalars) for internal functions with small numbers of fields that are only read from. This is particularly useful for languages which pass nearly everything by reference. If you follow this with dead argument elimination, you also don't have to pass fields that aren't touched.
It bears mentioning that optimizations that change the way a function is called can only work when the function being optimized is internal to the module being compiled (you get this by declaring a function static in C and with templates in C++). The optimizer has to fix not only the function but also all the call points. This makes such optimizations fairly limited in scope unless you do them at link time. In addition, the optimization would never be called when a copy constructor is involved (as other posters have mentioned) because it could potentially change the semantics of the program, which a good optimizer should never do.
There are many reasons to pass by value, and having the compiler optimise out your intention may break your code.
Example, if the called function modifies the structure in any way. If you intended the results to be passed back to the caller then you'd either pass a pointer/reference or return it yourself.
What you're asking the compiler to do is change the behaviour of your code, which would be considered a compiler bug.
If you want to make the optimization and pass by reference then by all means modify someone's existing function/method definitions to accept references; it's not all that hard to do. You might be surprised at the breakage you cause without realising it.
Changing from by value to by reference will change the signature of the function. If the function is not static this would cause linking errors for other compilation units which are not aware of the optimization you did.
Indeed the only way to do such an optimization is by some sort of post-link global optimization phase. These are notoriously hard to do yet some compilers do them to some extent.
Pass-by-reference is just syntactic sugar for pass-by-address/pointer. So the function must implicitly dereference a pointer to read the parameter's value. Dereferencing the pointer might be more expensive (if in a loop) then the struct copy for copy-by-value.
More importantly, like others have mentioned, pass-by-reference has different semantics than pass-by-value. const references do not mean the referenced value does not change. other function calls might change the referenced value.
Effectively passing a struct by reference even when the function declaration indicates pass-by-value is a common optimization: it's just that it usually happens indirectly via inlining, so it's not obvious from the generated code.
However, for this to happen, the compiler needs to know that callee doens't modify the passed object while it is compiling the caller. Otherwise, it will be restricted by the platform/language ABI which dictates exactly how values are passed to functions.
It can happen even without inlining!
Still, some compilers do implement this optimization even in the absence of inlining, although the circumstances are relatively limited, at least on platforms using the SysV ABI (Linux, OSX, etc) due to the constraints of stack layout. Consider the following simple example, based directly on your code:
__attribute__((noinline))
int foo(S s) {
return s.i + s.j + s.k + s.l + s.m + s.n + s.o + s.p;
}
int bar(S s) {
return foo(s);
}
Here, at the language level bar calls foo with pass-by-value semantics as required by C++. If we examine the assembly generated by gcc, however, it looks like this:
foo(S):
mov eax, DWORD PTR [rsp+12]
add eax, DWORD PTR [rsp+8]
add eax, DWORD PTR [rsp+16]
add eax, DWORD PTR [rsp+20]
add eax, DWORD PTR [rsp+24]
add eax, DWORD PTR [rsp+28]
add eax, DWORD PTR [rsp+32]
add eax, DWORD PTR [rsp+36]
ret
bar(S):
jmp foo(S)
Note that bar just directly calls foo, without making a copy: bar will use the same copy of s that was passed to bar (on the stack). In particular it doesn't make any copy as is implied by the language semantics (ignoring as if). So gcc has performed exactly the optimization you requested. Clang doesn't do it though: it makes a copy on the stack which it passes to foo().
Unfortunately, the cases where this can work are fairly limited: SysV requires that these large structures are passed on the stack in a specific position, so such re-use is only possible if callee expects the object in the exact same place.
That's possible in the foo/bar example since bar takes it's S as the first parameter in the same way as foo, and bar does a tail call to foo which avoids the need for the implicit return-address push that would otherwise ruin the ability to re-use the stack argument.
For example, if we simply add a + 1 to the call to foo:
int bar(S s) {
return foo(s) + 1;
}
The trick is ruined, since now the position of bar::s is different than the location foo will expect its s argument, and we need a copy:
bar(S):
push QWORD PTR [rsp+32]
push QWORD PTR [rsp+32]
push QWORD PTR [rsp+32]
push QWORD PTR [rsp+32]
call foo(S)
add rsp, 32
add eax, 1
ret
This doesn't mean that the caller bar() has to be totally trivial though. For example, it could modify its copy of s, prior to passing it along:
int bar(S s) {
s.i += 1;
return foo(s);
}
... and the optimization would be preserved:
bar(S):
add DWORD PTR [rsp+8], 1
jmp foo(S)
In principle, this possibility for this kind of optimization is much greated in the Win64 calling convention which uses a hidden pointer to pass large structures. This gives a lot more flexibility in reusing existing structures on the stack or elsewhere in order to implement pass-by-reference under the covers.
Inlining
All that aside, however, the main way this optimization happens is via inlining.
For example, at -O2 compilation all of clang, gcc and MSVC don't make any copy of the S object1. Both clang and gcc don't really create the object at all, but just calculated the result more or less directly without even referring unused fields. MSVC does allocate stack space for a copy, but never uses it: it fills out only one copy of S only and reads from that, just like pass-by-reference (MSVC generates much worse code than the other two compilers for this case).
Note that even though foo is inlined into main the compilers also generate a separate standalone copy of the foo() function since it has external linkage and so could be used by this object file. In this, the compiler is restricted by the application binary interface: the SysV ABI (for Linux) or Win64 ABI (for Windows) defines exactly how values must be passed, depending on the type and size of the value. Large structures are passed by hidden pointer, and the compiler has to respect that when compiling foo. It also has to respect that compiling some caller of foo when foo cannot be seen: since it has no idea what foo will do.
So there is very little window for the compiler to make a an effective optimization which transforms pass-by-value to pass-by-reference because:
1) If it can see both the caller and callee (main and foo in your example), it is likely that the callee will be inlined into the caller if it is small enough, and as the function becomes large and not-inlinable, the effect of fixed cost things like calling convention overhead become relatively smaller.
2) If the compiler cannot see both the caller and callee at the same time2, it generally has to compile each according to the platform ABI. There is no scope for optimization of the call at the call site since the compiler doesn't know what the callee will do, and there is no scope for optimization within the callee because the compiler has to make conservative assumptions about what the caller did.
1 My example is slightly more complicated that your original one to avoid the compiler just optimizing everything away entirely (in particular, you access uninitialized memory, so your program doesn't even have defined behavior): I populate a few of the fields of s with argc which is a value the compiler can't predict.
2 A compiler can see both "at the same time" generally means they are either in the same translation unit or that link-time-optimization is being used.
Well, the trivial answer is that the location of the struct in memory is different, and thus the data you're passing is different. The more complex answer, I think, is threading.
Your compiler would need to detect a) that foo does not modify the struct; b) that foo does not do any calculation on the physical location of the struct elements; AND c) that the caller, or another thread spawned by the caller, doesn't modify the struct before foo is finished running.
In your example, it's conceivable that the compiler could do these things - but the memory saved is inconsequential and probably not worth taking the guess. What happens if you run the same program with a struct that has two million elements?
the compiler would need to be sure that the struct that is passed (as named in the calling code) in is not modified
double x; // using non structs, oh-well
void Foo(double d)
{
x += d; // ok
x += d; // Oops
}
void main()
{
x = 1;
Foo(x);
}
On many platforms, large structures are in fact passed by reference, but either the caller will be expected to pass a reference to a copy that the function may manipulate as it likes1, or the called function will be expected to make a copy of the structure to which it receives a reference and then perform any manipulations on the copy.
While there are many circumstances in which the copy operations could in fact be omitted, it will often be difficult for a compiler to prove that such operations may be eliminated.
For example, given:
struct FOO { ... };
void func1(struct FOO *foo1);
void func2(struct FOO foo2);
void test(void)
{
struct FOO foo;
func1(&foo);
func2(foo);
}
there is no way a compiler could know whether foo might get modified during the execution of func2 (func1 could have stored a copy of foo1 or a pointer derived from it in a file-scope object which is then used by func2). Such modifications, however, should not affect the copy of foo (i.e. foo2) received by func2. If foo were passed by reference and func2 didn't make a copy, actions that affect foo would improperly affect foo2.
Note that even void func3(const struct FOO); is not meaningful: the callee is allowed to cast away const, and the normal asm calling convention still allow the callee to modify the memory holding the by-value copy.
Unfortunately, there are relatively few cases where examining the caller or called function in isolation would be sufficient to prove that a copy operation may be safely omitted, and there are many cases where even examining both would be insufficient. Thus, replacing pass-by-value with pass-by-reference is a difficult optimization whose payoff is often insufficient to justify the difficulty.
Footnote 1:
For example, Windows x64 passes objects larger than 8 bytes by non-const reference (callee "owns" the pointed-to memory). This doesn't help avoid copying at all; the motivation is to make all function args fit in 8 bytes each so they form an array on the stack (after spilling register args to shadow space), making variadic functions easy to implement.
By contrast, x86-64 System V does what the question describes for objects larger than 16 bytes: copying them to the stack. (Smaller objects are packed into up to two registers.)

Resources