Why isn't pass struct by reference a common optimization? - performance

Up until today, I had always thought that decent compilers automatically convert struct pass-by-value to pass-by-reference if the struct is large enough that the latter would be faster. To the best of my knowledge, this seems like a no-brainer optimization. However, to satisfy my curiosity as to whether this actually happens, I created a simple test case in both C++ and D and looked at the output of both GCC and Digital Mars D. Both insisted on passing 32-byte structs by value when all the function in question did was add up the members and return the values, with no modification of the struct passed in. The C++ version is below.
#include "iostream.h"
struct S {
int i, j, k, l, m, n, o, p;
};
int foo(S s) {
return s.i + s.j + s.k + s.l + s.m + s.n + s.o + s.p;
}
int main() {
S s;
int bar = foo(s);
cout << bar;
}
My question is, why the heck wouldn't something like this be optimized by the compiler to pass-by-reference instead of actually pushing all those ints onto the stack?
Note: Compiler switches used: GCC -O2 (-O3 inlined foo().), DMD -O -inline -release.
Edit: Obviously, in the general case the semantics of pass-by-value vs. pass-by-reference won't be the same, such as if copy constructors are involved or the original struct is modified in the callee. However, in a lot of real-world scenarios, the semantics will be identical in terms of observable behavior. These are the cases I'm asking about.

Don't forget that in C/C++ the compiler needs to be able to compile a call to a function based only on the function declaration.
Given that callers might be using only that information, there's no way for a compiler to compile the function to take advantage of the optimization you're talking about. The caller can't know the function won't modify anything and so it can't pass by ref. Since some callers might pass by value due to lack of detailed information, the function has to be compiled assuming pass-by-value and everybody needs to pass by value.
Note that even if you marked the parameter as 'const', the compiler still can't perform the optimization, because the function could be lying and cast away the constness (this is permitted and well-defined as long as the object being passed in is actually not const).
I think that for static functions (or those in an anonymous namespace), the compiler could possibly make the optimization you're talking about, since the function does not have external linkage. As long as the address of the function isn't passed to some other routine or stored in a pointer, it should not be callable from other code. In this case the compiler could have full knowledge of all callers, so I suppose it could make the optimization.
I'm not sure if any do (actually, I'd be surprised if any do, since it probably couldn't be applied very often).
Of course, as the programmer (when using C++) you can force the compiler to perform this optimization by using const& parameters whenever possible. I know you're asking why the compiler can't do it automatically, but I suppose this is the next best thing.

The problem is you're asking the compiler to make a decision about the intention of user code. Maybe I want my super large struct to be passed by value so that I can do something in the copy constructor. Believe me, someone out there has something they validly need to be called in a copy constructor for just such a scenario. Switching to a by ref will bypass the copy constructor.
Having this be a compiler generated decision would be a bad idea. The reason being is that it makes it impossible to reason about the flow of your code. You can't look at a call and know what exactly it will do. You have to a) know the code and b) guess the compiler optimization.

One answer is that the compiler would need to detect that the called method does not modify the contents of the struct in any way. If it did, then the effect of passing by reference would differ from that of passing by value.

It is true that compilers in some languages could do this if they have access to the function being called and if they can assume that the called function will not be changing. This is sometimes referred to as global optimization and it seems likely that some C or C++ compilers would in fact optimize cases such as this - more likely by inlining the code for such a trivial function.

I think this is definitely an optimization you could implement (under some assumptions, see last paragraph), but it's not clear to me that it would be profitable. Instead of pushing arguments onto the stack (or passing them through registers, depending on the calling convention), you would push a pointer through which you would read values. This extra indirection would cost cycles. It would also require the passed argument to be in memory (so you could point to it) instead of in registers. It would only be beneficial if the records being passed had many fields and the function receiving the record only read a few of them. The extra cycles wasted by indirection would have to make up for the cycles not wasted by pushing unneeded fields.
You may be surprised that the reverse optimization, argument promotion, is actually implemented in LLVM. This converts a reference argument into a value argument (or an aggregate into scalars) for internal functions with small numbers of fields that are only read from. This is particularly useful for languages which pass nearly everything by reference. If you follow this with dead argument elimination, you also don't have to pass fields that aren't touched.
It bears mentioning that optimizations that change the way a function is called can only work when the function being optimized is internal to the module being compiled (you get this by declaring a function static in C and with templates in C++). The optimizer has to fix not only the function but also all the call points. This makes such optimizations fairly limited in scope unless you do them at link time. In addition, the optimization would never be called when a copy constructor is involved (as other posters have mentioned) because it could potentially change the semantics of the program, which a good optimizer should never do.

There are many reasons to pass by value, and having the compiler optimise out your intention may break your code.
Example, if the called function modifies the structure in any way. If you intended the results to be passed back to the caller then you'd either pass a pointer/reference or return it yourself.
What you're asking the compiler to do is change the behaviour of your code, which would be considered a compiler bug.
If you want to make the optimization and pass by reference then by all means modify someone's existing function/method definitions to accept references; it's not all that hard to do. You might be surprised at the breakage you cause without realising it.

Changing from by value to by reference will change the signature of the function. If the function is not static this would cause linking errors for other compilation units which are not aware of the optimization you did.
Indeed the only way to do such an optimization is by some sort of post-link global optimization phase. These are notoriously hard to do yet some compilers do them to some extent.

Pass-by-reference is just syntactic sugar for pass-by-address/pointer. So the function must implicitly dereference a pointer to read the parameter's value. Dereferencing the pointer might be more expensive (if in a loop) then the struct copy for copy-by-value.
More importantly, like others have mentioned, pass-by-reference has different semantics than pass-by-value. const references do not mean the referenced value does not change. other function calls might change the referenced value.

Effectively passing a struct by reference even when the function declaration indicates pass-by-value is a common optimization: it's just that it usually happens indirectly via inlining, so it's not obvious from the generated code.
However, for this to happen, the compiler needs to know that callee doens't modify the passed object while it is compiling the caller. Otherwise, it will be restricted by the platform/language ABI which dictates exactly how values are passed to functions.
It can happen even without inlining!
Still, some compilers do implement this optimization even in the absence of inlining, although the circumstances are relatively limited, at least on platforms using the SysV ABI (Linux, OSX, etc) due to the constraints of stack layout. Consider the following simple example, based directly on your code:
__attribute__((noinline))
int foo(S s) {
return s.i + s.j + s.k + s.l + s.m + s.n + s.o + s.p;
}
int bar(S s) {
return foo(s);
}
Here, at the language level bar calls foo with pass-by-value semantics as required by C++. If we examine the assembly generated by gcc, however, it looks like this:
foo(S):
mov eax, DWORD PTR [rsp+12]
add eax, DWORD PTR [rsp+8]
add eax, DWORD PTR [rsp+16]
add eax, DWORD PTR [rsp+20]
add eax, DWORD PTR [rsp+24]
add eax, DWORD PTR [rsp+28]
add eax, DWORD PTR [rsp+32]
add eax, DWORD PTR [rsp+36]
ret
bar(S):
jmp foo(S)
Note that bar just directly calls foo, without making a copy: bar will use the same copy of s that was passed to bar (on the stack). In particular it doesn't make any copy as is implied by the language semantics (ignoring as if). So gcc has performed exactly the optimization you requested. Clang doesn't do it though: it makes a copy on the stack which it passes to foo().
Unfortunately, the cases where this can work are fairly limited: SysV requires that these large structures are passed on the stack in a specific position, so such re-use is only possible if callee expects the object in the exact same place.
That's possible in the foo/bar example since bar takes it's S as the first parameter in the same way as foo, and bar does a tail call to foo which avoids the need for the implicit return-address push that would otherwise ruin the ability to re-use the stack argument.
For example, if we simply add a + 1 to the call to foo:
int bar(S s) {
return foo(s) + 1;
}
The trick is ruined, since now the position of bar::s is different than the location foo will expect its s argument, and we need a copy:
bar(S):
push QWORD PTR [rsp+32]
push QWORD PTR [rsp+32]
push QWORD PTR [rsp+32]
push QWORD PTR [rsp+32]
call foo(S)
add rsp, 32
add eax, 1
ret
This doesn't mean that the caller bar() has to be totally trivial though. For example, it could modify its copy of s, prior to passing it along:
int bar(S s) {
s.i += 1;
return foo(s);
}
... and the optimization would be preserved:
bar(S):
add DWORD PTR [rsp+8], 1
jmp foo(S)
In principle, this possibility for this kind of optimization is much greated in the Win64 calling convention which uses a hidden pointer to pass large structures. This gives a lot more flexibility in reusing existing structures on the stack or elsewhere in order to implement pass-by-reference under the covers.
Inlining
All that aside, however, the main way this optimization happens is via inlining.
For example, at -O2 compilation all of clang, gcc and MSVC don't make any copy of the S object1. Both clang and gcc don't really create the object at all, but just calculated the result more or less directly without even referring unused fields. MSVC does allocate stack space for a copy, but never uses it: it fills out only one copy of S only and reads from that, just like pass-by-reference (MSVC generates much worse code than the other two compilers for this case).
Note that even though foo is inlined into main the compilers also generate a separate standalone copy of the foo() function since it has external linkage and so could be used by this object file. In this, the compiler is restricted by the application binary interface: the SysV ABI (for Linux) or Win64 ABI (for Windows) defines exactly how values must be passed, depending on the type and size of the value. Large structures are passed by hidden pointer, and the compiler has to respect that when compiling foo. It also has to respect that compiling some caller of foo when foo cannot be seen: since it has no idea what foo will do.
So there is very little window for the compiler to make a an effective optimization which transforms pass-by-value to pass-by-reference because:
1) If it can see both the caller and callee (main and foo in your example), it is likely that the callee will be inlined into the caller if it is small enough, and as the function becomes large and not-inlinable, the effect of fixed cost things like calling convention overhead become relatively smaller.
2) If the compiler cannot see both the caller and callee at the same time2, it generally has to compile each according to the platform ABI. There is no scope for optimization of the call at the call site since the compiler doesn't know what the callee will do, and there is no scope for optimization within the callee because the compiler has to make conservative assumptions about what the caller did.
1 My example is slightly more complicated that your original one to avoid the compiler just optimizing everything away entirely (in particular, you access uninitialized memory, so your program doesn't even have defined behavior): I populate a few of the fields of s with argc which is a value the compiler can't predict.
2 A compiler can see both "at the same time" generally means they are either in the same translation unit or that link-time-optimization is being used.

Well, the trivial answer is that the location of the struct in memory is different, and thus the data you're passing is different. The more complex answer, I think, is threading.
Your compiler would need to detect a) that foo does not modify the struct; b) that foo does not do any calculation on the physical location of the struct elements; AND c) that the caller, or another thread spawned by the caller, doesn't modify the struct before foo is finished running.
In your example, it's conceivable that the compiler could do these things - but the memory saved is inconsequential and probably not worth taking the guess. What happens if you run the same program with a struct that has two million elements?

the compiler would need to be sure that the struct that is passed (as named in the calling code) in is not modified
double x; // using non structs, oh-well
void Foo(double d)
{
x += d; // ok
x += d; // Oops
}
void main()
{
x = 1;
Foo(x);
}

On many platforms, large structures are in fact passed by reference, but either the caller will be expected to pass a reference to a copy that the function may manipulate as it likes1, or the called function will be expected to make a copy of the structure to which it receives a reference and then perform any manipulations on the copy.
While there are many circumstances in which the copy operations could in fact be omitted, it will often be difficult for a compiler to prove that such operations may be eliminated.
For example, given:
struct FOO { ... };
void func1(struct FOO *foo1);
void func2(struct FOO foo2);
void test(void)
{
struct FOO foo;
func1(&foo);
func2(foo);
}
there is no way a compiler could know whether foo might get modified during the execution of func2 (func1 could have stored a copy of foo1 or a pointer derived from it in a file-scope object which is then used by func2). Such modifications, however, should not affect the copy of foo (i.e. foo2) received by func2. If foo were passed by reference and func2 didn't make a copy, actions that affect foo would improperly affect foo2.
Note that even void func3(const struct FOO); is not meaningful: the callee is allowed to cast away const, and the normal asm calling convention still allow the callee to modify the memory holding the by-value copy.
Unfortunately, there are relatively few cases where examining the caller or called function in isolation would be sufficient to prove that a copy operation may be safely omitted, and there are many cases where even examining both would be insufficient. Thus, replacing pass-by-value with pass-by-reference is a difficult optimization whose payoff is often insufficient to justify the difficulty.
Footnote 1:
For example, Windows x64 passes objects larger than 8 bytes by non-const reference (callee "owns" the pointed-to memory). This doesn't help avoid copying at all; the motivation is to make all function args fit in 8 bytes each so they form an array on the stack (after spilling register args to shadow space), making variadic functions easy to implement.
By contrast, x86-64 System V does what the question describes for objects larger than 16 bytes: copying them to the stack. (Smaller objects are packed into up to two registers.)

Related

Can register name be passed into assembly template in GCC inline assembly [duplicate]

I have recently started learning how to use the inline assembly in C Code and came across an interesting feature where you can specify registers for local variables (https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html#Local-Register-Variables).
The usage of this feature is as follows:
register int *foo asm ("r12");
Then I started to wonder whether it was possible to insert a char pointer such as
const char d[4] = "r12";
register int *foo asm (d);
but got the error: expected string literal before ‘d’ (as expected)
I can understand why this would be a bad practice, but is there any possible way to achieve a similar effect where I can use a char pointer to access the register? If not, is there any particular reason why this is not allowed besides the potential security issues?
Additionally, I read this StackOverflow question: String literals: pointer vs. char array
Thank you.
The syntax to initialize the variable would be register char *foo asm ("r12") = d; to point an asm-register variable at a string. You can't use a runtime-variable string as the register name; register choices have to get assembled into machine code at compile time.
If that's what you're trying to do, you're misunderstanding something fundamental about assembly language and/or how ahead-of-time compiled languages compile into machine code. GCC won't make self-modifying code (and even if it wanted to, doing that safely would require redoing register allocation done by the ahead-of-time optimizer), or code that re-JITs itself based on a string.
(The first time I looked at your question, I didn't understand what you were even trying to do, because I was only considering things that are possible. #FelixG's comment was the clue I needed to make sense of the question.)
(Also note that registers aren't indexable; even in asm you can't use a single instruction to read a register number selected by an integer in another register. You could branch on it, or store all the registers in memory and index that like variadic functions do for their incoming register args.)
And if you do want a compile-time constant string literal, just use it with the normal syntax. Use a CPP macro if you want the same string to initialize a char array.

Make-array in SBCL

How does make-array work in SBCL? Are there some equivalents of new and delete operators in C++, or is it something else, perhaps assembler level?
I peeked into the source, but didn't understand anything.
When using SBCL compiled from source and an environment like Emacs/Slime, it is possible to navigate the code quite easily using M-. (meta-point). Basically, the make-array symbol is bound to multiple things: deftransform definitions, and a defun. The deftransform are used mostly for optimization, so better just follow the function, first.
The make-array function delegates to an internal make-array% one, which is quite complex: it checks the parameters, and dispatches to different specialized implementation of arrays, based on those parameters: a bit-vector is implemented differently than a string, for example.
If you follow the case for simple-array, you find a function which calls allocate-vector-with-widetag, which in turn calls allocate-vector.
Now, allocate-vector is bound to several objects, multiple defoptimizers forms, a function and a define-vop form.
The function is only:
(defun allocate-vector (type length words)
(allocate-vector type length words))
Even if it looks like a recursive call, it isn't.
The define-vop form is a way to define how to compile a call to allocate-vector. In the function, and anywhere where there is a call to allocate-vector, the compiler knows how to write the assembly that implements the built-in operation. But the function itself is defined so that there is an entry point with the same name, and a function object that wraps over that code.
define-vop relies on a Domain Specific Language in SBCL that abstracts over assembly. If you follow the definition, you can find different vops (virtual operations) for allocate-vector, like allocate-vector-on-heap and allocate-vector-on-stack.
Allocation on heap translates into a call to calc-size-in-bytes, a call to allocation and put-header, which most likely allocates memory and tag it (I followed the definition to src/compiler/x86-64/alloc.lisp).
How memory is allocated (and garbage collected) is another problem.
allocation emits assembly code using %alloc-tramp, which in turns executes the following:
(invoke-asm-routine 'call (if to-r11 'alloc-tramp-r11 'alloc-tramp) node)
There are apparently assembly routines called alloc-tramp-r11 and alloc-tramp, which are predefined assembly instructions. A comment says:
;;; Most allocation is done by inline code with sometimes help
;;; from the C alloc() function by way of the alloc-tramp
;;; assembly routine.
There is a base of C code for the runtime, see for example /src/runtime/alloc.c.
The -tramp suffix stands for trampoline.
Have also a look at src/runtime/x86-assem.S.

gcc/clang: How to force ordering of items on the stack?

Consider the following code:
int a;
int b;
Is there a way to force that a precedes b on the stack?
One way to do the ordering would be to put b in a function:
void foo() {
int b;
}
...
int a;
foo();
However, that would generally work only if b isn't inlined.
Maybe there's a different way to do that? Putting an inline assembler between the two declarations may do a trick, but I am not sure.
Your initial question was about forcing a function call to not be inlined.
To improve on Jordy Baylac's answer, you might try to declare the function within the block calling it, and perhaps use a statement expr:
#define FOO_WITHOUT_INLINING(c,i) ({ \
extern int foo (char, int) __attribute__((noinline)); \
int r = foo(c,i); \
r; })
(If the type of foo is unknown, you could use typeof)
However, I still think that your question is badly formulated (and is meaningless, if one avoid reading your comments which should really go inside the question, which should have mentioned your libmill). By definition of inlining, a compiler can inline any function as it wants without changing the semantics of the program.
For example, a user of your library might legitimately compile it with -flto -O2 (both at compiling and at linking stage). I don't know what would happen then.
I believe you might redesign your code, perhaps using -fsplit-stack; are you implementing some call/cc in C? Then look inside the numerous existing implementations of it, and inside Gabriel Kerneis CPC.... See also setcontext(3) & longjmp(3)
Perhaps you might need to use somewhere the return_twice (and/or nothrow) function attribute of GCC, or some _Pragma like GCC optimize
Then you edited your question to change it completely (asking about order of variables on the call stack), still without mentioning in the question your libmill and its go macro (as you should; comments are volatile so should not contain most of the question).
But the C compiler is not even supposed to have a call stack (an hypothetical C99 conforming compiler could do whole program optimization to avoid any call stack) in the compiled program. And GCC is certainly allowed to put some variables outside of the call stack (e.g. only in registers) and it is doing that. And some implementations (IA64 probably) have two call stacks.
So your changed question is completely meaniningless: a variable might not sit on the stack (e.g. only be in a register, or even disappear completely if the compiler can prove it is useless after some other optimizations), and the compiler is allowed to optimize and use the same call stack slot for two variables (and GCC is doing such an optimization quite often). So you cannot force any order on the call stack layout.
If you need to be sure that two local variables a & b have some well defined order on the call stack, make them into a struct e.g.
struct { int _a, _b; } _locals;
#define a _locals._a
#define b _locals._b
then, be sure to put the &_locals somewhere (e.g. in a volatile global or thread-local variable). Since some versions of GCC (IIRC 4.8 or 4.7) had some optimization passes to reorder the fields of non-escaping struct-s
BTW, you might customize GCC with your MELT extension to help about that (e.g. introduce your own builtin or pragma doing part of the work).
Apparently, you are inventing some new dialect of C (à la CPC); then you should say that!
below there is a way, using gcc attributes:
char foo (char, int) __attribute__ ((noinline));
and, as i said, you can try -fno-inline-functions option, but this is for all functions in the compilation process
It is still unclear for me why you want function not to be inline-d, but here is non-pro solution I am proposing:
You can make this function in separate object something.o file.
Since you will include header only, there will be no way for the compiler to inline the function.
However linker might decide to inline it later at linking time.

htonl and ntohl have same address in windows?

I rely on GetProcAddress() to do some hooking of some functions. I get a terrible result though, and to be honest, I don't really know what's happening.
It seems that this code will output "WHAT THE HELL?":
int main(void)
{
HMODULE ws32 = LoadLibrary("WS2_32.DLL");
if (GetProcAddress(ws32, "ntohl") == GetProcAddress(ws32, "htonl"))
printf("WHAT THE HELL\n");
return 0;
}
Could someone explain me why ntohl and htonl have the same absolute addresses?
The problem is that, when I hook into a DLL and do some processing inside the DLL, it's clear that Inside the PE import table (I parse the PE IAT), ntohl() and htonl() have different addresses (Inside the IAT as said).
So this thing renders my program useless. ntohl() is confounded with htonl(), and the program cannot make the difference and goes crazy.
Any thoughts? Would there be a way to circumvent this? An explication?
Sure, why not. All the ntohl and htonl function does, on a little endian platform, is to reverse all the individual bytes in an integer. There's no need for those 2 functions to be implemented differently - you do not need to worry that GetProcAddress() returns the same function pointer.
Ofcourse, you'd want to verify that GetProcAddress doesn't return a NULL pointer.
Unless you're on a mixed-endian platform like PDP-11, converting from host to the native endianness or vice versa is simply a byte swap or a NOP, so applying ntohl or htonl to an integer results in the same output. Therefore the linker may choose to use the same function for both names
It's unclear why you want to differentiate those functions, but it's completely unreliable. Smart compilers know to convert htonl and ntohl to a byte swap if necessary, resulting in zero function call. The user can also call compiler intrinsics such as _byteswap_ulong() or __builtin_bswap32() directly. In those cases how can you hook the function?
That being said, it's likely because of COMDAT folding optimization which merges identical functions. To disable it use the /OPT:NOICF flag
See also
Do distinct functions have distinct addresses?
Why do two functions have the same address?

Using the "naked" attribute for functions in GCC

GCC documentation states in 6.30 Declaring Attributes of Functions:
naked
Use this attribute on the ARM, AVR, IP2K, RX and SPU ports to indicate that the specified function does not need prologue/epilogue sequences generated by the compiler. It is up to the programmer to provide these sequences. The only statements that can be safely included in naked functions are asm statements that do not have operands. All other statements, including declarations of local variables, if statements, and so forth, should be avoided. Naked functions should be used to implement the body of an assembly function, while allowing the compiler to construct the requisite function declaration for the assembler.
Can I safely call functions using C syntax from naked functions, or only by using asm?
You can safely call functions from a naked function, provided that the called functions have a full prologue and epilogue.
Note that it is a bit of a nonsense to assert that you can 'safely' use assembly language in a naked function. You are entirely responsible for anything you do using assembly language, as you are for any calls you make to 'safe' functions.
To ensure that your generic called function is not static or inlined, it should be in a seperate compilation unit.
"naked" functions do not include any prologue or epilogue -- they are naked. In particular, they do not include operations on the stack for local variables, to save or restore registers, or to return to a calling function.
That does not mean that no stack exists -- the stack is initialized in the program initialisation, not in any function initialization. Since a stack exists, called function prologues and epilogues work correctly. A function call can safely push it's return address, any registers used, and space for any local variables. On return (using the return address), the registers are restored and the stack space is released.
Static or inlined-functions may not have a full prologue and epilogue. They can and may depend on the calling function to manage the stack and to restore corrupted registers.
This leads to the next point: you need the prologue and epilogue only to encapsulate the operations of the called function. If the called function is also safe (no explicit or implicit local variables, no changes to status registers), it can be safely static and/or inlined. As with asm, it would be your responsibility to make sure this is true.
If the only thing you do in the naked function is call another function you can just use a single JMP machine code instruction.
The function you jump to will have a valid prologue and it should return directly to the caller of the naked function since JMP doesn't push a return address on the stack.
The only statements that can be safely included in naked functions are asm statements that do not have operands. All other statements, including declarations of local variables, if statements, and so forth, should be avoided.
Based on the description you already gave, I would assume that even function calls are not suitable for the "naked" keyword.

Resources