gcc/clang: How to force ordering of items on the stack? - gcc

Consider the following code:
int a;
int b;
Is there a way to force that a precedes b on the stack?
One way to do the ordering would be to put b in a function:
void foo() {
int b;
}
...
int a;
foo();
However, that would generally work only if b isn't inlined.
Maybe there's a different way to do that? Putting an inline assembler between the two declarations may do a trick, but I am not sure.

Your initial question was about forcing a function call to not be inlined.
To improve on Jordy Baylac's answer, you might try to declare the function within the block calling it, and perhaps use a statement expr:
#define FOO_WITHOUT_INLINING(c,i) ({ \
extern int foo (char, int) __attribute__((noinline)); \
int r = foo(c,i); \
r; })
(If the type of foo is unknown, you could use typeof)
However, I still think that your question is badly formulated (and is meaningless, if one avoid reading your comments which should really go inside the question, which should have mentioned your libmill). By definition of inlining, a compiler can inline any function as it wants without changing the semantics of the program.
For example, a user of your library might legitimately compile it with -flto -O2 (both at compiling and at linking stage). I don't know what would happen then.
I believe you might redesign your code, perhaps using -fsplit-stack; are you implementing some call/cc in C? Then look inside the numerous existing implementations of it, and inside Gabriel Kerneis CPC.... See also setcontext(3) & longjmp(3)
Perhaps you might need to use somewhere the return_twice (and/or nothrow) function attribute of GCC, or some _Pragma like GCC optimize
Then you edited your question to change it completely (asking about order of variables on the call stack), still without mentioning in the question your libmill and its go macro (as you should; comments are volatile so should not contain most of the question).
But the C compiler is not even supposed to have a call stack (an hypothetical C99 conforming compiler could do whole program optimization to avoid any call stack) in the compiled program. And GCC is certainly allowed to put some variables outside of the call stack (e.g. only in registers) and it is doing that. And some implementations (IA64 probably) have two call stacks.
So your changed question is completely meaniningless: a variable might not sit on the stack (e.g. only be in a register, or even disappear completely if the compiler can prove it is useless after some other optimizations), and the compiler is allowed to optimize and use the same call stack slot for two variables (and GCC is doing such an optimization quite often). So you cannot force any order on the call stack layout.
If you need to be sure that two local variables a & b have some well defined order on the call stack, make them into a struct e.g.
struct { int _a, _b; } _locals;
#define a _locals._a
#define b _locals._b
then, be sure to put the &_locals somewhere (e.g. in a volatile global or thread-local variable). Since some versions of GCC (IIRC 4.8 or 4.7) had some optimization passes to reorder the fields of non-escaping struct-s
BTW, you might customize GCC with your MELT extension to help about that (e.g. introduce your own builtin or pragma doing part of the work).
Apparently, you are inventing some new dialect of C (à la CPC); then you should say that!

below there is a way, using gcc attributes:
char foo (char, int) __attribute__ ((noinline));
and, as i said, you can try -fno-inline-functions option, but this is for all functions in the compilation process

It is still unclear for me why you want function not to be inline-d, but here is non-pro solution I am proposing:
You can make this function in separate object something.o file.
Since you will include header only, there will be no way for the compiler to inline the function.
However linker might decide to inline it later at linking time.

Related

Can register name be passed into assembly template in GCC inline assembly [duplicate]

I have recently started learning how to use the inline assembly in C Code and came across an interesting feature where you can specify registers for local variables (https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html#Local-Register-Variables).
The usage of this feature is as follows:
register int *foo asm ("r12");
Then I started to wonder whether it was possible to insert a char pointer such as
const char d[4] = "r12";
register int *foo asm (d);
but got the error: expected string literal before ‘d’ (as expected)
I can understand why this would be a bad practice, but is there any possible way to achieve a similar effect where I can use a char pointer to access the register? If not, is there any particular reason why this is not allowed besides the potential security issues?
Additionally, I read this StackOverflow question: String literals: pointer vs. char array
Thank you.
The syntax to initialize the variable would be register char *foo asm ("r12") = d; to point an asm-register variable at a string. You can't use a runtime-variable string as the register name; register choices have to get assembled into machine code at compile time.
If that's what you're trying to do, you're misunderstanding something fundamental about assembly language and/or how ahead-of-time compiled languages compile into machine code. GCC won't make self-modifying code (and even if it wanted to, doing that safely would require redoing register allocation done by the ahead-of-time optimizer), or code that re-JITs itself based on a string.
(The first time I looked at your question, I didn't understand what you were even trying to do, because I was only considering things that are possible. #FelixG's comment was the clue I needed to make sense of the question.)
(Also note that registers aren't indexable; even in asm you can't use a single instruction to read a register number selected by an integer in another register. You could branch on it, or store all the registers in memory and index that like variadic functions do for their incoming register args.)
And if you do want a compile-time constant string literal, just use it with the normal syntax. Use a CPP macro if you want the same string to initialize a char array.

Why casting interface{} to cgo types is not allowed?

It's more of a technical question, not really an issue. Since we don't have variadic functions in cgo and there's currently no valid solution, I wonder if it'd be possible to cast interface{} to cgo types. So this would allow us to have more dynamic functions. I'm pretty sure we're not even allowed to assign types in a dynamic way to arguments in exported (//export) functions, neither the use of ellipsis is allowed. So what's the reason behind all those limits?
Thanks for answering.
import "C"
//export Foo
func Foo(arg1, arg2, arg3) {
}
C compilers are allowed, but not required, to return different types using different return mechanisms. For instance, some C compilers might return float results in the %f0 register, double results in the %f0:f1 register pair, integer results in the %d0 register, and pointer results in the %a0 register.
What this means for the person writing the Cgo interface for Go is that they must, in general, handle each kind of these C functions differently. In other words, it's not possible to write:
generic_type Cfunc(ctype1 arg1, ctype2 arg2) { ... }
We must know, at compile time, that Cfunc returns float/double/<some-integer-type>/<some-pointer-type> so that we can grab the correct register(s) and stuff its (or their) value(s) into the Cgo return-value slot, where the Cgo interface can get it and wrap it up for use in Go.
What this means for you, as a user of a Go compiler that implements Cgo wrappers to call C functions, is that you have to know the right type. There is no generalized answer; there is no way to use interface{} here. You must communicate the exact, correct type to the Cgo layer, so that the Cgo layer can use that exact, correct type information to generate the correct machine code at compile time.
If the C compiler writers had some way of flagging their code so that, e.g., at link time, the linker could pull in the right "save correct register to memory" location, that would enable the Cgo wrapper author to use the linker to automagically find the C function's type at link time. But these C compilers don't offer this ability to the linkers.
Is your particular compiler one of these? We don't know: you didn't say. But:
I'm pretty sure we're not even allowed to assign types in a dynamic way to arguments in exported (//export) functions, neither the use of ellipsis is allowed. So what's the reason behind all those limits?
That's correct, and the (slightly theoretical) example above is a reason. (I constructed this example by mixing actual techniques from 68k C compilers and SPARC C compilers, so I don't think there's any single C compiler like this. But examples like this did exist in the past, and SPARC systems still return integers in %o0, or %o0+%o1 on V8 SPARC, vs floating point in %f0 or %f0+%f1.)

What's the under-the-hood mechanism g++ uses to identify modification of 'const' variables?

when we declare a variable to be const
const int cv = 3;
I guess g++ reserve 4 bytes somewhere (say, address 0xFF77 ) in the data area. In the future, when people access cv, the compiler goes to 0xFF77 to get the value 3.
However, how does the compiler store the information 'const'? g++ must somehow store this information, so when another line tries to modify cv, the compiler knows 'oh, this is not correct, because I know 0xFF77 is const'.
Anybody here familiar with gcc compiler? could you give me some insight?
Thanks
Once the program is executing, the compiler is no longer present. It's work is done; the program has been compiled into an executable, and it can then be executed without the compiler even being installed. (Consequently, it is possible to distribute executables to machines which have no compiler.)
But even if your question were rewritten to fix that issue, there is an unwarranted and incorrect assumption:
g++ must somehow store this information, so when another line tries to modify cv, the run-time knows 'oh, this is not correct, because I know 0xFF77 is const'.
In fact, the run-time is under no obligation to stop the variable from being modified. That is the programmer's responsibility. When you declare a variable to be const, you are informing the compiler that you will not modify its value, which may allow the compiler to do a better job of optimising. Such optimisations may fail if it turns out that you do, in fact, modify the variable; that is covered by the fact that modifying a value declared const is "undefined behaviour". (Undefined behaviour is really undefined. Throwing an error would be defined behaviour.)
Under certain circumstances, the compiler can in fact detect during compilation that a variable declared const is being modified.
const int cv = 3;
cv = 42;
Most compilers will produce a warning if they see a blatant violation of the contract. But it is a warning, not an error, and there are times when the compiler has been misled. For example, the following code cannot produce an error, assuming that the function always_false lives up to its name:
const int cv = 3;
if (always_false(cv)) cv = 42;
In short, C++ does not undertake to save you from your errors; if you choose to write programs in C++, you must be prepared to ensure that you play by the rules.
However, how does the compiler store the information 'const'?
It doesn't. The const keyword is a type qualifier. Such knowledge about constness matters during type checking, a task performed by a compiler's frontend.
If no (invalid) attempt to modify a const value - beware of the difference between copy and reference/pointer semantics - is found, the compiler's backend will emit code. Then, the data in question is placed in an object file (not necessarily at a read-only section) like ELF.
Eventually, your operating system will load such binary file. What exactly happens upon modification of that once "const object", either from a violating expression not caught by the compiler's type-checker or any intrusive mechanism, can vary.

gcc - gdb - pretty print stl

I'm currently doing some research on the STL, especially for printing the STL content during debug. I know there are many different approaches.
Like:
http://sourceware.org/gdb/wiki/STLSupport
or using a shared library to print the content of a container
What I'm currently looking for is, why g++ deletes functions, which are not used for example I have following code and use the compile setting g++ -g main.cpp -o main.o.
include <vector>
include <iostream>
using namespace std;
int main() {
std::vector<int> vec;
vec.push_back(10);
vec.push_back(20);
vec.push_back(30);
return;
}
So when I debug this code I will see that I can't use print vec.front(). The message I receive is:
Cannot evaluate function -- may be inlined
Therefore I tried to use the setting -fkeep-inline-functions, but no changes.
When i use nm main.o | grep front I see that there is no line entry for the method .front(). Doing the same again but, with an extra vec.front() entry within my code I can use print vec.front(), and using nm main.o | grep front where I see the entry
0000000000401834 W _ZNSt6vectorIiSaIiEE5frontEv
Can someone explain me how I can keep all functions within my code without loosing them. I think, that dead functions do not get deleted as long as I don't set optimize settings or do following.
How to tell compiler to NOT optimize certain code away?
Why I need it: Current Python implementations use the internal STL implementation to print the content of a container, but it would be much more interesting to use functions which are defined by ISO/IEC 14882. I know it's possible to write a shared library, which can be compiled to your actual code before you debug it, to maintain that you have all STL functions, but who wants to compile an extra lib to its code, before debugging. It would also be interesting to know if there are some advantages and disadvantages of this two approaches (Shared Lib. and Python)?
What's exactly a dead function, isn't it a function which is available in my source code but isn't used?
There are two cases to consider:
int unused_function() { return 42; }
int main() { return 0; }
If you compile above program, the unused_function is dead -- never called. However, it would still be present in the final executable (even with optimization [1]).
Now consider this:
template <typename T> int unused_function(T*) { return 42; }
int main() { return 0; }
In this case, unused_function will not be present, even when you turn off all optimizations.
Why? Because the template is not a "real" function. It's a prototype, from which the compiler can create "real" functions (called "template instantiation") -- one for each type T. Since you've never used unused_function, the compiler didn't create any "real" instances of it.
You can request that the compiler explicitly instantiate all functions in a given class, with explicit instantiation request, like so:
#include <vector>
template class std::vector<int>;
int main() { return 0; }
Now, even though none of the vector functions are used, they are all instantiated into the final binary.
[1] If you are using the GNU ld (or gold), you could still get rid of unused_function in this case, by compiling with -ffunction-sections and linking with -Wl,--gc-sections.
Thanks for your answer. Just to repeat, template functions don't get initiated by the gcc, because they are prototypes. Only when the function is used or it gets explicitly initiated it will be available within my executable.
So what we have mentioned until yet is :
function definition int unusedFunc() { return 10; }
function prototype int protypeFunc(); (just to break it down)
What happens when you inline functions? I always thought, that the function will be inserted within my source code, but now I read, that compilers often decide what to do on their own. (Sounds strange, because their must be rule). It doesn't matter if you use the keyword inline, for example.
inline int inlineFunc() { return 10; }
A friend of mine also told me that he hasn't had access to addresses of functions, although he hasn't used inline. Are there any function types I forgot? He also told me that their should be differences within the object data format.
#edit - forgot:
nested functions
function pointers
overloaded functions

Why isn't pass struct by reference a common optimization?

Up until today, I had always thought that decent compilers automatically convert struct pass-by-value to pass-by-reference if the struct is large enough that the latter would be faster. To the best of my knowledge, this seems like a no-brainer optimization. However, to satisfy my curiosity as to whether this actually happens, I created a simple test case in both C++ and D and looked at the output of both GCC and Digital Mars D. Both insisted on passing 32-byte structs by value when all the function in question did was add up the members and return the values, with no modification of the struct passed in. The C++ version is below.
#include "iostream.h"
struct S {
int i, j, k, l, m, n, o, p;
};
int foo(S s) {
return s.i + s.j + s.k + s.l + s.m + s.n + s.o + s.p;
}
int main() {
S s;
int bar = foo(s);
cout << bar;
}
My question is, why the heck wouldn't something like this be optimized by the compiler to pass-by-reference instead of actually pushing all those ints onto the stack?
Note: Compiler switches used: GCC -O2 (-O3 inlined foo().), DMD -O -inline -release.
Edit: Obviously, in the general case the semantics of pass-by-value vs. pass-by-reference won't be the same, such as if copy constructors are involved or the original struct is modified in the callee. However, in a lot of real-world scenarios, the semantics will be identical in terms of observable behavior. These are the cases I'm asking about.
Don't forget that in C/C++ the compiler needs to be able to compile a call to a function based only on the function declaration.
Given that callers might be using only that information, there's no way for a compiler to compile the function to take advantage of the optimization you're talking about. The caller can't know the function won't modify anything and so it can't pass by ref. Since some callers might pass by value due to lack of detailed information, the function has to be compiled assuming pass-by-value and everybody needs to pass by value.
Note that even if you marked the parameter as 'const', the compiler still can't perform the optimization, because the function could be lying and cast away the constness (this is permitted and well-defined as long as the object being passed in is actually not const).
I think that for static functions (or those in an anonymous namespace), the compiler could possibly make the optimization you're talking about, since the function does not have external linkage. As long as the address of the function isn't passed to some other routine or stored in a pointer, it should not be callable from other code. In this case the compiler could have full knowledge of all callers, so I suppose it could make the optimization.
I'm not sure if any do (actually, I'd be surprised if any do, since it probably couldn't be applied very often).
Of course, as the programmer (when using C++) you can force the compiler to perform this optimization by using const& parameters whenever possible. I know you're asking why the compiler can't do it automatically, but I suppose this is the next best thing.
The problem is you're asking the compiler to make a decision about the intention of user code. Maybe I want my super large struct to be passed by value so that I can do something in the copy constructor. Believe me, someone out there has something they validly need to be called in a copy constructor for just such a scenario. Switching to a by ref will bypass the copy constructor.
Having this be a compiler generated decision would be a bad idea. The reason being is that it makes it impossible to reason about the flow of your code. You can't look at a call and know what exactly it will do. You have to a) know the code and b) guess the compiler optimization.
One answer is that the compiler would need to detect that the called method does not modify the contents of the struct in any way. If it did, then the effect of passing by reference would differ from that of passing by value.
It is true that compilers in some languages could do this if they have access to the function being called and if they can assume that the called function will not be changing. This is sometimes referred to as global optimization and it seems likely that some C or C++ compilers would in fact optimize cases such as this - more likely by inlining the code for such a trivial function.
I think this is definitely an optimization you could implement (under some assumptions, see last paragraph), but it's not clear to me that it would be profitable. Instead of pushing arguments onto the stack (or passing them through registers, depending on the calling convention), you would push a pointer through which you would read values. This extra indirection would cost cycles. It would also require the passed argument to be in memory (so you could point to it) instead of in registers. It would only be beneficial if the records being passed had many fields and the function receiving the record only read a few of them. The extra cycles wasted by indirection would have to make up for the cycles not wasted by pushing unneeded fields.
You may be surprised that the reverse optimization, argument promotion, is actually implemented in LLVM. This converts a reference argument into a value argument (or an aggregate into scalars) for internal functions with small numbers of fields that are only read from. This is particularly useful for languages which pass nearly everything by reference. If you follow this with dead argument elimination, you also don't have to pass fields that aren't touched.
It bears mentioning that optimizations that change the way a function is called can only work when the function being optimized is internal to the module being compiled (you get this by declaring a function static in C and with templates in C++). The optimizer has to fix not only the function but also all the call points. This makes such optimizations fairly limited in scope unless you do them at link time. In addition, the optimization would never be called when a copy constructor is involved (as other posters have mentioned) because it could potentially change the semantics of the program, which a good optimizer should never do.
There are many reasons to pass by value, and having the compiler optimise out your intention may break your code.
Example, if the called function modifies the structure in any way. If you intended the results to be passed back to the caller then you'd either pass a pointer/reference or return it yourself.
What you're asking the compiler to do is change the behaviour of your code, which would be considered a compiler bug.
If you want to make the optimization and pass by reference then by all means modify someone's existing function/method definitions to accept references; it's not all that hard to do. You might be surprised at the breakage you cause without realising it.
Changing from by value to by reference will change the signature of the function. If the function is not static this would cause linking errors for other compilation units which are not aware of the optimization you did.
Indeed the only way to do such an optimization is by some sort of post-link global optimization phase. These are notoriously hard to do yet some compilers do them to some extent.
Pass-by-reference is just syntactic sugar for pass-by-address/pointer. So the function must implicitly dereference a pointer to read the parameter's value. Dereferencing the pointer might be more expensive (if in a loop) then the struct copy for copy-by-value.
More importantly, like others have mentioned, pass-by-reference has different semantics than pass-by-value. const references do not mean the referenced value does not change. other function calls might change the referenced value.
Effectively passing a struct by reference even when the function declaration indicates pass-by-value is a common optimization: it's just that it usually happens indirectly via inlining, so it's not obvious from the generated code.
However, for this to happen, the compiler needs to know that callee doens't modify the passed object while it is compiling the caller. Otherwise, it will be restricted by the platform/language ABI which dictates exactly how values are passed to functions.
It can happen even without inlining!
Still, some compilers do implement this optimization even in the absence of inlining, although the circumstances are relatively limited, at least on platforms using the SysV ABI (Linux, OSX, etc) due to the constraints of stack layout. Consider the following simple example, based directly on your code:
__attribute__((noinline))
int foo(S s) {
return s.i + s.j + s.k + s.l + s.m + s.n + s.o + s.p;
}
int bar(S s) {
return foo(s);
}
Here, at the language level bar calls foo with pass-by-value semantics as required by C++. If we examine the assembly generated by gcc, however, it looks like this:
foo(S):
mov eax, DWORD PTR [rsp+12]
add eax, DWORD PTR [rsp+8]
add eax, DWORD PTR [rsp+16]
add eax, DWORD PTR [rsp+20]
add eax, DWORD PTR [rsp+24]
add eax, DWORD PTR [rsp+28]
add eax, DWORD PTR [rsp+32]
add eax, DWORD PTR [rsp+36]
ret
bar(S):
jmp foo(S)
Note that bar just directly calls foo, without making a copy: bar will use the same copy of s that was passed to bar (on the stack). In particular it doesn't make any copy as is implied by the language semantics (ignoring as if). So gcc has performed exactly the optimization you requested. Clang doesn't do it though: it makes a copy on the stack which it passes to foo().
Unfortunately, the cases where this can work are fairly limited: SysV requires that these large structures are passed on the stack in a specific position, so such re-use is only possible if callee expects the object in the exact same place.
That's possible in the foo/bar example since bar takes it's S as the first parameter in the same way as foo, and bar does a tail call to foo which avoids the need for the implicit return-address push that would otherwise ruin the ability to re-use the stack argument.
For example, if we simply add a + 1 to the call to foo:
int bar(S s) {
return foo(s) + 1;
}
The trick is ruined, since now the position of bar::s is different than the location foo will expect its s argument, and we need a copy:
bar(S):
push QWORD PTR [rsp+32]
push QWORD PTR [rsp+32]
push QWORD PTR [rsp+32]
push QWORD PTR [rsp+32]
call foo(S)
add rsp, 32
add eax, 1
ret
This doesn't mean that the caller bar() has to be totally trivial though. For example, it could modify its copy of s, prior to passing it along:
int bar(S s) {
s.i += 1;
return foo(s);
}
... and the optimization would be preserved:
bar(S):
add DWORD PTR [rsp+8], 1
jmp foo(S)
In principle, this possibility for this kind of optimization is much greated in the Win64 calling convention which uses a hidden pointer to pass large structures. This gives a lot more flexibility in reusing existing structures on the stack or elsewhere in order to implement pass-by-reference under the covers.
Inlining
All that aside, however, the main way this optimization happens is via inlining.
For example, at -O2 compilation all of clang, gcc and MSVC don't make any copy of the S object1. Both clang and gcc don't really create the object at all, but just calculated the result more or less directly without even referring unused fields. MSVC does allocate stack space for a copy, but never uses it: it fills out only one copy of S only and reads from that, just like pass-by-reference (MSVC generates much worse code than the other two compilers for this case).
Note that even though foo is inlined into main the compilers also generate a separate standalone copy of the foo() function since it has external linkage and so could be used by this object file. In this, the compiler is restricted by the application binary interface: the SysV ABI (for Linux) or Win64 ABI (for Windows) defines exactly how values must be passed, depending on the type and size of the value. Large structures are passed by hidden pointer, and the compiler has to respect that when compiling foo. It also has to respect that compiling some caller of foo when foo cannot be seen: since it has no idea what foo will do.
So there is very little window for the compiler to make a an effective optimization which transforms pass-by-value to pass-by-reference because:
1) If it can see both the caller and callee (main and foo in your example), it is likely that the callee will be inlined into the caller if it is small enough, and as the function becomes large and not-inlinable, the effect of fixed cost things like calling convention overhead become relatively smaller.
2) If the compiler cannot see both the caller and callee at the same time2, it generally has to compile each according to the platform ABI. There is no scope for optimization of the call at the call site since the compiler doesn't know what the callee will do, and there is no scope for optimization within the callee because the compiler has to make conservative assumptions about what the caller did.
1 My example is slightly more complicated that your original one to avoid the compiler just optimizing everything away entirely (in particular, you access uninitialized memory, so your program doesn't even have defined behavior): I populate a few of the fields of s with argc which is a value the compiler can't predict.
2 A compiler can see both "at the same time" generally means they are either in the same translation unit or that link-time-optimization is being used.
Well, the trivial answer is that the location of the struct in memory is different, and thus the data you're passing is different. The more complex answer, I think, is threading.
Your compiler would need to detect a) that foo does not modify the struct; b) that foo does not do any calculation on the physical location of the struct elements; AND c) that the caller, or another thread spawned by the caller, doesn't modify the struct before foo is finished running.
In your example, it's conceivable that the compiler could do these things - but the memory saved is inconsequential and probably not worth taking the guess. What happens if you run the same program with a struct that has two million elements?
the compiler would need to be sure that the struct that is passed (as named in the calling code) in is not modified
double x; // using non structs, oh-well
void Foo(double d)
{
x += d; // ok
x += d; // Oops
}
void main()
{
x = 1;
Foo(x);
}
On many platforms, large structures are in fact passed by reference, but either the caller will be expected to pass a reference to a copy that the function may manipulate as it likes1, or the called function will be expected to make a copy of the structure to which it receives a reference and then perform any manipulations on the copy.
While there are many circumstances in which the copy operations could in fact be omitted, it will often be difficult for a compiler to prove that such operations may be eliminated.
For example, given:
struct FOO { ... };
void func1(struct FOO *foo1);
void func2(struct FOO foo2);
void test(void)
{
struct FOO foo;
func1(&foo);
func2(foo);
}
there is no way a compiler could know whether foo might get modified during the execution of func2 (func1 could have stored a copy of foo1 or a pointer derived from it in a file-scope object which is then used by func2). Such modifications, however, should not affect the copy of foo (i.e. foo2) received by func2. If foo were passed by reference and func2 didn't make a copy, actions that affect foo would improperly affect foo2.
Note that even void func3(const struct FOO); is not meaningful: the callee is allowed to cast away const, and the normal asm calling convention still allow the callee to modify the memory holding the by-value copy.
Unfortunately, there are relatively few cases where examining the caller or called function in isolation would be sufficient to prove that a copy operation may be safely omitted, and there are many cases where even examining both would be insufficient. Thus, replacing pass-by-value with pass-by-reference is a difficult optimization whose payoff is often insufficient to justify the difficulty.
Footnote 1:
For example, Windows x64 passes objects larger than 8 bytes by non-const reference (callee "owns" the pointed-to memory). This doesn't help avoid copying at all; the motivation is to make all function args fit in 8 bytes each so they form an array on the stack (after spilling register args to shadow space), making variadic functions easy to implement.
By contrast, x86-64 System V does what the question describes for objects larger than 16 bytes: copying them to the stack. (Smaller objects are packed into up to two registers.)

Resources