htonl and ntohl have same address in windows? - windows

I rely on GetProcAddress() to do some hooking of some functions. I get a terrible result though, and to be honest, I don't really know what's happening.
It seems that this code will output "WHAT THE HELL?":
int main(void)
{
HMODULE ws32 = LoadLibrary("WS2_32.DLL");
if (GetProcAddress(ws32, "ntohl") == GetProcAddress(ws32, "htonl"))
printf("WHAT THE HELL\n");
return 0;
}
Could someone explain me why ntohl and htonl have the same absolute addresses?
The problem is that, when I hook into a DLL and do some processing inside the DLL, it's clear that Inside the PE import table (I parse the PE IAT), ntohl() and htonl() have different addresses (Inside the IAT as said).
So this thing renders my program useless. ntohl() is confounded with htonl(), and the program cannot make the difference and goes crazy.
Any thoughts? Would there be a way to circumvent this? An explication?

Sure, why not. All the ntohl and htonl function does, on a little endian platform, is to reverse all the individual bytes in an integer. There's no need for those 2 functions to be implemented differently - you do not need to worry that GetProcAddress() returns the same function pointer.
Ofcourse, you'd want to verify that GetProcAddress doesn't return a NULL pointer.

Unless you're on a mixed-endian platform like PDP-11, converting from host to the native endianness or vice versa is simply a byte swap or a NOP, so applying ntohl or htonl to an integer results in the same output. Therefore the linker may choose to use the same function for both names
It's unclear why you want to differentiate those functions, but it's completely unreliable. Smart compilers know to convert htonl and ntohl to a byte swap if necessary, resulting in zero function call. The user can also call compiler intrinsics such as _byteswap_ulong() or __builtin_bswap32() directly. In those cases how can you hook the function?
That being said, it's likely because of COMDAT folding optimization which merges identical functions. To disable it use the /OPT:NOICF flag
See also
Do distinct functions have distinct addresses?
Why do two functions have the same address?

Related

Can register name be passed into assembly template in GCC inline assembly [duplicate]

I have recently started learning how to use the inline assembly in C Code and came across an interesting feature where you can specify registers for local variables (https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html#Local-Register-Variables).
The usage of this feature is as follows:
register int *foo asm ("r12");
Then I started to wonder whether it was possible to insert a char pointer such as
const char d[4] = "r12";
register int *foo asm (d);
but got the error: expected string literal before ā€˜dā€™ (as expected)
I can understand why this would be a bad practice, but is there any possible way to achieve a similar effect where I can use a char pointer to access the register? If not, is there any particular reason why this is not allowed besides the potential security issues?
Additionally, I read this StackOverflow question: String literals: pointer vs. char array
Thank you.
The syntax to initialize the variable would be register char *foo asm ("r12") = d; to point an asm-register variable at a string. You can't use a runtime-variable string as the register name; register choices have to get assembled into machine code at compile time.
If that's what you're trying to do, you're misunderstanding something fundamental about assembly language and/or how ahead-of-time compiled languages compile into machine code. GCC won't make self-modifying code (and even if it wanted to, doing that safely would require redoing register allocation done by the ahead-of-time optimizer), or code that re-JITs itself based on a string.
(The first time I looked at your question, I didn't understand what you were even trying to do, because I was only considering things that are possible. #FelixG's comment was the clue I needed to make sense of the question.)
(Also note that registers aren't indexable; even in asm you can't use a single instruction to read a register number selected by an integer in another register. You could branch on it, or store all the registers in memory and index that like variadic functions do for their incoming register args.)
And if you do want a compile-time constant string literal, just use it with the normal syntax. Use a CPP macro if you want the same string to initialize a char array.

Why casting interface{} to cgo types is not allowed?

It's more of a technical question, not really an issue. Since we don't have variadic functions in cgo and there's currently no valid solution, I wonder if it'd be possible to cast interface{} to cgo types. So this would allow us to have more dynamic functions. I'm pretty sure we're not even allowed to assign types in a dynamic way to arguments in exported (//export) functions, neither the use of ellipsis is allowed. So what's the reason behind all those limits?
Thanks for answering.
import "C"
//export Foo
func Foo(arg1, arg2, arg3) {
}
C compilers are allowed, but not required, to return different types using different return mechanisms. For instance, some C compilers might return float results in the %f0 register, double results in the %f0:f1 register pair, integer results in the %d0 register, and pointer results in the %a0 register.
What this means for the person writing the Cgo interface for Go is that they must, in general, handle each kind of these C functions differently. In other words, it's not possible to write:
generic_type Cfunc(ctype1 arg1, ctype2 arg2) { ... }
We must know, at compile time, that Cfunc returns float/double/<some-integer-type>/<some-pointer-type> so that we can grab the correct register(s) and stuff its (or their) value(s) into the Cgo return-value slot, where the Cgo interface can get it and wrap it up for use in Go.
What this means for you, as a user of a Go compiler that implements Cgo wrappers to call C functions, is that you have to know the right type. There is no generalized answer; there is no way to use interface{} here. You must communicate the exact, correct type to the Cgo layer, so that the Cgo layer can use that exact, correct type information to generate the correct machine code at compile time.
If the C compiler writers had some way of flagging their code so that, e.g., at link time, the linker could pull in the right "save correct register to memory" location, that would enable the Cgo wrapper author to use the linker to automagically find the C function's type at link time. But these C compilers don't offer this ability to the linkers.
Is your particular compiler one of these? We don't know: you didn't say. But:
I'm pretty sure we're not even allowed to assign types in a dynamic way to arguments in exported (//export) functions, neither the use of ellipsis is allowed. So what's the reason behind all those limits?
That's correct, and the (slightly theoretical) example above is a reason. (I constructed this example by mixing actual techniques from 68k C compilers and SPARC C compilers, so I don't think there's any single C compiler like this. But examples like this did exist in the past, and SPARC systems still return integers in %o0, or %o0+%o1 on V8 SPARC, vs floating point in %f0 or %f0+%f1.)

Purpose of using Windows Data Types in a program

I am trying to understand the purpose of using Windows Data Types when defining parameters of a function/structure fields in a particular language. I've read explanations detailing how this prevents code from "breaking" if "underlying types" are changed. Can some one present a concise explanation and example to clarify? Thanks.
Found answer in a similar post (Why are the standard datatypes not used in Win32 API?):
And the reason that these types are defined the way they are, rather than using int, char and so on is that it removes the "whatever the compiler thinks an int should be sized as" from the interface of the OS. Which is a very good thing, because if you use compiler A, or compiler B, or compiler C, they will all use the same types - only the library interface header file needs to do the right thing defining the types.
By defining types that are not standard types, it's easy to change int from 16 to 32 bit, for example. The first C/C++ compilers for Windows were using 16-bit integers. It was only in the mid to late 1990's that Windows got a 32-bit API, and up until that point, you were using int that was 16-bit. Imagine that you have a well-working program that uses several hundred int variables, and all of a sudden, you have to change ALL of those variables to something else... Wouldn't be very nice, right - especially as SOME of those variables DON'T need changing, because moving to a 32-bit int for some of your code won't make any difference, so no point in changing those bits.
It should be noted that WCHAR is NOT the same as const char - WCHAR is a "wide char" so wchar_t is the comparable type.
So, basically, the "define our own type" is a way to guarantee that it's possible to change the underlying compiler architecture, without having to change (much of the) source code. All larger projects that do machine-dependant coding does this sort of thing.

How to read a call stack?

We have a native C++ application running via COM+ on a windows 2003 server. I've recently noticed from the event viewer that its throwing exceptions, specifically the C0000005 exception, which, according to http://blogs.msdn.com/calvin_hsia/archive/2004/06/30/170344.aspx means that the process is trying to write to memory not within its address space, aka access violation.
The entry in event viewer provides a call stack:
LibFmwk!UTIL_GetDateFromLogByDayDirectory(char const *,class utilCDate &) + 0xa26c
LibFmwk!UTIL_GetDateFromLogByDayDirectory(char const *,class utilCDate &) + 0x8af4
LibFmwk!UTIL_GetDateFromLogByDayDirectory(char const *,class utilCDate &) + 0x13a1
LibFmwk!utilCLogController::GetFLFInfoLevel(void)const + 0x1070
LibFmwk!utilCLogController::GetFLFInfoLevel(void)const + 0x186
Now, I understand that its giving me method names to go look at but I get a feeling that the address at the end of each line (e.g. + 0xa26c) is trying to point me to a specific line or instruction within that method.
So my questions are:
Does anyone know how I might use this address or any other information in a call stack to determine which line within the code its falling over on?
Are there any resources out there that I could read to better understand call stacks,
Are there any freeware/opensource tools that could help in analysing a call stack, perhaps by attaching to a debug symbol file and/or binaries?
Edit:
As requested, here is the method that appears to be causing the problem:
BOOL UTIL_GetDateFromLogByDayDirectory(LPCSTR pszDir, utilCDate& oDate)
{
BOOL bRet = FALSE;
if ((pszDir[0] == '%') &&
::isdigit(pszDir[1]) && ::isdigit(pszDir[2]) &&
::isdigit(pszDir[3]) && ::isdigit(pszDir[4]) &&
::isdigit(pszDir[5]) && ::isdigit(pszDir[6]) &&
::isdigit(pszDir[7]) && ::isdigit(pszDir[8]) &&
!pszDir[9])
{
char acCopy[9];
::memcpy(acCopy, pszDir + 1, 8);
acCopy[8] = '\0';
int iDay = ::atoi(&acCopy[6]);
acCopy[6] = '\0';
int iMonth = ::atoi(&acCopy[4]);
acCopy[4] = '\0';
int iYear = ::atoi(&acCopy[0]);
oDate.Set(iDay, iMonth, iYear);
bRet = TRUE;
}
return (bRet);
}
This is code written over 10 years ago by a member of our company who has long since gone, so I don't presume to know exactly what this is doing but I do know its involved in the process of renaming a log directory from 'Today' to the specific date, e.g. %20090329. The array indexing, memcpy and address of operators do make it look rather suspicious.
Another problem we seem to have is that this only happens on the production system, we've never been able to reproduce it on our test systems or development systems here, which would allow us to attach a debugger.
Much appreciated!
Andy
Others have said this in-between the lines, but not explicitly. look at:
LibFmwk!UTIL_GetDateFromLogByDayDirectory(char const *,class utilCDate &) + 0xa26c
The 0xa26c offset is huge, way past the end of the function. the debugger obviously doesn't have the proper symbols for LibFmwk so instead it's relying on the DLL exports and showing the offset relative to the closest one it can find.
So, yeah, get proper symbols and then it should be a breeze. UTIL_GetDateFromLogByDayDirectory is not at fault here.
if you really need to map those addresses to your functions - you'll need to work with .MAP file and see where those addresses really point to.
But being in your situation I would rather investigate this problem under debugger (e.g. MSVS debugger or windbg); as alternative (if crash happens at customer's site) you can generate crash dump and study it locally - it can be done via Windows MiniDumpWriteDump API or SysInternals ProcDump utility (http://download.sysinternals.com/Files/procdump.zip).
Make sure that all required symbol files are generated and available (also setup microsoft symbol server path so that windows DLLs' entry points get resolved also).
IMHO this is just the web site you need: http://www.dumpanalysis.org - which is the best resource to cover all your questions.
Consider also taking a look at this PDF - http://windbg.info/download/doc/pdf/WinDbg_A_to_Z_color.pdf
Point 2 and 3 are easily answered:
3rd Point. Any debugger. That's what they are made for. Set your debugger to break on this special exception. You should be able to click yourself through the callstack and find the different calls on the stack (at least delphi can do this, so visual studio should be able as well). Compile without optimisations if possible. OllyDBG might work as well - perhaps in combination with its trace functionality.
2nd Point. Any information about x86 Assembler, Reverseengineering ... Try: OpenRCE, NASM Documentation, ASM Community.
1st Point. The callstack tells you the functions. I don't know if it is written in order or in opposite order - so it might be that the first line is the last called function or the first called function. Follow the calls with the help of the debugger. Sometimes you can change between asm and code (depending on the debugger, map files ...). If you don't have the source - learn assembler, read about reverse engineering. Read the documentation of the functions you call in third party components. Perhaps you do not satisfy a precondition.
If you can tell a bit more about the programm (which parts of the source code do you have, is a library call involved?, ...)
Now some code-reading:
The function accepts a pointer to a zero terminated string and a reference to a date object. The pointer is assumed to be valid!
The function checks wether the string is in a specific format (% followed by 8 digits followed by a \0). If this is not the case, it returns false. This check (the big if) accesses the pointer without any validity checks. The length is not checked and if the pointer is pointing somewhere in the wild, this space is accessed. I don't know if a shorter string will cause problems. It shouldn't because of the way && is evaluated.
Then some memory is allocated on the stack. The number-part of the string is copied into it (which is ok) and the buffer gets its \0 termination. The atois extract the numbers. This will work because of the different start-locations used and the \0-termination after each part. Somehow tricky but nice. Some comments would have made everything clear.
These numbers are then inserted into the object. It should be valid since it is passed into the function by reference. I don't know if you can pass a reference to a deleted object but if this is the case, this might be your problem as well.
Anyway - except the missing check for the string-pointer, this function is sound and not the cause of your problem. It's only the place that throws the exception. Search for arguments that are passed into this function. Are they always valid? Do some logging.
I hope I didn't do any major mistakes as I am a Delphi programmer. If I did - feel free to comment.

Why isn't pass struct by reference a common optimization?

Up until today, I had always thought that decent compilers automatically convert struct pass-by-value to pass-by-reference if the struct is large enough that the latter would be faster. To the best of my knowledge, this seems like a no-brainer optimization. However, to satisfy my curiosity as to whether this actually happens, I created a simple test case in both C++ and D and looked at the output of both GCC and Digital Mars D. Both insisted on passing 32-byte structs by value when all the function in question did was add up the members and return the values, with no modification of the struct passed in. The C++ version is below.
#include "iostream.h"
struct S {
int i, j, k, l, m, n, o, p;
};
int foo(S s) {
return s.i + s.j + s.k + s.l + s.m + s.n + s.o + s.p;
}
int main() {
S s;
int bar = foo(s);
cout << bar;
}
My question is, why the heck wouldn't something like this be optimized by the compiler to pass-by-reference instead of actually pushing all those ints onto the stack?
Note: Compiler switches used: GCC -O2 (-O3 inlined foo().), DMD -O -inline -release.
Edit: Obviously, in the general case the semantics of pass-by-value vs. pass-by-reference won't be the same, such as if copy constructors are involved or the original struct is modified in the callee. However, in a lot of real-world scenarios, the semantics will be identical in terms of observable behavior. These are the cases I'm asking about.
Don't forget that in C/C++ the compiler needs to be able to compile a call to a function based only on the function declaration.
Given that callers might be using only that information, there's no way for a compiler to compile the function to take advantage of the optimization you're talking about. The caller can't know the function won't modify anything and so it can't pass by ref. Since some callers might pass by value due to lack of detailed information, the function has to be compiled assuming pass-by-value and everybody needs to pass by value.
Note that even if you marked the parameter as 'const', the compiler still can't perform the optimization, because the function could be lying and cast away the constness (this is permitted and well-defined as long as the object being passed in is actually not const).
I think that for static functions (or those in an anonymous namespace), the compiler could possibly make the optimization you're talking about, since the function does not have external linkage. As long as the address of the function isn't passed to some other routine or stored in a pointer, it should not be callable from other code. In this case the compiler could have full knowledge of all callers, so I suppose it could make the optimization.
I'm not sure if any do (actually, I'd be surprised if any do, since it probably couldn't be applied very often).
Of course, as the programmer (when using C++) you can force the compiler to perform this optimization by using const& parameters whenever possible. I know you're asking why the compiler can't do it automatically, but I suppose this is the next best thing.
The problem is you're asking the compiler to make a decision about the intention of user code. Maybe I want my super large struct to be passed by value so that I can do something in the copy constructor. Believe me, someone out there has something they validly need to be called in a copy constructor for just such a scenario. Switching to a by ref will bypass the copy constructor.
Having this be a compiler generated decision would be a bad idea. The reason being is that it makes it impossible to reason about the flow of your code. You can't look at a call and know what exactly it will do. You have to a) know the code and b) guess the compiler optimization.
One answer is that the compiler would need to detect that the called method does not modify the contents of the struct in any way. If it did, then the effect of passing by reference would differ from that of passing by value.
It is true that compilers in some languages could do this if they have access to the function being called and if they can assume that the called function will not be changing. This is sometimes referred to as global optimization and it seems likely that some C or C++ compilers would in fact optimize cases such as this - more likely by inlining the code for such a trivial function.
I think this is definitely an optimization you could implement (under some assumptions, see last paragraph), but it's not clear to me that it would be profitable. Instead of pushing arguments onto the stack (or passing them through registers, depending on the calling convention), you would push a pointer through which you would read values. This extra indirection would cost cycles. It would also require the passed argument to be in memory (so you could point to it) instead of in registers. It would only be beneficial if the records being passed had many fields and the function receiving the record only read a few of them. The extra cycles wasted by indirection would have to make up for the cycles not wasted by pushing unneeded fields.
You may be surprised that the reverse optimization, argument promotion, is actually implemented in LLVM. This converts a reference argument into a value argument (or an aggregate into scalars) for internal functions with small numbers of fields that are only read from. This is particularly useful for languages which pass nearly everything by reference. If you follow this with dead argument elimination, you also don't have to pass fields that aren't touched.
It bears mentioning that optimizations that change the way a function is called can only work when the function being optimized is internal to the module being compiled (you get this by declaring a function static in C and with templates in C++). The optimizer has to fix not only the function but also all the call points. This makes such optimizations fairly limited in scope unless you do them at link time. In addition, the optimization would never be called when a copy constructor is involved (as other posters have mentioned) because it could potentially change the semantics of the program, which a good optimizer should never do.
There are many reasons to pass by value, and having the compiler optimise out your intention may break your code.
Example, if the called function modifies the structure in any way. If you intended the results to be passed back to the caller then you'd either pass a pointer/reference or return it yourself.
What you're asking the compiler to do is change the behaviour of your code, which would be considered a compiler bug.
If you want to make the optimization and pass by reference then by all means modify someone's existing function/method definitions to accept references; it's not all that hard to do. You might be surprised at the breakage you cause without realising it.
Changing from by value to by reference will change the signature of the function. If the function is not static this would cause linking errors for other compilation units which are not aware of the optimization you did.
Indeed the only way to do such an optimization is by some sort of post-link global optimization phase. These are notoriously hard to do yet some compilers do them to some extent.
Pass-by-reference is just syntactic sugar for pass-by-address/pointer. So the function must implicitly dereference a pointer to read the parameter's value. Dereferencing the pointer might be more expensive (if in a loop) then the struct copy for copy-by-value.
More importantly, like others have mentioned, pass-by-reference has different semantics than pass-by-value. const references do not mean the referenced value does not change. other function calls might change the referenced value.
Effectively passing a struct by reference even when the function declaration indicates pass-by-value is a common optimization: it's just that it usually happens indirectly via inlining, so it's not obvious from the generated code.
However, for this to happen, the compiler needs to know that callee doens't modify the passed object while it is compiling the caller. Otherwise, it will be restricted by the platform/language ABI which dictates exactly how values are passed to functions.
It can happen even without inlining!
Still, some compilers do implement this optimization even in the absence of inlining, although the circumstances are relatively limited, at least on platforms using the SysV ABI (Linux, OSX, etc) due to the constraints of stack layout. Consider the following simple example, based directly on your code:
__attribute__((noinline))
int foo(S s) {
return s.i + s.j + s.k + s.l + s.m + s.n + s.o + s.p;
}
int bar(S s) {
return foo(s);
}
Here, at the language level bar calls foo with pass-by-value semantics as required by C++. If we examine the assembly generated by gcc, however, it looks like this:
foo(S):
mov eax, DWORD PTR [rsp+12]
add eax, DWORD PTR [rsp+8]
add eax, DWORD PTR [rsp+16]
add eax, DWORD PTR [rsp+20]
add eax, DWORD PTR [rsp+24]
add eax, DWORD PTR [rsp+28]
add eax, DWORD PTR [rsp+32]
add eax, DWORD PTR [rsp+36]
ret
bar(S):
jmp foo(S)
Note that bar just directly calls foo, without making a copy: bar will use the same copy of s that was passed to bar (on the stack). In particular it doesn't make any copy as is implied by the language semantics (ignoring as if). So gcc has performed exactly the optimization you requested. Clang doesn't do it though: it makes a copy on the stack which it passes to foo().
Unfortunately, the cases where this can work are fairly limited: SysV requires that these large structures are passed on the stack in a specific position, so such re-use is only possible if callee expects the object in the exact same place.
That's possible in the foo/bar example since bar takes it's S as the first parameter in the same way as foo, and bar does a tail call to foo which avoids the need for the implicit return-address push that would otherwise ruin the ability to re-use the stack argument.
For example, if we simply add a + 1 to the call to foo:
int bar(S s) {
return foo(s) + 1;
}
The trick is ruined, since now the position of bar::s is different than the location foo will expect its s argument, and we need a copy:
bar(S):
push QWORD PTR [rsp+32]
push QWORD PTR [rsp+32]
push QWORD PTR [rsp+32]
push QWORD PTR [rsp+32]
call foo(S)
add rsp, 32
add eax, 1
ret
This doesn't mean that the caller bar() has to be totally trivial though. For example, it could modify its copy of s, prior to passing it along:
int bar(S s) {
s.i += 1;
return foo(s);
}
... and the optimization would be preserved:
bar(S):
add DWORD PTR [rsp+8], 1
jmp foo(S)
In principle, this possibility for this kind of optimization is much greated in the Win64 calling convention which uses a hidden pointer to pass large structures. This gives a lot more flexibility in reusing existing structures on the stack or elsewhere in order to implement pass-by-reference under the covers.
Inlining
All that aside, however, the main way this optimization happens is via inlining.
For example, at -O2 compilation all of clang, gcc and MSVC don't make any copy of the S object1. Both clang and gcc don't really create the object at all, but just calculated the result more or less directly without even referring unused fields. MSVC does allocate stack space for a copy, but never uses it: it fills out only one copy of S only and reads from that, just like pass-by-reference (MSVC generates much worse code than the other two compilers for this case).
Note that even though foo is inlined into main the compilers also generate a separate standalone copy of the foo() function since it has external linkage and so could be used by this object file. In this, the compiler is restricted by the application binary interface: the SysV ABI (for Linux) or Win64 ABI (for Windows) defines exactly how values must be passed, depending on the type and size of the value. Large structures are passed by hidden pointer, and the compiler has to respect that when compiling foo. It also has to respect that compiling some caller of foo when foo cannot be seen: since it has no idea what foo will do.
So there is very little window for the compiler to make a an effective optimization which transforms pass-by-value to pass-by-reference because:
1) If it can see both the caller and callee (main and foo in your example), it is likely that the callee will be inlined into the caller if it is small enough, and as the function becomes large and not-inlinable, the effect of fixed cost things like calling convention overhead become relatively smaller.
2) If the compiler cannot see both the caller and callee at the same time2, it generally has to compile each according to the platform ABI. There is no scope for optimization of the call at the call site since the compiler doesn't know what the callee will do, and there is no scope for optimization within the callee because the compiler has to make conservative assumptions about what the caller did.
1 My example is slightly more complicated that your original one to avoid the compiler just optimizing everything away entirely (in particular, you access uninitialized memory, so your program doesn't even have defined behavior): I populate a few of the fields of s with argc which is a value the compiler can't predict.
2 A compiler can see both "at the same time" generally means they are either in the same translation unit or that link-time-optimization is being used.
Well, the trivial answer is that the location of the struct in memory is different, and thus the data you're passing is different. The more complex answer, I think, is threading.
Your compiler would need to detect a) that foo does not modify the struct; b) that foo does not do any calculation on the physical location of the struct elements; AND c) that the caller, or another thread spawned by the caller, doesn't modify the struct before foo is finished running.
In your example, it's conceivable that the compiler could do these things - but the memory saved is inconsequential and probably not worth taking the guess. What happens if you run the same program with a struct that has two million elements?
the compiler would need to be sure that the struct that is passed (as named in the calling code) in is not modified
double x; // using non structs, oh-well
void Foo(double d)
{
x += d; // ok
x += d; // Oops
}
void main()
{
x = 1;
Foo(x);
}
On many platforms, large structures are in fact passed by reference, but either the caller will be expected to pass a reference to a copy that the function may manipulate as it likes1, or the called function will be expected to make a copy of the structure to which it receives a reference and then perform any manipulations on the copy.
While there are many circumstances in which the copy operations could in fact be omitted, it will often be difficult for a compiler to prove that such operations may be eliminated.
For example, given:
struct FOO { ... };
void func1(struct FOO *foo1);
void func2(struct FOO foo2);
void test(void)
{
struct FOO foo;
func1(&foo);
func2(foo);
}
there is no way a compiler could know whether foo might get modified during the execution of func2 (func1 could have stored a copy of foo1 or a pointer derived from it in a file-scope object which is then used by func2). Such modifications, however, should not affect the copy of foo (i.e. foo2) received by func2. If foo were passed by reference and func2 didn't make a copy, actions that affect foo would improperly affect foo2.
Note that even void func3(const struct FOO); is not meaningful: the callee is allowed to cast away const, and the normal asm calling convention still allow the callee to modify the memory holding the by-value copy.
Unfortunately, there are relatively few cases where examining the caller or called function in isolation would be sufficient to prove that a copy operation may be safely omitted, and there are many cases where even examining both would be insufficient. Thus, replacing pass-by-value with pass-by-reference is a difficult optimization whose payoff is often insufficient to justify the difficulty.
Footnote 1:
For example, Windows x64 passes objects larger than 8 bytes by non-const reference (callee "owns" the pointed-to memory). This doesn't help avoid copying at all; the motivation is to make all function args fit in 8 bytes each so they form an array on the stack (after spilling register args to shadow space), making variadic functions easy to implement.
By contrast, x86-64 System V does what the question describes for objects larger than 16 bytes: copying them to the stack. (Smaller objects are packed into up to two registers.)

Resources