Say I'm working on a library called libModern. This library uses a legacy C library, called libLegacy, as an implementation strategy. libLegacy's interface looks like this:
typedef uint32_t LegacyFlags;
struct LegacyFoo {
uint32_t x;
uint32_t y;
LegacyFlags flags;
// more data
};
struct LegacyBar {
LegacyFoo foo;
float a;
// more data
};
void legacy_input(LegacyBar const* s); // Does something with s
void legacy_output(LegacyBar* s); // Stores data in s
libModern shouldn't expose libLegacy's types to its users for various reasons, among them:
libLegacy is an implementation detail that shouldn't be leaked. Future versions of libModern might chose to use another library instead of libLegacy.
libLegacy uses hard-to-use, easy-to-misuse types that shouldn't be part of any user-facing API.
The textbook way to deal with this situation is the pimpl idiom: libModern would provide a wrapper type that internally has a pointer to the legacy data. However, this is not possible here, since libModern cannot allocate dynamic memory. Generally, its goal is not to add a lot of overhead.
Therefore, libModern defines its own types that are layout-compatible with the legacy types, yet have a better interface. In this example it is using a strong enum instead of a plain uint32_t for flags:
enum class ModernFlags : std::uint32_t
{
first_flag = 0,
second_flag = 1,
};
struct ModernFoo {
std::uint32_t x;
std::uint32_t y;
ModernFlags flags;
// More data
};
struct ModernBar {
ModernFoo foo;
float a;
// more data
};
Now the question is: How can libModern convert between the legacy and the modern types without much overhead? I know of 3 options:
reinterpret_cast. This is undefined behavior, but in practice produces perfect assembly. I want to avoid this, since I cannot rely on this still working tomorrow or on another compiler.
std::memcpy. In simple cases this generates the same optimal assembly, but in any non-trivial case this adds significant overhead.
C++20's std::bit_cast. In my tests, at best it produces exactly the same code as memcpy. In some cases it's worse.
This is a comparison of the 3 ways to interface with libLegacy:
Interfacing with legacy_input()
Using reinterpret_cast:
void input_ub(ModernBar const& s) noexcept {
legacy_input(reinterpret_cast<LegacyBar const*>(&s));
}
Assembly:
input_ub(ModernBar const&):
jmp legacy_input
This is perfect codegen, but it invokes UB.
Using memcpy:
void input_memcpy(ModernBar const& s) noexcept {
LegacyBar ls;
std::memcpy(&ls, &s, sizeof(ls));
legacy_input(&ls);
}
Assembly:
input_memcpy(ModernBar const&):
sub rsp, 24
movdqu xmm0, XMMWORD PTR [rdi]
mov rdi, rsp
movaps XMMWORD PTR [rsp], xmm0
call legacy_input
add rsp, 24
ret
Significantly worse.
Using bit_cast:
void input_bit_cast(ModernBar const& s) noexcept {
LegacyBar ls = std::bit_cast<LegacyBar>(s);
legacy_input(&ls);
}
Assembly:
input_bit_cast(ModernBar const&):
sub rsp, 40
movdqu xmm0, XMMWORD PTR [rdi]
mov rdi, rsp
movaps XMMWORD PTR [rsp+16], xmm0
mov rax, QWORD PTR [rsp+16]
mov QWORD PTR [rsp], rax
mov rax, QWORD PTR [rsp+24]
mov QWORD PTR [rsp+8], rax
call legacy_input
add rsp, 40
ret
And I have no idea what's going on here.
Interfacing with legacy_output()
Using reinterpret_cast:
auto output_ub() noexcept -> ModernBar {
ModernBar s;
legacy_output(reinterpret_cast<LegacyBar*>(&s));
return s;
}
Assembly:
output_ub():
sub rsp, 56
lea rdi, [rsp+16]
call legacy_output
mov rax, QWORD PTR [rsp+16]
mov rdx, QWORD PTR [rsp+24]
add rsp, 56
ret
Using memcpy:
auto output_memcpy() noexcept -> ModernBar {
LegacyBar ls;
legacy_output(&ls);
ModernBar s;
std::memcpy(&s, &ls, sizeof(ls));
return s;
}
Assembly:
output_memcpy():
sub rsp, 56
lea rdi, [rsp+16]
call legacy_output
mov rax, QWORD PTR [rsp+16]
mov rdx, QWORD PTR [rsp+24]
add rsp, 56
ret
Using bit_cast:
auto output_bit_cast() noexcept -> ModernBar {
LegacyBar ls;
legacy_output(&ls);
return std::bit_cast<ModernBar>(ls);
}
Assembly:
output_bit_cast():
sub rsp, 72
lea rdi, [rsp+16]
call legacy_output
movdqa xmm0, XMMWORD PTR [rsp+16]
movaps XMMWORD PTR [rsp+48], xmm0
mov rax, QWORD PTR [rsp+48]
mov QWORD PTR [rsp+32], rax
mov rax, QWORD PTR [rsp+56]
mov QWORD PTR [rsp+40], rax
mov rax, QWORD PTR [rsp+32]
mov rdx, QWORD PTR [rsp+40]
add rsp, 72
ret
Here you can find the entire example on Compiler Explorer.
I also noted that the codegen varies significantly depending on the exact definition of the structs (i.e. order, amount & type of members). But the UB version of the code is consistently better or at least as good as the other two versions.
Now my questions are:
How come the codegen varies so dramatically? It makes me wonder if I'm missing something important.
Is there something I can do to guide the compiler to generate better code without invoking UB?
Are there other standard-conformant ways that generate better code?
In your compiler explorer link, Clang produces the same code for all output cases. I don't know what problem GCC has with std::bit_cast in that situation.
For the input case, the three functions cannot produce the same code, because they have different semantics.
With input_ub, the call to legacy_input may be modifying the caller's object. This cannot be the case in the other two versions. Therefore the compiler cannot optimize away the copies, not knowing how legacy_input behaves.
If you pass by-value to the input functions, then all three versions produce the same code at least with Clang in your compiler explorer link.
To reproduce the code generated by the original input_ub you need to keep passing the address of the caller's object to legacy_input.
If legacy_input is an extern C function, then I don't think the standards specify how the object models of the two languages are supposed to interact in this call. So, for the purpose of the language-lawyer tag, I will assume that legacy_input is an ordinary C++ function.
The problem in passing the address of &s directly is that there is generally no LegacyBar object at the same address that is pointer-interconvertible with the ModernBar object. So if legacy_input tries to access LegacyBar members through the pointer, that would be UB.
Theoretically you could create a LegacyBar object at the required address, reusing the object representation of the ModernBar object. However, since the caller presumably will expect there to still be a ModernBar object after the call, you then need to recreate a ModernBar object in the storage by the same procedure.
Unfortunately though, you are not always allowed to reuse storage in this way. For example if the passed reference refers to a const complete object, that would be UB, and there are other requirements. The problem is also whether the caller's references to the old object will refer to the new object, meaning whether the two ModernBar objects are transparently replaceable. This would also not always be the case.
So in general I don't think you can achieve the same code generation without undefined behavior if you don't put additional constraints on the references passed to the function.
Most non-MSVC compilers support an attribute called __may_alias__ that you can use
struct ModernFoo {
std::uint32_t x;
std::uint32_t y;
ModernFlags flags;
// More data
} __attribute__((__may_alias__));
struct ModernBar {
ModernFoo foo;
float a;
// more data
} __attribute__((__may_alias__));
Of course some optimizations can't be done when aliasing is allowed, so use it only if performance is acceptable
Godbolt link
Programs which would ever have any reason to access storage as multiple types should be processed using -fno-strict-aliasing or equivalent on any compiler that doesn't limit type-based aliasing assumptions around places where a pointer or lvalue of one type is converted to another, even if the program uses only corner-case behaviors mandated by the Standard. Using such a compiler flag will guarantee that one won't have type-based-aliasing problems, while jumping through hoops to use only standard-mandated corner cases won't. Both clang and gcc are sometimes prone to both:
have one phase of optimization change code whose behavior would be mandated by the Standard into code whose behavior isn't mandated by the Standard would be equivalent in the absence of further optimization, but then
have a later phase of optimization further transform the code in a manner that would have been allowable for the version of the code produced by #1 but not for the code as it was originally written.
If using -fno-strict-aliasing on straightforwardly-written source code yields machine code whose performance is acceptable, that's a safer approach than trying to jump through hoops to satisfy constraints that the Standard allows compilers to impose in cases where doing so would allow them to be more useful [or--for poor quality compilers--in cases where doing so would make them less useful].
You could create a union with a private member to restrict access to the legacy representation:
union UnionBar {
struct {
ModernFoo foo;
float a;
};
private:
LegacyBar legacy;
friend LegacyBar const* to_legacy_const(UnionBar const& s) noexcept;
friend LegacyBar* to_legacy(UnionBar& s) noexcept;
};
LegacyBar const* to_legacy_const(UnionBar const& s) noexcept {
return &s.legacy;
}
LegacyBar* to_legacy(UnionBar& s) noexcept {
return &s.legacy;
}
void input_union(UnionBar const& s) noexcept {
legacy_input(to_legacy_const(s));
}
auto output_union() noexcept -> UnionBar {
UnionBar s;
legacy_output(to_legacy(s));
return s;
}
The input/output functions are compiled to the same code as the reinterpret_cast-versions (using gcc/clang):
input_union(UnionBar const&):
jmp legacy_input
and
output_union():
sub rsp, 56
lea rdi, [rsp+16]
call legacy_output
mov rax, QWORD PTR [rsp+16]
mov rdx, QWORD PTR [rsp+24]
add rsp, 56
ret
Note that this uses anonymous structs and requires you to include the legacy implementation, which you mentioned you do not want. Also, I'm missing the experience to be fully confident that there's no hidden UB, so it would be great if someone else would comment on that :)
I'd like to get the TIB of a process and afterwards get its PEB and so forth. I'm failing to do so because I'm having some issues with the __readfsdword(0x18) function, so I'd like to do it with __asm inline code, if possible.
The program is compiled for x86, so I think it means that the TIB will be located at offset 0x18 from the FS register. On x64 it should be on gs:[0x30].
How would I implement this inline assembly idea?
Edit
NtCurrentTeb() and __readfsdword gave different return addresses so I wanted to get as low-level as possible to figure out which one was malfunctioning.
The reason why __readfsdword wasn't working is because I think the libraries weren't compatible with each other, so I replaced them with the updated versions and now it's working properly.
__readfsdword/__readgsqword are compiler intrinsic functions that will generate more optimized code, there is no reason to use inline assembly. Inline assembly is not even supported by Microsoft's compilers for 64-bit targets.
#include <intrin.h>
__declspec(naked) void* __stdcall GetTEB()
{
__asm mov eax, dword ptr fs:[0x18] ;
__asm ret ;
}
...
void *teb;
__asm push eax ;
__asm mov eax, dword ptr fs:[0x18] ;
__asm mov teb, eax ;
__asm pop eax ;
printf("%p == %p == %p\n", GetTEB(), teb, __readfsdword(0x18));
And as suggested in the comments, NtCurrentTeb() is provided by the Windows SDK. It most likely just uses __readfsdword.
It seems that Visual Studio debugger (I've checked VS 2015 and VS 2017) skips constructors and assignment operators in the base class. If I create a new C++ Win32 console application project with the following code
#include <iostream>
struct B
{
B() { std::cout << "ctor"; }
};
struct S : B { };
int main()
{
S s1;
return 0;
}
I cannot step into B::B(), "ctor" is printed and the debugger goes to the "return 0;" line. In the disassembly the "call S::S (01713D4h)" is followed by a piece of code that is not attributed to any source ("Source not available"):
00E51DF0 push ebp
00E51DF1 mov ebp,esp
00E51DF3 sub esp,0CCh
00E51DF9 push ebx
00E51DFA push esi
00E51DFB push edi
00E51DFC push ecx
00E51DFD lea edi,[ebp-0CCh]
00E51E03 mov ecx,33h
00E51E08 mov eax,0CCCCCCCCh
00E51E0D rep stos dword ptr es:[edi]
00E51E0F pop ecx
00E51E10 mov dword ptr [this],ecx
00E51E13 mov ecx,dword ptr [this]
00E51E16 call B::B (0E51389h)
How can I step into B::B() (without using a breakpoint)?
I got the same issue as yours, not find the VS settings which could impact the debugger tool, so I help you report this issue to the product team here:
https://developercommunity.visualstudio.com/content/problem/77978/vs-debugger-skips-ctors-in-base-class.html.
If possible, you could also add your comment and vote that report directly. If I got any update from the product team, I will share it here.
I am trying to implement this inline assembly trick to obtain the value of EIP in C++Builder. The following code works in Release mode:
unsigned long get_eip()
{
asm { mov eax, [esp] }
}
however it doesn't work in Debug mode. In Debug mode the code has to be changed to this:
unsigned long get_eip()
{
asm { mov eax, [esp+4] }
}
By inspecting the generated assembly; the difference is that in Debug mode the code generated for the get_eip() function (first version) is:
push ebp
mov ebp,esp
mov eax,[esp]
pop ebp
ret
however in Release mode the code is:
mov eax,[esp]
ret
Of course I could use #ifdef NDEBUG to work around the problem ; however is there any syntax I can use to specify that the whole function is in assembly and the compiler should not insert the push ebp stuff? (or otherwise solve this problem).
Have you tried __declspec(naked)?
__declspec(naked) unsigned long get_eip()
{
asm { mov eax, [esp] }
}
for the sake of simplicity ill just paste an example instead of my entire code which is a bit huge. while im porting my code to VC++ instead of using GCC i need to rewrite a few inline assembly functions that receive pointers and save values on those pointers.
imagine cpuid for example:
void cpuid( int* peax, int* pebx, int* pecx, int* pedx, int what ){
__asm__ __volatile__( "cpuid" : "=a" (*peax), "=b" (*pebx), "=c" (*pecx), "=d" (*pedx) : "a" (what) );
}
that will just work, it will save the values on the registers "returned" by cpuid on the pointers that i passed to the function.
can the same be done with the inline assembler for VC?
so far the exact same function signature but with:
mov eax, what;
cpuid;
mov dword ptr [peax], eax;
etc
wont work, peax will have the same value it had before calling the function.
thanks in advance.
Tough to see because it is just a snippet, plus it could be called from C++ code / thiscall.
It might have to be 'naked' ( __declspec(naked) ) in some cases.
It won't port as VC is dropping x64 inline asm support iirc.
Use the __cpuid or __cpuidex intrinsic and enjoy.
mov eax, what;
cpuid;
mov ecx, dword ptr peax;
mov [ecx], eax;
will work.
Good luck!