Are unwrap()s under patterns optimized away? - performance

I have some code where I often use unwrap() under patterns, where I can be sure it won't panic. Some of those pieces are in performance-critical functions, so I was wondering if it would be a good idea to get rid of these unwrap()s in favor of unchecked variants of applicable functions. However, I didn't see any difference with #[bench] tests and the ASM for both variants looks pretty similar to me (though I'm not an expert).
It appears that Rust is able to optimize such cases away; am I right or should I use unchecked functions instead of unwrap()?
MCVE:
use self::Foo::*;
use self::Error::*;
#[derive(Debug)]
enum Foo {
Bar(Box<Foo>),
Baz
}
#[derive(Debug)]
enum Error {
NotBar
}
impl Foo {
fn bar_mut_ref(&mut self) -> Result<&mut Foo, Error> {
match *self {
Bar(ref mut foo) => Ok(foo),
_ => Err(NotBar)
}
}
fn bar_mut_ref_unchecked(&mut self) -> &mut Foo {
match *self {
Bar(ref mut foo) => foo,
_ => panic!("bar_mut_ref_unchecked() called on a non-Bar!")
}
}
fn bazify(&mut self) {
match *self {
Bar(_) => { *self = Baz },
_ => ()
}
}
}
fn do_stuff_with_foo(foo: &mut Foo) {
match *foo {
Bar(_) => {
foo.bar_mut_ref().unwrap().bazify(); // is _unchecked() better here?
// underscore was used because foo is assigned to a new value here
},
_ => {}
}
}
fn main() {
let mut foo = Bar(Box::new(Bar(Box::new(Baz))));
do_stuff_with_foo(&mut foo);
println!("{:?}", foo);
}

A more direct comparison of the two approaches yields identical ASM, so at least for this simple example the answer appears to be: yes, such cases can be optimized away.
example::do_stuff_with_foo:
push rbp
mov rbp, rsp
push rbx
push rax
mov rbx, qword ptr [rdi]
test rbx, rbx
je .LBB1_3
cmp qword ptr [rbx], 0
je .LBB1_3
mov rdi, rbx
call core::ptr::drop_in_place
mov qword ptr [rbx], 0
.LBB1_3:
add rsp, 8
pop rbx
pop rbp
ret

Related

Some confusion about golang assembly

My Golang source code is as follows.
package main
func add(x, y int) int {
return x + y
}
func main() {
_ = add(1, 2)
}
The assembly code I obtained using go tool compile -N -l -S main.go > file1.s is as follows(part of it).
;file1.s
"".main STEXT size=54 args=0x0 locals=0x18 funcid=0x0
0x0000 00000 (main.go:7) TEXT "".main(SB), ABIInternal, $24-0
0x0000 00000 (main.go:7) CMPQ SP, 16(R14)
0x0004 00004 (main.go:7) PCDATA $0, $-2
0x0004 00004 (main.go:7) JLS 47
……
0x002f 00047 (main.go:7) CALL runtime.morestack_noctxt(SB)
0x0034 00052 (main.go:7) PCDATA $0, $-1
0x0034 00052 (main.go:7) JMP 0
And the assembly code I obtained using go tool compile -N -l main.go and go tool objdump -S -gnu main.o > file2.s is as follows(part of it).
;file2.s
TEXT "".main(SB) gofile..D:/code/Test/025_go/007_ass/main.go
func main() {
0x5b6 493b6610 CMPQ 0x10(R14), SP // cmp 0x10(%r14),%rsp
0x5ba 7629 JBE 0x5e5 // jbe 0x5e5
……
func main() {
0x5e5 e800000000 CALL 0x5ea // callq 0x5ea [1:5]R_CALL:runtime.morestack_noctxt
0x5ea ebca JMP "".main(SB) // jmp 0x5b6
My questions are:
Why are the source and destination of the CMPQ instructions in file1.s and file2.s opposite, as in CMPQ SP, 16(R14) vs CMPQ 0x10(R14), SP?
For the above two code, my understanding is: when SP <= R14 + 16, call runtime.morestack_noctxt to extend stack. But what I don't understand is: why is SP <= R14 + 16, what is the logic behind? R14 is link register?
Is the code in file2.s a dead loop? Why is it so? Why is the code in file1.s not a dead loop?
What is the meaning of [1:5] in [1:5]R_CALL:runtime.morestack_noctxt in file2.s?
I have a basic knowledge of c++/golang as well as assembly, and I understand the memory layout of programs, but I am really confused about the above questions. Can anyone help me, or what material should I read?
Thank you to everyone who helps me.
Why are the source and destination of the CMPQ instructions in file1.s and file2.s opposite, as in CMPQ SP, 16(R14) vs CMPQ 0x10(R14), SP?
This is likely a bug in the disassembler which I encourage you to file with the Go project. The Go assembler has AT&T operand order for almost all instructions. The CMP family of instructions is a major exception which for easier use has the Intel operand order (i.e. CMPQ foo, bar; JGT baz jumps to baz if foo > bar).
For the above two code, my understanding is: when SP <= R14 + 16, call runtime.morestack_noctxt to extend stack. But what I don't understand is: why is SP <= R14 + 16, what is the logic behind? R14 is link register?
R14 holds a pointer to the g structure corresponding to the currently active Go routine and 0x10(R14) holds the stack limit stackguard0. See user1856856's answer for details. This is a new development following the register ABI proposal. stackguard0 lowest stack address the thread can use before it has to ask the runtime for more stack.
Is the code in file2.s a dead loop? Why is it so? Why is the code in file1.s not a dead loop?
No. When runtime.morestack_noctxt returns, it has changed R14 to the new stack limit, hence the comparison will succeed. It is possible that it will not succeed in which case once again more stack is allocated until it does. This means it is normally not an endless loop.
What is the meaning of [1:5] in [1:5]R_CALL:runtime.morestack_noctxt in file2.s?
This comments hints on the presence of a relocation, indicating that the linker will have to patch in the address of runtime.morestack_noctxt at link time. You can see that the function address in the instruction e800000000 is all zeroes, so as is the call doesn't go anywhere useful. This only changes when the linker resolves the relocation.
for question 2,in amd64 system, R14 is the register that holds the current g,you can check the function
// Append code to p to check for stack split.
// Appends to (does not overwrite) p.
// Assumes g is in rg.
// Returns last new instruction and G register.
func stacksplit(ctxt *obj.Link, cursym *obj.LSym, p *obj.Prog, newprog obj.ProgAlloc, framesize int32, textarg int32) (*obj.Prog, int16) {
// emit...
// Load G register
var rg int16
p, rg = loadG(ctxt, cursym, p, newprog)
var q1 *obj.Prog
if framesize <= objabi.StackSmall {
// small stack: SP <= stackguard
// CMPQ SP, stackguard
p = obj.Appendp(p, newprog)
p.As = cmp
p.From.Type = obj.TYPE_REG
p.From.Reg = REG_SP
p.To.Type = obj.TYPE_MEM
p.To.Reg = rg
p.To.Offset = 2 * int64(ctxt.Arch.PtrSize) // G.stackguard0
if cursym.CFunc() {
p.To.Offset = 3 * int64(ctxt.Arch.PtrSize) // G.stackguard1
}
and the loadG
func loadG(ctxt *obj.Link, cursym *obj.LSym, p *obj.Prog, newprog obj.ProgAlloc) (*obj.Prog, int16) {
if ctxt.Arch.Family == sys.AMD64 && cursym.ABI() == obj.ABIInternal {
// Use the G register directly in ABIInternal
return p, REGG
}
var regg int16 = REG_CX
if ctxt.Arch.Family == sys.AMD64 {
regg = REGG // == REG_R14
}
and the file in src/cmd/internal/obj/x86/a.out.go
REGG = REG_R14 // g register in ABIInternal
the g structure is
type g struct {
// Stack parameters.
// stack describes the actual stack memory: [stack.lo, stack.hi).
// stackguard0 is the stack pointer compared in the Go stack growth prologue.
// It is stack.lo+StackGuard normally, but can be StackPreempt to trigger a preemption.
// stackguard1 is the stack pointer compared in the C stack growth prologue.
// It is stack.lo+StackGuard on g0 and gsignal stacks.
// It is ~0 on other goroutine stacks, to trigger a call to morestackc (and crash).
stack stack // offset known to runtime/cgo
stackguard0 uintptr // offset known to liblink
stackguard1 uintptr // offset known to liblink
}
and the stack structure is
// Stack describes a Go execution stack.
// The bounds of the stack are exactly [lo, hi),
// with no implicit data structures on either side.
type stack struct {
lo uintptr
hi uintptr
}
so i think an offset(16) to R14 is the value of current g's stackguard0

Efficient type punning without undefined behavior

Say I'm working on a library called libModern. This library uses a legacy C library, called libLegacy, as an implementation strategy. libLegacy's interface looks like this:
typedef uint32_t LegacyFlags;
struct LegacyFoo {
uint32_t x;
uint32_t y;
LegacyFlags flags;
// more data
};
struct LegacyBar {
LegacyFoo foo;
float a;
// more data
};
void legacy_input(LegacyBar const* s); // Does something with s
void legacy_output(LegacyBar* s); // Stores data in s
libModern shouldn't expose libLegacy's types to its users for various reasons, among them:
libLegacy is an implementation detail that shouldn't be leaked. Future versions of libModern might chose to use another library instead of libLegacy.
libLegacy uses hard-to-use, easy-to-misuse types that shouldn't be part of any user-facing API.
The textbook way to deal with this situation is the pimpl idiom: libModern would provide a wrapper type that internally has a pointer to the legacy data. However, this is not possible here, since libModern cannot allocate dynamic memory. Generally, its goal is not to add a lot of overhead.
Therefore, libModern defines its own types that are layout-compatible with the legacy types, yet have a better interface. In this example it is using a strong enum instead of a plain uint32_t for flags:
enum class ModernFlags : std::uint32_t
{
first_flag = 0,
second_flag = 1,
};
struct ModernFoo {
std::uint32_t x;
std::uint32_t y;
ModernFlags flags;
// More data
};
struct ModernBar {
ModernFoo foo;
float a;
// more data
};
Now the question is: How can libModern convert between the legacy and the modern types without much overhead? I know of 3 options:
reinterpret_cast. This is undefined behavior, but in practice produces perfect assembly. I want to avoid this, since I cannot rely on this still working tomorrow or on another compiler.
std::memcpy. In simple cases this generates the same optimal assembly, but in any non-trivial case this adds significant overhead.
C++20's std::bit_cast. In my tests, at best it produces exactly the same code as memcpy. In some cases it's worse.
This is a comparison of the 3 ways to interface with libLegacy:
Interfacing with legacy_input()
Using reinterpret_cast:
void input_ub(ModernBar const& s) noexcept {
legacy_input(reinterpret_cast<LegacyBar const*>(&s));
}
Assembly:
input_ub(ModernBar const&):
jmp legacy_input
This is perfect codegen, but it invokes UB.
Using memcpy:
void input_memcpy(ModernBar const& s) noexcept {
LegacyBar ls;
std::memcpy(&ls, &s, sizeof(ls));
legacy_input(&ls);
}
Assembly:
input_memcpy(ModernBar const&):
sub rsp, 24
movdqu xmm0, XMMWORD PTR [rdi]
mov rdi, rsp
movaps XMMWORD PTR [rsp], xmm0
call legacy_input
add rsp, 24
ret
Significantly worse.
Using bit_cast:
void input_bit_cast(ModernBar const& s) noexcept {
LegacyBar ls = std::bit_cast<LegacyBar>(s);
legacy_input(&ls);
}
Assembly:
input_bit_cast(ModernBar const&):
sub rsp, 40
movdqu xmm0, XMMWORD PTR [rdi]
mov rdi, rsp
movaps XMMWORD PTR [rsp+16], xmm0
mov rax, QWORD PTR [rsp+16]
mov QWORD PTR [rsp], rax
mov rax, QWORD PTR [rsp+24]
mov QWORD PTR [rsp+8], rax
call legacy_input
add rsp, 40
ret
And I have no idea what's going on here.
Interfacing with legacy_output()
Using reinterpret_cast:
auto output_ub() noexcept -> ModernBar {
ModernBar s;
legacy_output(reinterpret_cast<LegacyBar*>(&s));
return s;
}
Assembly:
output_ub():
sub rsp, 56
lea rdi, [rsp+16]
call legacy_output
mov rax, QWORD PTR [rsp+16]
mov rdx, QWORD PTR [rsp+24]
add rsp, 56
ret
Using memcpy:
auto output_memcpy() noexcept -> ModernBar {
LegacyBar ls;
legacy_output(&ls);
ModernBar s;
std::memcpy(&s, &ls, sizeof(ls));
return s;
}
Assembly:
output_memcpy():
sub rsp, 56
lea rdi, [rsp+16]
call legacy_output
mov rax, QWORD PTR [rsp+16]
mov rdx, QWORD PTR [rsp+24]
add rsp, 56
ret
Using bit_cast:
auto output_bit_cast() noexcept -> ModernBar {
LegacyBar ls;
legacy_output(&ls);
return std::bit_cast<ModernBar>(ls);
}
Assembly:
output_bit_cast():
sub rsp, 72
lea rdi, [rsp+16]
call legacy_output
movdqa xmm0, XMMWORD PTR [rsp+16]
movaps XMMWORD PTR [rsp+48], xmm0
mov rax, QWORD PTR [rsp+48]
mov QWORD PTR [rsp+32], rax
mov rax, QWORD PTR [rsp+56]
mov QWORD PTR [rsp+40], rax
mov rax, QWORD PTR [rsp+32]
mov rdx, QWORD PTR [rsp+40]
add rsp, 72
ret
Here you can find the entire example on Compiler Explorer.
I also noted that the codegen varies significantly depending on the exact definition of the structs (i.e. order, amount & type of members). But the UB version of the code is consistently better or at least as good as the other two versions.
Now my questions are:
How come the codegen varies so dramatically? It makes me wonder if I'm missing something important.
Is there something I can do to guide the compiler to generate better code without invoking UB?
Are there other standard-conformant ways that generate better code?
In your compiler explorer link, Clang produces the same code for all output cases. I don't know what problem GCC has with std::bit_cast in that situation.
For the input case, the three functions cannot produce the same code, because they have different semantics.
With input_ub, the call to legacy_input may be modifying the caller's object. This cannot be the case in the other two versions. Therefore the compiler cannot optimize away the copies, not knowing how legacy_input behaves.
If you pass by-value to the input functions, then all three versions produce the same code at least with Clang in your compiler explorer link.
To reproduce the code generated by the original input_ub you need to keep passing the address of the caller's object to legacy_input.
If legacy_input is an extern C function, then I don't think the standards specify how the object models of the two languages are supposed to interact in this call. So, for the purpose of the language-lawyer tag, I will assume that legacy_input is an ordinary C++ function.
The problem in passing the address of &s directly is that there is generally no LegacyBar object at the same address that is pointer-interconvertible with the ModernBar object. So if legacy_input tries to access LegacyBar members through the pointer, that would be UB.
Theoretically you could create a LegacyBar object at the required address, reusing the object representation of the ModernBar object. However, since the caller presumably will expect there to still be a ModernBar object after the call, you then need to recreate a ModernBar object in the storage by the same procedure.
Unfortunately though, you are not always allowed to reuse storage in this way. For example if the passed reference refers to a const complete object, that would be UB, and there are other requirements. The problem is also whether the caller's references to the old object will refer to the new object, meaning whether the two ModernBar objects are transparently replaceable. This would also not always be the case.
So in general I don't think you can achieve the same code generation without undefined behavior if you don't put additional constraints on the references passed to the function.
Most non-MSVC compilers support an attribute called __may_alias__ that you can use
struct ModernFoo {
std::uint32_t x;
std::uint32_t y;
ModernFlags flags;
// More data
} __attribute__((__may_alias__));
struct ModernBar {
ModernFoo foo;
float a;
// more data
} __attribute__((__may_alias__));
Of course some optimizations can't be done when aliasing is allowed, so use it only if performance is acceptable
Godbolt link
Programs which would ever have any reason to access storage as multiple types should be processed using -fno-strict-aliasing or equivalent on any compiler that doesn't limit type-based aliasing assumptions around places where a pointer or lvalue of one type is converted to another, even if the program uses only corner-case behaviors mandated by the Standard. Using such a compiler flag will guarantee that one won't have type-based-aliasing problems, while jumping through hoops to use only standard-mandated corner cases won't. Both clang and gcc are sometimes prone to both:
have one phase of optimization change code whose behavior would be mandated by the Standard into code whose behavior isn't mandated by the Standard would be equivalent in the absence of further optimization, but then
have a later phase of optimization further transform the code in a manner that would have been allowable for the version of the code produced by #1 but not for the code as it was originally written.
If using -fno-strict-aliasing on straightforwardly-written source code yields machine code whose performance is acceptable, that's a safer approach than trying to jump through hoops to satisfy constraints that the Standard allows compilers to impose in cases where doing so would allow them to be more useful [or--for poor quality compilers--in cases where doing so would make them less useful].
You could create a union with a private member to restrict access to the legacy representation:
union UnionBar {
struct {
ModernFoo foo;
float a;
};
private:
LegacyBar legacy;
friend LegacyBar const* to_legacy_const(UnionBar const& s) noexcept;
friend LegacyBar* to_legacy(UnionBar& s) noexcept;
};
LegacyBar const* to_legacy_const(UnionBar const& s) noexcept {
return &s.legacy;
}
LegacyBar* to_legacy(UnionBar& s) noexcept {
return &s.legacy;
}
void input_union(UnionBar const& s) noexcept {
legacy_input(to_legacy_const(s));
}
auto output_union() noexcept -> UnionBar {
UnionBar s;
legacy_output(to_legacy(s));
return s;
}
The input/output functions are compiled to the same code as the reinterpret_cast-versions (using gcc/clang):
input_union(UnionBar const&):
jmp legacy_input
and
output_union():
sub rsp, 56
lea rdi, [rsp+16]
call legacy_output
mov rax, QWORD PTR [rsp+16]
mov rdx, QWORD PTR [rsp+24]
add rsp, 56
ret
Note that this uses anonymous structs and requires you to include the legacy implementation, which you mentioned you do not want. Also, I'm missing the experience to be fully confident that there's no hidden UB, so it would be great if someone else would comment on that :)

go binaries parameter passing

I'm trying to understand parameter passing in go ELF binaries
mov %rdi,0x8(%rsp) <----- first parameter (?)
mov %r8,0x10(%rsp) <----- second
mov %r9,0x18(%rsp) <----- third
mov %r10,0x20(%rsp) <----- fourth
mov %rcx,0x28(%rsp) <----- fifth
sub %edx,%ebx
movslq %ebx,%rcx
mov %rcx,0x30(%rsp) <----- sixth
callq 5aaba0 <math/big.nat.shl> <----- call to native golang func
It seems that parameters are passed on stack, but when I look at the function here (line 681):
// z = x << s
func (z nat) shl(x nat, s uint) nat { ... }
The number of parameter is just 2, and in the ELF it looks like 6 (?)

VC++ Inline assembly errors

I have been searching around for a while, and couldn't seem to find the answer to my issue. I'm trying to code some functions to detect whether or not the executable is being debugged, and I'm using some inline assembly for it (with the __asm tag). It keeps throwing two errors, and the rest of the code seems to compile fine. Here's the function
int peb_detect() {
__asm {
ASSUME FS : NOTHING
MOV EAX, DWORD PTR FS : [18]
MOV EAX, DBYTE PTR DS : [EAX + 30]
MOVZX EAX, BYTE PTR DS : [EAX + 2]
RET
}
}
and I keep getting the errors
warning C4405: 'FS': identifier is reserved word
warning C2400: inline assembler syntax error in 'opcode'; found 'FS'
warning C2408: illegal type on PTR operator in 'second operand'
I can't seem to figure it out. If anyone can help, I would really appreciate it. Thanks!
at first not 18 but 0x18 and not 30 but 0x30
C_ASSERT(FIELD_OFFSET(NT_TIB, Self) == 0x18);
C_ASSERT(FIELD_OFFSET(TEB, ProcessEnvironmentBlock) == 0x30);
need use not hard coded constants. especially wrong.
at second int peb_detect() must be __declspec(naked) if you use RET instruction. so code can look like this:
#include <winternl.h>
#include <intrin.h>
__declspec(naked) BOOLEAN peb_detect() {
__asm {
MOV EAX, FS:[NT_TIB.Self]
MOV EAX, [EAX + TEB.ProcessEnvironmentBlock]
MOV AL, [EAX + PEB.BeingDebugged]
RET
}
}
but we can use and shorter variant
__declspec(naked) BOOLEAN peb_detect2() {
__asm {
MOV EAX, FS:[TEB.ProcessEnvironmentBlock]
MOV AL, [EAX]PEB.BeingDebugged
RET
}
}
and for implement IsDebuggerPresent we can not use inline assembler at all. and this will be work for x64 too
__forceinline BOOLEAN peb_detect3()
{
return ((PEB*)
#ifdef _WIN64
__readgsqword
#else
__readfsdword
#endif
(FIELD_OFFSET(_TEB, ProcessEnvironmentBlock)))->BeingDebugged;
}

C++Builder - implement entire function in assembly

I am trying to implement this inline assembly trick to obtain the value of EIP in C++Builder. The following code works in Release mode:
unsigned long get_eip()
{
asm { mov eax, [esp] }
}
however it doesn't work in Debug mode. In Debug mode the code has to be changed to this:
unsigned long get_eip()
{
asm { mov eax, [esp+4] }
}
By inspecting the generated assembly; the difference is that in Debug mode the code generated for the get_eip() function (first version) is:
push ebp
mov ebp,esp
mov eax,[esp]
pop ebp
ret
however in Release mode the code is:
mov eax,[esp]
ret
Of course I could use #ifdef NDEBUG to work around the problem ; however is there any syntax I can use to specify that the whole function is in assembly and the compiler should not insert the push ebp stuff? (or otherwise solve this problem).
Have you tried __declspec(naked)?
__declspec(naked) unsigned long get_eip()
{
asm { mov eax, [esp] }
}

Resources