Performance of single-constructor disjoint union in fsharp - performance

I try to wrap up a primitive type in a single-constructor disjoint union with members:
type T = | T of int
with
member inline this.add k =
let (T i) = this
T (i+k)
member inline this.ret =
let (T i) = this
i
I do this to get type safety (not all primitive values are meaningful to me), and because in some settings it is nice to get to pretend the value is an object (e.g., you get to override ToString()).
I was expecting the compiler to remove the overhead of the unnecessary tag T, but it doesn't seem to be doing that. I try the following two functions, one using the members, one deconstructing and working on the int value 'by hand'. When I disassmble (using fasmi), I get the assembly on the right:
let byMembers (c : Card) =
let d = c.add 0xcafebabe L0000: push rbx
let e = d.add 0xcafebab0 L0001: mov ebx, [rdi+8]
e.ret L0004: mov rdi, 0x1147adb08
L000e: call 0x00000001047b6c10
L0013: add ebx, 0xcafebabe
L0019: mov [rax+8], ebx
L001c: mov ebx, [rax+8]
L001f: mov rdi, 0x1147adb08
L0029: call 0x00000001047b6c10
L002e: lea edi, [rbx-0x35014550]
L0034: mov [rax+8], edi
L0037: mov eax, [rax+8]
L003a: pop rbx
L003b: ret
let byHand (Card i) =
let d = i + 0xcafebabe L0000: mov eax, [rdi+8]
let e = d + 0xcafebab0 L0003: add eax, 0x95fd756e
e L0008: ret
When I benchmark these, I get unsurprisingly that byHand runs at roughly .66 the time of byMembers, which is undesirable for my actual application.
Am I doing it wrong with the type T? Is there a way to type-safely abstract a primitive type in f# such that the compiled output would have no overhead compare to the 'raw' implementation?

Add [<Struct>]:
[<Struct>] // <--- here
type T = | T of int
with
member inline this.add k =
let (T i) = this
T (i+k)
member inline this.ret =
let (T i) = this
i
Disassembling, we find byMembers has been optimized down to essentials:
let byMembers (c : T) =
let d = c.add 0xcafebabe L0000: add edi, 0xcafebabe
let e = d.add 0xcafebab0 L0006: lea eax, [rdi-0x35014550]
e.ret L000c: ret
NB! The [<Struct>] attribute causes T to be stack-allocated and get pass-by-value semantics; presumably this is why the optimizer can remove the unnecessary tag.
The [<Struct>] attribute also applies to records and classes, with the same meaning, and yield the desired performance characteristics also in that setting. In this case, there is neater syntax available via the struct keyword.
Documentation here.

Related

All the calculations take place in registers. Why is the stack not storing the result of the register computation here

I am debugging a simple code in c++ and, looking at the disassembly.
In the disassembly, all the calculations are done in the registers. And later, the result of the operation is returned. I only see the a and b variables being pushed onto the stack (the code is below). I don't see the resultant c variable pushed onto the stack. Am I missing something?
I researched on the internet. But on the internet it looks like all variables a,b and c should be pushed onto the stack. But in my Disassembly, I don't see the resultant variable c being pushed onto the stack.
C++ code:
#include<iostream>
using namespace std;
int AddMe(int a, int b)
{
int c;
c = a + b;
return c;
}
int main()
{
AddMe(10, 20);
return 0;
}
Relevant assembly code:
int main()
{
00832020 push ebp
00832021 mov ebp,esp
00832023 sub esp,0C0h
00832029 push ebx
0083202A push esi
0083202B push edi
0083202C lea edi,[ebp-0C0h]
00832032 mov ecx,30h
00832037 mov eax,0CCCCCCCCh
0083203C rep stos dword ptr es:[edi]
0083203E mov ecx,offset _E7BF1688_Function#cpp (0849025h)
00832043 call #__CheckForDebuggerJustMyCode#4 (083145Bh)
AddMe(10, 20);
00832048 push 14h
0083204A push 0Ah
0083204C call std::operator<<<std::char_traits<char> > (08319FBh)
00832051 add esp,8
return 0;
00832054 xor eax,eax
}
As seen above, 14h and 0Ah are pushed onto the stack - corresponding to AddMe(10, 20);
But, when we look at the disassembly for the AddMe function, we see that the variable c (c = a + b), is not pushed onto the stack.
snippet of AddMe in Disassembly:
…
int c;
c = a + b;
00836028 mov eax,dword ptr [a]
0083602B add eax,dword ptr [b]
0083602E mov dword ptr [c],eax
return c;
00836031 mov eax,dword ptr [c]
}
shouldn't c be pushed to the stack in this program? Am I missing something?
All the calculations take place in registers.
Well yes, but they're stored afterwards.
Using memory-destination add instead of just using the accumulator register (EAX) would be an optimization. And one that's impossible when when the result needs to be in a different location than any of the inputs to an expression.
Why is the stack not storing the result of the register computation here
It is, just not with push
You compiled with optimization disabled (debug mode) so every C object really does have its own address in the asm, and is kept in sync between C statements. i.e. no keeping C variables in registers. (Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?). This is one reason why debug mode is extra slow: it's not just avoiding optimizations, it's forcing store/reload.
But the compiler uses mov not push because it's not a function arg. That's a missed optimization that all compilers share, but in this case it's not even trying to optimize. (What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?). It would certainly be possible for the compiler to reserve space for c in the same instruction as storing it, using push. But compilers instead to stack-allocation for all locals on entry to a function with one sub esp, constant.
Somewhere before the mov dword ptr [c],eax that spills c to its stack slot, there's a sub esp, 12 or something that reserves stack space for c. In this exact case, MSVC uses a dummy push to reserve 4 bytes space, as an optimization over sub esp, 4.
In the MSVC asm output, the compiler will emit a c = ebp-4 line or something that defines c as a text substitution for ebp-4. If you looked at disassembly you'd just see [ebp-4] or whatever addressing mode instead of.
In MSVC asm output, don't assume that [c] refers to static storage. It's actually still stack space as expected, but using a symbolic name for the offset.
Putting your code on the Godbolt compiler explorer with 32-bit MSVC 19.22, we get the following asm which only uses symbolic asm constants for the offset, not the whole addressing mode. So [c] might just be that form of listing over-simplifying even further.
_c$ = -4 ; size = 4
_a$ = 8 ; size = 4
_b$ = 12 ; size = 4
int AddMe(int,int) PROC ; AddMe
push ebp
mov ebp, esp ## setup a legacy frame pointer
push ecx # RESERVE 4B OF STACK SPACE FOR c
mov eax, DWORD PTR _a$[ebp]
add eax, DWORD PTR _b$[ebp] # c = a+b
mov DWORD PTR _c$[ebp], eax # spill c to the stack
mov eax, DWORD PTR _c$[ebp] # reload it as the return value
mov esp, ebp # restore ESP
pop ebp # tear down the stack frame
ret 0
int AddMe(int,int) ENDP ; AddMe
The __cdecl calling convention, which AddMe() uses by default (depending on the compiler's configuration), requires parameters to be passed on the stack. But there is nothing requiring local variables to be stored on the stack. The compiler is allowed to use registers as an optimization, as long as the intent of the code is preserved.

Disassembly code from visual studio

Using WinDBG for debugging the assembly code of an executable, it seems that compiler inserts some other codes between two sequential statements. The statements are pretty simple, e.g. they don't work with complex objects for function calls;
int a, b;
char c;
long l;
a = 0; // ##
b = a + 1; // %%
c = 1; // ##
l = 1000000;
l = l + 1;
And the disassembly is
## 008a1725 c745f800000000 mov dword ptr [ebp-8],0
008a172c 80bd0bffffff00 cmp byte ptr [ebp-0F5h],0 ss:002b:0135f71f=00
008a1733 750d jne test!main+0x42 (008a1742)
008a1735 687c178a00 push offset test!main+0x7c (008a177c)
008a173a e893f9ffff call test!ILT+205(__RTC_UninitUse) (008a10d2)
008a173f 83c404 add esp,4
008a1742 8b45ec mov eax,dword ptr [ebp-14h]
%% 008a1745 83c001 add eax,1
008a1748 c6850bffffff01 mov byte ptr [ebp-0F5h],1
008a174f 8945ec mov dword ptr [ebp-14h],eax
## 008a1752 c645e301 mov byte ptr [ebp-1Dh],1
Please note that ##, %% and ## in the disassembly list show the corresponding C++ lines.
So what are that call, cmp, jne and push?
It is the compiler run-time error checking (RTC), the RTC switch check for uninitialized variables, I think that you can manage it from Visual Studio (compiler options).
For more information, take a look to this. Section /RTCu switch

Recursive methods

I'm having a hard time grasping recursion. For example I have the following method. When the if statement returns true, I expect to return from this method. However looking at the method execution in Windbg and Visual Studio shows that the method continues to execute. I apologize for the generic question however your feedback would really be appreciated.
How is N decremented in-order to satisfy the if condition?
long factorial(int N)
{
if(N == 1)
return 1;
return N * factorial(N - 1);
}
compiling and disassembling the function you should get a disassembly similar to this
0:000> cdb: Reading initial command 'uf fact!fact;q'
fact!fact:
00401000 55 push ebp
00401001 8bec mov ebp,esp
00401003 837d0801 cmp dword ptr [ebp+8],1
00401007 7507 jne fact!fact+0x10 (00401010)
fact!fact+0x9:
00401009 b801000000 mov eax,1
0040100e eb13 jmp fact!fact+0x23 (00401023)
fact!fact+0x10:
00401010 8b4508 mov eax,dword ptr [ebp+8]
00401013 83e801 sub eax,1
00401016 50 push eax
00401017 e8e4ffffff call fact!fact (00401000)
0040101c 83c404 add esp,4
0040101f 0faf4508 imul eax,dword ptr [ebp+8]
fact!fact+0x23:
00401023 5d pop ebp
00401024 c3 ret
quit:
lets assume N == 5 when the function is entered ie [ebp+8] will hold 5
as long as [ebp+8] > 1 the jne will be taken
here you can see N being decremented (sub eax ,1)
the decremented N is again passed to the function fact (recursed without a return back to caller) the loop happens again and the decremented N is resent to fact this keeps on happening until the jne is not taken that is until N or [ebp+8] == 1
when N becomes 1 jne is not taken but jmp 401023 is taken
where it returns to the caller the caller being the function fact(int N)
that is it will return 40101c where the multiplication of eax of takes place and result is stored back in eax;
this will keep on happening until the ret points to the first call in main() see the stack below prior to executing pop ebp for the first time
0:000> kPL
ChildEBP RetAddr
0013ff38 0040101c fact!fact(
int N = 0n1)+0x23
0013ff44 0040101c fact!fact(
int N = 0n2)+0x1c
0013ff50 0040101c fact!fact(
int N = 0n3)+0x1c
0013ff5c 0040101c fact!fact(
int N = 0n4)+0x1c
0013ff68 0040109f fact!fact(
int N = 0n5)+0x1c
0013ff78 0040140b fact!main(
int argc = 0n2,
char ** argv = 0x00033ac0)+0x6f
I think the best way to grasp is to work through your code manually. Say you call factorial(4), what happens?4 is not equal to 1. Return 4 * factorial(4-1).
What is the return value of factorial 3? 3 is not equal to 1 return 3* factorial(3-1).
What is the return value of factorial 2? 2 is not equal to 1 return 2* factorial(2-1).
What is the return value of factorial 1? 1 equals 1 is true. Return 1. This is the base case. Now we move back up the recursion.
Return 1. This is factorial (2-1)
Return 2*1. This is factorial (3-1)
Return 3*2 this is factorial(4-1)
Return 4*6 this is factorial(4), the original call you made.
The idea is you have a function that has a base case (when n=1 return 1) and the function calls itself in way that moves the function towards the base case (factorial(n**-**1)).

Visual Studio performance optimization in branching

Consider the following
while(true)
{
if(x>5)
// Run function A
else
// Run function B
}
if x is always less than 5, does visual studio compiler do any optimization? i.e. like never checks if x is larger than 5 and always run function B
It depends on whether or not the compiler "knows" that x will always be less than 5.
Yes, nearly all modern compilers are capable of removing the branch. But the compiler needs to be able to prove that the branch will always go one direction.
Here's an example that can be optimized:
int x = 1;
if (x > 5)
printf("Hello\n");
else
printf("World\n");
The disassembly is:
sub rsp, 40 ; 00000028H
lea rcx, OFFSET FLAT:??_C#_06DKJADKFF#World?6?$AA#
call QWORD PTR __imp_printf
x = 1 is provably less than 5. So the compiler is able to remove the branch.
But in this example, even if you always input less than 5, the compiler doesn't know that. It must assume any input.
int x;
cin >> x;
if (x > 5)
printf("Hello\n");
else
printf("World\n");
The disassembly is:
cmp DWORD PTR x$[rsp], 5
lea rcx, OFFSET FLAT:??_C#_06NJBIDDBG#Hello?6?$AA#
jg SHORT $LN5#main
lea rcx, OFFSET FLAT:??_C#_06DKJADKFF#World?6?$AA#
$LN5#main:
call QWORD PTR __imp_printf
The branch stays. But note that it actually hoisted the function call out of the branch. So it really optimized the code down to something like this:
const char *str = "Hello\n";
if (!(x > 5))
str = "World\n";
printf(str);

In a loop, is it more efficient to set a state value every time or to check to see if it has changed and then set it?

I'm in a situation where I'm iterating through a number of records and setting state information based on the data in those record. Something like this (not real code, just a simplification):
StateObject state;
ConcurrentQueue<Record> records;
while(!records.IsEmpty())
{
//set state here based on the next record
}
So, would it be more efficient/better practice to
{
//set state here based on the next record
Record r = records.next();
state = r.state;
}
or
{
//set state here based on the next record
Record r = records.next();
if(state != r.state)
state = r.state;
}
it's totally depends on your type of records. in some case 1st is better and in some case 2nd one is better.
As you said yourself, this is a simplification. The actual answer depends on the specifics of your situation: you could be running to a database on the other side of the world, in which case an update might be prohibitively expensive. Alternatively, your state variable could be a massively complex type that is expensive to compare.
As #harold said in his comment: "try it." Profile your code and gain some understanding of what's expensive and what's not. Chances are the results will not be what you expect!
Testing is expensive, I simplified your code to this:
int x = 5;
if (x == 5)
x = 4;
x = 4;
Here is the disassembled code:
int x = 5;
00000036 mov dword ptr [rsp+20h],5
if (x == 5)
0000003e xor eax,eax
00000040 cmp dword ptr [rsp+20h],5
00000045 setne al
00000048 mov dword ptr [rsp+28h],eax
0000004c movzx eax,byte ptr [rsp+28h]
00000051 mov byte ptr [rsp+24h],al
00000055 movzx eax,byte ptr [rsp+24h]
0000005a test eax,eax
0000005c jne 0000000000000066
x = 4;
0000005e mov dword ptr [rsp+20h],4
x = 4;
00000066 mov dword ptr [rsp+20h],4
That being said, premature optimization is a waste of time. A database call may take one second, the above call may take .00000001 second.
Write the code that is simplest, optimize it later.

Resources