Visual Studio performance optimization in branching

Visual Studio performance optimization in branching - visual-studio-2010

Consider the following
while(true)
{
if(x>5)
// Run function A
else
// Run function B
}
if x is always less than 5, does visual studio compiler do any optimization? i.e. like never checks if x is larger than 5 and always run function B

It depends on whether or not the compiler "knows" that x will always be less than 5.
Yes, nearly all modern compilers are capable of removing the branch. But the compiler needs to be able to prove that the branch will always go one direction.
Here's an example that can be optimized:
int x = 1;
if (x > 5)
printf("Hello\n");
else
printf("World\n");
The disassembly is:
sub rsp, 40 ; 00000028H
lea rcx, OFFSET FLAT:??_C#_06DKJADKFF#World?6?$AA#
call QWORD PTR __imp_printf
x = 1 is provably less than 5. So the compiler is able to remove the branch.
But in this example, even if you always input less than 5, the compiler doesn't know that. It must assume any input.
int x;
cin >> x;
if (x > 5)
printf("Hello\n");
else
printf("World\n");
The disassembly is:
cmp DWORD PTR x$[rsp], 5
lea rcx, OFFSET FLAT:??_C#_06NJBIDDBG#Hello?6?$AA#
jg SHORT $LN5#main
lea rcx, OFFSET FLAT:??_C#_06DKJADKFF#World?6?$AA#
$LN5#main:
call QWORD PTR __imp_printf
The branch stays. But note that it actually hoisted the function call out of the branch. So it really optimized the code down to something like this:
const char *str = "Hello\n";
if (!(x > 5))
str = "World\n";
printf(str);

Related

All the calculations take place in registers. Why is the stack not storing the result of the register computation here

I am debugging a simple code in c++ and, looking at the disassembly.
In the disassembly, all the calculations are done in the registers. And later, the result of the operation is returned. I only see the a and b variables being pushed onto the stack (the code is below). I don't see the resultant c variable pushed onto the stack. Am I missing something?
I researched on the internet. But on the internet it looks like all variables a,b and c should be pushed onto the stack. But in my Disassembly, I don't see the resultant variable c being pushed onto the stack.
C++ code:
#include<iostream>
using namespace std;
int AddMe(int a, int b)
{
int c;
c = a + b;
return c;
}
int main()
{
AddMe(10, 20);
return 0;
}
Relevant assembly code:
int main()
{
00832020 push ebp
00832021 mov ebp,esp
00832023 sub esp,0C0h
00832029 push ebx
0083202A push esi
0083202B push edi
0083202C lea edi,[ebp-0C0h]
00832032 mov ecx,30h
00832037 mov eax,0CCCCCCCCh
0083203C rep stos dword ptr es:[edi]
0083203E mov ecx,offset _E7BF1688_Function#cpp (0849025h)
00832043 call #__CheckForDebuggerJustMyCode#4 (083145Bh)
AddMe(10, 20);
00832048 push 14h
0083204A push 0Ah
0083204C call std::operator<<<std::char_traits<char> > (08319FBh)
00832051 add esp,8
return 0;
00832054 xor eax,eax
}
As seen above, 14h and 0Ah are pushed onto the stack - corresponding to AddMe(10, 20);
But, when we look at the disassembly for the AddMe function, we see that the variable c (c = a + b), is not pushed onto the stack.
snippet of AddMe in Disassembly:
…
int c;
c = a + b;
00836028 mov eax,dword ptr [a]
0083602B add eax,dword ptr [b]
0083602E mov dword ptr [c],eax
return c;
00836031 mov eax,dword ptr [c]
}
shouldn't c be pushed to the stack in this program? Am I missing something?

All the calculations take place in registers.
Well yes, but they're stored afterwards.
Using memory-destination add instead of just using the accumulator register (EAX) would be an optimization. And one that's impossible when when the result needs to be in a different location than any of the inputs to an expression.
Why is the stack not storing the result of the register computation here
It is, just not with push
You compiled with optimization disabled (debug mode) so every C object really does have its own address in the asm, and is kept in sync between C statements. i.e. no keeping C variables in registers. (Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?). This is one reason why debug mode is extra slow: it's not just avoiding optimizations, it's forcing store/reload.
But the compiler uses mov not push because it's not a function arg. That's a missed optimization that all compilers share, but in this case it's not even trying to optimize. (What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?). It would certainly be possible for the compiler to reserve space for c in the same instruction as storing it, using push. But compilers instead to stack-allocation for all locals on entry to a function with one sub esp, constant.
Somewhere before the mov dword ptr [c],eax that spills c to its stack slot, there's a sub esp, 12 or something that reserves stack space for c. In this exact case, MSVC uses a dummy push to reserve 4 bytes space, as an optimization over sub esp, 4.
In the MSVC asm output, the compiler will emit a c = ebp-4 line or something that defines c as a text substitution for ebp-4. If you looked at disassembly you'd just see [ebp-4] or whatever addressing mode instead of.
In MSVC asm output, don't assume that [c] refers to static storage. It's actually still stack space as expected, but using a symbolic name for the offset.
Putting your code on the Godbolt compiler explorer with 32-bit MSVC 19.22, we get the following asm which only uses symbolic asm constants for the offset, not the whole addressing mode. So [c] might just be that form of listing over-simplifying even further.
_c$ = -4 ; size = 4
_a$ = 8 ; size = 4
_b$ = 12 ; size = 4
int AddMe(int,int) PROC ; AddMe
push ebp
mov ebp, esp ## setup a legacy frame pointer
push ecx # RESERVE 4B OF STACK SPACE FOR c
mov eax, DWORD PTR _a$[ebp]
add eax, DWORD PTR _b$[ebp] # c = a+b
mov DWORD PTR _c$[ebp], eax # spill c to the stack
mov eax, DWORD PTR _c$[ebp] # reload it as the return value
mov esp, ebp # restore ESP
pop ebp # tear down the stack frame
ret 0
int AddMe(int,int) ENDP ; AddMe

The __cdecl calling convention, which AddMe() uses by default (depending on the compiler's configuration), requires parameters to be passed on the stack. But there is nothing requiring local variables to be stored on the stack. The compiler is allowed to use registers as an optimization, as long as the intent of the code is preserved.

Recursive methods

I'm having a hard time grasping recursion. For example I have the following method. When the if statement returns true, I expect to return from this method. However looking at the method execution in Windbg and Visual Studio shows that the method continues to execute. I apologize for the generic question however your feedback would really be appreciated.
How is N decremented in-order to satisfy the if condition?
long factorial(int N)
{
if(N == 1)
return 1;
return N * factorial(N - 1);
}

compiling and disassembling the function you should get a disassembly similar to this
0:000> cdb: Reading initial command 'uf fact!fact;q'
fact!fact:
00401000 55 push ebp
00401001 8bec mov ebp,esp
00401003 837d0801 cmp dword ptr [ebp+8],1
00401007 7507 jne fact!fact+0x10 (00401010)
fact!fact+0x9:
00401009 b801000000 mov eax,1
0040100e eb13 jmp fact!fact+0x23 (00401023)
fact!fact+0x10:
00401010 8b4508 mov eax,dword ptr [ebp+8]
00401013 83e801 sub eax,1
00401016 50 push eax
00401017 e8e4ffffff call fact!fact (00401000)
0040101c 83c404 add esp,4
0040101f 0faf4508 imul eax,dword ptr [ebp+8]
fact!fact+0x23:
00401023 5d pop ebp
00401024 c3 ret
quit:
lets assume N == 5 when the function is entered ie [ebp+8] will hold 5
as long as [ebp+8] > 1 the jne will be taken
here you can see N being decremented (sub eax ,1)
the decremented N is again passed to the function fact (recursed without a return back to caller) the loop happens again and the decremented N is resent to fact this keeps on happening until the jne is not taken that is until N or [ebp+8] == 1
when N becomes 1 jne is not taken but jmp 401023 is taken
where it returns to the caller the caller being the function fact(int N)
that is it will return 40101c where the multiplication of eax of takes place and result is stored back in eax;
this will keep on happening until the ret points to the first call in main() see the stack below prior to executing pop ebp for the first time
0:000> kPL
ChildEBP RetAddr
0013ff38 0040101c fact!fact(
int N = 0n1)+0x23
0013ff44 0040101c fact!fact(
int N = 0n2)+0x1c
0013ff50 0040101c fact!fact(
int N = 0n3)+0x1c
0013ff5c 0040101c fact!fact(
int N = 0n4)+0x1c
0013ff68 0040109f fact!fact(
int N = 0n5)+0x1c
0013ff78 0040140b fact!main(
int argc = 0n2,
char ** argv = 0x00033ac0)+0x6f

I think the best way to grasp is to work through your code manually. Say you call factorial(4), what happens?4 is not equal to 1. Return 4 * factorial(4-1).
What is the return value of factorial 3? 3 is not equal to 1 return 3* factorial(3-1).
What is the return value of factorial 2? 2 is not equal to 1 return 2* factorial(2-1).
What is the return value of factorial 1? 1 equals 1 is true. Return 1. This is the base case. Now we move back up the recursion.
Return 1. This is factorial (2-1)
Return 2*1. This is factorial (3-1)
Return 3*2 this is factorial(4-1)
Return 4*6 this is factorial(4), the original call you made.
The idea is you have a function that has a base case (when n=1 return 1) and the function calls itself in way that moves the function towards the base case (factorial(n**-**1)).

How do you store data in the memory without using a variable within Assembly (NASM)?

I have search every webpage for an answer but I cant seem to find it. I have been learning the net-wide assembly syntax for around 2 months, and I'm trying to find a way to store data in the memory.
I know that sys_break:
mov,eax 45
reserves memory, and I have a functional macro which reserves 16kb of memory:
%macro reserve_mem
mov eax,45
xor ebx,ebx
int 80h
add eax,16384
mov ebx,eax
mov eax,45
int 80h
%endmacro
I also know that when you reserve bytes (resb), words ect. in the .bss section, small parts of memory are allocated to that uninitialised data.
In addition, there is virtual memory which can be accessed with an address like 0x0000, and this is then mapped into its actual memory location.
However, my problem is that I am trying to store data in the memory, but everything I try ends in a segmentation fault(core dumped) which is the programm trying to access memory it doesn't have access to. I have tried code such as below.
mov [0x0000],eax
Thankyou for the help.

You seem to misunderstand the concept of virtual memory. It's similar to telephone number in that you cannot call someone at every combination of some 10 digits. You are supopsed to call at only those numbers listed in the telephone book. Otherwise you'll hear "Sorry, this number is currently out of service". Likewise, only those virtual addresses listed in the page table of each process (always automatically and transparently maintained by OS) are valid for the process to access. SEGV is OS's way of saying "Sorry, this virtual address is currently out of service."
In your code, you dereferenced 0x0000 but it is one of the least possible values for a vaild virtual address. You ended up in doing so because you had thrown away the valid virtual address returned by the brk(2) syscall (read man 2 brk carefully because the raw syscall behaves diffrently from both of glibc brk and sbrk.) Your code would translate to C in this way (though nowadays glibc malloc(3) often relies on mmap(2) rather than brk(2)):
void *p = malloc(16384);
int eax = ...;
(void *)0 = eax;
This is obviously wrong and you must do something like this:
void *p = malloc(16384);
int *p0 = (int *)p + 0;
int *p1 = (int *)p + 1;
int eax = ...;
int ebx = ...; /* it's all up to you which register to use */
*p0 = eax;
*p1 = ebx;
which should translates to NASM like this:
reserve_mem ; IIRC eax now points to the last
mov ecx, eax ; byte of the newly allocated chunk
sub ecx, 16383 ; set p0 (== p)
mov edx, ecx
add edx, 4 ; set p1; 4 is for sizeof(int)
; ... set whatever value to eax ...
; ... set whatever value to ebx ...
mov [ecx], eax ; *p0 = eax;
mov [edx], ebx ; *p1 = ebx;
My knowledge on assembly programming is rusting and the above codes may contain many errors... but the concept part should not be so wrong.

In a loop, is it more efficient to set a state value every time or to check to see if it has changed and then set it?

I'm in a situation where I'm iterating through a number of records and setting state information based on the data in those record. Something like this (not real code, just a simplification):
StateObject state;
ConcurrentQueue<Record> records;
while(!records.IsEmpty())
{
//set state here based on the next record
}
So, would it be more efficient/better practice to
{
//set state here based on the next record
Record r = records.next();
state = r.state;
}
or
{
//set state here based on the next record
Record r = records.next();
if(state != r.state)
state = r.state;
}

it's totally depends on your type of records. in some case 1st is better and in some case 2nd one is better.

As you said yourself, this is a simplification. The actual answer depends on the specifics of your situation: you could be running to a database on the other side of the world, in which case an update might be prohibitively expensive. Alternatively, your state variable could be a massively complex type that is expensive to compare.
As #harold said in his comment: "try it." Profile your code and gain some understanding of what's expensive and what's not. Chances are the results will not be what you expect!

Testing is expensive, I simplified your code to this:
int x = 5;
if (x == 5)
x = 4;
x = 4;
Here is the disassembled code:
int x = 5;
00000036 mov dword ptr [rsp+20h],5
if (x == 5)
0000003e xor eax,eax
00000040 cmp dword ptr [rsp+20h],5
00000045 setne al
00000048 mov dword ptr [rsp+28h],eax
0000004c movzx eax,byte ptr [rsp+28h]
00000051 mov byte ptr [rsp+24h],al
00000055 movzx eax,byte ptr [rsp+24h]
0000005a test eax,eax
0000005c jne 0000000000000066
x = 4;
0000005e mov dword ptr [rsp+20h],4
x = 4;
00000066 mov dword ptr [rsp+20h],4
That being said, premature optimization is a waste of time. A database call may take one second, the above call may take .00000001 second.
Write the code that is simplest, optimize it later.

Can someone explain these couple assembly lines?

C++
int main(void)
{
int a = 3;
int b = 10;
int c;
c = a + b;
return 0;
}
008C1353 sub esp,0E4h
......
008C135C lea edi,[ebp+FFFFFF1Ch]
008C1362 mov ecx,39h
008C1367 mov eax,0CCCCCCCCh
008C136C rep stos dword ptr es:[edi]
3: int a = 3;
008C136E mov dword ptr [ebp-8],3
4: int b = 10;
008C1375 mov dword ptr [ebp-14h],0Ah
5: int c;
6: c = a + b;
A couple things that I don't understand.
(1) G++ will have stack alignment 16 bytes, and doing this in Visual Studio is 228 bytes??
(2) Doing this on Windows, does the stack grows upward or downward? I am confused. I know how the stack should look like
[Parameter n ]
...
[Parameter 2 ]
[Parameter 1 ]
[Return Address ] 0x002CF744
[Previous EBP ] 0x002CF740 (current ebp)
[Local Variables ]
So would the lowest address be the downward?
(3) When we push the variable a to the stack, it is ebp - 8.How come it's eight bytes?
(4) Similarly, why is int b ebp - 14 ?
Can someone please explain this to me? (-4, -8, respectively)
Using GDB, the offset makes more sense to me.
Thanks.

When compiling in debug mode, the Microsoft compiler adds quite a lot of padding and other safety-checking code to your generated code. Filling the stack with 0xCC bytes is one of those checks. That may be confusing your interpretation compared to the generated gcc code.
In release mode, these safety checks are generally turned off, but optimisation is turned on. Optimisation may make your assembly code even harder to follow.
For best results, you might try creating a new configuration starting with release mode, and specifically turning optimisations off.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Visual Studio performance optimization in branching - visual-studio-2010

Consider the following while(true) { if(x>5) // Run function A else // Run function B } if x is always less than 5, does visual studio compiler do any optimization? i.e. like never checks if x is larger than 5 and always run function B

Related

All the calculations take place in registers. Why is the stack not storing the result of the register computation here

Recursive methods

How do you store data in the memory without using a variable within Assembly (NASM)?

In a loop, is it more efficient to set a state value every time or to check to see if it has changed and then set it?

Can someone explain these couple assembly lines?

Categories

Resources