gcc compiled binary in x86 assembly - gcc

gcc compile binary has following assembly:
8049264: 8d 44 24 3e lea 0x3e(%esp),%eax
8049268: 89 c2 mov %eax,%edx
804926a: bb ff 00 00 00 mov $0xff,%ebx
804926f: b8 00 00 00 00 mov $0x0,%eax
8049274: 89 d1 mov %edx,%ecx
8049276: 83 e1 02 and $0x2,%ecx
8049279: 85 c9 test %ecx,%ecx
804927b: 74 09 je 0x8049286
At first glance, I had no idea what it is doing at all. My best guess is some sort of memory alignment and clearing up local variable (because rep stos is filling 0 at local variable location). If you take a look at first few lines, load address into eax and move to ecx and test if it is even address or not, but I'm lost why this is happening. I want to know what exactly happen in here.

It looks like initialising a local variable located at [ESP + 0x03e] to zeroes. At first, EDX is initialised to hold the address and EBX is initialised to hold the size in bytes. Then, it's checked whether EDX & 2 is nonzero; in other words, whether EDX as a pointer is wyde-aligned but not tetra-aligned. (Assuming ESP is tetrabyte aligned, as it generally should, EDX, which was initialised at 0x3E bytes above ESP, would not be tetrabyte aligned. But this is slightly besides the point.) If this is the case, the wyde from AX, which is zero, is stored at [EDX], EDX is incremented by two, and the counter EBX is decremented by two. Now, assuming ESP was at least wyde-aligned, EDX is guaranteed to be tetra-aligned. ECX is calculated to hold the number of tetrabytes remaining by shifting EBX right two bits, EDI is loaded from EDX, and the REP STOS stores that many zero tetrabytes at [EDI], incrementing EDI in the process. Then, EDX is loaded from EDI to get the pointer-past-space-initialised-so-far. Finally, if there were at least two bytes remaining uninitialised, a zero wyde is stored at [EDX] and EDX is incremented by two, and if there was at least one byte remaining uninitialised, a zero byte is stored at [EDX] and EDX is incremented by one. The point of this extra complexity is apparently to store most of the zeroes as four-byte values rather than single-byte values, which may, under certain circumstances and in certain CPU architectures, be slightly faster.

Related

CISC short instructions vs long instructions

I'm currently programming a compiler and am about to implement code generation. The target instruction set for now is x64.
Now x64 is CISC, so there are many complex instructions. But I know these are internally converted to RISC by the CPU, and there's also out-of-order execution after that.
My question is therefore: Does using more short instructions (RISC-like) have a performance impact over using fewer complex instructions? The test programs for my language aren't that big, so I think fitting instructions into the cache should currently not be a problem.
No, using mostly simple x86 instructions (e.g. avoiding push and using sub rsp, whatever and storing args with mov) was a useful optimization for P5-pentium, because it didn't know how to split compact but complex instructions internally. Its 2-wide superscalar pipeline could only pair simple instructions.
Modern x86 CPUs (since Intel P6 (pentium pro / PIII), and including all x86-64 CPUs) do decode complex instructions to multiple uops that can be scheduled independently. (And for common complex instructions like push / pop, they have tricks to handle them as a single uop. In that case, a stack engine that renames the stack pointer outside of the out-of-order part of the core, so no uop is needed for the rsp-=8 part of push.)
Memory-source instructions like add eax, [rdi] can even decode to a single uop on Intel CPUs by micro-fusing the load with the ALU uop, only separating them in the out-of-order scheduler for dispatching to execution units. In the rest of the pipeline, it only uses 1 entry (in the front-end and the ROB). (But see Micro fusion and addressing modes for limitations on Sandybridge with indexed addressing modes, relaxed somewhat on Haswell and later.) AMD CPUs just naturally keep memory operands fused with ALU instructions, and didn't used to decode them to extra m-ops / uops so it doesn't have a fancy name.
Instruction length is not perfectly correlated with simplicitly. e.g. idiv rcx is only 3 bytes, but decodes to 57 uops on Skylake. (Avoid 64-bit division, it's slower than 32-bit.)
Smaller code is better, all else equal. Prefer 32-bit operand-size when it's sufficient to avoid REX prefixes, and choose register that don't need REX prefixes (like ecx instead of r8d). But normally don't spend extra instructions to make that happen. (e.g. use r8d instead of saving/restoring rbx so you can use ebx as another scratch register).
But when all else is not equal, size is usually the last priority for high performance, behind minimizing uops and keeping latency dependency chains short (especially loop-carried dependency chains).
Modern x86 cost model
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
Agner Fog's optimization guides and instruction tables: https://agner.org/optimize/
Intel’s Sandy Bridge Microarchitecture, deep dive by David Kanter.
(https://www.realworldtech.com/sandy-bridge/)
https://stackoverflow.com/tags/x86/info
Most programs spend most of their time in loops small enough to fit in L1d cache, and lots of that time in a few even smaller loops within that.
Unless you can correctly identify "cold" code (executed rarely), optimizing for size over speed with something like 3-byte push 1 / pop rax instead of 5-byte mov eax, 1 is definitely not a good default. clang/LLVM will push/pop for constants with -Oz (optimize only for size), but not -Os (optimize for a balance of size and speed).
Using inc instead of add reg,1 saves a byte (only 1 in x86-64, vs. 2 in 32-bit code). With a register destination, it's just as fast in most cases on most CPUs. See INC instruction vs ADD 1: Does it matter?
Modern mainstream x86 CPUs have decoded-uop caches (AMD since Ryzen, Intel since Sandybridge) that mostly avoid the front-end bottlenecks on older CPUs with average instruction length > 4.
Before that (Core2 / Nehalem), tuning to avoid front-end bottlenecks was much more complicated that just using short instructions on average. See Agner Fog's microarch guide for details about the uop patterns the decoders can handle in those older Intel CPUs, and effects of code alignment relative to 16-byte boundaries for fetch after a jump, and lots more.
AMD Bulldozer-family marks instruction boundaries in L1i cache, and can decode up to 2x 16 bytes per cycle if both cores of a cluster are active, otherwise Agner Fog's microarch PDF (https://agner.org/optimize/) reports ~21 bytes per cycle (vs. Intel's up to 16 bytes per cycle for the decoders when not running from the uop cache). Bulldozer's lower back-end throughput probably means that front-end bottlenecks happen less often. But I don't really know, I haven't tuned anything for Bulldozer-family with access to hardware to test anything.
An example: this function compiled with clang with -O3, -Os, and -Oz
int sum(int*arr) {
int sum = 0;
for(int i=0;i<10240;i++) {
sum+=arr[i];
}
return sum;
}
Source + asm output on the Godbolt compiler explorer, where you can play with this code and compiler options.
I also used -fno-vectorize because I assume you won't be trying to auto-vectorize with SSE2, even though that's baseline for x86-64. (Although that would speed up this loop by a factor of 4
# clang -O3 -fno-vectorize
sum: # #sum
xor eax, eax
mov ecx, 7
.LBB2_1: # =>This Inner Loop Header: Depth=1
add eax, dword ptr [rdi + 4*rcx - 28]
add eax, dword ptr [rdi + 4*rcx - 24]
add eax, dword ptr [rdi + 4*rcx - 20]
add eax, dword ptr [rdi + 4*rcx - 16]
add eax, dword ptr [rdi + 4*rcx - 12]
add eax, dword ptr [rdi + 4*rcx - 8]
add eax, dword ptr [rdi + 4*rcx - 4]
add eax, dword ptr [rdi + 4*rcx]
add rcx, 8
cmp rcx, 10247
jne .LBB2_1
ret
This is pretty silly; it unrolled by 8 but still with only 1 accumulator. So it bottlenecks on 1 cycle latency add instead of on 2 loads per clock throughput on Intel since SnB and AMD since K8. (And only reading 4 bytes per clock cycle, it probably doesn't bottleneck on memory bandwidth very much.)
It does better with normal -O3, not disabling vectorization, using 2 vector accumulators:
sum: # #sum
pxor xmm0, xmm0 # zero first vector register
mov eax, 36
pxor xmm1, xmm1 # 2nd vector
.LBB2_1: # =>This Inner Loop Header: Depth=1
movdqu xmm2, xmmword ptr [rdi + 4*rax - 144]
paddd xmm2, xmm0
movdqu xmm0, xmmword ptr [rdi + 4*rax - 128]
paddd xmm0, xmm1
movdqu xmm1, xmmword ptr [rdi + 4*rax - 112]
movdqu xmm3, xmmword ptr [rdi + 4*rax - 96]
movdqu xmm4, xmmword ptr [rdi + 4*rax - 80]
paddd xmm4, xmm1
paddd xmm4, xmm2
movdqu xmm2, xmmword ptr [rdi + 4*rax - 64]
paddd xmm2, xmm3
paddd xmm2, xmm0
movdqu xmm1, xmmword ptr [rdi + 4*rax - 48]
movdqu xmm3, xmmword ptr [rdi + 4*rax - 32]
movdqu xmm0, xmmword ptr [rdi + 4*rax - 16]
paddd xmm0, xmm1
paddd xmm0, xmm4
movdqu xmm1, xmmword ptr [rdi + 4*rax]
paddd xmm1, xmm3
paddd xmm1, xmm2
add rax, 40
cmp rax, 10276
jne .LBB2_1
paddd xmm1, xmm0 # add the two accumulators
# and horizontal sum the result
pshufd xmm0, xmm1, 78 # xmm0 = xmm1[2,3,0,1]
paddd xmm0, xmm1
pshufd xmm1, xmm0, 229 # xmm1 = xmm0[1,1,2,3]
paddd xmm1, xmm0
movd eax, xmm1 # extract the result into a scalar integer reg
ret
This version unrolls probably more than it needs to; the loop overhead is tiny and movdqu + paddd is only 2 uops, so we're far from bottlenecking on the front-end. With 2-per-clock movdqu loads, this loop can process 32 bytes of input per clock cycle assuming the data is hot in L1d cache or maybe L2, otherwise it will run slower. This more-than-minimum unroll will let out-of-order execution run ahead and see the loop exit condition before the paddd work has caught up, and maybe mostly hide the branch mispredict on the last iteration.
Using more than 2 accumulators to hide latency is very important in FP code, where most instructions don't have single-cycle latency. (It would also be useful for this function on AMD Bulldozer-family, where paddd has 2 cycle latency.)
With big unrolls and large displacements, compilers sometimes generate a lot of instructions that need a disp32 displacement instead of disp8 in the addressing mode. Choosing the point where you increment the loop counter or pointer to keep as many addressing modes as possible using a displacement of -128 .. +127 would probably be a good thing.
Unless you're tuning for Nehalem / Core2 or other CPUs without a uop cache, you probably don't want to add extra loop overhead (of add rdi, 256 twice instead of add rdi, 512 or something) just to shrink the code size.
By comparison, clang -Os still auto-vectorizes (unless you disable it), with an inner loop that's exactly 4 uops long on Intel CPUs.
# clang -Os
.LBB2_1: # =>This Inner Loop Header: Depth=1
movdqu xmm1, xmmword ptr [rdi + 4*rax]
paddd xmm0, xmm1
add rax, 4
cmp rax, 10240
jne .LBB2_1
But with clang -Os -fno-vectorize, we get the simple and obvious minimal scalar implementation:
# clang -Os -fno-vectorize
sum: # #sum
xor ecx, ecx
xor eax, eax
.LBB2_1: # =>This Inner Loop Header: Depth=1
add eax, dword ptr [rdi + 4*rcx]
inc rcx
cmp rcx, 10240
jne .LBB2_1
ret
Missed-optimization: using ecx would avoid a REX prefix on inc and cmp. The range is known to fix in 32-bits. Probably it's using RCX because it promoted int to 64-bit to avoid movsxd rcx,ecx sign-extension to 64-bit before use in an addressing mode. (Because signed overflow is UB in C.) But after doing that, it could optimize it back down again after noticing the range.
The loop is 3 uops (assuming macro-fused cmp/jne on Intel since Nehalem and AMD since Bulldozer), or 4 uops on Sandybridge (unlamination of add with an indexed addressing mode.) A pointer-increment loop could be slightly more efficient on some CPUs, only requiring 3 uops inside the loop even on SnB/IvB.
Clang's -Oz output is actually larger, showing signs of its code-gen strategy. Many loops can't be proven to run at least 1 time, and thus need a conditional branch to skip the loop instead of falling into it in the run-zero-times case. Or they need a jump to an entry point near the bottom. (Why are loops always compiled into "do...while" style (tail jump)?).
Looks like LLVM's -Oz code-gen unconditionally uses the jump-to-bottom strategy without checking if the condition is provably always true on the first iteration.
sum: # #sum
xor ecx, ecx
xor eax, eax
jmp .LBB2_1
.LBB2_3: # in Loop: Header=BB2_1 Depth=1
add eax, dword ptr [rdi + 4*rcx]
inc rcx
.LBB2_1: # =>This Inner Loop Header: Depth=1
cmp rcx, 10240
jne .LBB2_3
ret
Everything's the same except the extra jmp to enter the loop.
In a function that did more, you'd see more differences in code-gen. Like maybe using a slow div even for compile-time-constants, instead of a multiplicative inverse (Why does GCC use multiplication by a strange number in implementing integer division?).

Write assembly language program to sort student names according to their grades

I want to write a simple assembly language program to sort student names according to their grades.
I am just using:
.data
.code
I try this bubble sort but this one is only for numbers. How can I add names for the students?
.data
array db 9,6,5,4,3,2,1
count dw 7
.code
mov cx,count
dec cx
nextscan:
mov bx,cx
mov si,0
nextcomp:
mov al,array[si]
mov dl,array[si+1]
cmp al,dl
jnc noswap
mov array[si],dl
mov array[si+1],al
noswap:
inc si
dec bx
jnz nextcomp
loop nextscan
Long ago, one of the most common way to represent data was with what was called fixed length fields. It wasn't uncommon to find all related data in one place like this;
Student: db 72, 'Marie '
db 91, 'Barry '
db 83, 'Constantine '
db 59, 'Wil-Alexander '
db 97, 'Jake '
db 89, 'Ceciel '
This is doable, as each of the fields is 16 bytes long and that is the way data used to be constructed in multiples of 2. So the data length was either 2, 4, 8, 16, 32, 64 and so on. Didn't have to be this way and a lot of times it wasn't, but multiples like that made the code simpler.
Problem is, each time we want to sort, all data has to be moved, so the relational database was born. Here we separate variable data from static.
Student: db 'Marie '
db 'Barry '
db 'Constantine '
db 'Wil-Alexander '
db 'Jake '
db 'Ceciel '
Grades: db 72, 0
db 91, 1
db 83, 2
db 59, 3
db 97, 4
db 89, 5
dw -1 ; Marks end of list
Not only will this be easier to manage in the program, but to add more grades and even grades for the same person is easier. Here is an example of how code would work to do comparisons.
mov si, Grades
mov bl, 0
push si
L0: lodsw
cmp ax, -1
jz .done
cmp [si-4], al
jae L0
.... Exchange Data Here ....
bts bx, 0
jmp L0
.done:
pop si
btc bx, 0
jc L0 - 1
ret
After routine has been executed the contents of grades is as follows;
61 04 5B 01 59 05 53 02 48 00 3B 00
I do have a working copy of this program tested in DOSBOX and because this is a homework assignment, I'm not going to hand it to you on a silver platter, but 95% of the work is done. All you need to do before handing in is make sure you can explain why BTS & BTC makes the bubble work and implement something that will exchange data.
If you needed to display this data, you'd need to device a conversion routine from binary -> decimal, but by simply multiplying the index number by 16 associated with each grade and adding the address of Student to it, that would give you a pointer to the appropriate name.
Sort pointers to name, grade structs, or indices into separate name and grade arrays.
That's one extra level of indirection in the compare, but not in the swap.

Creating array and adding values

So i am working on an assignment and i am having some issues understanding arrays in this type of code (keep in mind that my knowledge of this stuff is limited). My code is supposed to ask the user to enter the number of values that that will be put in an array of SDWORD's and then create a procedure that has the user input the numbers. I have the part done below that asks the user for the amount (saved in "count") but i am struggling with the other procedure part For example with my code below if they enter 5 then the procedure that i have to make would require them to input 5 numbers that would go in to an array.
The problem I am facing is that i'm not sure how to actually set up the array. It can contain anywhere between 2 and twelve numbers which is why i have the compare set up in the code below. Let's say for example that the user inputs that they will enter 5 numbers and i set it up like this...
.data
array SDWORD 5
the problem i am having is that I'm not sure if that is saying the array will hold 5 values or if just one value in the array is 5. I need the amount of values in the array to be equal to "count". "count" as i have set up below is the amount that the user is going to enter.
Also i obviously know how to set up the procedure like this...
EnterValues PROC
return
EnterValues ENDP
I just don't know how to implement something like this. All of the research that i have done online is only confusing me more and none of the examples i have found ask the user to enter how many number will be i the array before they physically enter any numbers in to it. I hope what i described makes sense. Any input on what i could possibly do would be great!
INCLUDE Irvine32.inc
.data
count SDWORD ?
prompt1 BYTE "Enter the number of values to sort",0
prompt2 BYTE "Error. The number must be between 2 and 12",0
.code
Error PROC
mov edx, OFFSET prompt2
call WriteString
exit ; exit ends program after error occures
Error ENDP
main PROC
mov edx, OFFSET prompt1
call WriteString ; prints out prompt1
call ReadInt
mov count, eax ; save returned value from eax to count
cmp count, 12
jle Loop1 ; If count is less than or equal to 12 jump to Loop1, otherwise continue with Error procedure
call Error ; performs Error procedure which will end the program
Loop1: cmp count, 2
jge Loop2 ; If count is greater than or equal to 2 jump to Loop2, otherwise continue with Error procedure
call Error ; performs Error procedure which will end the program
Loop2: exit
main ENDP
END main
============EDIT==============
I came up with this...
EnterValues PROC
mov ecx, count
mov edx, 0
Loop3:
mov eax, ArrayOfInputs[edx * 4]
call WriteInt
call CrLf
inc edx
dec ecx
jnz Loop3
ret
EnterValues ENDP
.data
array SDWORD 5
defines one SDWORD with the initial value 5 in the DATA section and gives it the name "array".
You might want to use the DUP operator
.data
array SDWORD 12 DUP (5)
This defines twelve SDWORD and initializes each of them with the value 5. If the initial value doesn't matter, i.e. you want an uninitialized array change the initial value to '?':
array SDWORD 12 DUP (?)
MASM may now create a _BSS segment. To force the decision:
.data?
array SDWORD 12 DUP (?)
The symbol array is used in a MASM program as a constant offset to the address of the first entry. Use an additional index to address subsequent entries, for example:
mov eax, [array + 4] ; second SDWORD
mov eax, [array + esi]
Pointer arithmetic:
lea esi, array ; copy address into register
add esi, 8 ; move pointer to the third entry
mov eax, [esi] ; load eax with the third entry
lea esi, array + 12 ; copy the address of the fourth entry
mov eax, [esi] ; load eax with the fourth entry
You've got in every case an array with a fixed size. It's on you, just to fill it with count values.

Offset in function in disassembler output

Cannot find (or formulate a question to Google to find) an answer on the simple (or noob) question.
I'm inspecting an application with objdump -d tool:
. . .
5212c0: 73 2e jae 5212f0 <rfb::SMsgReaderV3::readSetDesktopSize()+0x130>
5213e8: 73 2e jae 521418 <rfb::SMsgReaderV3::readSetDesktopSize()+0x258>
521462: 73 2c jae 521490 <rfb::SMsgReaderV3::readSetDesktopSize()+0x2d0>
. . .
What does it mean the +XXXX offset in the output? How can I relate it to the source code, if possible? (Postprocessed with c++filt)
It's the offset in bytes from the beginning of the function.
Here's an example from WinDbg, but it's the same everywhere:
This is the current call stack:
0:000> k L1
# Child-SP RetAddr Call Site
00 00000000`001afcb8 00000000`77b39ece USER32!NtUserGetMessage+0xa
This is how the function looks like:
0:000> uf USER32!NtUserGetMessage
USER32!NtUserGetMessage:
00000000`77b39e90 4c8bd1 mov r10,rcx
00000000`77b39e93 b806100000 mov eax,1006h
00000000`77b39e98 0f05 syscall
00000000`77b39e9a c3 ret
And this is what the current instruction is:
0:000> u USER32!NtUserGetMessage+a L1
USER32!NtUserGetMessage+0xa:
00000000`77b39e9a c3 ret
So, the offset 0x0A is 10 bytes from the function start. 3 bytes for the first mov, 5 bytes for the second mov and 2 bytes for the syscall.
If you want to relate it to your code, it heavily depends on whether or not it was optimized.
If the offset is very high, you might not have enough symbols. E.g. with export symbols only you may see offsets like +0x2AF4 and you can't tell anything about the real function any more.

Why does the 0x55555556 divide by 3 hack work?

There is a (relatively) well known hack for dividing a 32-bit number by three. Instead of using actual expensive division, the number can be multiplied by the magic number 0x55555556, and the upper 32 bits of the result are what we're looking for. For example, the following C code:
int32_t div3(int32_t x)
{
return x / 3;
}
compiled with GCC and -O2, results in this:
08048460 <div3>:
8048460: 8b 4c 24 04 mov ecx,DWORD PTR [esp+0x4]
8048464: ba 56 55 55 55 mov edx,0x55555556
8048469: 89 c8 mov eax,ecx
804846b: c1 f9 1f sar ecx,0x1f
804846e: f7 ea imul edx
8048470: 89 d0 mov eax,edx
8048472: 29 c8 sub eax,ecx
8048474: c3 ret
I'm guessing the sub instruction is responsible for fixing negative numbers, because what it does is essentially add 1 if the argument is negative, and it's a NOP otherwise.
But why does this work? I've been trying to manually multiply smaller numbers by a 1-byte version of this mask, but I fail to see a pattern, and I can't really find any explanations anywhere. It seems to be a mystery magic number whose origin isn't clear to anyone, just like 0x5f3759df.
Can someone provide an explanation of the arithmetic behind this?
It's because 0x55555556 is really 0x100000000 / 3, rounded up.
The rounding is important. Since 0x100000000 doesn't divide evenly by 3, there will be an error in the full 64-bit result. If that error were negative, the result after truncation of the lower 32 bits would be too low. By rounding up, the error is positive, and it's all in the lower 32 bits so the truncation wipes it out.

Resources