Basic Blocks and Control Flow in x86 - compilation

The Red-Dragon book contains, page 529, algorithm 9.1, and algorithm for partitioning assembly code into basic blocks. The algorithm is essentially:
Input: Assembly, as a list of instructions.
Output: A list of basic blocks of instructions
Phase 1: Mark the leaders. The very first statement (of a procedure) is a leader. Any statement that is the target of a branch is a leader. Any statement that immediately follows a branch is a leader.
Phase 2: Assemble the blocks. To make a basic block, add a leader and all statements to the block until another leader is encountered. Do this until no statements are left.
End and return the blocks.
Then, later, they use this block structure to develop control flow analyses. The basic structure developed is a control flow graph; it is made by treating basic blocks as nodes in a graph and creating a directed edge from Block Bi to Bj if Bi ends in a jump to Bj.
Cooper, Harvey and Waterman (in Building a Control Flow Graph from Scheduled Code) point out that this algorithm for creating a control flow graph is insufficient if there is a jump to a location held in a memory address or register.
There are several questions this brings up. When can assembly contain a branch to a location in memory? What is scheduled code? Are there any other issues one should be aware of when building a control flow graph from x86? What are the best known algorithms/implementations for building control flow graphs directly from x86?

There are several cases where a branch is performed on register's content (or variable content, for that matter)
Jump table for switches, where the address to jump to is calculated using some table of addresses/offsets and the value of a register.
A function pointer, such as callback function, where the function to be called is given as a parameter to the caller, or stored in some variable.
virtual function tables in C++, where the function to be called is not known at compile time, and may change in runtime.
Here is an example for function pointer call (64 bit Mac OS):
C code
int fptr_call( int (*ptr)(int) ) {
return (*ptr)( 3 );
}
Assembly
_fptr_call:
0000000100000e70 pushq %rbp
0000000100000e71 movq %rsp, %rbp
0000000100000e74 subq $0x10, %rsp
0000000100000e78 movl $0x3, %eax
0000000100000e7d movq %rdi, 0xfffffffffffffff8(%rbp)
0000000100000e81 movl %eax, %edi
; Call based on %rbp, copied from %rdi which is ptr
0000000100000e83 callq *0xfffffffffffffff8(%rbp)
0000000100000e86 addq $0x10, %rsp
0000000100000e8a popq %rbp
0000000100000e8b ret
0000000100000e8c nopl (%rax)

Related

x86 - instruction interleaving to avoid cpu stall

Gcc6 - intel core 2 duo.
Compilation flags: "-march=native -O3" (-S)
I was compiling a simple program and asked for the assembly output:
Code
movq 8(%rsi), %rdi
call _atoi
movq 16(%rbp), %rdi
movl %eax, %ebx
call _atof
pxor %xmm1, %xmm1
movl $1, %eax <- this instruction is my problem
cvtsi2sd %ebx, %xmm1
leaq LC0(%rip), %rdi
addsd %xmm1, %xmm0
call _printf
addq $8, %rsp
Execution
read/convert an integer variable, then read/convert a double value and add them.
The problem
I perfectly understand that one (the compiler more so) has to avoid cpu stalls as much as possible.
I've shown the offending instruction in the code section above.
To me, with cpu reordering, and different execution context, this interleaved instruction is useless.
My rationale is: chances that we stall are very high anyway and the cpu will wait for pxor xmm1 to return before being able to reuse it in the next instruction. Adding an instruction will just fill the cpu decoder for nothing. The cpu HAS to wait anyway. So why not leaving it alone for 1 instruction?
Moving the pxor before atof seems not possible as atof may use it.
Question
Is that a bug, a legacy junk (when cpu were not able to reorder) or.. else?
Thanks
EDIT:
I admit my question was not clear: can this instruction be safely removed without performance consequences?
The x86-64 ABI requires that calls to varargs functions (like printf) set %al = the count of floating-point args passed in xmm registers. In this case, you're passing one double, so the ABI requires %al = 1. (Fun fact: C's promotion rules make it impossible to pass a float to a vararg function. This is why there are no printf conversion specifiers for float, only double.)
mov $1, %eax avoids false dependencies on the rest of eax, (compared to mov $1, %al), so gcc prefers spending extra instruction bytes on that, even though it's tuning for Core2 (which renames partial registers).
Previous answer, before it was clarified that the question was why the mov is done all, not about its ordering.
IIRC, gcc doesn't do much instruction scheduling for x86, because it's assuming out-of-order execution. I tried to google that, but didn't find the quote from a gcc developer that I seem to remember reading (maybe in a gcc bug report comment).
Anyway, it looks ok to me, unless you're tuning for in-order Atom or P5. If you are, use gcc -O3 -march=atom (which implies -mtune=atom). But anyway, you're clearly not doing that, because you used -march=native on a C2Duo, which is a 4-wide out-of-order design with a fairly large scheduler.
To me, with cpu reordering, and different execution context, this interleaved instruction is useless.
I have no idea what you think the problem is, or what ordering you think would be better, so I'll just explain why it looks good.
I didn't take the time to edit this down to a short answer, so you might prefer to just read Agner Fog's microarch pdf for details of the Core2 pipeline, and skim this answer. See also other links from the x86 tag wiki.
...
call _atof
# xmm0 is probably still not ready when the following instructions issue
pxor %xmm1, %xmm1 # no inputs, so can run any time after being issued.
gcc uses pxor because cvtsi2sd is badly designed, giving it a false dependency on the previous value of the vector register. Note how the upper half of the vector register keeps its old value. Intel probably designed it this way because the original SSE cvtsi2ss was first implemented on Pentium III, where 128b vectors were handled as two halves. Zeroing the rest of the register (including the upper half) instead of merging probably would have taken an extra uop on PIII.
This short-sighted design choice saddled the architecture with the choice between an extra dependency-breaking instruction, or a false dependency. A false dep might not matter at all, or might be a big slowdown if the register used by one function happened to be used for a very long FP dependency chain in another function (maybe including a cache miss).
On Intel SnB-family CPUs, xor-zeroing is handled at register-rename time, so the uop never needs to execute on an execution port; it's already completed as soon as it issues into the ROB. This is true for integer and vector registers.
On other CPUs, the pxor will need an execution port, but has no input dependencies so it can execute any time there's a free ALU port, after it issues.
movl $1, %eax # no input dependencies, can execute any time.
This instruction could be placed anywhere after call atof and before call printf.
cvtsi2sd %ebx, %xmm1 # no false dependency thanks to pxor.
This is a 2 uop instruction on Core2 (Merom and Penryn), according to Agner Fog's tables. That's weird because cvtsi2ss is 1 uop. (They're both 2 uops in SnB; presumably one uop to move data between integer and vector, and another for the conversion).
Putting this insn earlier would be good, potentially issue it a cycle earlier, since it's part of the longest dependency chain here. (The integer stuff is all simple and trivial). However, printf has to parse the format string before it will decide to look at xmm0, so the FP instructions aren't actually on the critical path.
It can't go ahead of pxor, and call / pxor / cvtsi2sd would mean pxor would decode by itself that cycle. Decoding will start with the instruction after the call, after the ret in the called function has been decoded (and the return-address predictor predicts the jump back to the insn after the call). Multi-uop instructions have to be the first instruction in a block, so having pxor and mov imm32 decode that cycle means less of a decode bottleneck.
leaq LC0(%rip), %rdi # 1 uop
addsd %xmm1, %xmm0 # 1 uop
call _printf # 3 uop insn
cvtsi2sd/lea/addsd can all decode in the same cycle, which is optimal. If the mov imm32 was after the cvt, it could decode in the same cycle as well (since pre-SnB decoders can handle up to 4-1-1-1), but it couldn't have issued as soon.
If decoding was only barely keeping up with issue, that would mean pxor would issue by itself (because no other instructions were decoded yet). Then cvtsi2sd/mov imm/lea (4 uops), then addsd / call (4 uops). (addsd decoded with the previous issue group; core2 has a short queue between decode and issue to help absorb decode bubbles like this, and make it useful to be able to decode up to 7 uops in a cycle.)
That's not appreciably different from the current issue pattern in a decode-bottleneck situation: (pxor / mov imm) / (cvtsi2sd/lea/addsd) / (call printf)
If decode isn't the bottleneck, I'm not sure if Core2 can issue a ret or jmp in the same cycle as uops that follow the jump. In SnB-family CPUs, an unconditional jump always ends an issue group. e.g. a 3-uop loop issues ABC, ABC, ABC, not ABCA, BCAB, CABC.
Assuming the instructions after the ret issue with a group not including the ret, we'd have
(pxor/mov imm/cvtsi2sd), (lea / addsd / 2 of call's 3 uops) / (last call uop)
So the cvtsi2sd still issues in the first cycle after returning from atof, which means it can get started executing right away. Even on Core2, where pxor takes an execution unit, the first of the 2 uops from cvtsi2sd can probably execute in the same cycle as pxor. It's probably only the 2nd uop that has an input dependency on the dst register.
(mov imm / pxor / cvtsi2sd) would be equivalent, and so would the slower-to-decode (pxor / cvtsi2sd / mov imm), or getting the lea executed before mov imm.

In x86 assembly, is it better to use two separate registers for imul?

I am wondering, mostly out of curiosity, if using the same register for an operation is better than using two. What would be better, considering performance and/or other concerns?
mov %rbx, %rcx
imul %rcx, %rcx
or
mov %rbx, %rcx
imul %rbx, %rcx
Any tips for how to benchmark this, or resources where I could read about this type of thing would be appreciated, as I am new to assembly.
resources where I could read about this type of thing
See Agner Fog's microarch pdf, and his optimizing assembly guide. Also other links in the x86 tag wiki (e.g. Intel's optimization manual).
The interesting option you didn't mention is:
mov %rbx, %rcx
imul %rbx, %rbx # doesn'y have to wait for mov to execute
# old value of %rbx is still available in %rcx
If the imul is on the critical path, and mov has non-zero latency (like on AMD CPUs, and Intel before IvyBridge), this is potentially better. The result of imul will be ready one cycle earlier, because has no dependency on the result of the mov.
If, however, the old value is on the critical path and the squared value isn't, then this is worse because it adds a mov to the critical path.
Of course, it also means you have to keep track of the fact that your old variable is now live in a different register, and the old register has the squared value. If this is a problem in a loop, unroll it so you can end up with things where the top of the loop is expecting them. If you wanted this to be easy, you'd use a compiler instead of optimizing asm by hand.
However, Intel P6-family CPUs (PPro/PII to Nehalem) have limited register-read ports, so it can be better to favour reading registers that you just wrote. If the %rbx wasn't written in the last couple cycles, it will have to be read from the permanent register file when the mov and imul uops go through the rename&issue stage (the RAT).
If they don't issue as part of the same group of 4, then they would each need to read %rbx separately. Since the register file in Core2/Nehalem only has 3 read ports, issue groups (quartets, as Agner Fog calls them) stall until all their not-recently-written input register values are read from the register file (at 3 per cycle, or 2 on Core2 is none of the 3 regs are index regs in an addressing mode).
For the full details, see Agner Fog's microarch pdf section 8.8. The Core2 section refers back to the PPro section. PPro has a 3-wide pipeline, so in that section Agner talks about triplets, not quartets.
If mov and imul issue together, they both share the same read of %rbx. There's a 3 in 4 chance of this happening on Core2/Nehalem.
Choosing just between the sequences you mention the first one has a clear (but usually small) advantage over the second for Intel P6-family CPUs. There's no difference for other CPUs, AFAIK, so the choice is obvious.
mov %rbx, %rcx
imul %rcx, %rcx # uses only the recently-written rcx; can't contribute to register-read stalls
worst of both worlds:
mov %rbx, %rcx
imul %rbx, %rcx # can't execute until after the mov, but still reads a potentially-old register
If you're going to depend on a recently-written register, you might as well use only recently-written registers.
Intel Sandybridge-family uses a physical register file (like AMD Bulldozer-family), and doesn't have register-read stalls.
Ivybridge (2nd gen Sandybridge) and later also handle mov reg,reg at register rename time, with zero latency and no execution unit. This means it doesn't matter whether you imul rbx or rcx as far as critical path length.
However, AMD Bulldozer-family can only handle xmm register moves in its rename stage; integer register moves still have 1c latency.
It's potentially still worth caring about which dependency chain the mov is part of, if latency is a limiting factor in the cycles per iteration of a loop.
how to benchmark this
I think you could put together a microbenchmark that has a register read stall on Core2 with imul %rbx, %rcx, but not with imul %rcx, %rcx. However, that would require some trial and error to get the mov and imul to issue in different groups, and unless you're feeling really creative, probably some artificial-looking surrounding code that exists only to read lots of registers. (e.g. lea (%rsi, %rdi, 1), %eax, or even add (%rsi, %rdi, 1), %eax (which has to read all three registers, and does micro-fuse on core2/nehalem so it only takes 1 uop slot in an issue group. (It doesn't micro-fuse on SnB-family)).
On a modern processor, using one register for both source and destination and using two different registers will never make any difference to performance. The reason for this is partly due to register renaming which, if there were a difference in performance would solve it by changing one of the registers to a different one and modify your subsequent instructions to use the new register (your processor actually has more registers than the instruction set has a way of refering to them so that it can do stuff like this). It is also because of the nature of a pipelined processor's implementation -- the contents of source registers are read at one pipeline stage and are then written at another later stage, which makes it difficult or impossible for register usage for a single instruction to cause any kind of interaction like the one you're worrying about.
More problematic is if an instruction refers to a value produced in its previous instruction, but even that is solved (usually) by out-of-order execution.

How does address operand affect performance and size of machine code?

Starting with 32-bit CPU mode, there are extended address operands available for x86 architecture. One can specify the base address, a displacement, an index register and a scaling factor.
For example, we would like to stride through a list of 32-bit integers (every first two from an array of 32-byte-long data structures, %rdi as data index, %rbx as base pointer).
addl $8, %rdi # skip eight values: advance index by 8
movl (%rbx, %rdi, 4), %eax # load data: pointer + scaled index
movl 4(%rbx, %rdi, 4), %edx # load data: pointer + scaled index + displacement
As I know, such complex addressing fits into a single machine-code instruction. But what is the cost of such operation and how does it compare to simple addressing with independent pointer calculation:
addl $32, %rbx # skip eight values: move pointer forward by 32 bytes
movl (%rbx), %eax # load data: pointer
addl $4, %rbx # point next value: move pointer forward by 4 bytes
movl (%rbx), %edx # load data: pointer
In the latter example, I have introduced one extra instruction and a dependency. But integer addition is very fast, I gained simpler address operands, and there are no multiplications any more. On the other hand, since the allowed scaling factors are powers of 2, the multiplication comes down to a bit shift, which is also a very fast operation. Still, two additions and a bit shift can be replaced with one addition.
What are the performance and code size differences between these two approaches? Are there any best practices for using the extended addressing operands?
Or, asking it from a C programmer's point of view, what is faster: array indexing or pointer arithmetic?
Is there any assembly editor meant for size/performance tuning? I wish I could see the machine-code size of each assembly instruction, its execution time in clock cycles or a dependency graph. There are thousands of assembly freaks that would benefit from such application, so I bet that something like this already exists!
The address arithmetic is very fast and should be used always if possible.
But here is something that the question misses.
At first you can't multiply by 32 using address arithmetic - 8 is the maximal possible constant.
The first version of the code in not complete, because it will need second instruction, that to increment rbx. So, we have following two variants:
inc rbx
mov eax, [8*rbx+rdi]
vs
add rbx, 8
mov eax, [rbx]
This way, the speed of the two variants will be the same. The size is the same - 6 bytes as well.
So, what code is better depends only on the program context - if we have a register that already contains the address of the needed array cell - use mov eax, [rbx]
If we have register containing the index of the cell and another containing the start address, then use the first variant. This way, after the algorithm ends, we still will have the start address of the array in rdi.
The answer to your question depends on the given, local program flow circumstances - and they, in turn, may very somewhat between processor manufacturers and architectures. To microanalyze a single instruction or two is usually pointless. You have a multi-stage pipeline, more than one integer unit, caches and lots more that come into play and which you need to factor into your analysis.
You could try reverse-engineering by looking at generated assembly code and analyzing why the sequence looks the way it does with respect to the different hardware units that will be working on it.
Another way is to use a profiler and experiment with different constructs to see what works well and what doesn't.
You could also download the source code for gcc and see how the really cool programmers evaluate a sequence to produce the fastest possible code. One day you might become one of them :-)
In any case I expect that you will come to the conclusion that the best sequence varies greatly depending on processor, compiler, optimization level and surrounding instructions. If you are using C the quality of the source code is extremely important: garbage in = garbage out.

What is this assembly function prologue / epilogue code doing with rbp / rsp / leave?

I am just starting to learn assembly for the mac using the GCC compiler to assemble my code. Unfortunately, there are VERY limited resources for learning how to do this if you are a beginner. I finally managed to find some simple sample code that I could start to rap my head around, and I got it to assemble and run correctly. Here is the code:
.text # start of code indicator.
.globl _main # make the main function visible to the outside.
_main: # actually label this spot as the start of our main function.
push %rbp # save the base pointer to the stack.
mov %rsp, %rbp # put the previous stack pointer into the base pointer.
subl $8, %esp # Balance the stack onto a 16-byte boundary.
movl $0, %eax # Stuff 0 into EAX, which is where result values go.
leave # leave cleans up base and stack pointers again.
ret
The comments explain some things in the code (I kind of understand what lines 2 - 5 do), but I dont understand what most of this means. I do understand the basics of what registers are and what each register here (rbp, rsp, esp and eax) is used for and how big they are, I also understand (generally) what the stack is, but this is still going over my head. Can anyone tell me exactly what this is doing? Also, could anyone point me in the direction of a good tutorial for beginners?
Stack is a data structure that follows LIFO principle. Whereas stacks in everyday life (outside computers, I mean) grow upward, stacks in x86 and x86-64 processors grow downward. See Wikibooks article on x86 stack (but please take into account that the code examples are 32-bit x86 code in Intel syntax, and your code is 64-bit x86-64 code in AT&T syntax).
So, what your code does (my explanations here are with Intel syntax):
push %rbp
Pushes rbp to stack, practically subtracting 8 from rsp (because the size of rbp is 8 bytes) and then stores rbp to [ss:rsp].
So, in Intel syntax push rbp practically does this:
sub rsp, 8
mov [ss:rsp], rbp
Then:
mov %rsp, %rbp
This is obvious. Just store the value of rsp into rbp.
subl $8, %esp
Subtract 8 from esp and store it into esp. Actually this is a bug in your code, even if it causes no problems here. Any instruction with a 32-bit register (eax, ebx, ecx, edx, ebp, esp, esi or edi) as destination in x86-64 sets the topmost 32 bits of the corresponding 64-bit register (rax, rbx, rcx, rdx, rbp, rsp, rsi or rdi) to zero, causing the stack pointer to point somewhere below the 4 GiB limit, effectively doing this (in Intel syntax):
sub rsp,8
and rsp,0x00000000ffffffff
Edit: added consequences of sub esp,8 below.
However, this causes no problems on a computer with less than 4 GiB of memory. On computers with more than 4 GiB memory, it may result in a segmentation fault. leave further below in your code returns a sane value to rsp. Generally in x86-64 code you don't need esp never (excluding possibly some optimizations or tweaks). To fix this bug:
subq $8, %rsp
The instructions so far are the standard entry sequence (replace $8 according to the stack usage). Wikibooks has a useful article on x86 functions and stack frames (but note again that it uses 32-bit x86 assembly with Intel syntax, not 64-bit x86-64 assembly with AT&T syntax).
Then:
movl $0, %eax
This is obvious. Store 0 into eax. This has nothing to do with the stack.
leave
This is equivalent to mov rsp, rbp followed by pop rbp.
ret
And this, finally, sets rip to the value stored at [ss:rsp], effective returning the code pointer back to where this procedure was called, and adds 8 to rsp.

Performance of modern processor

Being executed on modern processor (AMD Phenom II 1090T), how many clock ticks does the following code consume more likely : 3 or 11?
label: mov (%rsi), %rax
adc %rax, (%rdx)
lea 8(%rdx), %rdx
lea 8(%rsi), %rsi
dec %ecx
jnz label
The problem is, when I execute many iterations of such code, results vary near 3 OR 11 ticks per iteration from time to time. And I can't decide "who is who".
UPD
According to Table of instruction latencies (PDF), my piece of code takes at least 10 clock cycles on AMD K10 microarchitecture. Therefore, impossible 3 ticks per iteration are caused by bugs in measurement.
SOLVED
#Atom noticed, that cycle frequency isn't constant in modern processors. When I disabled in BIOS three options - Core Performance Boost, AMD C1E Support and AMD K8 Cool&Quiet Control, consumption of my "six instructions" stabilized on 3 clock ticks :-)
I won't try to answer with certainty how many cycles (3 or 10) it will take to run each iteration, but I'll explain how it might be possible to get 3 cycles per iteration.
(Note that this is for processors in general and I make no references specific to AMD processors.)
Key Concepts:
Out of Order Execution
Register Renaming
Most modern (non-embedded) processors today are both super-scalar and out-of-order. Not only can execute multiple (independent) instructions in parallel, but they can re-order instructions to break dependencies and such.
Let's break down your example:
label:
mov (%rsi), %rax
adc %rax, (%rdx)
lea 8(%rdx), %rdx
lea 8(%rsi), %rsi
dec %ecx
jnz label
The first thing to notice is that the last 3 instructions before the branch are all independent:
lea 8(%rdx), %rdx
lea 8(%rsi), %rsi
dec %ecx
So it's possible for a processor to execute all 3 of these in parallel.
Another thing is this:
adc %rax, (%rdx)
lea 8(%rdx), %rdx
There seems to be a dependency on rdx that prevents the two from running in parallel. But in reality, this is false dependency because the second instruction doesn't actually
depend on the output of the first instruction. Modern processors are able to rename the rdx register to allow these two instructions to be re-ordered or done in parallel.
Same applies to the rsi register between:
mov (%rsi), %rax
lea 8(%rsi), %rsi
So in the end, 3 cycles is (potentially) achievable as follows (this is just one of several possible orderings):
1: mov (%rsi), %rax lea 8(%rdx), %rdx lea 8(%rsi), %rsi
2: adc %rax, (%rdx) dec %ecx
3: jnz label
*Of course, I'm over-simplifying things for simplicity. In reality the latencies are probably longer and there's overlap between different iterations of the loop.
In any case, this could explain how it's possible to get 3 cycles. As for why you sometimes get 10 cycles, there could be a ton of reasons for that: branch misprediction, some random pipeline bubble...
At Intel, Dr. David Levinthal's "Performance Analysis Guide" investigates the answers to such questions in great detail.

Resources