Unexpected slowdown from inserting a nop in a loop, and from reading near a movnti store

Unexpected slowdown from inserting a nop in a loop, and from reading near a movnti store - performance

I cannot understand why the first code has ~1 cycle per iteration and second has 2 cycle per iteration. I measured with Agner's tool and perf. According to IACA it should take 1 cycle, from my theoretical computations too.
This takes 1 cycle per iteration.
; array is array defined in section data
%define n 1000000
xor rcx, rcx
.begin:
movnti [array], eax
add rcx, 1
cmp rcx, n
jle .begin
And this takes 2 cycles per iteration. but why?
; array is array defined in section data
%define n 1000000
xor rcx, rcx
.begin:
movnti [array], eax
nop
add rcx, 1
cmp rcx, n
jle .begin
This final version takes ~27 cycles per iteration. But why? After all, there is no dependency chain.
.begin:
movnti [array], eax
mov rbx, [array+16]
add rcx, 1
cmp rcx, n
jle .begin
My CPU is IvyBridge.

movnti is 2 uops, and can't micro-fuse, according to Agner Fog's tables for IvyBridge.
So your first loop is 4 fused-domain uops, and can issue at one iteration per clock.
The nop is a 5th fused-domain uop (even though it doesn't take any execution ports, so it's 0 unfused-domain uops). This means the frontend can only issue the loop at one per 2 clocks.
See also the x86 tag wiki for more links to how CPUs work.
The 3rd loop is probably slow because mov rbx, [array+16] is probably loading from the same cache line that movnti evicts. This happens every time the fill-buffer it's storing into is flushed. (Not every movnti, apparently it can rewrite some bytes in the same fill-buffer.)

Related

How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

In the general case, how can an instruction that can take memory or register operands ever be slower with memory operands then mov + mov -> instruction -> mov + mov
Based on the throughput and latency found in Agner Fog's instruction tables (looking at Skylake in my case, p238)
I see that the following numbers for the btr/bts instructions:
instruction, operands, uops fused domain, uops unfused domain, latency, throughput
mov r,r 1 1 0-1 .25
mov m,r 1 2 2 1
mov r,m 1 1 2 .5
...
bts/btr r,r 1 1 N/A .5
bts/btr m,r 10 10 N/A 5
I dont see how these numbers could possibly be correct. Even in the worst case where there are no registers to spare and you have store one in a temporary memory location it would be faster to:
## hypothetical worst-case microcode that saves/restores a scratch register
mov m,r // + 1 throughput , save a register
mov r,m // + .5 throughput , load BTS destination operand
bts r,r // + 1 throughput , do bts (or btr)
mov m,r // + 1 throughput , store result
mov r,m // + .5 throughput , restore register
As the worst case this has a better throughput than just bts m,r (4 < 5). (Editor's note: adding up throughputs doesn't work when they have different bottlenecks. You need to consider uops and ports; this sequence should be 2c throughput, bottlenecked on 1/clock store throughput.)
And microcode instructions have there own set of registers so it seems aggressively unlikely this would actually be needed. Can anyone explain why bts (or in general any instruction) could have higher throughput with memory, register operands than using the worst case moving policy.
(Editor's note: yes, there are a few hidden temp register that microcode can use. Something like add [mem], reg does at least logically just load into one of those and then store the result.)

What you're missing is that BT, BTC, BTS and BTR don't work like you described when a memory operand is used. You're assuming the memory versions work the same as the register versions, but that's not quite the case. With the register version, the value of the second operand is used is taken modulo 64 (or 16 or 32). With the memory version, the value of the second operand is used as is. This means that the actual memory location accessed by the instruction may not be the address given by the memory operand, but one somewhere past it.
For example, ignoring the need to save registers and atomicity, to get the same operation of BTS [rsi + rdi], rax using the register version of BTS you'd need to do something like this:
LEA rbx, [rsi + rdi]
MOV rcx, rax
SHR rcx, 8
MOV rdx, [rbx + rcx]
BTS rdx, rax
MOV [rbx + rcx], rdx
You can simplify this if you know the value of RAX is less than 64, or if it's a simpler memory operand. Indeed as you've noticed, it may be an advantage in cases like these to use the faster register version over the slower memory version even if it means a few more instructions.

Why jnz counts no cycle?

I found in online resource that IvyBridge has 3 ALU. So I write a small program to test:
global _start
_start:
mov rcx, 10000000
.for_loop: ; do {
inc rax
inc rbx
dec rcx
jnz .for_loop ; } while (--rcx)
xor rdi, rdi
mov rax, 60 ; _exit(0)
syscall
I compile and run it with perf:
$ nasm -felf64 cycle.asm && ld cycle.o && sudo perf stat ./a.out
The output shows:
10,491,664 cycles
which seems to make sense at the first glance, because there are 3 independent instructions (2 inc and 1 dec) that uses ALU in the loop, so they count 1 cycle together.
But what I don't understand is why the whole loop only has 1 cycle? jnz depends on the result of dec rcx, it should counts 1 cycle, so that the whole loop is 2 cycle. I would expect the output to be close to 20,000,000 cycles.
I also tried to change the second inc from inc rbx to inc rax, which makes it dependent on the first inc. The result does becomes close to 20,000,000 cycles, which shows that dependency will delay an instruction so that they can't run at the same time. So why jnz is special?
What I'm missing here?

First of all, dec/jnz will macro-fuse into a single uop on Intel Sandybridge-family. You could defeat that by putting a non-flag-setting instruction between the dec and jnz.
.for_loop: ; do {
inc rax
dec rcx
lea rbx, [rbx+1] ; doesn't touch flags, defeats macro-fusion
jnz .for_loop ; } while (--rcx)
This will still run at 1 iter per cycle on Haswell and later and Ryzen because they have 4 integer execution ports to keep up with 4 uops per iteration. (Your loop with macro-fusion is only 3 fused-domain uops on Intel CPUs, so SnB/IvB can run it at 1 per clock, too.)
See Agner Fog's optimization guide and especially his microarch guide. Also other links in https://stackoverflow.com/tags/x86/info.
Control dependencies are hidden by branch prediction + speculative execution, unlike data dependencies.
Out-of-order execution and branch prediction + speculative execution hide the "latency" of the control dependency. i.e. the next iteration can start running before the CPU verifies that jnz should really be taken.
So each jnz has an input dependency on the previous dec rcx before it can verify the prediction, but later instructions don't have to wait for it to be checked before they can execute. In-order retirement makes sure that mis-speculation is caught before anything can "see" it happen (except for microarchitectural effects leading to the Spectre attack...)
10M iterations is not a lot. I'd normally use at least 100M for something that runs at only 1c per iter. Having a simple microbenchmark run for 0.1 to 1 second is normally good to get very high precision and hide startup overhead.
And BTW, you don't need sudo perf if you set kernel.perf_event_paranoid = 0 with sysctl. It's almost certainly better to do that than to use sudo all the time.

Packing two DWORDs into a QWORD to save store bandwidth

Imagine a load-store loop like the following which loads DWORDs from non-contiguous locations and stores them contiguously:
top:
mov eax, DWORD [rsi]
mov DWORD [rdi], eax
mov eax, DWORD [rdx]
mov DWORD [rdi + 4], eax
; unroll the above a few times
; increment rdi and rsi somehow
cmp ...
jne top
On modern Intel and AMD hardware, when running in-cache such a loop will usually bottleneck ones stores at one store per cycle. That's kind of wasteful, since that's only an IPC of 2 (one store, one load).
One idea that naturally arises is to combine two DWORD loads into a single QWORD store which is possible since the stores are contiguous. Something like this could work:
top:
mov eax, DWORD [rsi]
mov ebx, DWORD [rdx]
shl rbx, 32
or rax, rbx
mov QWORD [rdi]
Basically do the two loads and use two ALU ops to combine them into a single QWORD which we can store with a single store. Now we're bottlenecked on uops: 5 uops per 2 DWORDs - so 1.25 cycles per QWORD or 0.625 cycles per DWORD.
Already much better than the first option, but I can't help but think there is a better option for this shuffling - for example, we are wasting uop throughput by using plain loads - It feels like we should be able to combine at least some of the ALU ops with the loads with memory source operands, but I was mostly stymied on Intel: shl on memory only has a RMW form, and shlx and rolx don't micro-fuse.
It also seems like we could maybe get the shift for free by making the second load a QWORD load offset by -4, but then we are left clearing out garbage in the load DWORD.
I'm interested in scalar code, and code for both the base x86-64 instruction set and better versions if possible with useful extensions like BMI.

It also seems like we could maybe get the shift for free by making the second load a QWORD load offset by -4, but then we are left clearing out garbage in the load DWORD.
If wider loads are ok for correctness and performance (cache-line splits...), we can use shld
top:
mov eax, DWORD [rsi]
mov rbx, QWORD [rdx-4] ; unaligned(?) 64-bit load
shld rax, rbx, 32 ; 1 uop on Intel SnB-family, 0.5c recip throughput
mov QWORD [rdi], rax
MMX punpckldq mm0, [mem] micro-fuses on SnB-family (including Skylake).
top:
movd mm0, DWORD [rsi]
punpckldq mm0, QWORD [rdx] ; 1 micro-fused uop on Intel SnB-family
movq QWORD [rdi], mm0
; required after the loop, making it only worth-while for long-running loops
emms
punpckl instructions unfortunately have a vector-width memory operand, not half-width. This often spoils them for uses where they'd otherwise be perfect (especially the SSE2 version where the 16B memory operand must be aligned). But note that the MMX versions (with only a qword memory operand) don't have an alignment requirement.
You could also use the 128-bit AVX version, but that's even more likely to cross a cache line boundary and be slow. (Skylake does not optimize by loading only the required 8 bytes; a loop with an aligned mov + vpunckldq xmm1, xmm0, [cache_line-8] runs at 1 iter per 2 clocks vs. 1 iter per clock for aligned.) The AVX version is required to fault if the 16-byte load crosses into an unmapped page, so it couldn't just use a narrower load without extra support from the load port. :/
Such a frustrating and useless design decision (presumably made before load ports could zero-extend for free, and not fixed with AVX). At least we have movhps as a replacement for memory-source punpcklqdq, but narrower widths that actually shuffle can't be replaced.
To avoid CL-splits, you could also use a separate movd load and punpckldq, or SSE4.1 pinsrd. With this, there's no reason for MMX.
top:
movd xmm0, DWORD [rsi]
movd xmm1, DWORD [rdx] ; SSE2
punpckldq xmm0, xmm1
; or pinsrd xmm0, DWORD [rdx], 1 ; 2 uops not micro-fused
movq QWORD [rdi], xmm0
Obviously AVX2 vpgatherdd is a possibility, and may perform well on Skylake.

Do complex addressing modes have extra overhead for loads from memory?

Is there a difference in performance between these mov load instructions? Do the more complex addressing modes have extra overhead (latency or throughput) compared to the simple ones?
# AT&T syntax # Intel syntax:
movq (%rsi), %rax mov rax, [rsi]
movq (%rdi, %rsi), %rax mov rax, [rdi + rsi]
movq (%rdi, %rsi, 4), %rax mov rax, [rdi + rsi*4]

Yes, there is an overhead for "complex addressing" on recent Intel CPUs. The cost is one additional cycle of latency (e.g., 5 cycles for a normal GP load using complex addressing versus 4 cycles with simple addressing).
Simple addressing is anything of the form [reg + offset] where the immediate offset between 0 and 2047 inclusive.
Complex addressing is anything other than simple addressing.
In particular any addressing mode with two registers like your examples [rdi + rsi] or [rdi + rsi*4] are complex addressing and cost an extra cycle.
There is an exceptional case: if the index register1 is zeroed via a zeroing idiom (like xor edi, edi, but not like mov edi, 0) you don't pay the complex addressing penalty.
1 The index register is the one multiplied by 1, 2, 4 or 8, i.e., rsi in [rdi + rsi*4]. In the case neither register shows a multiplier, like [rdi + rsi] the multiplier is 1 and you'll have to check your assembler to see how to specify which is the index and which is the displacement. nasm seems to use the second register as the index.

Depending on which specific CPU; mostly "no, there's no extra overhead". However...
Most CPUs have out-of-order cores, which means they perform instruction in whatever order is fastest and not in the order the instructions are given. For this to work, one instruction (e.g. movq (%rdi, %rsi, 4), %rax) can't happen until things it depended on are finished (e.g. the values in rdi and rsi are known).
For example, these 2 instructions can occur in parallel (because the second instruction doesn't depend on the first):
movq (%rdi), %edi
movq (%rsi), %rax
And these 2 instructions can't occur in parallel (the second instruction has to wait until the first instruction completes):
movq (%rdi), %rdi
movq (%rdi, %rsi), %rax
Also note that the bottleneck for a piece of code may not be execution. If the bottleneck is instruction fetch then larger instructions will be worse; if the bottleneck is instruction decode then more complex instructions can be worse; if the bottleneck is data cache bandwidth then anything that reads/writes to memory can be worse, etc.
Basically; you can't look at individual instructions in isolation and decide if they're better/worse. You have to look at entire sequences of multiple instructions so that you can know about any dependencies on previous instructions (and their latencies); and you have to know what the bottleneck is (e.g. from performance monitoring tools); and if you know all this then you can make an "educated guess" that's only really useful for a small number of CPUs (because different CPUs have different characteristics).

Performance of modern processor

Being executed on modern processor (AMD Phenom II 1090T), how many clock ticks does the following code consume more likely : 3 or 11?
label: mov (%rsi), %rax
adc %rax, (%rdx)
lea 8(%rdx), %rdx
lea 8(%rsi), %rsi
dec %ecx
jnz label
The problem is, when I execute many iterations of such code, results vary near 3 OR 11 ticks per iteration from time to time. And I can't decide "who is who".
UPD
According to Table of instruction latencies (PDF), my piece of code takes at least 10 clock cycles on AMD K10 microarchitecture. Therefore, impossible 3 ticks per iteration are caused by bugs in measurement.
SOLVED
#Atom noticed, that cycle frequency isn't constant in modern processors. When I disabled in BIOS three options - Core Performance Boost, AMD C1E Support and AMD K8 Cool&Quiet Control, consumption of my "six instructions" stabilized on 3 clock ticks :-)

I won't try to answer with certainty how many cycles (3 or 10) it will take to run each iteration, but I'll explain how it might be possible to get 3 cycles per iteration.
(Note that this is for processors in general and I make no references specific to AMD processors.)
Key Concepts:
Out of Order Execution
Register Renaming
Most modern (non-embedded) processors today are both super-scalar and out-of-order. Not only can execute multiple (independent) instructions in parallel, but they can re-order instructions to break dependencies and such.
Let's break down your example:
label:
mov (%rsi), %rax
adc %rax, (%rdx)
lea 8(%rdx), %rdx
lea 8(%rsi), %rsi
dec %ecx
jnz label
The first thing to notice is that the last 3 instructions before the branch are all independent:
lea 8(%rdx), %rdx
lea 8(%rsi), %rsi
dec %ecx
So it's possible for a processor to execute all 3 of these in parallel.
Another thing is this:
adc %rax, (%rdx)
lea 8(%rdx), %rdx
There seems to be a dependency on rdx that prevents the two from running in parallel. But in reality, this is false dependency because the second instruction doesn't actually
depend on the output of the first instruction. Modern processors are able to rename the rdx register to allow these two instructions to be re-ordered or done in parallel.
Same applies to the rsi register between:
mov (%rsi), %rax
lea 8(%rsi), %rsi
So in the end, 3 cycles is (potentially) achievable as follows (this is just one of several possible orderings):
1: mov (%rsi), %rax lea 8(%rdx), %rdx lea 8(%rsi), %rsi
2: adc %rax, (%rdx) dec %ecx
3: jnz label
*Of course, I'm over-simplifying things for simplicity. In reality the latencies are probably longer and there's overlap between different iterations of the loop.
In any case, this could explain how it's possible to get 3 cycles. As for why you sometimes get 10 cycles, there could be a ton of reasons for that: branch misprediction, some random pipeline bubble...

At Intel, Dr. David Levinthal's "Performance Analysis Guide" investigates the answers to such questions in great detail.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio