Cache hits and misses on AVX-512 multicore but not single core - performance

Following is the loop body of a NASM program (loop body means I am not showing the parts that instantiate cores and shared memory, read the input data, write the final results to file). This program is a shared object called from a C wrapper. Line numbers are shown for nine of the lines; they correspond to the line numbers referenced in the notes below.
mov rax,255
kmovq k7,rax
label_401:
cmp r11,r10
jge label_899
vmovupd zmm14,[r12+r11] ;[185]
add r11,r9 ; stride ;[186]
vmulpd zmm13,zmm14,zmm31 ; [196]
vmulpd zmm9,zmm14,zmm29 ; [207]
vmulpd zmm8,zmm13,zmm30
mov r8,1
Exponent_Label_0:
vmulpd zmm7,zmm29,zmm29
add r8,1
cmp r8,2 ;rdx
jl Exponent_Label_0
vmulpd zmm3,zmm7,zmm8
vsubpd zmm0,zmm9,zmm3
vmulpd zmm1,zmm0,zmm28
VCVTTPD2QQ zmm0{k7},zmm1 ; [240]
VCVTUQQ2PD zmm2{k7},zmm0 ; [241]
vsubpd zmm3,zmm1,zmm2
vmulpd zmm4,zmm3,zmm27 ; [243]
VCVTTPD2QQ zmm5{k7}{z},zmm4
VPCMPGTQ k2,zmm5,zmm26
VPCMPEQQ k3 {k7},zmm5,zmm26
KADDQ k1,k2,k3
VCVTQQ2PD zmm2{k7},zmm0 ; [252]
vmulpd zmm1{k7},zmm2,zmm25
vmovupd zmm2,zmm1
VADDPD zmm2{k1},zmm1,zmm25
vmovapd [r15+r14],zmm2 ; [266]
add r14,r9 ; stride
jmp label_401
The program uses AVX-512 register-to-register instructions exclusively between the data read at line 185 to where the final results are written to a shared memory buffer at line 266. I ran this with 1 core and with 4 cores, but the 4-core version is 2-3 times slower than the single core. I profiled it with Linux perf to understand why AVX-512 is 2-3x slower with multicore than with a single core.
The perf reports shown below were done by running all 65 PEBS counters with perf record / perf annotate -- to see results by source code line -- and perf stat to get the full count. Each perf record and perf stat counter was a separate run, and the results are aggregated by source code line, with the count from perf stat shown below each.
Each instruction is followed by the source code line number. For perf record instructions it shows the percentage of that counter attributable to the source line, and the total count of such instructions (from perf stat) in parentheses at the end of each line.
My main question is why we see cache hits and misses with multicore on AVX-512 instructions that are all register-to-register instructions, but not with the same instructions on single core. There should not be any cache hits or misses for an instruction that is entirely within registers. Each core has its own set of registers so I would not expect any cache activity where the instructions are all register-to-register. We see virtually no cache activity in all-register instructions when run with only a single core.
1. Line 186 - add r11,r9
mem_inst_retired.all_loads 75.00% (447119383)
mem_inst_retired.all_stores 86.36% (269650353)
mem_inst_retired.split_loads 71.43% (6588771)
mem_load_retired.l1_hit 57.14% (443561879)
Single core (line 177) - add r11,r9
mem_inst_retired.all_stores 24.00% (267231461)
This instruction (add r11,r9) adds two registers. When run with a single-core we don't see any cache hits/misses or memory loads, but with multicore we do. Why are there cache hits and memory load instructions here with multicore but not with a single core?
2. Line 196 - vmulpd zmm13,zmm14,zmm31
mem_inst_retired.split_loads 28.57% (6588771)
mem_load_retired.fb_hit 100.00% (8327967)
mem_load_retired.l1_hit 14.29% (443561879)
mem_load_retired.l1_miss 66.67% (11033416)
Single core (line 187) - vmulpd zmm13,zmm14,zmm31
mem_load_retired.fb_hit 187 100.00% (8889146)
This instruction (vmulpd zmm13,zmm14,zmm31) is all registers, but again it shows L1 hits and misses and split loads with multicore but not with a single core.
3. Line 207 - vmulpd zmm9,zmm14,zmm29
mem_load_retired.l1_hit 14.29% (443561879)
mem_load_retired.l1_miss 33.33% (11033416)
rs_events.empty_end 25.00% (37013411)
Single core (line 198):
mem_inst_retired.all_stores 24.00% (267231461)
mem_inst_retired.stlb_miss_stores 22.22%
This instruction (vmulpd zmm9,zmm14,zmm29) is the same instruction as the one described above it (vmulpd, all registers), but again it shows L1 hits and misses and split loads with multicore but not with a single core. The single core does show second-level TLB misses and store instructions retired, but no cache activity.
4. Line 240 - VCVTTPD2QQ zmm0{k7},zmm1
mem_inst_retired.all_loads 23.61% (447119383)
mem_inst_retired.split_loads 26.67% (6588771)
mem_load_l3_hit_retired.xsnp_hitm 28.07% (1089506)
mem_load_l3_hit_retired.xsnp_none 12.90% (1008914)
mem_load_l3_miss_retired.local_dram 40.00% (459610)
mem_load_retired.fb_hit 29.21% (8327967)
mem_load_retired.l1_miss 19.82% (11033416)
mem_load_retired.l2_hit 10.22% (12323435)
mem_load_retired.l2_miss 24.84% (2606069)
mem_load_retired.l3_hit 19.70% (700800)
mem_load_retired.l3_miss 21.05% (553670)
Single core line 231:
mem_load_retired.l1_hit 25.00% (429499496)
mem_load_retired.l3_hit 50.00% (306278)
This line (VCVTTPD2QQ zmm0{k7},zmm1) is register-to-register. The single core shows L1 and L3 activity, but the multicore has much more cache activity.
5. Line 241 - VCVTUQQ2PD zmm2{k7},zmm0
mem_load_l3_hit_retired.xsnp_hitm 21.05% (1089506)
mem_load_l3_miss_retired.local_dram 10.00% (459610)
mem_load_retired.fb_hit 10.89% (8327967)
mem_load_retired.l2_miss 13.07% (2606069)
mem_load_retired.l3_miss 10.53%
Single core line 232:
Single core has no cache hits or misses reported
mem_load_retired.l1_hit 12.50% (429499496)
All-register instruction (VCVTUQQ2PD zmm2{k7},zmm0) that shows a lot of cache activity with multicore but only a small number of L1 hits with single core (12.5%). I would not expect to see any cache hits/misses or load/store instructions with an all-register instruction.
6. Line 243 - vmulpd zmm4,zmm3,zmm27
br_inst_retired.all_branches_pebs 12.13% (311104072)
Single core line 234:
mem_load_l3_hit_retired.xsnp_none 100.00% (283620)
Why do we see branch instructions for an all-register mul instruction?
7. Line 252 - VCVTQQ2PD zmm2{k7},zmm0
br_inst_retired.all_branches_pebs 16.62% (311104072)
mem_inst_retired.all_stores 21.22% (269650353)
Single core line 243:
Single core also has branch instructions
br_inst_retired.all_branches_pebs 22.16% (290445009)
For a register-to-register instruction (VCVTQQ2PD zmm2{k7},zmm0), why do we see branch instructions? This instruction does not branch, nor is it preceded or followed by a branch.
8. Line 266 - vmovapd [r15+r14],zmm2
br_inst_retired.all_branches_pebs 43.56% (311104072)
mem_inst_retired.all_loads 48.67% (447119383)
mem_inst_retired.all_stores 43.09% (269650353)
mem_inst_retired.split_loads 41.30% (6588771)
mem_inst_retired.stlb_miss_loads 11.36% (487591)
mem_inst_retired.stlb_miss_stores 12.50% (440729)
mem_load_l3_hit_retired.xsnp_hitm 33.33% (1089506)
mem_load_l3_hit_retired.xsnp_none 56.45% (1008914)
mem_load_l3_miss_retired.local_dram 35.00% (459610)
mem_load_retired.fb_hit 39.60% (8327967)
mem_load_retired.l1_hit 48.75% (443561879)
mem_load_retired.l1_miss 51.65% (11033416)
mem_load_retired.l2_hit 71.51% (12323435)
mem_load_retired.l2_miss 45.10% (2606069)
mem_load_retired.l3_hit 59.09% (700800)
mem_load_retired.l3_miss 47.37% (553670)
Single core line 257:
mem_inst_retired.all_loads 84.86% (426023012)
mem_inst_retired.all_loads
mem_inst_retired.all_stores 59.28% (267231461)
mem_inst_retired.split_loads 89.92% (6477955)
mem_load_l3_miss_retired.local_dram 100.00% (372586)
mem_load_retired.fb_hit 92.80% (8889146)
mem_load_retired.l1_hit 54.17% (429499496)
mem_load_retired.l1_miss 91.30% (4170386)
mem_load_retired.l2_hit 100.00% (4564407)
mem_load_retired.l2_miss 100.00% (476024)
mem_load_retired.l3_hit 33.33% (306278)
This line (vmovapd [r15+r14],zmm2) may be the line most likely to affect the difference between single core and multicore. Here we transfer the final results to a memory buffer that is shared by all cores. Because there is memory movement, we expect to see cache activity with both multicore and single core. The single core uses a single buffer created with malloc. For multicore it's posix shared memory because that ran significantly faster than with an array created with malloc.
Both single core and multicore were run on an Intel Xeon Gold 6140 CPU # 2.30GHz, which has two FMA units for AVX-512.
To summarize, my questions are: (1) why do we see cache activity on register-to-register instructions with AVX-512 multicore but not single core (except rare cases); and (2) is there any way to bypass cache entirely at vmovapd [r15+r14],zmm2 and go straight to memory to avoid cache misses? Posix shared memory was an improvement but that doesn't do it completely. Finally, are there any other reason(s) why AVX-512 would be so much slower with multicore than with a single core?
UPDATE: the access pattern for this code is dictated by AVX - the stride is (64 x number of cores) bytes. With 4 cores, core 0 begins at 0, reads and processes 64 bytes, then jumps by 256 (64x4); core 1 begins at 64, reads and processes 64 bytes, then jumps by 256, etc.

Related

Is the L1-Dcache the ultimate data cache and is DSB also a cache that can be simulated by gem5?

I wonder if the L1-Dcache is the ultimate cache that data comes from. Because I know for i-cache, there is a DSB which is even closer to CPU which could be seen as L0-icache.
Also, I am interested in what hardware changes could influence DSB's performance? I mean for cache, there are things such as cache size, Cache Associativity. But is DSB also just a cache that can be influenced by those factors?
If yes, can I simulate the results using gem5. I know with gem5, I can configure the L1 instruction cache and observe L1 instruction cache performance. How could same things be done for DSB on gem?
I wonder if the L1-Dcache is the ultimate cache that data comes from
Yes, or the store buffer. Globally Invisible load instructions explains how partial store-forwarding can let a core load a dword value that was never globally visible, so no other core could have loaded.
The DSB (uop cache) is a cache, but it doesn't cache machine code. It caches the result of decoding x86 machine code into uops.
It has various limitations like not using more than 3 "lines" for uops from the same 32-byte block of x86 machine code, so modeling is it not as simple as just size / assocativity. e.g. each way (aka line) can hold up to 6 uops, but ends with an unconditional (or predicted-taken) branch uop. And all the uops from a multi-uop instruction have to go in the same line.
The number of fused-domain uops from each x86 instruction depend on exactly what instruction it is; see https://uops.info/, but note that un-lamination will mean some instructions take more uops in the issue/rename stage and ROB than they do decoders and uop-cache. (Micro fusion and addressing modes)
Agner Fog's microarch guide has some detailed testing results (https://agner.org/optimize/), and see also https://www.realworldtech.com/sandy-bridge/4/
The basic parameters of Intel's uop cache are, as described in the Sandybridge section Agner's microarch guide:
The µop cache is organized as 32 sets x 8 ways x 6 µops, totaling a maximum capacity of
1536 µops. It can allocate a maximum of 3 lines of 6 µops each for each aligned and
contiguous 32-bytes block of code.
AFAIK, this geometry has remained unchanged from SnB through Skylake and Ice Lake.
The L1i cache is inclusive of the uop cache. The uop cache is virtually-addressed, so TLB lookups aren't needed. But it has to be evicted on TLB invalidation as well, I guess. (That's not a huge problem because the legacy decoders are quite good; Sandybridge-family avoided problems of P4's slow decoding, and trying to use its trace cache instead of a normal L1i.)
Note that AMD's Zen microarchitecture family also uses a uop cache. They don't call it a DSB, and it presumably has some differences from Intel's.
Also, I am interested in what hardware changes could influence DSB's performance?
Skylake increased the bandwidth of uop-cache -> IDQ from 4 to 6 uops per cycle. So even in high-throughput code, the uop-cache can "catch up" after bubbles partially drain the IDQ.
It can still only read 1 uop cache line per cycle, though, so for example on a Skylake where microcode updates disabled the loop buffer (LSD), a tiny loop that would normally run at 1 cycle per iteration can slow down to 2 cycles if the loop is split across a 32-byte boundary, because that means its uops will be in 2 separate uop-cache lines. (Like 1 or 2 from each line.)
But Haswell can sustain 4 uops per clock from the uop cache under ideal conditions, even with instructions that fully pack uop cache lines with 6 uops per line. So there's apparently some buffering between uop cache-line fetch and adding to the IDQ, otherwise it would be a 4 : 2 pattern if all the uops added to the IDQ had to come from the same line.

What type of addresses can the port 7 store AGU handle on recent Intel x86?

Starting with Haswell, Intel CPU micro-architectures have had a dedicated store-address unit on port 7 which can handle the address-generation uop for some store operations (the other uop, store data always goes to port 4).
Originally it was believed that this could handle any type of addresses, but this seems not to be the case. What types of addresses can this port handle?
This answer applies to Haswell and Skylake (/Kaby Lake / Coffee Lake). Future ISAs (Cannon Lake / Ice Lake) will have to be checked when they're available. The port 7 AGU was new in Haswell.
For instructions that can use port7 at all (e.g. not vextracti128), any non-indexed addressing mode can use port 7.
This includes RIP-relative, and 64-bit absolute (mov [qword abs buf], eax, even in a PIE executable loaded above 2^32, so the address really doesn't fit in 32 bits), as well as normal [reg + disp0/8/32] or absolute [disp32].
An index register always prevents use of port7, e.g. [rdi + rax], or [disp32 + rax*2]. Even [NOSPLIT disp32 + rax*1] can't use port 7 (so HSW/SKL doesn't internally convert an indexed with scale=1 and no base register into a base+disp32 addressing mode.)
I tested myself with ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_dispatched_port.port_2,uops_dispatched_port.port_3,uops_dispatched_port.port_7 ./testloop on a Skylake i7-6700k.
The [+0, +2047] range of displacements makes no different for stores: mov [rsi - 4000], rax can use port 7.
Non-indexed loads with small positive displacements have 1c lower latency. No special case for stores is mentioned in Intel's optimization manual. Skylake's variable-latency store-forwarding (with worse latency when the load tries to execute right away after the store) makes it hard to construct a microbenchmark that includes store latency but isn't affected by having store-address uops compete with loads for fewer ports. I haven't come up with a microbenchmark with a loop-carried dependency chain through a store-address uop but not through the store-data uop. Presumably it's possible, but maybe needs an array instead of a single location.
Some instructions can't use port7 at all:
vextracti128 [rdi], ymm0, 0 includes a store-address uop (of course), but it can only run on port 2 or port 3.
Agner Fog's instruction tables have at least one error here, though: he lists pextrb/w/d/q as only running the store-address uop on p23, but in fact it can use any of p237 on HSW/SKL.
I haven't tested this exhaustively, but one difference between HSW and SKL I found1 was VCVTPS2PH [mem], xmm/ymm, imm8. (The instruction changed to use fewer ALU uops, so that doesn't indicate a change in p7 between HSW and SKL).
On Haswell: VCVTPS2PH is 4 uops (fused and unfused domain): p1 p4 p5 p23 (Agner Fog is right).
On Skylake: VCVTPS2PH xmm is 2 fused / 3 unfused uops: p01 p4 p237
On Skylake: VCVTPS2PH ymm is 3 fused / 3 unfused uops: p01 p4 p237
(Agner Fog lists VCVTPS2PH v as 3F/3U (one entry for both vector widths), missing the micro-fusion with the xmm version, and incorrectly lists the port breakdown as p01 p4 p23).
In general, beware that Agner's recent updates seem a little sloppy, like copy/paste or typo errors (e.g. 5 instead of 0.5 for Ryzen vbroadcastf128 y,m128 throughput).
1: HSW testing was on an old laptop that's no longer usable (I used its RAM to upgrade another machine that still gets regular use). I don't have a Broadwell to test on. Everything in this answer is definitely true on Skylake: I double checked it just now. I tested some of this a while ago on Haswell, and still had my notes from that.

Why isn't MOVNTI slower, in a loop storing repeatedly to the same address?

section .text
%define n 100000
_start:
xor rcx, rcx
jmp .cond
.begin:
movnti [array], eax
.cond:
add rcx, 1
cmp rcx, n
jl .begin
section .data
array times 81920 db "A"
According to perf it runs at 1.82 instructions per cycle. I cannot understand why it's so fast. After all, it has to be stored in memory (RAM) so it should be slow.
P.S Is there any loop-carried-dependency?
EDIT
section .text
%define n 100000
_start:
xor rcx, rcx
jmp .cond
.begin:
movnti [array+rcx], eax
.cond:
add rcx, 1
cmp rcx, n
jl .begin
section .data
array times n dq 0
Now, the iteration take 5 cycle per iteration. Why? After all, there is still no loop-carried-dependency.
movnti can apparently sustain a throughput of one per clock when writing to the same address repeatedly.
I think movnti keeps writing into the same fill buffer, and it's not getting flushed very often because there are no other loads or stores happening. (That link is about copying from WC video memory with SSE4.1 NT loads, as well as storing to normal memory with NT stores.)
So the NT write-combining fill-buffer acts like a cache for multiple overlapping NT stores to the same address, and writes are actually hitting in the fill buffer instead of going to DRAM each time.
DDR DRAM only supports burst-transfer commands. If every movnti produced a 4B write that actually was visible to the memory chips, there'd be no way it could run that fast. The memory controller either has to read/modify/write, or do an interrupted burst transfer, since there is no non-burst write command. See also Ulrich Drepper's What Every Programmer Should Know About Memory.
We can further prove this is the case by running the test on multiple cores at once. Since they don't slow each other down at all, we can be sure that the writes are only infrequently making it out of the CPU cores and competing for memory cycles.
The reason your experiment doesn't show your loop running at 4 instruction per clock (one cycle per iteration) is that you used such a tiny repeat count. 100k cycles barely accounts for the startup overhead (which perf's timing includes).
For example, on a Core2 E6600 (Merom/Conroe) with dual channel DDR2 533MHz, the total time including all process startup / exit stuff is 0.113846 ms. That's only 266,007 cycles.
A more reasonable microbenchmark shows one iteration (one movnti) per cycle:
global _start
_start:
xor ecx,ecx
.begin:
movnti [array], eax
dec ecx
jnz .begin ; 2^32 iterations
mov eax, 60 ; __NR_exit
xor edi,edi
syscall ; exit(0)
section .bss
array resb 81920
(asm-link is a script I wrote)
$ asm-link movnti-same-address.asm
+ yasm -felf64 -Worphan-labels -gdwarf2 movnti-same-address.asm
+ ld -o movnti-same-address movnti-same-address.o
$ perf stat -e task-clock,cycles,instructions ./movnti-same-address
Performance counter stats for './movnti-same-address':
1835.056710 task-clock (msec) # 0.995 CPUs utilized
4,398,731,563 cycles # 2.397 GHz
12,891,491,495 instructions # 2.93 insns per cycle
1.843642514 seconds time elapsed
Running in parallel:
$ time ./movnti-same-address; time ./movnti-same-address & time ./movnti-same-address &
real 0m1.844s / user 0m1.828s # running alone
[1] 12523
[2] 12524
peter#tesla:~/src/SO$
real 0m1.855s / user 0m1.824s # running together
real 0m1.984s / user 0m1.808s
# output compacted by hand to save space
I expect perfect SMP scaling (except with hyperthreading), up to any number of cores. e.g. on a 10-core Xeon, 10 copies of this test could run at the same time (on separate physical cores), and each one would finish in the same time as if it was running alone. (Single-core turbo vs. multi-core turbo will also be a factor, though, if you measure wall-clock time instead of cycle counts.)
zx485's uop count nicely explains why the loop isn't bottlenecked by the frontend or unfused-domain execution resources.
However, this disproves his theory about the ratio of CPU to memory clocks having anything to do with it. Interesting coincidence, though, that the OP chose a count that happened to make the final total IPC work out that way.
P.S Is there any loop-carried-dependency?
Yes, the loop counter. (1 cycle). BTW, you could have saved an insn by counting down towards zero with dec / jg instead of counting up and having to use a cmp.
The write-after-write memory dependency isn't a "true" dependency in the normal sense, but it is something the CPU has to keep track of. The CPU doesn't "notice" that the same value is written repeatedly, so it has to make sure the last write is the one that "counts".
This is called an architectural hazard. I think the term still applies when talking about memory, rather than registers.
The result is plausible. Your loop code consists of the follwing instuctions. According to Agner Fog's instruction tables, these have the following timings:
Instruction regs fused unfused ports Latency Reciprocal Throughput
---------------------------------------------------------------------------------------------------------------------------
MOVNTI m,r 2 2 p23 p4 ~400 1
ADD r,r/i 1 1 p0156 1 0.25
CMP r,r/i 1 1 p0156 1 0.25
Jcc short 1 1 p6 1 1-2 if predicted that the jump is taken
Fused CMP+Jcc short 1 1 p6 1 1-2 if predicted that the jump is taken
So
MOVNTI consumes 2 uOps, 1 in port 2 or 3 and one in port 4
ADD consumes 1 uOps in port 0 or 1 or 5 or 6
CMP and Jcc macro-fuse to the last line in the table resulting in a consumption of 1 uOp
Because neither ADD nor CMP+Jcc depend on the result of MOVNTI they can be executed (nearly) in parallel on recent architectures, for example using the ports 1,2,4,6. The worst case would be a latency of 1 between ADD and CMP+Jcc.
This is most likely a design error in your code: you're essentially writing to the same address [array] a 100000 times, because you do not adjust the address.
The repeated writes can even go to the L1-cache under the condition that
The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an uncacheable (UC) or write protected (WP) memory region.
but it doesn't look like this and won't make for a great difference, anyway, because even if writing to memory, the memory speed will be the limiting factor.
For example, if you have a 3GHz CPU and 1600MHz DDR3-RAM this will result in 3/1.6 = 1.875 CPU cycles per memory cycle. This seems plausible.

Loop optimization. How does register renaming break dependencies? What is execution port capacity?

I am analyzing an example of a loop from Agner Fog's optimization_assembly. I mean the 12.9 chapter.
The code is: ( I simplified a bit)
L1:
vmulpd ymm1, ymm2, [rsi+rax]
vaddpd ymm1, ymm1, [rdi+rax]
vmovupd [rdi+rax], ymm1
add rax, 32
jl L1
And I have some questions:
The author said that there is no loop-carried dependency. I don't understand why it is so. ( I skipped the case of add rax, 32 ( it is loop-carried indeed, but only one cycle)). But, after all, the next iteration cannot modify ymm1 register before the previous iteration will not have finished. Maybe register-renaming plays a role here?
Let's assume that there is a loop-carried dependency.
vaddpd ymm1, ymm1, [rdi+rax] -> vmovupd [rdi+rax], ymm1
And let latency for first is 3, and latency for second is 7.
( In fact, there is no such dependency, but I would like to ask a hypothetical question)
Now, How to determine a total latency. Should I add latencies and the result would be 10? I have no idea.
It is written:
There are two 256-bit read operations, each using a read port for two
consecutive clock cycles, which is indicated as 1+ in the table. Using
both read ports (port 2 and 3), we will have a throughput of two
256-bit reads in two clock cycles. One of the read ports will make an
address calculation for the write in the second clock cycle. The write
port (port 4) is occupied for two clock cycles by the 256-bit write.
The limiting factor will be the read and write operations, using the
two read ports and the write port at their maximum capacity.
What exactly is capacity for ports? How can I determine them, for example for IvyBridge (my CPU).
Yes, the whole point of register renaming is to break dependency chains when an instruction writes a register without depending on the old value. The destination of a mov, or the write-only destination operand of AVX instructions, is like this. Also zeroing idioms like xor eax,eax are recognized as independent of the old value, even though they appear to have the old value as an input.
See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for a more detailed description of register-renaming, and some performance experiments with multiple loop-carried dependency chains in flight at once.
Without renaming, vmulpd couldn't write ymm1 until vmovupd had read its operand (Write-After-Read hazard), but it wouldn't have to wait for vmovupd to complete. See a computer architecture textbook to learn about in-order pipelines and stuff. I'm not sure if any out-of-order CPUs without register renaming exist.
update: early OoO CPUs used scoreboarding to do some limited out-of-order execution without register renaming, but were much more limited in their capacity to find and exploit instruction-level parallelism.
Each of the two load ports on IvB has a capacity of one 128b load per clock. And also of one address-generation per clock.
In theory, SnB/IvB can sustain a throughput of 2x 128b load and 1x 128b store per clock, but only by using 256b instructions. They can only generate two addresses per clock, but a 256b load or store only needs one address calculation per 2 cycles of data transfer. See Agner Fog's microarch guide
Haswell added a dedicated store AGU on port 7 that handles simple addressing modes only, and widened the data paths to 256b. A single cycle can do a peak of 96 bytes total loaded + stored. (But some unknown bottleneck limits sustained throughput to less than that. On Skylake-client, about 84 bytes / cycle reported by Intel, and matches my testing.)
(IceLake client reportedly can sustain 2x64B loaded + 1x64B stored per cycle, or 2x32B stored, according to a recent update to Intel's optimization guide.)
Also note that your indexed addressing modes won't micro-fuse, so fused-domain uop throughput is also a concern.

Cycles/cost for L1 Cache hit vs. Register on x86?

I remember assuming that an L1 cache hit is 1 cycle (i.e. identical to register access time) in my architecture class, but is that actually true on modern x86 processors?
How many cycles does an L1 cache hit take? How does it compare to register access?
Here's a great article on the subject:
http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/1
To answer your question - yes, a cache hit has approximately the same cost as a register access. And of course a cache miss is quite costly ;)
PS:
The specifics will vary, but this link has some good ballpark figures:
Approximate cost to access various caches and main memory?
Core i7 Xeon 5500 Series Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles remote
L3 CACHE ~100-300 cycles
Local DRAM ~30 ns (~120 cycles)
Remote DRAM ~100 ns
PPS:
These figures represent much older, slower CPUs, but the ratios basically hold:
http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/2
Level Access Time Typical Size Technology Managed By
----- ----------- ------------ --------- -----------
Registers 1-3 ns ?1 KB Custom CMOS Compiler
Level 1 Cache (on-chip) 2-8 ns 8 KB-128 KB SRAM Hardware
Level 2 Cache (off-chip) 5-12 ns 0.5 MB - 8 MB SRAM Hardware
Main Memory 10-60 ns 64 MB - 1 GB DRAM Operating System
Hard Disk 3M - 10M ns 20 - 100 GB Magnetic Operating System/User
Throughput and latency are different things. You can't just add up cycle costs. For throughput, see Load/stores per cycle for recent CPU architecture generations - 2 loads per clock throughput for most modern microarchitectures. And see How can cache be that fast? for microarchitectural details of load/store execution units, including showing load / store buffers which limit how much memory-level parallelism they can track. The rest of this answer will focus only on latency, which is relevant for workloads that involve pointer-chasing (like linked lists and trees), and how much latency out-of-order exec needs to hide. (L3 Cache misses are usually too long to fully hide.)
Single-cycle cache latency used to be a thing on simple in-order pipelines at lower clock speeds (so each cycle was more nanoseconds), especially with simpler caches (smaller, not as associative, and with a smaller TLB for caches that weren't purely virtually addressed.) e.g. the classic 5-stage RISC pipeline like MIPS I assumes 1 cycle for memory access on a cache hit, with address calculation in EX and memory access in a single MEM pipeline stage, before WB.
Modern high-performance CPUs divide the pipeline up into more stages, allowing each cycle to be shorter. This lets simple instructions like add / or / and run really fast, still 1 cycle latency but at high clock speed.
For more details about cycle-counting and out-of-order execution, see Agner Fog's microarch pdf, and other links in the x86 tag wiki.
Intel Haswell's L1 load-use latency is 4 cycles for pointer-chasing, which is typical of modern x86 CPUs. i.e. how fast mov eax, [eax] can run in a loop, with a pointer that points to itself. (Or for a linked list that hits in cache, easy to microbench with a closed loop). See also Is there a penalty when base+offset is in a different page than the base? That 4-cycle latency special case only applies if the pointer comes directly from another load, otherwise it's 5 cycles.
Load-use latency is 1 cycle higher for SSE/AVX vectors in Intel CPUs.
Store-reload latency is 5 cycles, and is unrelated to cache hit or miss (it's store-forwarding, reading from the store buffer for store data that hasn't yet committed to L1d cache).
As harold commented, register access is 0 cycles. So, for example:
inc eax has 1 cycle latency (just the ALU operation)
add dword [mem], 1 has 6 cycle latency until a load from dword [mem] will be ready. (ALU + store-forwarding). e.g. keeping a loop counter in memory limits a loop to one iteration per 6 cycles.
mov rax, [rsi] has 4 cycle latency from rsi being ready to rax being ready on an L1 hit (L1 load-use latency.)
http://www.7-cpu.com/cpu/Haswell.html has a table of latency per cache (which I'll copy here), and some other experimental numbers, including L2-TLB hit latency (on an L1DTLB miss).
Intel i7-4770 (Haswell), 3.4 GHz (Turbo Boost off), 22 nm. RAM: 32 GB (PC3-12800 cl11 cr2).
L1 Data cache = 32 KB, 64 B/line, 8-WAY.
L1 Instruction cache = 32 KB, 64 B/line, 8-WAY.
L2 cache = 256 KB, 64 B/line, 8-WAY
L3 cache = 8 MB, 64 B/line
L1 Data Cache Latency = 4 cycles for simple access via pointer (mov rax, [rax])
L1 Data Cache Latency = 5 cycles for access with complex address calculation (mov rax, [rsi + rax*8]).
L2 Cache Latency = 12 cycles
L3 Cache Latency = 36 cycles
RAM Latency = 36 cycles + 57 ns
The top-level benchmark page is http://www.7-cpu.com/utils.html, but still doesn't really explain what the different test-sizes mean, but the code is available. The test results include Skylake, which is nearly the same as Haswell in this test.
#paulsm4's answer has a table for a multi-socket Nehalem Xeon, including some remote (other-socket) memory / L3 numbers.
If I remember correctly it's about 1-2 clock cycles but this is an estimate and newer caches may be faster. This is out of a Computer Architecture book I have and this is information for AMD so Intel may be slightly different but I would bound it between 5 and 15 clock cycles which seems like a good estimate to me.
EDIT: Whoops L2 is 10 cycles with TAG access, L1 takes 1 to two cycles, my mistake :\
Actually the cost of the L1 cache hit is almost the same as a cost of register access. It was surprising for me, but this is true, at least for my processor (Athlon 64). Some time ago I written a simple test application to benchmark efficiency of access to the shared data in a multiprocessor system. The application body is a simple memory variable incrementing during the predefined period of time. To make a comapison, I benchmarked non-shared variable at first. And during that activity I captured the result, but then during application disassembling I found that compiler was deceived my expectations and apply unwanted optimisation to my code. It just put variable in the CPU register and increment it iterativetly in the register without memory access. But real surprise was achived after I force compliler to use in-memory variable instead of register variable. On updated application I achived almost the same benchmarking results. Performance degradation was really negligeble (~1-2%) and looks like related to some side effect.
As result:
1) I think you can consider L1 cache as an unmanaged processor registers pool.
2) There is no any sence to apply brutal assambly optimization by forcing compiler store frequently accesing data in processor registers. If they are really frequently accessed, they will live in the L1 cache, and due to this will have same access cost as the processor register.

Resources