Can Apache web server make use of CPU AVX instructions? - performance

I have replaced my old home server (i3-6100, 51W TDP, 3,7GHz, SSE4.1, SSE4.2, AVX2) with a thin client (Celeron J4105, 10W TDP, 1.5/2.5GHz turbo, SSE4.2).
Can Apache make use of CPU AVX instructions?

Glibc automatically uses AVX/AVX2 if available for memcpy, memcmp, strlen, and stuff like that, which is nice for small to medium-length strings hot in L1d or L2 cache. (e.g. maybe twice as fast for strings of 100B to 128KiB). For shorter strings, startup and cleanup overhead are a significant fraction. Hopefully apache doesn't spend a lot of time looping over strings.
There might possibly be some auto-vectorized loops inside apache itself if you compile with -O3 -march=native, but unlikely.
I doubt there's anything in Apache that would be worth manually dispatching based on CPUID (except for libc functions), so you probably won't find any AVX instructions in the apache binary on your i3 server if you check with a disassembler, unless it was specifically compiled for that machine or for AVX-capable machines. If the whole binary was compiled with AVX enabled, even scalar FP math would use instructions like vmovsd / vucomisd instead of movsd / ucomisd, so if you see any movsd it wasn't compiled that way.
See How to check if compiled code uses SSE and AVX instructions? and note the SIMD (packed) vs. scalar.
One interesting feature of AVX that's relevant for multithreaded programs: Intel recently documented that the AVX feature flag implies 16-byte aligned load/store is guaranteed atomic. (And I think AMD is planning to do so if they haven't already, since it's also true in practice on their CPUs.) Previously the only support for 16-byte lock-free atomics was via lock cmpxchg16b, meaning that pure-load cost as much as an RMW. GCC-compiled code can take advantage of this via libatomic, including via updates to a shared libatomic which dispatches to more efficient load/store functions on CPUs with AVX.
So anyway, cheaper lock-free atomics for objects the size of two pointers in 64-bit mode. Not a game-changer for code that doesn't spend a ton of time communicating between threads. And it doesn't help the kernel because you can't take advantage of it with -mgeneral-regs-only; 16-byte load/store require an XMM reg, unless cmpxchg16b without a lock prefix counts. But that could do a non-atomic RMW if the compare succeeds, so that's unusable.
Probably more relevant is that AVX2 support comes with faster memcpy inside the kernel, for copy_to_user (from the pagecache) for read system calls. rep movsb can work in 32-byte chunks internally in microcode, vs. 16-byte chunks on CPUs whose load/store data paths are only 16 bytes wide.
(AVX can be implemented on CPUs with 16-byte load/store paths, like Zen 1 and Ivy Bridge, but your i3 with AVX2 has 32-byte datapaths between execution units and L1d cache. https://www.realworldtech.com/haswell-cpu/5/)
AVX2 can help with some OpenSSL stuff, but probably nothing important for web serving.
Usually you'll be using AES for encryption, and both CPUs have AES-NI. AVX+AES does enable working on 32 bytes per instruction instead of 16, but IIRC that has to be on 2 separate blocks in parallel, not working twice as fast on one single AES stream. Still, Apache + OpenSSL might manage to take advantage of this.
There's also a possible speedup for MD5 or SHA512 using AVX2, if I recall correctly.
For SHA1 and SHA256, the new CPU has SHA-NI (new in Goldmont and Ice Lake. The J4105 is Goldmont+, but the old CPU is Skylake so it didn't have SHA-NI and had to do it manually with SIMD.) There is no VEX encoding of SHA1RNDS4 xmm or SHA256 acceleration instructions, let alone one which uses 256-bit vectors to go faster. If you use SHA512 for anything, then that will go somewhat faster with AVX2 than with SSE4.2, all else equal.
(And of course a Skylake would run the same asm faster clock-for-clock, with a wider front-end that's more robust against bottlenecks, and more throughput in the back-end. https://agner.org/optimize/ and https://uops.info/ - compare your old Skylake against your "new" Goldmont+. I put "new" in quotes because it launched at the end of 2017, only a couple years after your Skylake.)
Intel haven't had AVX support in their low-power cores until Gracemont, the E-cores in Alder Lake. IDK if/when they're planning a stand-alone low-power chip with only Gracemont cores to replace Tremont, and if they might include AVX in that.

Out of curiosity, and since I have a Gentoo Linux system where I can simply compile Apache with -O3 -march=native, I tried looking at the disassembly to see whether AVX vector instructions are generated at all.
objdump -d --no-show-raw-insn --no-addresses \
/usr/sbin/apache2 /usr/lib64/apache2/modules/*.so |
grep -oE '^\s+([[:alpha:]][[:alnum:]]*)+' |
LC_ALL=C sort | uniq -c
This gives the following stats:
3 vaddsd
1 vcomisd
23 vcomiss
3 vcvtsd2ss
9 vcvtsi2sd
1 vcvtsi2sdq
25 vcvtsi2ss
2 vcvtsi2ssl
11 vcvtsi2ssq
51 vcvtss2sd
5 vcvttsd2si
2 vcvttss2si
2 vcvttss2usi
4 vcvtusi2sd
1 vcvtusi2sdl
1 vcvtusi2ss
4 vcvtusi2ssl
3 vcvtusi2ssq
8 vdivsd
28 vdivss
19 vextracti128
2 vextracti64x2
15 vinserti128
185 vmovaps
74 vmovd
585 vmovdqa
28 vmovdqa64
1510 vmovdqu
55 vmovdqu8
323 vmovq
15 vmovsd
113 vmovss
8 vmulsd
30 vmulss
22 vpackuswb
27 vpaddd
16 vpaddq
3 vpalignr
29 vpand
17 vpblendmq
2 vpblendvb
1 vpbroadcastd
14 vpbroadcastq
2 vpbroadcastw
8 vpcmpeqb
3 vpcmpeqd
16 vpcmpeqq
16 vpcmpneqq
1 vpermi2w
20 vpermq
1 vpermt2d
1 vpermt2q
7 vpermt2w
1 vpextrb
5 vpextrq
32 vpgatherdd
8 vpinsrb
44 vpinsrd
249 vpinsrq
3 vpmaxsd
3 vpmaxsq
3 vpminsd
1 vpmovqd
8 vpmovsxdq
18 vpmovzxbw
36 vpmovzxwd
2 vpmuldq
17 vpor
28 vpshufb
8 vpshufd
24 vpslld
8 vpsrld
13 vpsrldq
1 vpsrlq
20 vpsrlw
4 vpsubq
1 vpternlogd
1 vpunpcklbw
2 vpunpckldq
4 vpunpcklqdq
4 vpunpcklwd
317 vpxor
1 vshufpd
1 vshufps
12 vucomiss
12 vxorpd
41 vxorps
126 vzeroupper
So there is definitely some use. However, this doesn't prove that these instructions are executed or that they are effective at improving performance compared to the same at SSE2 or compiled without automatic vectorization.
I find it somewhat curious to see instructions such as vpgatherdd in use. That's not something I would expect a compiler to use on its own. I should also note that this is GCC-11.3.1 on an i7-11800H (Tiger Lake), so this uses AVX-512, not just AVX-1 or 2.
As noted by Peter, the more likely candidate for effective usage is in the libc. I might add that OpenSSL will also make use of AVX if available.

Related

Why is LOOP faster than DEC,JNZ on 8086?

My professor claimed that LOOP is faster on 8086 because only one instruction is fetched instead of two, like in dec cx, jnz. So I think we are saving time by avoiding the extra fetch and decode per iteration.
But earlier in the lecture, he also mentioned that LOOP does the same stuff as DEC, JNZ under the hood, and I presume that its decoding should also be more complex, so the speed difference should kind of balance out. Then, why is the LOOP instruction faster? I went through this post, and the answers there pertain to processors more modern than 8086, although one of the answers (and the page it links) does point out that on 8088 (closely related to 8086), LOOP is faster.
Later, the professor used the same reasoning to explain why rep string operations might be faster than LOOP + individual movement instructions, but since I was not entirely convinced with the previous approach, I asked this question here.
It's not decode that's the problem, it's usually fetch on 8086.
Starting two separate instruction-decode operations probably is more expensive than just fetching more microcode for one loop instruction. I'd guess that's what accounts for the numbers in the table below that don't include code-fetch bottlenecks.
Equally or more importantly, 8086 is often bottlenecked by memory access, including code-fetch. (8088 almost always is, breathing through a straw with it's 8-bit bus, unlike 8086's 16-bit bus).
dec cx is 1 byte, jnz rel8 is 2 bytes.
So 3 bytes total, vs. 2 for loop rel8.
8086 performance can be approximated by counting memory accesses and multiply by four, since its 6-byte instruction prefetch buffer allows it to overlap code-fetch with decode and execution of other instructions. (Except for very slow instructions like mul that would let the buffer fill up after at most three 2-byte fetches.)
See also Increasing Efficiency of binary -> gray code for 8086 for an example of optimizing something for 8086, with links to more resources like tables of instruction timings.
https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html has instruction timings for 8086 (taken from Intel manuals I think, as cited in njuffa's answer), but those are only execution, when fetch isn't a bottleneck. (i.e. just decoding from the prefetch buffer.)
Decode / execute timings, not including fetch:
DEC Decrement
operand bytes 8088 186 286 386 486 Pentium
r8 2 3 3 2 2 1 1 UV
r16 1 3 3 2 2 1 1 UV
r32 1 3 3 2 2 1 1 UV
mem 2+d(0,2) 23+EA 15 7 6 3 3 UV
Jcc Jump on condition code
operand bytes 8088 186 286 386 486 Pentium
near8 2 4/16 4/13 3/7+m 3/7+m 1/3 1 PV
near16 3 - - - 3/7+m 1/3 1 PV
LOOP Loop control with CX counter
operand bytes 8088 186 286 386 486 Pentium
short 2 5/17 5/15 4/8+m 11+m 6/7 5/6 NP
So even ignoring code-fetch differences:
dec + taken jnz takes 3 + 16 = 19 cycles to decode / exec on 8086 / 8088.
taken loop takes 17 cycles to decode / exec on 8086 / 8088.
(Taken branches are slow on 8086, and discard the prefetch buffer; there's no branch prediction. IDK if those timings include any of that penalty, since they apparently don't for other instructions and non-taken branches.)
8088/8086 are not pipelined except for the code-prefetch buffer. Finishing execution of one instruction and starting decode / exec of the next take it some time; even the cheapest instructions (like mov reg,reg / shift / rotate / stc/std / etc.) take 2 cycles. Bizarrely more than nop (3 cycles).
I presume that its decoding should also be more complex
There's no reason that the decoding is more complex for the loop instruction.  This instruction has to do multiple things, but decoding is not at issue — it should decode as easily as JMP, since there's just the opcode and the one operand, the branch target, like JMP.
Saving one instruction's fetch & decode probably accounts for the speed improvement, since in execution they are effectively equivalent.
Looking at the "8086/8088 User's Manual: Programmer's and Hardware Reference" (Intel 1989) confirms that LOOP is marginally faster than the combination DEC CX; JNZ. DEC takes 3 clock cycles, JNZ takes 4 (not taken) or 16 (taken) cycles. So the combination requires 7 or 19 cycles. LOOP on the other hand requires 5 cycles (not taken) or 17 cycles (taken), for a saving of 2 cycles.
I do not see anything in the manual that describes why LOOP is faster. The faster instruction fetch due to the reduced number of opcode bytes seems like a reasonable hypothesis.
According to the "80286 and 80287 Programmer's Reference Manual" (Intel 1987), LOOP still has a slight advantage over the discrete replacement, in that it requires 8 cycles when taken and 4 cycles when not taken, while the combo requires 1 cycle more in both cases (DEC 2 cycles; JNZ 7 or 3 cycles).
The 8086 microcode has been disassembled, so one could theoretically take a look at the internal sequence of operations for both of these cases to establish exactly why LOOP is faster, if one is so inclined.

Timing of executing 8 bit and 64 bit instructions on 64 bit x64/Amd64 processors

Is there any execution timing difference between 8 it and 64 bit instructions on 64 bit x64/Amd64 processor, when those instructions are similar/same except bit width?
Is there a way to find real processor timing of executing these 2 tiny assembly functions?
-Thanks.
; 64 bit instructions
add64:
mov $0x1, %rax
add $0x2, %rax
ret
; 8 bit instructions
add8:
mov $0x1, %al
add $0x2, %al
ret
Yes, there's a difference. mov $0x1, %al has a false dependency on the old value of RAX on most CPUs, including everything newer than Sandybridge. It's a 2-input 1-output instruction; from the CPU's point of view it's like add $1, %al as far as scheduling it independently or not relative to other uses of RAX. Only writing a 32 or 64-bit register starts a new dependency chain.
This means the AL return value of your add8 function might not be ready until after a cache miss for some independent work the caller happened to be doing in EAX before the call, but the RAX result of add64 could be ready right away for out-of-order execution to get started on later instructions in the caller that use the return value. (Assuming their other inputs are also ready.)
Why doesn't GCC use partial registers? and
How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent
and What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? - Important background for understanding performance on modern OoO exec CPUs.
Their code-size also differs: Both the 8-bit instructions are 2 bytes long. (Thanks to the AL, imm8 short-form encoding; add $1, %dl would be 3 bytes). The RAX instructions are 7 and 4 bytes long. This matters for L1i cache footprint (and on a large scale, for how many bytes have to get paged in from disk). On a small scale, how many instructions can fit into a 16 or 32-byte fetch block if the CPU is doing legacy decode because the code wasn't already hot in the uop cache. Also code-alignment of later instructions is affected by varying lengths of previous instructions, sometimes affecting which branches alias each other.
https://agner.org/optimize/ explains the details of the pipelines of various x86 microarchitectures, including front-end decoding effects that can make instruction-length matter beyond just code density in the I-cache / uop-cache.
Generally 32-bit operand-size is the most efficient (for performance, and pretty good for code-size). 32 and 8 are the operand-sizes that x86-64 can use without extra prefixes, and in practice with 8-bit to avoid stalls and badness you need more instructions or longer instructions because they don't zero-extend. The advantages of using 32bit registers/instructions in x86-64.
A few instructions are actually slower in the ALUs for 64-bit operand-size, not just front-end effects. That includes div on most CPUs, and imul on some older CPUs. Also popcnt and bswap. e.g. Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux
Note that mov $0x1, %rax will assemble to 7 bytes with GAS, unless you use as -O2 (not the same as gcc -O2, see this for examples) to get it to optimize to mov $1, %eax which exactly the same architectural effects, but is shorter (no REX or ModRM byte). Some assemblers do that optimization by default, but GAS doesn't. Why NASM on Linux changes registers in x86_64 assembly has more about why this optimization is safe and good, and why you should do it yourself in the source especially if your assembler doesn't do it for you.
But other than the false dep and code-size, they're the same for the back-end of the CPU: all those instructions are single-uop and can run on any scalar-integer ALU execution port1. (https://uops.info/ has automated test results for every form of every unprivileged instruction).
Footnote 1: Excavator (last-gen Bulldozer-family) can also run mov $imm, %reg on 2 more ports (AGU) for 32 and 64-bit operand-size. But merging a new low-8 or low-16 into a full register needs an ALU port. So mov $1, %rax has 4/clock throughput on Excavator, but mov $1, %al only has 2/clock throughput. (And of course only if you use a few different destination registers, not actually AL repeatedly; that would be a latency bottleneck of 1/clock because of the false dependency from writing a partial register on that microarchitecture.)
Previous Bulldozer-family CPUs starting with Piledriver can run mov reg, reg (for r32 or r64) on EX0, EX1, AGU0, AGU1, while most ALU instructions including mov $imm, %reg can only run on EX0/1. Further extending the AGU port's capabilities to also handle mov-immediate was a new feature in Excavator.
Fortunately Bulldozer was obsoleted by AMD's much better Zen architecture which has 4 full scalar integer ALU ports / execution units. (And a wider front end and a uop cache, good caches, and generally doesn't suck in a lot of the ways that Bulldozer sucked.)
Is there a way to measure it?
yes, but generally not in a function you call with call. Instead put it in an unrolled loop so you can run it lots of times with minimal other instructions. Especially useful to look at CPU performance counter results to find front-end / back-end uop counts, as well as just the overall time for your loop.
You can construct your loop to measure latency or throughput; see RDTSCP in NASM always returns the same value (timing a single instruction). Also:
Assembly - How to score a CPU instruction by latency and throughput
Idiomatic way of performance evaluation?
Can x86's MOV really be "free"? Why can't I reproduce this at all? is a good specific example of constructing a microbenchmark to measure / prove something specific.
Generally you don't need to measure yourself (although it's good to understand how, that helps you know what the measurements really mean). People have already done that for most CPU microarchitectures. You can predict performance for a specific CPU for some loops (if you can assume no stalls or cache misses) based on analyzing the instructions. Often that can predict performance fairly accurately, but medium-length dependency chains that OoO exec can only partially hide makes it too hard to accurately predict or account for every cycle.
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? has links to lots of good details, and stuff about CPU internals.
How many CPU cycles are needed for each assembly instruction? (you can't add up a cycle count for each instruction; front-end and back-end throughput, and latency, could each be the bottleneck for a loop.)

What type of addresses can the port 7 store AGU handle on recent Intel x86?

Starting with Haswell, Intel CPU micro-architectures have had a dedicated store-address unit on port 7 which can handle the address-generation uop for some store operations (the other uop, store data always goes to port 4).
Originally it was believed that this could handle any type of addresses, but this seems not to be the case. What types of addresses can this port handle?
This answer applies to Haswell and Skylake (/Kaby Lake / Coffee Lake). Future ISAs (Cannon Lake / Ice Lake) will have to be checked when they're available. The port 7 AGU was new in Haswell.
For instructions that can use port7 at all (e.g. not vextracti128), any non-indexed addressing mode can use port 7.
This includes RIP-relative, and 64-bit absolute (mov [qword abs buf], eax, even in a PIE executable loaded above 2^32, so the address really doesn't fit in 32 bits), as well as normal [reg + disp0/8/32] or absolute [disp32].
An index register always prevents use of port7, e.g. [rdi + rax], or [disp32 + rax*2]. Even [NOSPLIT disp32 + rax*1] can't use port 7 (so HSW/SKL doesn't internally convert an indexed with scale=1 and no base register into a base+disp32 addressing mode.)
I tested myself with ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_dispatched_port.port_2,uops_dispatched_port.port_3,uops_dispatched_port.port_7 ./testloop on a Skylake i7-6700k.
The [+0, +2047] range of displacements makes no different for stores: mov [rsi - 4000], rax can use port 7.
Non-indexed loads with small positive displacements have 1c lower latency. No special case for stores is mentioned in Intel's optimization manual. Skylake's variable-latency store-forwarding (with worse latency when the load tries to execute right away after the store) makes it hard to construct a microbenchmark that includes store latency but isn't affected by having store-address uops compete with loads for fewer ports. I haven't come up with a microbenchmark with a loop-carried dependency chain through a store-address uop but not through the store-data uop. Presumably it's possible, but maybe needs an array instead of a single location.
Some instructions can't use port7 at all:
vextracti128 [rdi], ymm0, 0 includes a store-address uop (of course), but it can only run on port 2 or port 3.
Agner Fog's instruction tables have at least one error here, though: he lists pextrb/w/d/q as only running the store-address uop on p23, but in fact it can use any of p237 on HSW/SKL.
I haven't tested this exhaustively, but one difference between HSW and SKL I found1 was VCVTPS2PH [mem], xmm/ymm, imm8. (The instruction changed to use fewer ALU uops, so that doesn't indicate a change in p7 between HSW and SKL).
On Haswell: VCVTPS2PH is 4 uops (fused and unfused domain): p1 p4 p5 p23 (Agner Fog is right).
On Skylake: VCVTPS2PH xmm is 2 fused / 3 unfused uops: p01 p4 p237
On Skylake: VCVTPS2PH ymm is 3 fused / 3 unfused uops: p01 p4 p237
(Agner Fog lists VCVTPS2PH v as 3F/3U (one entry for both vector widths), missing the micro-fusion with the xmm version, and incorrectly lists the port breakdown as p01 p4 p23).
In general, beware that Agner's recent updates seem a little sloppy, like copy/paste or typo errors (e.g. 5 instead of 0.5 for Ryzen vbroadcastf128 y,m128 throughput).
1: HSW testing was on an old laptop that's no longer usable (I used its RAM to upgrade another machine that still gets regular use). I don't have a Broadwell to test on. Everything in this answer is definitely true on Skylake: I double checked it just now. I tested some of this a while ago on Haswell, and still had my notes from that.

Intel Intrinsics guide - Latency and Throughput

Can somebody explain the Latency and the Throughput values given in the Intel Intrinsic Guide?
Have I understood it correctly that the latency is the amount of time units an instruction takes to run, and the throughput is the number of instructions that can be started per time unit?
If my definition is correct, why is the latency for some instructions higher on newer CPU versions (e.g. mulps)?
Missing from that table: MULPS latency on Broadwell: 3. On Skylake: 4.
The intrinsic finder's latency is accurate in this case, although it occasionally doesn't match Agner Fog's experimental testing. (That VEXTRACTF128 latency may be a case of Intel not including a bypass delay in their table). See my answer on that linked question for more details about what to do with throughput and latency numbers, and what they mean for a modern out-of-order CPU.
MULPS latency did increase from 4 (Nehalem) to 5 (Sandybridge). This may have been to save power or transistors, but more likely because SandyBridge standardized uop latencies to only a few different values, to avoid writeback conflict: i.e. when the same execution unit would produce two results in the same cycle, e.g. from starting a 2c uop one cycle, then a 1c uop the next cycle.
This simplifies the uop scheduler, which dispatches uops from the Reservation Station to the execution units. More or less in oldest-first order, but it has has to filter by which ones have their inputs ready. The scheduler is power-hungry, and this is a significant part of the power cost of out-of-order execution. (It's unfortunately not practical to make a scheduler that picks uops in critical-path-first order, to avoid having independent uops steal cycles from the critical path with resource conflicts.)
Agner Fog explains the same thing (in the SnB section of his microarch pdf):
Mixing μops with different latencies
Previous processors have a write-back conflict when μops with
different latencies are issued to the same execution port, as
described on page 114. This problem is largely solved on the Sandy
Bridge. Execution latencies are standardized so that all μops with a
latency of 3 are issued to port 1 and all μops with a latency of 5 go
to port 0. μops with a latency of 1 can go to port 0, 1 or 5. No other
latencies are allowed, except for division and square root.
The standardization of latencies has the advantage that write-back
conflicts are avoided. The disadvantage is that some μops have higher
latencies than necessary.
Hmm, I just realized that Agner's numbers for VEXTRACTF128 xmm, ymm, imm8 are weird. Agner lists it as 1 uop 2c latency on SnB, but Intel lists it as 1c latency (as discussed here). Maybe the execution unit is 1c latency, but there's a built-in 1c bypass delay (for lane-crossing?) before you can use the result. That would explain the discrepancy between Intel's numbers and Agner's experimental test.
Some instructions are still 2c latency, because they decode to 2 dependent uops that are each 1c latency. MULPS is a single uop, even the AVX 256b version, because even Intel's first-gen AVX CPUs have full-width 256b execution units (except the divide/sqrt unit). Needing twice as many copies of the FP multiplier circuitry is a good reason for optimizing it to save transistors at the cost of latency.
This pattern holds up to and including Broadwell, AFAICT from searching Agner's tables. (Using LibreOffice, I selected the whole table, and did data->filter->standard filter, and looked for rows with column C = 1 and column F = 4. (And then repeat for 2.) Look for any uops that aren't loads or stores.
Haswell sticks to the pattern of only 1, 3 and 5 cycle ALU uop latencies (except for AESENC/AESDEC, which is 1 uop for port5 with 7c latency. And of course DIVPS and SQRTPS). There's also CVTPI2PS xmm, mm, at 1 uop 4c latency, but maybe that's 3c for the p1 uop and 1c of bypass delay, the way Agner Fog measured it or unavoidable. VMOVMSKPS r32, ymm is also 2c (vs. 3c for the r32,xmm version).
Broadwell dropped MULPS latency to 3, same as ADDPS, but kept FMA at 5c. Presumably they figured out how to shortcut the FMA unit to produce just a multiply when no add was needed.
Skylake is able to handle uops with latency=4. Latency for FMA, ADDPS/D, and MULPS/D = 4 cycles. (SKL drops the dedicated vector-FP add unit, and does everything with the FMA unit. So ADDPS/D throughput is doubled to match MULPS/D and FMA...PS/D. I'm not sure which change motivated what, and whether they would have introduced 4c latency instructions at all if they hadn't wanted to drop the vec-FP adder without hurting ADDPS latency too badly.)
Other SKL instructions with 4c latency: PHMINPOSUW (down from 5c), AESDEC/AESENC, CVTDQ2PS (up from 3c, but this might be 3c + bypass), RCPPS (down from 5c), RSQRTPS, CMPPS/D (up from 3c). Hmm, I guess FP compares were done in the adder, and now have to use FMA.
MOVD r32, xmm and MOVD xmm, r32 are listed as 2c, perhaps a bypass delay from int-vec to int? Or a glitch in Agner's testing? Testing the latency would require other instructions to create a round-trip back to xmm. It's 1c on HSW. Agner lists SKL MOVQ r64, xmm as 2 cycles (port0), but MOVQ xmm, r64 as 1c (port5), and it seems extremely weird that reading a 64-bit register is faster than reading a 32-bit register. Agner has had mistakes in his table in the past; this may be another.

Why is SSE aligned read + shuffle slower than unaligned read on some CPUs but not on others?

While trying to optimize misaligned reads necessary for my finite differences code, I changed unaligned loads like this:
__m128 pm1 =_mm_loadu_ps(&H[k-1]);
into this aligned read + shuffle code:
__m128 p0 =_mm_load_ps(&H[k]);
__m128 pm4 =_mm_load_ps(&H[k-4]);
__m128 pm1 =_mm_shuffle_ps(p0,p0,0x90); // move 3 floats to higher positions
__m128 tpm1 =_mm_shuffle_ps(pm4,pm4,0x03); // get missing lowest float
pm1 =_mm_move_ss(pm1,tpm1); // pack lowest float with 3 others
where H is 16 byte-aligned; and there also was similar change for H[k+1], H[k±3] and movlhps & movhlps optimization for H[k±2] (here's the full code of the loop).
I found that on my Core i7-930 optimization for reading H[k±3] appeared to be fruitful, while adding next optimization for ±1 slowed down my loop (by units of percent). Switching between ±1 and ±3 optimizations didn't change results.
At the same time, on Core 2 Duo 6300 and Core 2 Quad enabling both optimizations (for ±1 and ±3) boosted performance (by tens of percent), while for Core i7-4765T both of these slowed it down (by units of percent).
On Pentium 4 all attempts to optimize misaligned reads, including those with movlhps/movhlps lead to slowdown.
Why is it so different for different CPUs? Is it because of increase in code size so that the loop might not fit in some instruction cache? Or is it because some of CPUs are insensitive to misaligned reads, while others are much more sensitive? Or maybe such actions as shuffles are slow on some CPUs?
Every two years Intel comes out with a new microarchitecture. The number of execution units may change, instructions that previously could only execute in one execution unit may have 2 or 3 available in newer processors. The latency of instruction might change, as when a shuffle execution unit is added.
Intel goes into some detail in their Optimization Reference Manual, here's the link, below I've copied the relevant sections.
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
section 3.5.2.7 Floating-Point/SIMD Operands
The MOVUPD from memory instruction performs two 64-bit loads, but requires additional μops to adjust the address and combine the loads into a single register. This same functionality can be obtained using MOVSD XMMREG1, MEM; MOVSD XMMREG2, MEM+8; UNPCKLPD XMMREG1, XMMREG2, which uses fewer μops and can be packed into the trace cache more effectively. The latter alternative has been found to provide a several percent performance improvement in some cases. Its encoding requires more instruction bytes, but this is seldom an issue for the Pentium 4 processor. The store version of MOVUPD is complex and slow, so much so that the sequence with two MOVSD and a UNPCKHPD should always be used.
Assembly/Compiler Coding Rule 44. (ML impact, L generality) Instead of using MOVUPD XMMREG1, MEM for a unaligned 128-bit load, use MOVSD XMMREG1, MEM; MOVSD XMMREG2, MEM+8; UNPCKLPD XMMREG1, XMMREG2. If the additional register is not available, then use MOVSD XMMREG1, MEM; MOVHPD XMMREG1, MEM+8.
Assembly/Compiler Coding Rule 45. (M impact, ML generality) Instead of using MOVUPD MEM, XMMREG1 for a store, use MOVSD MEM, XMMREG1; UNPCKHPD XMMREG1, XMMREG1; MOVSD MEM+8, XMMREG1 instead.
section 6.5.1.2 Data Swizzling
Swizzling data from SoA to AoS format can apply to a number of application domains, including 3D geometry, video and imaging. Two different swizzling techniques can be adapted to handle floating-point and integer data. Example 6-3 illustrates a swizzle function that uses SHUFPS, MOVLHPS, MOVHLPS instructions.
The technique in Example 6-3 (loading 16 bytes, using SHUFPS and copying halves of XMM registers) is preferable over an alternate approach of loading halves of each vector using MOVLPS/MOVHPS on newer microarchitectures. This is because loading 8 bytes using MOVLPS/MOVHPS can create code dependency and reduce the throughput of the execution engine. The performance considerations of Example 6-3 and Example 6-4 often depends on the characteristics of each microarchitecture. For example, in Intel Core microarchitecture, executing a SHUFPS tend to be slower than a PUNPCKxxx instruction. In Enhanced Intel Core microarchitecture, SHUFPS and PUNPCKxxx instruction all executes with 1 cycle throughput due to the 128-bit shuffle execution unit. Then the next important consideration is that there is only one port that can execute PUNPCKxxx vs. MOVLHPS/MOVHLPS can execute on multiple ports. The performance of both techniques improves on Intel Core microarchitecture over previous microarchitectures due to 3 ports for executing SIMD instructions. Both techniques improves further on Enhanced Intel Core microarchitecture due to the 128-bit shuffle unit.
On older CPUs misaligned loads have a large performance penalty - they generate two bus read cycles and then there is some additional fix-up after the two read cycles. This means that misaligned loads are typically 2x or more slower than aligned loads. However with more recent CPUs (e.g. Core i7) the penalty for misaligned loads is almost negligible. So if you need so support old CPUs and new CPUs you'll probably want to handle misaligned loads differently for each.

Resources