vgetmantps vs andpd instructions for getting the mantissa of float

vgetmantps vs andpd instructions for getting the mantissa of float - performance

For skylakex (agner fog's instruction tables):
+-----------------------+-------------+-------------------+---------------------+----------------+---------+-----------------------+
| Instruction | Operands | µops fused domain | µops unfused domain | µops each port | Latency | Reciprocal throughput |
+-----------------------+-------------+-------------------+---------------------+----------------+---------+-----------------------+
| VGETMANTPS/PD | v,v,v | 1 | 1 | p01/05 | 4 | 0.5-1 |
| AND/ANDN/OR/ XORPS/PD | x,x / y,y,y | 1 | 1 | p015 | 1 | 0.33 |
+-----------------------+-------------+-------------------+---------------------+----------------+---------+-----------------------+
Does that mean using a bitmask and logical and to get the mantissa of a float is faster than using the vgetmantps instruction?
How much is the latency for transferring the number from float to int and back to float?

For implementing log(x), you want the mantissa and exponent as float, and vgetmantps / vgetexpps are perfect for it. Efficient implementation of log2(__m256d) in AVX2. This is what those instructions are for, and do speed up a fast approximation to log(2). (Plus it can normalize the significand to -0.5 .. +0.5 instead of 1 .. 2.0 or other neat ranges to create input for a polynomial approximation to log(x+1) or whatever. See its docs.)
If you only want the mantissa as an integer, then sure AND away the other bits and you're done in one instruction.
(But remember that for a NaN, the mantissa is the NaN payload, so if you need to do something different for NaN then you need to check the exponent.)
How much is the latency for transferring the number from float to int and back to float?
You already have Agner Fog's instruction tables (https://agner.org/optimize/). On Skylake (SKL and SKX) VCVT(T) PS2DQ is 4c latency for the FMA ports, and so is the other direction.
Or are you asking about bypass latency for using the output of an FP instruction like andps as the input to an integer instruction?
Agner Fog's microarch PDF has some info about bypass latency for sending data between vec-int and fp domains, but not many specifics.
Skylake's bypass latency is weird: unlike on previous uarches, it depends what port the instruction actually picked. andps has no bypass latency between FP instructions if it runs on port 5, but if it runs on p0 or p1, it has an extra 1c of latency.
See Intel's optimization manual for a table of domain-crossing latencies broken down by domain+execution-port.
(And just to be extra weird, this bypass-delay latency affects that register forever, even after it has definitely written back to a physical register and isn't being forwarded over the bypass network. vpaddd xmm0, xmm1, xmm2 has 2c latency for both inputs if either input came from vmulps. But some shuffles and other instructions work in either domain. It was a while since I experimented with this, and I didn't check my notes, so this example might not be exactly right, but it's something like this.)
(Intel's optimization manual doesn't mention this permanent effect which lasts until you overwrite the architectural register with a new value. So be careful about creating FP constants ahead of a loop with integer instructions.)

Related

Why is LOOP faster than DEC,JNZ on 8086?

My professor claimed that LOOP is faster on 8086 because only one instruction is fetched instead of two, like in dec cx, jnz. So I think we are saving time by avoiding the extra fetch and decode per iteration.
But earlier in the lecture, he also mentioned that LOOP does the same stuff as DEC, JNZ under the hood, and I presume that its decoding should also be more complex, so the speed difference should kind of balance out. Then, why is the LOOP instruction faster? I went through this post, and the answers there pertain to processors more modern than 8086, although one of the answers (and the page it links) does point out that on 8088 (closely related to 8086), LOOP is faster.
Later, the professor used the same reasoning to explain why rep string operations might be faster than LOOP + individual movement instructions, but since I was not entirely convinced with the previous approach, I asked this question here.

It's not decode that's the problem, it's usually fetch on 8086.
Starting two separate instruction-decode operations probably is more expensive than just fetching more microcode for one loop instruction. I'd guess that's what accounts for the numbers in the table below that don't include code-fetch bottlenecks.
Equally or more importantly, 8086 is often bottlenecked by memory access, including code-fetch. (8088 almost always is, breathing through a straw with it's 8-bit bus, unlike 8086's 16-bit bus).
dec cx is 1 byte, jnz rel8 is 2 bytes.
So 3 bytes total, vs. 2 for loop rel8.
8086 performance can be approximated by counting memory accesses and multiply by four, since its 6-byte instruction prefetch buffer allows it to overlap code-fetch with decode and execution of other instructions. (Except for very slow instructions like mul that would let the buffer fill up after at most three 2-byte fetches.)
See also Increasing Efficiency of binary -> gray code for 8086 for an example of optimizing something for 8086, with links to more resources like tables of instruction timings.
https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html has instruction timings for 8086 (taken from Intel manuals I think, as cited in njuffa's answer), but those are only execution, when fetch isn't a bottleneck. (i.e. just decoding from the prefetch buffer.)
Decode / execute timings, not including fetch:
DEC Decrement
operand bytes 8088 186 286 386 486 Pentium
r8 2 3 3 2 2 1 1 UV
r16 1 3 3 2 2 1 1 UV
r32 1 3 3 2 2 1 1 UV
mem 2+d(0,2) 23+EA 15 7 6 3 3 UV
Jcc Jump on condition code
operand bytes 8088 186 286 386 486 Pentium
near8 2 4/16 4/13 3/7+m 3/7+m 1/3 1 PV
near16 3 - - - 3/7+m 1/3 1 PV
LOOP Loop control with CX counter
operand bytes 8088 186 286 386 486 Pentium
short 2 5/17 5/15 4/8+m 11+m 6/7 5/6 NP
So even ignoring code-fetch differences:
dec + taken jnz takes 3 + 16 = 19 cycles to decode / exec on 8086 / 8088.
taken loop takes 17 cycles to decode / exec on 8086 / 8088.
(Taken branches are slow on 8086, and discard the prefetch buffer; there's no branch prediction. IDK if those timings include any of that penalty, since they apparently don't for other instructions and non-taken branches.)
8088/8086 are not pipelined except for the code-prefetch buffer. Finishing execution of one instruction and starting decode / exec of the next take it some time; even the cheapest instructions (like mov reg,reg / shift / rotate / stc/std / etc.) take 2 cycles. Bizarrely more than nop (3 cycles).

I presume that its decoding should also be more complex
There's no reason that the decoding is more complex for the loop instruction.  This instruction has to do multiple things, but decoding is not at issue — it should decode as easily as JMP, since there's just the opcode and the one operand, the branch target, like JMP.
Saving one instruction's fetch & decode probably accounts for the speed improvement, since in execution they are effectively equivalent.

Looking at the "8086/8088 User's Manual: Programmer's and Hardware Reference" (Intel 1989) confirms that LOOP is marginally faster than the combination DEC CX; JNZ. DEC takes 3 clock cycles, JNZ takes 4 (not taken) or 16 (taken) cycles. So the combination requires 7 or 19 cycles. LOOP on the other hand requires 5 cycles (not taken) or 17 cycles (taken), for a saving of 2 cycles.
I do not see anything in the manual that describes why LOOP is faster. The faster instruction fetch due to the reduced number of opcode bytes seems like a reasonable hypothesis.
According to the "80286 and 80287 Programmer's Reference Manual" (Intel 1987), LOOP still has a slight advantage over the discrete replacement, in that it requires 8 cycles when taken and 4 cycles when not taken, while the combo requires 1 cycle more in both cases (DEC 2 cycles; JNZ 7 or 3 cycles).
The 8086 microcode has been disassembled, so one could theoretically take a look at the internal sequence of operations for both of these cases to establish exactly why LOOP is faster, if one is so inclined.

What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

I want to be able to predict, by hand, exactly how long arbitrary arithmetical (i.e. no branching or memory, though that would be nice too) x86-64 assembly code will take given a particular architecture, taking into account instruction reordering, superscalarity, latencies, CPIs, etc.
What / describe the rules must be followed to achieve this?
I think I've got some preliminary rules figured out, but I haven't been able to find any references on breaking down any example code to this level of detail, so I've had to take some guesses. (For example, the Intel optimization manual barely even mentions instruction reordering.)
At minimum, I'm looking for (1) confirmation that each rule is correct or else a correct statement of each rule, and (2) a list of any rules that I may have forgotten.
As many instructions as possible are issued each cycle, starting in-order from the current cycle and potentially as far ahead as the reorder buffer size.
An instruction can be issued on a given cycle if:
No instructions that affect its operands are still being executed. And:
If it is a floating-point instruction, every floating-point instruction before it has been issued (floating-point instructions have static instruction re-ordering). And:
There is a functional unit available for that instruction on that cycle. Every (?) functional unit is pipelined, meaning it can accept 1 new instruction per cycle, and the number of total functional units is 1/CPI, for the CPI of a given function class (nebulous here: presumably e.g. addps and subps use the same functional unit? How do I determine this?). And:
Fewer than the superscalar width (typically 4) number of instructions have already been issued this cycle.
If no instructions can be issued, the processor simply doesn't issue any—a condition called a "stall".
As an example, consider the following example code (which computes a cross-product):
shufps xmm3, xmm2, 210
shufps xmm0, xmm1, 201
shufps xmm2, xmm2, 201
mulps xmm0, xmm3
shufps xmm1, xmm1, 210
mulps xmm1, xmm2
subps xmm0, xmm1
My attempt to predict the latency for Haswell looks something like this:
; `mulps` Haswell latency=5, CPI=0.5
; `shufps` Haswell latency=1, CPI=1
; `subps` Haswell latency=3, CPI=1
shufps xmm3, xmm2, 210 ; cycle 1
shufps xmm0, xmm1, 201 ; cycle 2
shufps xmm2, xmm2, 201 ; cycle 3
mulps xmm0, xmm3 ; (superscalar execution)
shufps xmm1, xmm1, 210 ; cycle 4
mulps xmm1, xmm2 ; cycle 5
; cycle 6 (stall `xmm0` and `xmm1`)
; cycle 7 (stall `xmm1`)
; cycle 8 (stall `xmm1`)
subps xmm0, xmm1 ; cycle 9
; cycle 10 (stall `xmm0`)

TL:DR: look for dependency chains, especially loop-carried ones. For a long-running loop, see which latency, front-end throughput, or back-end port contention/throughput is the worst bottleneck. That's how many cycles your loop probably takes per iteration, on average, if there are no cache misses or branch mispredicts.
Latency bounds and throughput bounds for processors for operations that must occur in sequence is a good example of analyzing loop-carried dependency chains in a specific loop with two dep chains, one pulling values from the other.
Related: How many CPU cycles are needed for each assembly instruction? is a good introduction to throughput vs. latency on a per-instruction basis, and how what that means for sequences of multiple instructions. See also Assembly - How to score a CPU instruction by latency and throughput for how to measure a single instruction.
This is called static (performance) analysis. Wikipedia says (https://en.wikipedia.org/wiki/List_of_performance_analysis_tools) that AMD's AMD CodeXL has a "static kernel analyzer" (i.e. for computational kernels, aka loops). I've never tried it.
Intel also has a free tool for analyzing how loops will go through the pipeline in Sandybridge-family CPUs: What is IACA and how do I use it?
IACA is not bad, but has bugs (e.g. wrong data for shld on Sandybridge, and last I checked, it doesn't know that Haswell/Skylake can keep indexed addressing modes micro-fused for some instructions. But maybe that will change now that Intel's added details on that to their optimization manual.) IACA is also unhelpful for counting front-end uops to see how close to a bottleneck you are (it likes to only give you unfused-domain uop counts).
Static analysis is often pretty good, but definitely check by profiling with performance counters. See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example of profiling a simple loop to investigate a microarchitectural feature.
Essential reading:
Agner Fog's microarch guide (chapter 2: Out of order exec) explains some of the basics of dependency chains and out-of-order execution. His "Optimizing Assembly" guide has more good introductory and advanced performance stuff.
The later chapters of his microarch guide cover the details of the pipelines in CPUs like Nehalem, Sandybridge, Haswell, K8/K10, Bulldozer, and Ryzen. (And Atom / Silvermont / Jaguar).
Agner Fog's instruction tables (spreadsheet or PDF) are also normally the best source for instruction latency / throughput / execution-port breakdowns.
David Kanter's microarch analysis docs are very good, with diagrams. e.g. https://www.realworldtech.com/sandy-bridge/, https://www.realworldtech.com/haswell-cpu/, and https://www.realworldtech.com/bulldozer/.
See also other performance links in the x86 tag wiki.
I also took a stab at explaining how a CPU core finds and exploits instruction-level parallelism in this answer, but I think you've already grasped those basics as far as it's relevant for tuning software. I did mention how SMT (Hyperthreading) works as a way to expose more ILP to a single CPU core, though.
In Intel terminology:
"issue" means to send a uop into the out-of-order part of the core; along with register-renaming, this is the last step in the front-end. The issue/rename stage is often the narrowest point in the pipeline, e.g. 4-wide on Intel since Core2. (With later uarches like Haswell and especially Skylake often actually coming very close to that in some real code, thanks to SKL's improved decoders and uop-cache bandwidth, as well as back-end and cache bandwidth improvements.) This is fused-domain uops: micro-fusion lets you send 2 uops through the front-end and only take up one ROB entry. (I was able to construct a loop on Skylake that sustains 7 unfused-domain uops per clock). See also http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ re: out-of-order window size.
"dispatch" means the scheduler sends a uop to an execution port. This happens as soon as all the inputs are ready, and the relevant execution port is available. How are x86 uops scheduled, exactly?. Scheduling happens in the "unfused" domain; micro-fused uops are tracked separately in the OoO scheduler (aka Reservation Station, RS).
A lot of other computer-architecture literature uses these terms in the opposite sense, but this is the terminology you will find in Intel's optimization manual, and the names of hardware performance counters like uops_issued.any or uops_dispatched_port.port_5.
exactly how long arbitrary arithmetical x86-64 assembly code will take
It depends on the surrounding code as well, because of OoO exec
Your final subps result doesn't have to be ready before the CPU starts running later instructions. Latency only matters for later instructions that need that value as an input, not for integer looping and whatnot.
Sometimes throughput is what matters, and out-of-order exec can hide the latency of multiple independent short dependency chains. (e.g. if you're doing the same thing to every element of a big array of multiple vectors, multiple cross products can be in flight at once.) You'll end up with multiple iterations in flight at once, even though in program order you finish all of one iteration before doing any of the next. (Software pipelining can help for high-latency loop bodies if OoO exec has a hard time doing all the reordering in HW.)
There are three major dimensions to analyze for a short block
You can approximately characterize a short block of non-branching code in terms of these three factors. Usually only one of them is the bottleneck for a given use-case. Often you're looking at a block that you will use as part of a loop, not as the whole loop body, but OoO exec normally works well enough that you can just add up these numbers for a couple different blocks, if they're not so long that OoO window size prevents finding all the ILP.
latency from each input to the output(s). Look at which instructions are on the dependency chain from each input to each output. e.g. one choice might need one input to be ready sooner.
total uop count (for front-end throughput bottlenecks), fused-domain on Intel CPUs. e.g. Core2 and later can in theory issue/rename 4 fused-domain uops per clock into the out-of-order scheduler/ROB. Sandybridge-family can often achieve that in practice with the uop cache and loop buffer, especially Skylake with its improved decoders and uop-cache throughput.
uop count for each back-end execution port (unfused domain). e.g. shuffle-heavy code will often bottleneck on port 5 on Intel CPUs. Intel usually only publishes throughput numbers, not port breakdowns, which is why you have to look at Agner Fog's tables (or IACA output) to do anything meaningful if you're not just repeating the same instruction a zillion times.
Generally you can assuming best-case scheduling/distribution, with uops that can run on other ports not stealing the busy ports very often, but it does happen some. (How are x86 uops scheduled, exactly?)
Looking at CPI is not sufficient; two CPI=1 instructions might or might not compete for the same execution port. If they don't, they can execute in parallel. e.g. Haswell can only run psadbw on port 0 (5c latency, 1c throughput, i.e. CPI=1) but it's a single uop so a mix of 1 psadbw + 3 add instructions could sustain 4 instructions per clock. There are vector ALUs on 3 different ports in Intel CPUs, with some operations replicated on all 3 (e.g. booleans) and some only on one port (e.g. shifts before Skylake).
Sometimes you can come up with a couple different strategies, one maybe lower latency but costing more uops. A classic example is multiplying by constants like imul eax, ecx, 10 (1 uop, 3c latency on Intel) vs. lea eax, [rcx + rcx*4] / add eax,eax (2 uops, 2c latency). Modern compilers tend to choose 2 LEA vs. 1 IMUL, although clang up to 3.7 favoured IMUL unless it could get the job done with only a single other instruction.
See What is the efficient way to count set bits at a position or lower? for an example of static analysis for a few different ways to implement a function.
See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) (which ended up being way more detailed than you'd guess from the question title) for another summary of static analysis, and some neat stuff about unrolling with multiple accumulators for a reduction.
Every (?) functional unit is pipelined
The divider is pipelined in recent CPUs, but not fully pipelined. (FP divide is single-uop, though, so if you do one divps mixed in with dozens of mulps / addps, it can have negligible throughput impact if latency doesn't matter: Floating point division vs floating point multiplication. rcpps + a Newton iteration is worse throughput and about the same latency.
Everything else is fully pipelined on mainstream Intel CPUs; multi-cycle (reciprocal) throughput for a single uop. (variable-count integer shifts like shl eax, cl have lower-than-expected throughput for their 3 uops, because they create a dependency through the flag-merging uops. But if you break that dependency through FLAGS with an add or something, you can get better throughput and latency.)
On AMD before Ryzen, the integer multiplier is also only partially pipelined. e.g. Bulldozer's imul ecx, edx is only 1 uop, but with 4c latency, 2c throughput.
Xeon Phi (KNL) also has some not-fully-pipelined shuffle instructions, but it tends to bottleneck on the front-end (instruction decode), not the back-end, and does have a small buffer + OoO exec capability to hide back-end bubbles.
If it is a floating-point instruction, every floating-point instruction before it has been issued (floating-point instructions have static instruction re-ordering)
No.
Maybe you read that for Silvermont, which doesn't do OoO exec for FP/SIMD, only integer (with a small ~20 uop window). Maybe some ARM chips are like that, too, with simpler schedulers for NEON? I don't know much about ARM uarch details.
The mainstream big-core microarchitectures like P6 / SnB-family, and all AMD OoO chips, do OoO exec for SIMD and FP instructions the same as for integer. AMD CPUs use a separate scheduler, but Intel uses a unified scheduler so its full size can be applied to finding ILP in integer or FP code, whichever is currently running.
Even the silvermont-based Knight's Landing (in Xeon Phi) does OoO exec for SIMD.
x86 is generally not very sensitive to instruction ordering, but uop scheduling doesn't do critical-path analysis. So it could sometimes help to put instructions on the critical path first, so they aren't stuck waiting with their inputs ready while other instructions run on that port, leading to a bigger stall later when we get to instructions that need the result of the critical path. (i.e. that's why it is the critical path.)
My attempt to predict the latency for Haswell looks something like this:
Yup, that looks right. shufps runs on port 5, addps runs on p1, mulps runs on p0 or p1. Skylake drops the dedicated FP-add unit and runs SIMD FP add/mul/FMA on the FMA units on p0/p1, all with 4c latency (up/down from 3/5/5 in Haswell, or 3/3/5 in Broadwell).
This is a good example of why keeping a whole XYZ direction vector in a SIMD vector usually sucks. Keeping an array of X, an array of Y, and an array of Z, would let you do 4 cross products in parallel without any shuffles.
The SSE tag wiki has a link to these slides: SIMD at Insomniac Games (GDC 2015) which covers that array-of-structs vs. struct-of-arrays issues for 3D vectors, and why it's often a mistake to always try to SIMD a single operation instead of using SIMD to do multiple operations in parallel.

SHA1 hashing FPGA performance

I'm trying to understand how well FPGAs can do SHA1 hashing.
For reference, SHA1 involves doing a series of 32-bit integer computations, arranged in 80 "steps"; here are 4 representative steps from the middle of the algorithm, in C:
x0 = rol(x13 ^ x8 ^ x2 ^ x0, 1);
e += rol(a,5) + (b^c^d) + x0 + 0x6ED9EBA1L;
b = rol(b,30);
x1 = rol(x14 ^ x9 ^ x3 ^ x1, 1);
c += rol(d,5) + (e^a^b) + x1 + 0x6ED9EBA1L;
e = rol(e,30);
x2 = rol(x13 ^ x10 ^ x4 ^ x2, 1);
b += rol(c,5) + (d^e^a) + x2 + 0x6ED9EBA1L;
d = rol(d,30);
x3 = rol(x13 ^ x11 ^ x5 ^ x3, 1)
a += rol(b,5) + (c^d^e) + x3 + 0x6ED9EBA1L;
c = rol(c,30);
There is a total of 21 internal 32-bit variables, and the algorithm keeps feeding them into each other. 'rol' is shift with rotation (shifting bits out of one end and into the other.)
Now, it would seem to me that computing x13 ^ x11 ^ x5 ^ x3 takes 32 LUTs, c^d^e takes another 32 LUTs, and I'm not clear on how to calculate the resources needed by the additions, but I'm guessing either 96 or 128 LUTs. Rotations and assignments are done through interconnects. So, let's say 192 LUTs total, times 80 steps, plus some overhead. Fully unrolled, I'd expect ~16,000 LUTs, with throughput of 1 input block per clock cycle and latency of 80-ish clock cycles.
A Xilinx Artix-7 XC7A50T contains 8150 slices with 4 LUTs each, so I'd have throughput of 2 blocks per clock cycle, or 600 Mhash/s at 300 MHz (300 Gbps since each block is 512 bit.) Is that a reasonable estimate or am I way off?
I've not been able to find any references to fully unrolled SHA1 implementations, but these guys https://www.heliontech.com/fast_hash.htm claim a "very high performance" implementation with 828 LUTs and throughput of 1 block per 82 clock cycles, so, closer to 70 Gbps on a XC7A50T. Is this figure so much lower simply because they are not unrolled?

Now, it would seem to me that computing x13 ^ x11 ^ x5 ^ x3 takes 32 LUTs, c^d^e takes another 32 LUTs, and I'm not clear on how to calculate the resources needed by the additions, but I'm guessing either 96 or 128 LUTs.
That would all be true if the XORs and addition were all independent -- but that isn't the case. Each LUT on a 7-series FPGA can take up to 6 inputs, so the synthesizer may be able to absorb some of the XORs into the addition chain.
That all being said, routing and layout will be your largest obstacle. To make use of the carry chain, all of the bits in a wide adder have to be laid out "vertically". This causes the pipeline to naturally flow from left to right, but I doubt the XC7A50T has enough columns to fit the entire pipeline in a single row. Routing resources will be the limiting factor, not LUTs.

Okay, I can answer my own question now. I've managed to put together a working SHA1 implementation in Verilog.
https://github.com/ekuznetsov139/fpga
This is actually a WPA2 PMK generator rather than just SHA1 (SHA1 executed in a loop 8192 times on the same data.)
I would not claim it to be perfectly optimized or even particularly well coded - I've learned all I know about Verilog in the last two weeks, in between other projects, and half of that time was spent on getting the data marshalled to and from multiple instances of the core over PCI-Express. But I got it working correctly in a simulator and had successful runs on an actual FPGA, and performance figures are close to my original projections. With a Cyclone V target, I consistently see about 7,000 ALMs per core, with each core capable of doing one hash per clocktick. One ALM is essentially 2 LUTs (either 1 large or 2 small) plus some carry adder hardware. So, 14,000 LUTs. Fmax seems to be around 300 MHz for fast silicon and closer to 150 MHz for slow silicon.
One thing I did not account for in my initial estimates is the need for lots of memory for the internal state. 21 32-bit variables times 80 steps is 53760 bit, and, with 4 registers per ALM, that alone would require more resources than all computations. But the compiler is able to pack most of that into memory cells, even if I don't instruct it to do it explicitly.
Routing/layout is a fairly big problem, though. I have a chip with 113K ALM (301K LE). The most I've been able to fit into it is 5 copies. That's less than 40% utilization. And that took ~8 hours of fitting. Going to try to mess with LogicLock to see if I can do better.
With 5 copies running at once at 300 MHz, the throughput would be 1.5 Ghash/s SHA1 or 90 Khash/s WPA2. Which is somewhat less than I hoped for (about 1/3rd of the throughput of a GeForce 980 Ti). But at least the energy efficiency is a lot better.
EDIT: One look at the Design Partition Planner in the standard edition of Quartus revealed the problem. The compiler, too smart for its own good, was merging internal storage arrays of each core, thus creating tons of unnecessary interconnects between cores.
Even without full LogicLock, just with "Allow shift register merging across hierarchies" set to "off", I have a successful fit with 10 copies. Let's see if I can do 12...

Intel Intrinsics guide - Latency and Throughput

Can somebody explain the Latency and the Throughput values given in the Intel Intrinsic Guide?
Have I understood it correctly that the latency is the amount of time units an instruction takes to run, and the throughput is the number of instructions that can be started per time unit?
If my definition is correct, why is the latency for some instructions higher on newer CPU versions (e.g. mulps)?

Missing from that table: MULPS latency on Broadwell: 3. On Skylake: 4.
The intrinsic finder's latency is accurate in this case, although it occasionally doesn't match Agner Fog's experimental testing. (That VEXTRACTF128 latency may be a case of Intel not including a bypass delay in their table). See my answer on that linked question for more details about what to do with throughput and latency numbers, and what they mean for a modern out-of-order CPU.
MULPS latency did increase from 4 (Nehalem) to 5 (Sandybridge). This may have been to save power or transistors, but more likely because SandyBridge standardized uop latencies to only a few different values, to avoid writeback conflict: i.e. when the same execution unit would produce two results in the same cycle, e.g. from starting a 2c uop one cycle, then a 1c uop the next cycle.
This simplifies the uop scheduler, which dispatches uops from the Reservation Station to the execution units. More or less in oldest-first order, but it has has to filter by which ones have their inputs ready. The scheduler is power-hungry, and this is a significant part of the power cost of out-of-order execution. (It's unfortunately not practical to make a scheduler that picks uops in critical-path-first order, to avoid having independent uops steal cycles from the critical path with resource conflicts.)
Agner Fog explains the same thing (in the SnB section of his microarch pdf):
Mixing μops with different latencies
Previous processors have a write-back conflict when μops with
different latencies are issued to the same execution port, as
described on page 114. This problem is largely solved on the Sandy
Bridge. Execution latencies are standardized so that all μops with a
latency of 3 are issued to port 1 and all μops with a latency of 5 go
to port 0. μops with a latency of 1 can go to port 0, 1 or 5. No other
latencies are allowed, except for division and square root.
The standardization of latencies has the advantage that write-back
conflicts are avoided. The disadvantage is that some μops have higher
latencies than necessary.
Hmm, I just realized that Agner's numbers for VEXTRACTF128 xmm, ymm, imm8 are weird. Agner lists it as 1 uop 2c latency on SnB, but Intel lists it as 1c latency (as discussed here). Maybe the execution unit is 1c latency, but there's a built-in 1c bypass delay (for lane-crossing?) before you can use the result. That would explain the discrepancy between Intel's numbers and Agner's experimental test.
Some instructions are still 2c latency, because they decode to 2 dependent uops that are each 1c latency. MULPS is a single uop, even the AVX 256b version, because even Intel's first-gen AVX CPUs have full-width 256b execution units (except the divide/sqrt unit). Needing twice as many copies of the FP multiplier circuitry is a good reason for optimizing it to save transistors at the cost of latency.
This pattern holds up to and including Broadwell, AFAICT from searching Agner's tables. (Using LibreOffice, I selected the whole table, and did data->filter->standard filter, and looked for rows with column C = 1 and column F = 4. (And then repeat for 2.) Look for any uops that aren't loads or stores.
Haswell sticks to the pattern of only 1, 3 and 5 cycle ALU uop latencies (except for AESENC/AESDEC, which is 1 uop for port5 with 7c latency. And of course DIVPS and SQRTPS). There's also CVTPI2PS xmm, mm, at 1 uop 4c latency, but maybe that's 3c for the p1 uop and 1c of bypass delay, the way Agner Fog measured it or unavoidable. VMOVMSKPS r32, ymm is also 2c (vs. 3c for the r32,xmm version).
Broadwell dropped MULPS latency to 3, same as ADDPS, but kept FMA at 5c. Presumably they figured out how to shortcut the FMA unit to produce just a multiply when no add was needed.
Skylake is able to handle uops with latency=4. Latency for FMA, ADDPS/D, and MULPS/D = 4 cycles. (SKL drops the dedicated vector-FP add unit, and does everything with the FMA unit. So ADDPS/D throughput is doubled to match MULPS/D and FMA...PS/D. I'm not sure which change motivated what, and whether they would have introduced 4c latency instructions at all if they hadn't wanted to drop the vec-FP adder without hurting ADDPS latency too badly.)
Other SKL instructions with 4c latency: PHMINPOSUW (down from 5c), AESDEC/AESENC, CVTDQ2PS (up from 3c, but this might be 3c + bypass), RCPPS (down from 5c), RSQRTPS, CMPPS/D (up from 3c). Hmm, I guess FP compares were done in the adder, and now have to use FMA.
MOVD r32, xmm and MOVD xmm, r32 are listed as 2c, perhaps a bypass delay from int-vec to int? Or a glitch in Agner's testing? Testing the latency would require other instructions to create a round-trip back to xmm. It's 1c on HSW. Agner lists SKL MOVQ r64, xmm as 2 cycles (port0), but MOVQ xmm, r64 as 1c (port5), and it seems extremely weird that reading a 64-bit register is faster than reading a 32-bit register. Agner has had mistakes in his table in the past; this may be another.

Little Endian Cabling?

I have a doubt about endian-ness concept.please don't refer me to wikipedia, i've already read it.
Endian-ness, Isn't it just the 2 ways that the hardware cabling(between memory, and registers, through data bus) has been implemented in a system?
In my understanding, below picture is a little endian implementation(follow horizontal line from a memory address (e.g 4000) and then vertical line to reach to the low/high part of the register please)
As you see little memory addresses have been physically connected to low-part of 4-byte register.
I think that it does not related at all to READ and WRITE instructions in any language(e.g. LDR in ARM).
1-byte memory address:
- 4000 value:XX ------------------|
- 4001 value:XX ---------------| |
- 4002 value:XX ------------| | |
- 4003 value:XX ---------| | | |
| | | |
general-purpose register:XX XX XX XX

Yes and no. (I can't see your diagram, but I think I understand what you're asking). The way data lines are physically connected in the hardware can determine/control whether the representation in memory is treated as big or little endian. However, there is more to it than this; little endian is a means of representation, so for instance data stored on magnetic storage (in a file) might be coded using little endian representation or big endian representation and obviously at this level the hardware is not important.
Furthermore, some 8 bit microcontrollers can perform 16 bit operations, which are performed at the hardware level using two separate memory accesses. They can therefore use either little or big endian representation independent of bus design and ALU connection.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio