SHA1 hashing FPGA performance - fpga

I'm trying to understand how well FPGAs can do SHA1 hashing.
For reference, SHA1 involves doing a series of 32-bit integer computations, arranged in 80 "steps"; here are 4 representative steps from the middle of the algorithm, in C:
x0 = rol(x13 ^ x8 ^ x2 ^ x0, 1);
e += rol(a,5) + (b^c^d) + x0 + 0x6ED9EBA1L;
b = rol(b,30);
x1 = rol(x14 ^ x9 ^ x3 ^ x1, 1);
c += rol(d,5) + (e^a^b) + x1 + 0x6ED9EBA1L;
e = rol(e,30);
x2 = rol(x13 ^ x10 ^ x4 ^ x2, 1);
b += rol(c,5) + (d^e^a) + x2 + 0x6ED9EBA1L;
d = rol(d,30);
x3 = rol(x13 ^ x11 ^ x5 ^ x3, 1)
a += rol(b,5) + (c^d^e) + x3 + 0x6ED9EBA1L;
c = rol(c,30);
There is a total of 21 internal 32-bit variables, and the algorithm keeps feeding them into each other. 'rol' is shift with rotation (shifting bits out of one end and into the other.)
Now, it would seem to me that computing x13 ^ x11 ^ x5 ^ x3 takes 32 LUTs, c^d^e takes another 32 LUTs, and I'm not clear on how to calculate the resources needed by the additions, but I'm guessing either 96 or 128 LUTs. Rotations and assignments are done through interconnects. So, let's say 192 LUTs total, times 80 steps, plus some overhead. Fully unrolled, I'd expect ~16,000 LUTs, with throughput of 1 input block per clock cycle and latency of 80-ish clock cycles.
A Xilinx Artix-7 XC7A50T contains 8150 slices with 4 LUTs each, so I'd have throughput of 2 blocks per clock cycle, or 600 Mhash/s at 300 MHz (300 Gbps since each block is 512 bit.) Is that a reasonable estimate or am I way off?
I've not been able to find any references to fully unrolled SHA1 implementations, but these guys https://www.heliontech.com/fast_hash.htm claim a "very high performance" implementation with 828 LUTs and throughput of 1 block per 82 clock cycles, so, closer to 70 Gbps on a XC7A50T. Is this figure so much lower simply because they are not unrolled?

Now, it would seem to me that computing x13 ^ x11 ^ x5 ^ x3 takes 32 LUTs, c^d^e takes another 32 LUTs, and I'm not clear on how to calculate the resources needed by the additions, but I'm guessing either 96 or 128 LUTs.
That would all be true if the XORs and addition were all independent -- but that isn't the case. Each LUT on a 7-series FPGA can take up to 6 inputs, so the synthesizer may be able to absorb some of the XORs into the addition chain.
That all being said, routing and layout will be your largest obstacle. To make use of the carry chain, all of the bits in a wide adder have to be laid out "vertically". This causes the pipeline to naturally flow from left to right, but I doubt the XC7A50T has enough columns to fit the entire pipeline in a single row. Routing resources will be the limiting factor, not LUTs.

Okay, I can answer my own question now. I've managed to put together a working SHA1 implementation in Verilog.
https://github.com/ekuznetsov139/fpga
This is actually a WPA2 PMK generator rather than just SHA1 (SHA1 executed in a loop 8192 times on the same data.)
I would not claim it to be perfectly optimized or even particularly well coded - I've learned all I know about Verilog in the last two weeks, in between other projects, and half of that time was spent on getting the data marshalled to and from multiple instances of the core over PCI-Express. But I got it working correctly in a simulator and had successful runs on an actual FPGA, and performance figures are close to my original projections. With a Cyclone V target, I consistently see about 7,000 ALMs per core, with each core capable of doing one hash per clocktick. One ALM is essentially 2 LUTs (either 1 large or 2 small) plus some carry adder hardware. So, 14,000 LUTs. Fmax seems to be around 300 MHz for fast silicon and closer to 150 MHz for slow silicon.
One thing I did not account for in my initial estimates is the need for lots of memory for the internal state. 21 32-bit variables times 80 steps is 53760 bit, and, with 4 registers per ALM, that alone would require more resources than all computations. But the compiler is able to pack most of that into memory cells, even if I don't instruct it to do it explicitly.
Routing/layout is a fairly big problem, though. I have a chip with 113K ALM (301K LE). The most I've been able to fit into it is 5 copies. That's less than 40% utilization. And that took ~8 hours of fitting. Going to try to mess with LogicLock to see if I can do better.
With 5 copies running at once at 300 MHz, the throughput would be 1.5 Ghash/s SHA1 or 90 Khash/s WPA2. Which is somewhat less than I hoped for (about 1/3rd of the throughput of a GeForce 980 Ti). But at least the energy efficiency is a lot better.
EDIT: One look at the Design Partition Planner in the standard edition of Quartus revealed the problem. The compiler, too smart for its own good, was merging internal storage arrays of each core, thus creating tons of unnecessary interconnects between cores.
Even without full LogicLock, just with "Allow shift register merging across hierarchies" set to "off", I have a successful fit with 10 copies. Let's see if I can do 12...

Related

Latency bounds and throughput bounds for processors for operations that must occur in sequence

My textbook (Computer Systems: A programmer's perspective) states that a latency bound is encountered when a series of operations must be performed in strict sequence, while a throughput bound characterizes the raw computing capacity of the processor's functional units.
Questions 5.5 and 5.6 of the textbook introduce these two possible loop structures for polynomial computation
double result = a[0];
double xpwr = x;
for (int i = 1; i <= degree; i++) {
result += a[i] * xpwr;
xpwr = x * xpwr;
}
and
double result = a[degree];
double xpwr = x;
for (int i = degree - 1; i >= 0; i--) {
result = a[i] + x * result;
}
The loops are assumed to be executed on a microarchitecture with the following execution units:
One floating-point adder. It's latency of 3 cycles and is fully pipelined.
Two floating-pointer multipliers. The latency of each is 5 cycles and both are fully pipelined.
Four integer ALUs, each has a latency of one cycle.
The latency bounds for floating point multiplication and addition given for this problem are 5.0 and 3.0 respectively. According to the answer key, the overall loop latency for the first loop is 5.0 cycles per element and the second is 8.0 cycles per element. I don't understand why the first loop isn't also 8.0.
It seems as though a[i] must be multiplied by xpwr before adding a[i] to this product to produce the next value of result. Could somebody please explain this to me?
Terminology: you can say a loop is "bound on latency", but when analyzing that bottleneck I wouldn't say "the latency bound" or "bounds". That sounds wrong to me. The thing you're measuring (or calculating via static performance analysis) is the latency or length of the critical path, or the length of the loop-carried dependency chain. (The critical path is the latency chain that's longest, and is the one responsible for the CPU stalling if it's longer than out-of-order exec can hide.)
The key point is that out-of-order execution only cares about true dependencies, and allows operations to execute in parallel otherwise. The CPU can start a new multiply and a new add every cycle. (Assuming from the latency numbers that it's Intel Sandybridge or Haswell, or similar1. i.e. assume the FPU is fully pipelined.)
The two loop-carried dependency chains in the first loop are xpwr *= x and result += stuff, each reading from their own previous iteration, but not coupled to each other in a way that would add their latencies. They can overlap.
result += a[i] * xpwr has 3 inputs:
result from the previous iteration.
a[i] is assumed to be ready as early as you want it.
xpwr is from the previous iteration. And more importantly, that previous iteration could start computing xpwr right away, not waiting for the previous result.
So you have 2 dependency chains, one reading from the other. The addition dep chain has lower latency per step so it just ends up waiting for the multiplication dep chain.
The a[i] * xpwr "forks off" from the xpwr dep chain, independent of it after reading its input. Each computation of that expression is independent of the previous computation of it. It depends on a later xpwr, so the start of each a[i] * xpwr is limited only by the xpwr dependency chain having that value ready.
The loads and integer loop overhead (getting load addresses ready) can be executed far ahead by out-of-order execution.
Graph of the dependency pattern across iterations
mulsd (x86-64 multiply Scalar Double) is for the xpwr updates, addsd for the result updates. The a[i] * xpwr; multiplication is not shown because it's independent work each iteration. It skews the adds later by a fixed amount, but we're assuming there's enough FP throughput to get that done without resource conflicts for the critical path.
mulsd addsd # first iteration result += stuff
| \ | # first iteration xpwr *= x can start at the same time
v \ v
mulsd addsd
| \ |
v \ v
mulsd addsd
| \ |
v \ v
mulsd addsd
(Last xpwr mulsd result is unused, compiler could peel the final iteration and optimize it away.)
Second loop, Horner's method - fast with FMA, else 8 cycles
result = a[i] + x * result; is fewer math operations, but there we do have a loop-carried critical path of 8 cycles. The next mulsd can't start until the addsd is also done. This is bad for long chains (high-degree polynomials), although out-of-order exec can often hide the latency for small degrees, like 5 or 6 coefficients.
This really shines when you have FMA available: each iteration becomes one Fused Multiply-Add. On real Haswell CPUs, FMA has the same 5-cycle latency as an FP multiply, so the loop runs at one iteration per 5 cycles, with less extra latency at the tail to get the final result.
Real world high-performance code often uses this strategy for short polynomials when optimizing for machines with FMA, for high throughput evaluating the same polynomial for many different inputs. (So the instruction-level parallelism is across separate evaluations of the polynomial, rather than trying to create any within one by spending more operations.)
Footnote 1: similarity to real hardware
Having two FP multipliers with those latencies matches Haswell with SSE2/AVX math, although in actual Haswell the FP adder is on the same port as a multiplier, so you can't start all 3 operations in one cycle. FP execution units share ports with the 4 integer ALUs, too, but Sandybridge/Haswell's front-end is only 4 uops wide so it's typically not much of a bottleneck.
See David Kanter's deep dive on Haswell with nice diagrams, and https://agner.org/optimize/, and other performance resources in the x86 tag wiki)
On Broadwell, the next generation after Haswell, FP mul latency improved to 3 cycles. Still 2/clock throughput, with FP add/sub still being 3c and FMA 5c. So the loop with more operations would be faster there, even compared to an FMA implementation of Horner's method.
On Skylake, all FP operations are the same 4-cycle latency, all running on the two FMA units with 2/clock throughput for FP add/sub as well. More recently, Alder Lake re-introduced lower latency FP addition (3 cycles vs. 4 for mul/fma but still keeping the 2/clock throughput), since real-world code often does stuff like naively summing an array, and strict FP semantics don't let compilers optimize it to multiple accumulators. So on recent x86, there's nothing to gain by avoiding FMA if you would still have a multiply dep chain, not just add.
Also related:
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? more general analysis needs to consider other possible bottlenecks: front-end uop throughput, and back-end contention for execution units. Dep chains, especially loop-carried ones, are the 3rd major possible bottleneck (other than stalls like cache misses and branch misses.)
How many CPU cycles are needed for each assembly instruction? - another basic intro to these concepts
Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths - ability of out-of-order exec to overlap dep chains is limited when they're too long.
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) - FP dep chains in parallel with unrolling when possible, e.g. a dot product.
In this case, for large-degree polynomials you could be doing stuff like x2 = x*x / x4 = x2 * x2, and maybe generate x^(2n) and x^(2n+1) in parallel. As in Estrin's scheme used in Agner Fog's vectorclass library for short polynomials. I found that when short polynomials are part of independent work across loop iterations (e.g. as part of log(arr[i])), straight Horner's Rule was better throughput, as out-of-order exec was able to hide the latency of a chain of 5 or 6 FMAs interleaved with other work.
For 5.5 , there are 3 parallel lines:
xpwr = x * xpwr; which has 5 cycle latency. Occurs in iteration #i
a[i] * xpwr; which has 5 cycle latency, but isn't on the critical path of a loop-carried dependency. Occurs in iteration #i.
result + (2); which has 3 cycle latency. Occurs in iteration #i+1 but for the iter #i result
Update
Based on clarifications by #peter
To understand 'loop-carried' dep: means current loop(i) depends on other loops(say , i-1): so we can see xpwr = x * xpwr; as xpwr(i) = x * xpwr(i-1); . consequently form a path( but not known yet if it's critical path)
a[i] * xpwr , could be seen as a byproduct of step 1. So called "forked off from step 1". which also takes 5 cycles.
Upon step 2 finished, result += ... starts for loop i . which takes 3 cycles. it dependent on step 1 , consequently, step 3 is also a 'loop carried' dep, so could be candidates of "critical path".
Since step 3 is 3-cycle < 5 cycle, so step 1 becomes critical path.
What if step 3 ( assuming ) takes 10 cycles . Then to my understanding step 3 becomes critical path.
Attached the diagram as below:

Desired Compute-To-Memory-Ratio (OP/B) on GPU

I am trying to undertand the architecture of the GPUs and how we assess the performance of our programs on the GPU. I know that the application can be:
Compute-bound: performance limited by the FLOPS rate. The processor’s cores are fully utilized (always have work to do)
Memory-bound: performance limited by the memory
bandwidth. The processor’s cores are frequently idle because memory cannot supply data fast enough
The image below shows the FLOPS rate, peak memory bandwidth, and the Desired Compute to memory ratio, labeled by (OP/B), for each microarchitecture.
I also have an example of how to compute this OP/B metric. Example: Below is part of a CUDA code for applying matrix-matrix multiplication
for(unsigned int i = 0; i < N; ++i) {
sum += A[row*N + i]*B[i*N + col];
}
and the way to calculate OP/B for this matrix-matrix multiplication is as follows:
Matrix multiplication performs 0.25 OP/B
1 FP add and 1 FP mul for every 2 FP values (8B) loaded
Ignoring stores
and if we want to utilize this:
But matrix multiplication has high potential for reuse. For NxN matrices:
Data loaded: (2 input matrices)×(N^2 values)×(4 B) = 8N^2 B
Operations: (N^2 dot products)(N adds + N muls each) = 2N^3 OP
Potential compute-to-memory ratio: 0.25N OP/B
So if I understand this clearly well, I have the following questions:
It is always the case that the greater OP/B, the better ?
how do we know how much FP operations we have ? Is it the adds and the multiplications
how do we know how many bytes are loaded per FP operation ?
It is always the case that the greater OP/B, the better ?
Not always. The target value balances the load on compute pipe throughput and memory pipe throughput (i.e. that level of op/byte means that both pipes will be fully loaded). As you increase op/byte beyond that or some level, your code will switch from balanced to compute-bound. Once your code is compute bound, the performance will be dictated by the compute pipe that is the limiting factor. Additional op/byte increase beyond this point may have no effect on code performance.
how do we know how much FP operations we have ? Is it the adds and the multiplications
Yes, for the simple code you have shown, it is the adds and multiplies. Other more complicated codes may have other factors (e.g. sin, cos, etc.) which may also contribute.
As an alternative to "manually counting" the FP operations, the GPU profilers can indicate the number of FP ops that a code has executed.
how do we know how many bytes are loaded per FP operation ?
Similar to the previous question, for simple codes you can "manually count". For complex codes you may wish to try to use profiler capabilities to estimate. For the code you have shown:
sum += A[row*N + i]*B[i*N + col];
The values from A and B have to be loaded. If they are float quantities then they are 4 bytes each. That is a total of 8 bytes. That line of code will require 1 floating point multiplication (A * B) and one floating point add operation (sum +=). The compiler will fuse these into a single instruction (fused multiply-add) but the net effect is you are performing two floating point operations per 8 bytes. op/byte is 2/8 = 1/4. The loop does not change the ratio in this case. To increase this number, you would want to explore various optimization methods, such as a tiled shared-memory matrix multiply, or just use CUBLAS.
(Operations like row*N + i are integer arithmetic and don't contribute to the floating-point load, although its possible they may be significant, performance-wise.)

What are the relative cycle times for the 6 basic arithmetic operations?

When I try to optimize my code, for a very long time I've just been using a rule of thumb that addition and subtraction are worth 1, multiplication and division are worth 3, squaring is worth 3 (I rarely use the more general pow function so I have no rule of thumb for it), and square roots are worth 10. (And I assume squaring a number is just a multiplication, so worth 3.)
Here's an example from a 2D orbital simulation. To calculate and apply acceleration from gravity, first I get distance from the ship to the center of earth, then calculate the acceleration.
D = sqrt( sqr(Ship.x - Earth.x) + sqr(Ship.y - Earth.y) ); // this is worth 19
A = G*Earth.mass/sqr(D); // this is worth 9, total is 28
However, notice that in calculating D, you take a square root, but when using it in the next calculation, you square it. Therefore you can just do this:
A = G*Earth.mass/( sqr(Ship.x - Earth.x) + sqr(Ship.y - Earth.y) ); // this is worth 15
So if my rule of thumb is true, I almost cut in half the cycle time.
However, I cannot even remember where I heard that rule before. I'd like to ask what is the actual cycle times for those basic arithmetic operations?
Assumptions:
everything is a 64-bit floating number in x64 architecture.
everything is already loaded into registers, so no worrying about hits and misses from caches or memory.
no interrupts to the CPU
no if/branching logic such as look ahead prediction
Edit: I suppose what I'm really trying to do is look inside the ALU and only count the cycle time of its logic for the 6 operations. If there is still variance within that, please explain what and why.
Note: I did not see any tags for machine code, so I chose the next closest thing, assembly. To be clear, I am talking about actual machine code operations in x64 architecture. Thus it doesn't matter whether those lines of code I wrote are in C#, C, Javascript, whatever. I'm sure each high-level language will have its own varying times so I don't wanna get into an argument over that. I think it's a shame that there's no machine code tag because when talking about performance and/or operation, you really need to get down into it.
At a minimum, one must understand that an operation has at least two interesting timings: the latency and the throughput.
Latency
The latency is how long any particular operation takes, from its inputs to its output. If you had a long series of operations where the output of one operation is fed into the input of the next, the latency would determine the total time. For example, an integer multiplication on most recent x86 hardware has a latency of 3 cycles: it takes 3 cycles to complete a single multiplication operation. Integer addition has a latency of 1 cycle: the result is available the cycle after the addition executes. Latencies are generally positive integers.
Throughput
The throughput is the number of independent operations that can be performed per unit time. Since CPUs are pipelined and superscalar, this is often more than the inverse of the latency. For example, on most recent x86 chips, 4 integer addition operations can execute per cycle, even though the latency is 1 cycle. Similarly, 1 integer multiplication can execute, on average per cycle, even though any particular multiplication takes 3 cycles to complete (meaning that you must have multiple independent multiplications in progress at once to achieve this).
Inverse Throughput
When discussing instruction performance, it is common to give throughput numbers as "inverse throughput", which is simply 1 / throughput. This makes it easy to directly compare with latency figures without doing a division in your head. For example, the inverse throughput of addition is 0.25 cycles, versus a latency of 1 cycle, so you can immediately see that you if you have sufficient independent additions, they use only something like 0.25 cycles each.
Below I'll use inverse throughput.
Variable Timings
Most simple instructions have fixed timings, at least in their reg-reg form. Some more complex mathematical operations, however, may have input-dependent timings. For example, addition, subtraction and multiplication usually have fixed timings in their integer and floating point forms, but on many platforms division has variable timings in integer, floating point or both. Agner's numbers often show a range to indicate this, but you shouldn't assume the operand space has been tested extensively, especially for floating point.
The Skylake numbers below, for example, show a small range, but it isn't clear if that's due to operand dependency (which would likely be larger) or something else.
Passing denormal inputs, or results that themselves are denormal may incur significant additional cost depending on the denormal mode. The numbers you'll see in the guides generally assume no denormals, but you might be able to find a discussion of denormal costs per operation elsewhere.
More Details
The above is necessary but often not sufficient information to fully qualify performance, since you have other factors to consider such as execution port contention, front-end bottlenecks, and so on. It's enough to start though and you are only asking for "rule of thumb" numbers if I understand it correctly.
Agner Fog
My recommended source for measured latency and inverse throughput numbers are Agner's Fogs guides. You want the files under 4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs, which lists fairly exhaustive timings on a huge variety of AMD and Intel CPUs. You can also get the numbers for some CPUs directly from Intel's guides, but I find them less complete and more difficult to use than Agner's.
Below I'll pull out the numbers for a couple of modern CPUs, for the basic operations you are interested in.
Intel Skylake
Lat Inv Tpt
add/sub (addsd, subsd) 4 0.5
multiply (mulsd) 4 0.5
divide (divsd) 13-14 4
sqrt (sqrtpd) 15-16 4-6
So a "rule of thumb" for latency would be add/sub/mul all cost 1, and division and sqrt are about 3 and 4, respectively. For throughput, the rule would be 1, 8, 8-12 respectively. Note also that the latency is much larger than the inverse throughput, especially for add, sub and mul: you'd need 8 parallel chains of operations if you wanted to hit the max throughput.
AMD Ryzen
Lat Inv Tpt
add/sub (addsd, subsd) 3 0.5
multiply (mulsd) 4 0.5
divide (divsd) 8-13 4-5
sqrt (sqrtpd) 14-15 4-8
The Ryzen numbers are broadly similar to recent Intel. Addition and subtraction are slightly lower latency, multiplication is the same. Latency-wise, the rule of thumb could still generally be summarized as 1/3/4 for add,sub,mul/div/sqrt, with some loss of precision.
Here, the latency range for divide is fairly large, so I expect it is data dependent.

floating point operations per cycle - intel

I have been looking for quite a while and cannot seem to find an official/conclusive figure quoting the number of single precision floating point operations/clock cycle that an Intel Xeon quadcore can complete. I have an Intel Xeon quadcore E5530 CPU.
I'm hoping to use it to calculate the maximum theoretical FLOP/s my CPU can achieve.
MAX FLOPS = (# Number of cores) * (Clock Frequency (cycles/sec) ) * (# FLOPS / cycle)
Anything pointing me in the right direction would be useful. I have found this
FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2
Intel Core 2 and Nehalem:
4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication
But I'm not sure where these figures were found. Are they assuming a fused multiply add (FMAD) operation?
EDIT: Using this, in DP I calculate the correct DP arithmetic throughput cited by Intel as 38.4 GFLOP/s (cited here). For SP, I get double that, 76.8 GFLOP/s. I'm pretty sure 4 DP FLOP/cycle and 8 SP FLOP/cycle is correct, I just want confirmation of how they got the FLOPs/cycle value of 4 and 8.
Nehalem is capable of executing 4 DP or 8 SP FLOP/cycle. This is accomplished using SSE, which operates on packed floating point values, 2/register in DP and 4/register in SP. In order to achieve 4 DP FLOP/cycle or 8 SP FLOP/cycle the core has to execute 2 SSE instructions per cycle. This is accomplished by executing a MULDP and an ADDDP (or a MULSP and an ADDSP) per cycle. The reason this is possible is because Nehalem has separate execution units for SSE multiply and SSE add, and these units are pipelined so that the throughput is one multiply and one add per cycle. Multiplies are in the multiplier pipeline for 4 cycles in SP and 5 cycles in DP. Adds are in the pipeline for 3 cycles independent of SP/DP. The number of cycles in the pipeline is known as the latency. To compute peak FLOP/cycle all you need to know is the throughput. So with a throughput of 1 SSE vector instruction/cycle for both the multiplier and the adder (2 execution units) you have 2 x 2 = 4 FLOP/cycle in DP and 2 x 4 = 8 FLOP/cycle in SP. To actually sustain this peak throughput you need to consider latency (so you have at least as many independent operations in the pipeline as the depth of the pipeline) and you need to consider being able to feed the data fast enough. Nehalem has an integrated memory controller capable of very high bandwidth from memory which it can achieve if the data prefetcher correctly anticipates the access pattern of the data (sequentially loading from memory is a trivial pattern that it can anticipate). Typically there isn't enough memory bandwidth to sustain feeding all cores with data at peak FLOP/cycle, so some amount of reuse of the data from the cache is necessary in order to sustain peak FLOP/cycle.
Details on where you can find information on the number of independent execution units and their throughput and latency in cycles follows.
See page 105 8.9 Execution units of this document
http://www.agner.org/optimize/microarchitecture.pdf
It says that for Nehalem
The floating point multiplier on port 0 has a latency of 4 for single precision and 5 for double and long double precision. The throughput of the floating point multiplier is 1 operation per clock cycle, except for long double precision on Core2. The floating point adder is connected to port 1. It has a latency of 3 and is fully pipelined.
In order to get 8 SP FLOP/cycle you need 4 SP ADD/cycle and 4 SP MUL/cycle. The adder and the multiplier are on separate execution units, and dispatch out of separate ports, each can execute on 4 SP packed operands simultaneously using SSE packed (vector) instructions (4x32bit = 128bits). Both have throughput of 1 operation per clock cycle. In order to get that throughput, you need to consider the latency... how many cycles after the instruction issues before you can use the result.. so you have to issue several independent instructions to cover the latency. The multiplier in single precision has a latency of 4 and the adder of 3.
You can find these same throughput and latency numbers for Nehalem in the Intel Optimization guide, table C-15a
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

GT200 Single Precision Peak Performance

I was trying to verify the single precision peak performance of a reference GT200 card.
From http://www.realworldtech.com/gt200/9/, we have two facts about GT200 –
The latency of the fastest operation for an SP core is 4 cycles.
The SFU takes 4 cycles too to finish an operation.
Now, each SM has a total of 8 SPs and 2 SFUs, with each SFU having 4 FP multiply units and these SPs and SFUs can work at the same time as they are on two different ports as explained in their SM level diagrams. Each SP can perform MAD operation.
So, we are looking at 8 MAD operations and 8 MUL operations per 4 SP cycles. This gives us 16 + 8 = 24 operations per 4 SP clock cycles as MAD counts as 2 operations. Since 2 SP clock cycle counts as one shader clock, we have 24/2 = 12 operations per shader clock.
For a reference GT200 card, shader clock = 1296 MHz/s.
Thus, the single precision peak performance must be = 1296 MHz/s * 30 SM * 12 operations per shader clock = 466.560 GFLOPS
This is exactly half of the GFLOPS as reported in the specs. So where am I going wrong?
Edit: After Robert’s pointer to the CUDA Programming Guide that says 8MADs/shader clock can be performed in a GT200 SM, I would have to question how latency and throughput relate to each other in this particular SM.
There is a latency of one OP / 4 SP cycles (as pointed out earlier), thus one MAD every 4 SP cycles, right? We have 8 SPs, so it becomes 8 MADs for every 4 SP cycles in an SM.
Since 2 SP cycles form one shader cycle, so we are left with => 8MADs per 2 shader clock cycles
=> 4 MADs per shader clock.
This doesn’t match with the 8MADs/shader clock from the Programming Guide.
So, what am I doing wrong again?
Latency and throughput are not the same thing.
A cc 1.x SM can retire 8 single precision floating point MAD operations on every clock cycle.
This is the correct formula:
1296 MHz(cycle/s) * 30 SM * (8 SP/SM * 2 flop/cycle per SP + 2 SFU/SM * 4 FPU/SFU * 1 flop/cycle per FPU)
= 622080 Mflop/s + 311040 Mflop/s = 933 GFlop/s single precision
From here
EDIT: The 4-cycle latency you're referring to is the latency of a warp (i.e. 32 threads) MAD instruction, as issued to the SM, not the latency of a single MAD operation on a single SP. The FPU in each SP can generate one MAD result per clock, and there are 8 SP's in one SM, so each SM can generate 8 MAD results per clock. Since a warp (32 threads) MAD instruction requires 32 MAD results, it requires 4 total clocks to complete the warp instruction, as issued to the SPs in the SM.
The FPU in the SP can generate one new MAD result per clock. From the standpoint of instruction issue, the fundamental unit is the warp. Therefore a warp MAD instruction requires 4 clocks to complete.
EDIT2: Responding to question below.
Preface: The FPUs in the SFU are not independently schedulable. They only come into play when an instruction is scheduled to the SFUs. There are 4 FPU per SFU, and an SFU instruction requires 16 cycles (since there are 2 SFU/SM) to complete for a warp. If all 4 FPU in both SFUs were fully utilized, that would be 128 (16x4x2) flops produced during the computation of the SFU instruction, in 16 cycles. This is added to the 256 (16x2x8) total flops that could be generated by the "regular" MAD FPUs in the SM during the same time (16 cycles).
Your question seems to be interpreting the observed benchmark result and this statement in the text:
Table III also shows that the throughput for single-precision
floating point multiplication is 11.2 ops/clock, which means
that multiplication can be issued to both the SP and SFU
units. This suggests that each SFU unit is capable of doing
2 multiplications per cycle, twice the throughput of other
(more complex) instructions that map to this unit.
as an indication of either the throughput of the FPUs in the SFU or else the number of FPUs in the SFU. However you are conflating benchmark data with a theoretical number. The SFU has 4 FPU, but this does not mean that all 4 are independently schedulable for arbitrary arithmetic or instruction streams. Seeing all 4 FPU take on a new floating point instruction in a given cycle may require a specific instruction sequence that the authors haven't used.

Resources