Calculate Speed up - performance

This question is related Computer Organization and Architecture. i would truly appreciate some assistance
Processor A has a clock speed of 1 GHz, and takes 1 cycle for integer operations, 2 cycles for memory operations, and 4 cycles for floating point operations. Empirical data shows that programs run on Processor A typically are composed of 35% floating point operations, 30% memory operations; and 35% integer operations.
However you wish to design a new processor, Processor B, which is an improvement on Processor A. Processor B will run the same programs as Processor A. To complete your design, you are faced with two options for improving performance:
Increase the clock speed to 1.2 GHz, but memory operations take 3 cycles
Decrease the clock speed to 900 MHz, but floating point operations only take 3 cycles
Compute the speedup for both options and decide the option Processor B should take.
i have tried finding the Execution time for Processors A and B by using ICxCPIxCT. I used MIPS (Clock rate/CPIx10^6) as the IC in all cases. When i was finished with everything, the Processor B(1) was deemed the most efficient since it had the lowest CPU time(940.3x10^-9) but i'm not sure if my method was correct.

Related

Calculating Cycles Per Instruction

From what I understand, to calculate CPI, it's the percentage of the type of instruction multiplied by the number of cycles right? Does the type of machine have any part of this calculation whatsoever?
I have a problem that asks me if a change should be recommended.
Machine 1: 40% R - 5 Cycles, 30% lw - 6 Cycles, 15% sw - 6 Cycles, 15% beq 3 - Cycles, on a 2.5 GHz machine
Machine 2: 40% R - 5 Cycles, 30% lw - 6 Cycles, 15% sw - 6 Cycles, 15% beq 4 - Cycles, on a 2.7 GHz machine
By my calculations, machine 1 has 5.15 CPI while machine 2 has 5.3 CPI. Is it okay to ignore the GHz of the machine and say that the change would not be a good idea or do I have to factor the machine in?
I think the point is to evaluate a design change that makes an instruction take more clocks, but allows you to raise the clock frequency. (i.e. leaning towards a speed-demon design like Pentium 4, instead of brainiac like Apple's A7/A8 ARM cores. http://www.lighterra.com/papers/modernmicroprocessors/)
So you need to calculate instructions per second to see which one will get more work done in the same amount of real time. i.e. (clock/sec) / (clocks/insn) = insn/sec, cancelling out the clocks from the units.
Your CPI calculation looks ok; I didn't check it, but yes a weighted average of the cycles according to the instruction mix.
These numbers are obviously super simplified; any CPU worth building at 2.5GHz would have some kind of branch prediction so the cost of a branch isn't just a 3 or 4 instruction bubble. And taking ~5 cycles per instruction on average is pathetic. (Most pipelined designs aim for at least 1 instruction per clock.)
Caches and superscalar CPUs also lead to complex interactions between instructions depending on whether they depend on earlier results or not.
But this is sort of like what you might do if considering increasing the L1d cache load-use latency by 1 cycle (for example), if that took it off the critical path and let you raise the clock frequency. Or vice versa, tightening up the latency or reducing the number of pipeline stages on something at the cost of reducing frequency.
Cycles per instruction a count of cycles. ghz doesnt matter as far as that average goes. But saying that we can see from your numbers that one instruction is more clocks but the processors are a different speed.
So while it takes more cycles to do the same job on the faster processor the speed of the processor DOES compensate for that so it seems clear this is a question about does the processor speed account for the extra clock?
5.15 cycles/instruction / 2.5 (giga) cycles/second, cycles cancels out you get
2.06 seconds/(giga) instruction or (nano) seconds/ instruction
5.30 / 2.7 = 1.96296 (nano) seconds / instruction
The faster one takes a slightly less amount of time so it will run the program faster.
Another way to see this to check the math.
For 100 clock cycles on the slower machine 15% of those are beq. So 15 of the 100 clocks, which is 5 beq instructions. The same 5 beq instructions take 20 clocks on the faster machine so 105 clocks total for the same instructions on the faster machine.
100 cycles at 2.5ghz vs 105 at 2.7ghz
we want the amount of time
hz is cycles / second we want seconds on the top
so we want
cycles / (cycles/second) to have cycles cancel out and have seconds on the top
1/2.5 = 0.400 (400 picoseconds)
1/2.7 = 0.370
0.400 * 100 = 40.00 units of time
0.370 * 105 = 38.85 units of time
So despite taking 5 more cycles the processor speed differences is fast enough to compensate.
2.7/2.5 = 1.08
105/100 = 1.05
so 2.5 * 1.05 = 2.625 so a processor 2.625ghz or faster would run that program faster.
Now what were the rules for changing computers, is less time defined as a reason to change computers? What is the definition of better? How much more power does the faster one consume it might take less time but the power consumption might not be linear so it may take more watts despite taking less time. I assume the question is not that detailed, meaning it is vague meaning it is a poorly written question on its own, so it goes to what the textbook or lecture defined as the threshold for change to the other processor.
Disclaimer, dont blame me if you miss this question on your homework/test.
Outside an academic exercise like this, the real world is full of pipelined processors (not all but most of the folks writing programs are writing programs for) and basically you cant put a number on clock cycles per instruction type in a way that you can do this calculation because of a laundry list of factors. Make sore you understand that, nice exercise, but that specific exercise is difficult and dangerous to attempt on real world processors. Dangerous in that as hard as you work you may be incorrectly measuring something and jumping to the wrong conclusions and as a result making bad recommendations. At the same time there is very much the reality that faster ghz does improve some percentage of the execution, but another percentage suffers, and is there a net gain or loss. Or a new processor design faster or slower may have features that perform better than an older processor, but not all feature will be better, there is a tradeoff and then we get into what "better" means.

What is the definition of Floating Point Operations ( FLOPs )

I'm trying to optimize my code with SIMD ( on ARM CPUs ), and want to know its arithmetic intensity (flops/byte, AI) and FLOPS.
In order to calculate AI and FLOPS, I have to count the number of floating point operations(FLOPs).
However, I can't find any precise definition of FLOPs.
Of course, mul, add, sub, div are clearly FLOPs, but how about move operations, shuffle operations (e.g. _mm_shuffle_ps), set operations (e.g. _mm_set1_ps), conversion operations (e.g. _mm_cvtps_pi32), etc. ?
They're operations that deal with floating point values. Should I count them as FLOPs ? If not, why ?
Which operations do profilers like Intel VTune and Nvidia's nvprof, or PMUs usually count ?
EDIT:
What all operations does FLOPS include?
This question is mainly about mathematically complex operations.
I also want to know the standard way to deal with "not mathematical" operations which take floating point values or vectors as inputs.
Shuffle / blend on FP values are not considered FLOPs. They are just overhead of using SIMD on not purely "vertical" problems, or for problems with branching that you do branchlessly with a blend.
Neither are FP AND/OR/XOR. You could try to justify counting FP absolute value using andps (_mm_and_ps), but normally it's not counted. FP abs doesn't require looking at the exponent / significand, or normalizing the result, or any of the things that make FP execution units expensive. abs (AND) / sign-flip (XOR) or make negative (OR) are trivial bitwise ops.
FMA is normally counted as two floating point ops (the mul and add), even though it's a single instruction with the same (or similar) performance to SIMD FP add or mul. The most important problem that bottlenecks on raw FLOP/s is matmul, which does need an equal mix of mul and add, and can take advantage of FMA perfectly.
So the FLOP/s of a Haswell core is
its SIMD vector width (8 float elements per vector)
times SIMD FMA per clock (2)
times FLOPs per FMA (2)
times clock speed (max single core turbo it can sustain while maxing out both FMA units; long-term depends on cooling, short term just depends on power limits).
For a whole CPU, not just a single core: multiply by number of cores and use the max sustained clock speed with all cores busy, usually lower than single-core turbo on CPUs that have turbo at all.)
Intel and other CPU vendors don't count the fact that their CPUs can also sustain a vandps in parallel with 2 vfma132ps instructions per clock, because FP abs is not a difficult operation.
See also How do I achieve the theoretical maximum of 4 FLOPs per cycle?. (It's actually more than 4 on modern CPUs :P)
Peak FLOPS (FP ops per second, or FLOP/s) isn't achievable if you have much other overhead taking up front-end bandwidth or creating other bottlenecks. The metric is just the raw amount of math you can do when running in a straight line, not on any specific practical problem.
Although people would think it's silly if theoretical peak flops is much higher than a carefully hand-tuned matmul or Mandelbrot could ever achieve, even for compile-time-constant problem sizes. e.g. if the front-end couldn't keep up with doing any stores as well as the FMAs. e.g. if Haswell had four FMA execution units, so it could only sustain max FLOPs if literally every instruction was an FMA. Memory source operands could micro-fuse for loads, but there'd be no room to store without hurting throughput.
The reason Intel doesn't have even 3 FMA units is that most real code has trouble saturating 2 FMA units, especially with only 2 load ports and 1 store port. They'd be wasted almost all of the time, and 256-bit FMA unit takes a lot of transistors.
(Ice Lake widens issue/rename stage of the pipeline to 5 uops/clock, but also widens SIMD execution units to 512-bit with AVX-512 instead of adding a 3rd 256-bit FMA unit. It has 2/clock load and 2/clock store, although that store throughput is only sustainable to L1d cache for 32-byte or narrower stores, not 64-byte.)
When it comes to optimisation, it is common practise to only measure FLOPs on the hotspots of your code, for example, the number of Floating Point Multiply & Accumulate operations in Convolution. This is mainly because other operations might be insignificant or irreplaceable and therefore can't be exploited for any kind of optimization.
For example, all instructions under Vector Floating Point Instructions in A4.13 in ARMv7 Reference Manual fall under a Floating Point Operation as a FLOPs/Cycle for an FPU instruction is typically constant in a processor.
Not just ARM, but many micro-processors have a dedicated Floating Point Unit, so when you are measuring FLOPs, you're measuring the speed of this unit. With this and FLOPs/cycle you can more or less calculate the theoretical peak performance.
But, FLOPs are to be taken with a grain of salt, as they can only be used to approximately estimate the speed of your code because they fail to take into account other conditions your processor operates under. This is why counting FLOPs only for your hotspots (usually arithmetic ops) is more or less enough in most cases.
Having said that, FLOPs can act as a comparative metric for two strenuous piece of code but doesn't say much about your code per se.

How can CPU's have FLOPS much higher than their clock speeds?

For example, a modern i7-8700k can supposedly do ~60 GFLOPS (single-precision, source) while its maximum frequency is 4.7GHz. As far as I am aware, an instruction has to take at least one cycle to complete, so how is this possible?
There are multiple factors that are all multiplied together for this large effect:
SIMD, Intel 8700k and similar processors support AVX and AVX2, which includes many instructions that operate on registers that can hold 8 floats at the same time.
multiple cores, 8700k has 6 cores.
fused multiply-add, part of AVX2, has both a multiplication and addition in the same instruction.
high throughput execution. The latency (time an individual instruction takes) is not directly important to how much computation a processor can do in a unit of time. A modern CPU such as 8700k can start executing two (independent) FMAs in the same cycle (and keep in mind these are still SIMD instructions so that represents a lot of floating point operations) even through the latency of the operation is actually 4 cycles.
Multiplying all those factors together we get: 8 * 6 * 2 * 2 * 4.3 = 825 GFLOPS (matching the stats reported here). This calculation certainly does not mean that it can actually be attained. For example the processor may downclock significantly under such a workload in order to stay within its power budget, which is what Intel has been doing at least since Haswell (though the specifics have changed and it applied to server parts). Also, most real code has significant trouble feeding that many FMAs with data. Large matrix multiplications can get close though, and for example according to these stats the 8700k reached 496.7 Gflops in their SGEMM benchmark. Possibly the 8700k's max AVX2 turbo speed on 6 cores is 2.6GHz but as far as I can find it does not have an AVX offset by default (only needed when overclocked), or that GEMM is just not that close to hitting peak FLOPS.

Faster cpu wastes more time as compared to slower cpu

Suppose I have a program that has an instruction to add two numbers and that operation takes 10 nanoseconds(constant, as enforced by the gate manufactures).
Now I have 3 different processors A, B and C(where A< B < C in terms of clock cycles). A's one clock cycle has 15 nanosec, B has 10 nanosec and C has 7 nanosec.
Firstly am I correct on my following assumptions-
1. Add operation takes 1 complete cycle of processor A(slow processor) and wastes rest of 5 ns of the cycle.
2. Add operation takes 1 complete cycle of processor B wasting no time.
3. Add operation takes 2 complete cycles(20 ns) of processor C(fast processor) wasting rest of the 20-14=7 ns.
If the above assumptions are correct then isn't this a contradiction to the regular assumption that processors with high clock cycles are faster. Here processor C which is the fastest actually takes 2 cycles and wastes 7ns whereas, the slower processor A takes just 1 cycle.
Processor C is fastest, no matter what. It takes 7 ns per cycle and therefore performs more cycles than A and B. It's not C's fault that the circuit is not fast enough. If you would implement the addition circuit in a way that it gives result in 1 ns, all processors will give the answer in 1 clock cycle (i.e. C will give you the answer in 7ns, B in 10ns and A in 15ns).
Firstly am I correct on my following assumptions-
1. Add operation takes 1 complete cycle of processor A(slow processor) and wastes rest of 5 ns of the cycle.
2. Add operation takes 1 complete cycle of processor B wasting no time.
3. Add operation takes 2 complete cycles(20 ns) of processor C(fast processor) wasting rest of the 20-7=13 ns.
No. It is because you are using incomplete data to express the time for an operation. Measure the time taken to finish an operation on a particular processor in clock cycles instead of nanoseconds as you are doing here. When you say ADD op takes 10 ns and you do not mention the processor on which you measured the time for the ADD op, the time measurement in ns is meaningless.
So when you say that ADD op takes 2 clock cycles on all three processors, then you have standardized the measurement. A standardized measurement can then be translated as:
Time taken by A for addition = 2 clock cycles * 15 ns per cycle = 30 ns
Time taken by B for addition = 2 clock cycles * 10 ns per cycle = 20 ns
Time taken by C for addition = 2 clock cycles * 07 ns per cycle = 14 ns
In case you haven't noticed, when you say:
A's one clock cycle has 15 nanosec, B has 10 nanosec and C has 7 nanosec.
which of the three processors is fastest?
Answer: C is fastest. It's one cycle is finished in 7ns. It implies that it finishes 109/7 (~= 1.4 * 108) cycles in one second, compared to B which finishes 109/10 (= 108) cycles in one second, compared to A which finishes only 109/15 (~= 0.6 * 108) cycles in one second.
What does a ADD instruction mean, does it purely mean only and only ADD(with operands available at the registers) or does it mean getting
the operands, decoding the instruction and then actually adding the
numbers.
Getting the operands is done by MOV op. If you are trying to compare how fast ADD op is being done, it should be compared by time to perform ADD op only. If you, on the other hand want to find out how fast addition of two numbers is being done, then it will involve more operations than simple ADD. However, if it's helpful, the list of all Original 8086/8088 instructions is available on Wikipedia too.
Based on the above context to what add actually means, how many cycles does add take, one or more than one.
It will depend on the processor because each processor may have the adder differently implemented. There are many ways to generate addition of two numbers. Quoting Wikipedia again - A full adder can be implemented in many different ways such as with a custom transistor-level circuit or composed of other gates.
Also, there may be pipelining in the instructions which can result in parallelizing of the addition of the numbers resulting in huge time savings.
Why is clock cycle a standard since it can vary with processor to processor. Shouldn't nanosec be the standard. Atleast its fixed.
Clock cycle along with the processor speed can be the standard if you want to tell the time taken by a processor to execute an instruction. Pick any two from:
Time to execute an instruction,
Processor Speed, and
Clock cycles needed for an instruction.
The third can be derived from it.
When you say the clock cycles taken by ADD is x and you know the processor speed is y MHz, you can calculate that the time to ADD is x / y. Also, you can mention the time to perform ADD as z ns and you know the processor speed is same y MHz as earlier, you can calculate the cycles needed to execute ADD as y * z.
I'm no expert BUT I'd say ...
the regular assumption that processors with high clock cycles are faster FOR THE VAST MAJORITY OF OPERATIONS
For example, a more intelligent processor might perform an "overhead task" that takes X ns. The "overhead task" might make it faster for repetitive operations but might actually cause it to take longer for a one-off operation such as adding 2 numbers.
Now, if the same processor performed that same operation 1 million times, it should be massively faster than the slower less intelligent processor.
Hope my thinking helps. Your feedback on my thoughts welcome.
Why would a faster processor take more cycles to do the same operation than a slower one?
Even more important: modern processors use Instruction pipelining, thus executing multiple operations in one clock cycle.
Also, I don't understand what you mean by 'wasting 5ns', the frequency determines the clock speed, thus the time it takes to execute 1 clock. Of course, cpu's can have to wait on I/O for example, but that holds for all cpu's.
Another important aspect of modern cpu's are the L1, L2 and L3 caches and the architecture of those caches in multicore systems. For example: if a register access takes 1 time unit, a L1 cache access will take around 2 while a normal memory access will take between 50 and 100 (and a harddisk access would take thousands..).
This is actually almost correct, except that on processor B taking 2 cycles means 14ns, so with 10ns being enough the next cycle starts 4ns after the result was already "stable" (though it is likely that you need some extra time if you chop it up, to latch the partial result). It's not that much of a contradiction, setting your frequency "too high" can require trade-offs like that. An other thing you might do it use more a different circuit or domino logic to get the actual latency of addition down to one cycle again. More likely, you wouldn't set addition at 2 cycles to begin with. It doesn't work out so well in this case, at least not for addition. You could do it, and yes, basically you will have to "round up" the time a circuit takes to an integer number of cycles. You can also see this in bitwise operations, which take less time than addition but nevertheless take a whole cycle. On machine C you could probably still fit bitwise operations in a single cycle, for some workloads it might even be worth splitting addition like that.
FWIW, Netburst (Pentium 4) had staggered adders, which computed the lower half in one "half-cycle" and the upper half in the next (and the flags in the third half cycle, in some sense giving the whole addition a latency of 1.5). It's not completely out of this world, though Netburst was over all, fairly mad - it had to do a lot of weird things to get the frequency up that high. But those half-cycles aren't very half (it wasn't, AFAIK, logic that advanced on every flank, it just used a clock multiplier), you could also see them as the real cycles that are just very fast, with most of the rest of the logic (except that crazy ALU) running at half speed.
Your broad point that 'a CPU will occasionally waste clock cycles' is valid. But overall in the real world, part of what makes a good CPU a good CPU is how it alleviates this problem.
Modern CPUs consist of a number of different components, none of whose operations will end up taking a constant time in practice. For example, an ADD instruction might 'burst' at 1 instruction per clock cycle if the data is immediately available to it... which in turn means something like 'if the CPU subcomponents required to fetch that data were immediately available prior to the instruction'. So depending on if e.g. another subcomponent had to wait for a cache fetch, the ADD may in practice take 2 or 3 cycles, say. A good CPU will attempt to re-order the incoming stream of instructions to maximise the availability of subcomponents at the right time.
So you could well have the situation where a particular series of instructions is 'suboptimal' on one processor compared to another. And the overall performance of a processor is certainly not just about raw clock speed: it is as much about the clever logic that goes around taking a stream of incoming instructions and working out which parts of which instructions to fire off to which subcomponents of the chip when.
But... I would posit that any modern chip contains such logic. Both a 2GHz and a 3GHz processor will regularly "waste" clock cycles because (to put it simply) a "fast" instruction executed on one subcomponent of the CPU has to wait for the result of the output from another "slower" subcomponent. But overall, you will still expect the 3GHz processor to "execute real code faster".
First, if the 10ns time to perform the addition does not include the pipeline overhead (clock skew and latch delay), then Processor B cannot complete an addition (with these overheads) in one 10ns clock cycle, but Processor A can and Processor C can still probably do it in two cycles.
Second, if the addition itself is pipelined (or other functional units are available), then a subsequent non-dependent operation can begin executing in the next cycle. (If the addition was width-pipelined/staggered (as mentioned in harold's answer) then even dependent additions, logical operations and left shifts could be started after only one cycle. However, if the exercise is constraining addition timing, it presumably also prohibits other optimizations to simplify the exercise.) If dependent operations are not especially common, then the faster clock of Processor C will result in higher performance. (E.g., if a dependence stall occurred every fourth cycle, then, ignoring other effects, Processor C can complete four instructions every five 7ns cycles (35 ns; the first three instruction overlap in execution) compared to 40ns for Processor B (assuming the add timing included pipelining overhead).) (Note: Your assumption 3 is incorrect, two cycles for Processor C would be 14ns.)
Third, the extra time in a clock cycle can be used to support more complex operations (e.g., preshifting one operand by a small immediate value and even adding three numbers — a carry-save adder has relatively little delay), to steal work from other pipeline stages (potentially reducing the number of pipeline stages, which generally reduces branch misprediction penalties), or to reduce area or power by using simpler logic. In addition, the extra time might be used to support a larger (or more associative) cache with fixed latency in cycles, reducing miss rates. Such factors can compensate for the "waste" of 5ns in Processor A.
Even for scalar (single issue per cycle) pipelines clock speed is not the single determinant of performance. Design choices become even more complex when power, manufacturing cost (related to yield, adjusted according to sellable bins, and area), time-to-market (and its variability/predictability), workload diversity, and more advanced architectural and microarchitectural techniques are considered.
The incorrect assumption that clock frequency determines performance even has a name: the Megahertz myth.

floating point operations per cycle - intel

I have been looking for quite a while and cannot seem to find an official/conclusive figure quoting the number of single precision floating point operations/clock cycle that an Intel Xeon quadcore can complete. I have an Intel Xeon quadcore E5530 CPU.
I'm hoping to use it to calculate the maximum theoretical FLOP/s my CPU can achieve.
MAX FLOPS = (# Number of cores) * (Clock Frequency (cycles/sec) ) * (# FLOPS / cycle)
Anything pointing me in the right direction would be useful. I have found this
FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2
Intel Core 2 and Nehalem:
4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication
But I'm not sure where these figures were found. Are they assuming a fused multiply add (FMAD) operation?
EDIT: Using this, in DP I calculate the correct DP arithmetic throughput cited by Intel as 38.4 GFLOP/s (cited here). For SP, I get double that, 76.8 GFLOP/s. I'm pretty sure 4 DP FLOP/cycle and 8 SP FLOP/cycle is correct, I just want confirmation of how they got the FLOPs/cycle value of 4 and 8.
Nehalem is capable of executing 4 DP or 8 SP FLOP/cycle. This is accomplished using SSE, which operates on packed floating point values, 2/register in DP and 4/register in SP. In order to achieve 4 DP FLOP/cycle or 8 SP FLOP/cycle the core has to execute 2 SSE instructions per cycle. This is accomplished by executing a MULDP and an ADDDP (or a MULSP and an ADDSP) per cycle. The reason this is possible is because Nehalem has separate execution units for SSE multiply and SSE add, and these units are pipelined so that the throughput is one multiply and one add per cycle. Multiplies are in the multiplier pipeline for 4 cycles in SP and 5 cycles in DP. Adds are in the pipeline for 3 cycles independent of SP/DP. The number of cycles in the pipeline is known as the latency. To compute peak FLOP/cycle all you need to know is the throughput. So with a throughput of 1 SSE vector instruction/cycle for both the multiplier and the adder (2 execution units) you have 2 x 2 = 4 FLOP/cycle in DP and 2 x 4 = 8 FLOP/cycle in SP. To actually sustain this peak throughput you need to consider latency (so you have at least as many independent operations in the pipeline as the depth of the pipeline) and you need to consider being able to feed the data fast enough. Nehalem has an integrated memory controller capable of very high bandwidth from memory which it can achieve if the data prefetcher correctly anticipates the access pattern of the data (sequentially loading from memory is a trivial pattern that it can anticipate). Typically there isn't enough memory bandwidth to sustain feeding all cores with data at peak FLOP/cycle, so some amount of reuse of the data from the cache is necessary in order to sustain peak FLOP/cycle.
Details on where you can find information on the number of independent execution units and their throughput and latency in cycles follows.
See page 105 8.9 Execution units of this document
http://www.agner.org/optimize/microarchitecture.pdf
It says that for Nehalem
The floating point multiplier on port 0 has a latency of 4 for single precision and 5 for double and long double precision. The throughput of the floating point multiplier is 1 operation per clock cycle, except for long double precision on Core2. The floating point adder is connected to port 1. It has a latency of 3 and is fully pipelined.
In order to get 8 SP FLOP/cycle you need 4 SP ADD/cycle and 4 SP MUL/cycle. The adder and the multiplier are on separate execution units, and dispatch out of separate ports, each can execute on 4 SP packed operands simultaneously using SSE packed (vector) instructions (4x32bit = 128bits). Both have throughput of 1 operation per clock cycle. In order to get that throughput, you need to consider the latency... how many cycles after the instruction issues before you can use the result.. so you have to issue several independent instructions to cover the latency. The multiplier in single precision has a latency of 4 and the adder of 3.
You can find these same throughput and latency numbers for Nehalem in the Intel Optimization guide, table C-15a
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

Resources