How to derive the Peak performance in GFlop/s of Intel Xeon E5-2690? - performance

I was able to find the theoretical DP peak performance 371 GFlop/s for the Xeon E5-2690 in this Processor Comparison (interesting that it is easier to find this information in Intel's competitor than Intel support pages itself). However, when I try to derive that peak performance my derivation doesn't match:
The frequency (in Turbo mode) for each core of the Xeon E5-2690 = 3.8Ghz
The processor can do an add and mul operation per cycle so we get: 3.8 x 2 = 7.6
Given it has AVX support it can do 4 double operations per cycle: 7.6 x 4 = 30.4
Finally, it has 8 cores, therefore we get: 8 x 30.4 = 243.2
Thus, the peak performance in Gflop/s would be 243.2 GFlop/s and not 371 GFlop/s?

Turbo Mode is not used to calculate Theoretical Peak Performance, you have to consider something like:
CPU speed = 2.9 GHz
CPU Cores = 8
CPU instruction per cycle = 8 (considering AVX-256 -> 256 bits unit, can hold 8 single precision values) x 2 (add and mul operations like you said) = 16
Putting all together:
2.9x8x16 = 371 GFlops/s

Related

Theoretical maximum performance (FLOPS ) of Intel Xeon E5-2640 v4 CPU, using only addition?

I am confused about the theoretical maximum performance of the Intel Xeon E5-2640 v4 CPU (Boardwell-based). In this post, >800GFLOPS; in this post, about 200GFLOPS; in this post, 3.69GFLOPS per core, 147.70GFLOPS per computer. So what is the theoretical maximum performance of Intel Xeon E5-2640 v4 CPU?
Some specifications:
Processor Base Frequency = 2.4GHz;
Max turbo frequency = 3.4GHz;
IPC (instruction per cycle) = 2;
Instruction Set Extensions: AVX2, so #SIMD = 256/32 = 8;
I tried to compute the theoretical maximum FLOPS. Based on my understanding, it should be (Max turbo frequency) * (IPC) * (#SIMD), which is 3.4 * 2 * 8 = 54.4GFLOPS, is it right?
Should it be multiplied by 2 (due to the pipeline technique which makes addition and multiplication can be done in parallel)? What if additions and multiplications do not appear at the same time? (eg. if the workload only contains additions, is *2 appropriate?)
Besides, the above computation should be the maximum FLOPS per core, right?
3.4 GHz is the max single-core turbo (and also 2-core), so note that this isn't the per-core GFLOPS, it's the single-core GFLOPS.
The max all-cores turbo is 2.6 GHz on that CPU, and probably won't sustain that for long with all cores maxing out their SIMD FP execution units. That's the most power-intensive thing x86 CPUs can do. So it will likely drop back to 2.4 GHz if you actually keep all cores busy.
And yes you're missing a factor of two because FMA counts as two FP operations, and that's what you need to do to achieve the hardware's theoretical max FLOPS. FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2 . (Your Broadwell is the same as Haswell for max-throughput purposes.)
If you're only using addition then only have one FLOP per SIMD element per instruction, and also only 1/clock FP instruction throughput on a v4 (Broadwell) or earlier.
Haswell / Broadwell have two fully-pipelined SIMD FMA units (on ports 0 and 1), and one fully-pipelined SIMD FP-add unit (on port 1) with lower latency than FMA.
The FP-add unit is on the same execution port as one of the FMA units, so it can start 2 FP uops per clock, up to one of which can be pure addition. (Unless you do addition x+y as fma(x, 1.0, y), trading higher latency for more throughput.)
IPC (instruction per cycle) = 2;
Nope, that's the number of FP math instructions per cycle, max, not total instructions per clock. The pipeline's narrowest point is 4 uops wide, so there's room for a bit of loop overhead and a store instruction every cycle as well as two SIMD FP operations.
But yes, 2 FP operations started per clock, if they're not both addition.
Should it be multiplied by 2 (due to the pipeline technique which makes addition and multiplication can be done in parallel)?
You're already multiplying by IPC=2 for parallel additions and multiplications.
If you mean FMA (Fused Multiply-Add), then no, that's literally doing them both as part of a single operation, not in parallel as a "pipeline technique". That's why it's called "fused".
FMA has the same latency as multiply in many CPUs, not multiply and then addition. (Although on Broadwell, FMA latency = 5 cycles, vmulpd latency = 3 cycles, vaddpd latency = 3 cycles. All are fully pipelined, with a throughput discussed in the rest of this answer, since theoretical max throughput requires arranging your calculations to not bottleneck on the latency of addition or multiplication. e.g. using multiple accumulators for a dot product or other reduction.) Anyway, point being, a hardware FMA execution unit is not terribly more complex than an FP multiplier or adder, and you shouldn't think of it as two separate operations.
If you write a*b + c in the source, a compiler can contract that into an FMA, instead of rounding the a*b to a temporary result before addition, depending on compiler options (and defaults) to allow that or not.
How to use Fused Multiply-Add (FMA) instructions with SSE/AVX
FMA3 in GCC: how to enable
Instruction Set Extensions: AVX2, so #SIMD = 256/64 = 8;
256/64 = 4, not 8. In a 32-byte (256-bit) SIMD vector, you can fit 4 double-precision elements.
Per core per clock, Haswell/Broadwell can begin up to:
two FP math instructions (FMA/MUL/ADD), up to one of which can be addition.
FMA counts as 2 FLOPs per element, MUL/ADD only count as 1 each.
on up to 32 byte wide inputs (e.g. 4 doubles or 8 floats)

How to calculate speedup and efficiency in a hybrid CPU and GPU algorithm?

I have an algorithm that I have executed in parallel using only CPU and I have achieved a speedup of 30x. That is, an efficiency equal to 0.93 (efficiency = speedup/cores, i.e. 0.93 = 30/32).
Later I added 2 GPUs (Tesla C2075 of 448 cores each) together to the 32 CPU cores.
To calculate the efficiency including CPUs and GPUs, should I add the amount of GPU cores to the CPU cores? That is, I would calculate the efficiency using 928 cores (32 + 448 + 448 = 928). Or should it be calculated differently?
Speedup and efficiency has been calculated based on what has been said here:
https://software.intel.com/en-us/articles/predicting-and-measuring-parallel-performance
GPUs have bigger "core complex" architectures called "SM" or "CU" with tens of pipelines each. Not "very" similar to "SIMD" of a CPU, they can issue commands in parallel to these pipelines in a "single-threaded" kernel code.
You have counted "cores" in CPU and not SIMD pipelines (which is 4 to 16 times of number of cores) so, it wouldn't be wrong to count SM units of Nvidia or CU of Amd or Slice subset of Intel etc.
Tesla C2075 has 14 SM units so you could add 14 for each GPU (32+14+14).
If you have also used SIMDified code for CPU, then it wouldn't be wrong to count each pipeline of a GPU which is 32 to 192 times the number of SM/CU(like 448 per GPU of yours) (32*SIMD_WIDTH + 448 + 448).
At least this is how I would compute "core efficiency" and "pipeline efficiency". If data transfer to/from GPU is not a bottleneck, efficiency should not drop much after GPUs are added.

Building a Roofline Model

I'm trying to build a roofline model for a node in a supercomputer that I'm running simulations on. The node has 2x Intel Xeon E5-2650 v2 (Ivy Bridge) 8 core 2.6 GHz processors (16 cores per
node), with 64GB RAM total (4GB each). The maximum memory bandwidth for the Intel Xeon E5-2650 is shown here as 59.7 GB/s.
Achieved GFLOPS = max mem bandwidth x arithmetic intensity.
Max GFLOPS = num cores x clock frequency in GHz x ops/cycle.
My code has arithmetic intensity of 1/3 and uses double precision floating point.
Here are my calculations for calculating the peak GFLOPs for the different types of program:
Sequential program (single core) no vectorisation:
1x2.6x1 (I assume without vectorisation, we can only achieve 1 op/cycle?) = 2.6 GFLOPs
Sequential program (single core) with vectorisation (SSE):
1x2.6x8 = 20.8 GFLOPs
All cores on one Xeon with vectorisation (SSE):
8x2.6x8 = 166.4 GFLOPs
All cores one both Xeons with vectorisation (SSE):
2x 8x2.6x8 = 332.8 GFLOPs
How does the memory bandwidth available to the program change between the different types of program shown above? I know that the max memory bandwidth for 1 Xeon E5-2650 is 59.7 GB/s, however is this achieveable on a single core? Does this become 119.4 GB/s with 2 Xeon E2650s?
So would the achieved GFLOPs (using peak bandwidth x arithmetic intensity) be:
Sequential program w/o vectorisation:
59.7 * 1/3 = 19.9 GFLOPs, however because our roofline is 2.6 GFLOPs, we are limited to 2.6 GFLOPs?
Sequential program with vectorisation:
59.7 * 1/3 = 19.9 GFLOPs. This is achieveable because our roofline is 20.8 GFLOPs.
One Xeon (using all 8 cores) with vectorisation:
59.7 * 1/3 = 19.9 GFLOPs. I am suspicious of this, because surely our parallel program is capable of producing more mem reqs than the sequential program, and surely the sequential program doesn't saturate the memory system?
Two Xeons (total of 16 cores) with vectorisation:
119.4 * 1/3 = 39.8 GFLOPs.
I feel like something is wrong with the achieved GFLOPs, have I made a mistake somewhere?

floating point operations per cycle - intel

I have been looking for quite a while and cannot seem to find an official/conclusive figure quoting the number of single precision floating point operations/clock cycle that an Intel Xeon quadcore can complete. I have an Intel Xeon quadcore E5530 CPU.
I'm hoping to use it to calculate the maximum theoretical FLOP/s my CPU can achieve.
MAX FLOPS = (# Number of cores) * (Clock Frequency (cycles/sec) ) * (# FLOPS / cycle)
Anything pointing me in the right direction would be useful. I have found this
FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2
Intel Core 2 and Nehalem:
4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication
But I'm not sure where these figures were found. Are they assuming a fused multiply add (FMAD) operation?
EDIT: Using this, in DP I calculate the correct DP arithmetic throughput cited by Intel as 38.4 GFLOP/s (cited here). For SP, I get double that, 76.8 GFLOP/s. I'm pretty sure 4 DP FLOP/cycle and 8 SP FLOP/cycle is correct, I just want confirmation of how they got the FLOPs/cycle value of 4 and 8.
Nehalem is capable of executing 4 DP or 8 SP FLOP/cycle. This is accomplished using SSE, which operates on packed floating point values, 2/register in DP and 4/register in SP. In order to achieve 4 DP FLOP/cycle or 8 SP FLOP/cycle the core has to execute 2 SSE instructions per cycle. This is accomplished by executing a MULDP and an ADDDP (or a MULSP and an ADDSP) per cycle. The reason this is possible is because Nehalem has separate execution units for SSE multiply and SSE add, and these units are pipelined so that the throughput is one multiply and one add per cycle. Multiplies are in the multiplier pipeline for 4 cycles in SP and 5 cycles in DP. Adds are in the pipeline for 3 cycles independent of SP/DP. The number of cycles in the pipeline is known as the latency. To compute peak FLOP/cycle all you need to know is the throughput. So with a throughput of 1 SSE vector instruction/cycle for both the multiplier and the adder (2 execution units) you have 2 x 2 = 4 FLOP/cycle in DP and 2 x 4 = 8 FLOP/cycle in SP. To actually sustain this peak throughput you need to consider latency (so you have at least as many independent operations in the pipeline as the depth of the pipeline) and you need to consider being able to feed the data fast enough. Nehalem has an integrated memory controller capable of very high bandwidth from memory which it can achieve if the data prefetcher correctly anticipates the access pattern of the data (sequentially loading from memory is a trivial pattern that it can anticipate). Typically there isn't enough memory bandwidth to sustain feeding all cores with data at peak FLOP/cycle, so some amount of reuse of the data from the cache is necessary in order to sustain peak FLOP/cycle.
Details on where you can find information on the number of independent execution units and their throughput and latency in cycles follows.
See page 105 8.9 Execution units of this document
http://www.agner.org/optimize/microarchitecture.pdf
It says that for Nehalem
The floating point multiplier on port 0 has a latency of 4 for single precision and 5 for double and long double precision. The throughput of the floating point multiplier is 1 operation per clock cycle, except for long double precision on Core2. The floating point adder is connected to port 1. It has a latency of 3 and is fully pipelined.
In order to get 8 SP FLOP/cycle you need 4 SP ADD/cycle and 4 SP MUL/cycle. The adder and the multiplier are on separate execution units, and dispatch out of separate ports, each can execute on 4 SP packed operands simultaneously using SSE packed (vector) instructions (4x32bit = 128bits). Both have throughput of 1 operation per clock cycle. In order to get that throughput, you need to consider the latency... how many cycles after the instruction issues before you can use the result.. so you have to issue several independent instructions to cover the latency. The multiplier in single precision has a latency of 4 and the adder of 3.
You can find these same throughput and latency numbers for Nehalem in the Intel Optimization guide, table C-15a
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

Cache bandwidth per tick for modern CPUs

What is a speed of cache accessing for modern CPUs? How many bytes can be read or written from memory every processor clock tick by Intel P4, Core2, Corei7, AMD?
Please, answer with both theoretical (width of ld/sd unit with its throughput in uOPs/tick) and practical numbers (even memcpy speed tests, or STREAM benchmark), if any.
PS it is question, related to maximal rate of load/store instructions in assembler. There can be theoretical rate of loading (all Instructions Per Tick are widest loads), but processor can give only part of such, a practical limit of loading.
For nehalem: rolfed.com/nehalem/nehalemPaper.pdf
Each core in the architecture has a 128-bit write port and a
128-bit read port to the L1 cache.
128 bit = 16 bytes / clock read
AND
128 bit = 16 bytes / clock write
(can I combine read and write in single cycle?)
The L2 and L3 caches each have a 256-bit port for reading or writing,
but the L3 cache must share its port with three other cores on the chip.
Can L2 and L3 read and write ports be used in single clock?
Each integrated memory controller has a theoretical bandwidth
peak of 32 Gbps.
Latency (clock ticks), some measured by CPU-Z's latencytool or by lmbench's lat_mem_rd - both uses long linked list walk to correctly measure modern out-of-order cores like Intel Core i7
L1 L2 L3, cycles; mem link
Core 2 3 15 -- 66 ns http://www.anandtech.com/show/2542/5
Core i7-xxx 4 11 39 40c+67ns http://www.anandtech.com/show/2542/5
Itanium 1 5-6 12-17 130-1000 (cycles)
Itanium2 2 6-10 20 35c+160ns http://www.7-cpu.com/cpu/Itanium2.html
AMD K8 12 40-70c +64ns http://www.anandtech.com/show/2139/3
Intel P4 2 19 43 200-210 (cycles) http://www.arsc.edu/files/arsc/phys693_lectures/Performance_I_Arch.pdf
AthlonXP 3k 3 20 180 (cycles) --//--
AthlonFX-51 3 13 125 (cycles) --//--
POWER4 4 12-20 ?? hundreds cycles --//--
Haswell 4 11-12 36 36c+57ns http://www.realworldtech.com/haswell-cpu/5/
And good source on latency data is 7cpu web-site, e.g. for Haswell: http://www.7-cpu.com/cpu/Haswell.html
More about lat_mem_rd program is in its man page or here on SO.
Widest read/writes are 128 bit (16 byte) SSE load/store. L1/L2/L3 caches have different bandwidths and latencies and these are of course CPU-specific. Typical L1 latency is 2 - 4 clocks on modern CPUs but you can usually issue 1 or 2 load instructions per clock.
I suspect there's a more specific question lurking here somewhere - what is it that you are actually trying to achieve ? Do you just want to write the fastest possible memcpy ?

Resources