How to compute the theoretical peak performance of CPU - performance

Here is my cat /proc/cpuinfo output:
...
processor : 15
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU E5520 # 2.27GHz
stepping : 5
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 3
cpu cores : 4
apicid : 23
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic ...
bogomips : 4533.56
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management :
This machine has two CPUs, each with 4 cores with hyperthreading capability, so the total processor number is 16(2 CPU * 4 core * 2 hyperthreading). These processors have same output, to keep clean, I just show the last one's info and omit part of flags in the flags line.
So how do I calculate the peak performance of this machine in terms of GFlops?
Let me know if more info should be supplied.
Thanks.

You can check the Intel export spec.
The GFLOP in the chart is usually referred as the peak of a single chip.
It shows 36.256 Gflop/s for E5520.
This single chip has 4 physical cores with SSE.
So this GFLOP can also be calculated as:
2.26GHz*2(mul,add)*2(SIMD double precision)*4(physical core) = 36.2.
You system has two CPUs, so your peak is 36.2*2 = 72.4 GFLOP/S.

you can find a formula in this website:
http://www.novatte.com/our-blog/197-how-to-calculate-peak-theoretical-performance-of-a-cpu-based-hpc-system
here the formula:
performance in GFlops = (CPU speed in GHz) x (number of CPU cores) x (CPU instruction per cycle) x (number of CPUs per node).
so in your case: 2.27x4x4x2=72.64 GFLOP/s
see here for the configuration of your CPU http://ark.intel.com/products/40200

Related

Performance of my MPI code does not improve when I use two NUMA nodes (dual Xeon chips)

I have a computer, Precision-Tower-7810 dual Xeon E5-2680v3 #2.50GHz × 48 threads.
Here is result of $lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz
Stepping: 2
CPU MHz: 1200.000
CPU max MHz: 3300,0000
CPU min MHz: 1200,0000
BogoMIPS: 4988.40
Virtualization: VT-x
L1d cache: 768 KiB
L1i cache: 768 KiB
L2 cache: 6 MiB
L3 cache: 60 MiB
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
My MPI code is based on basic MPI (Isend, Irecv, Wait, Bcast). Fundamentally, the data will be distributed and sent to all processors. On each processor, data is used to calculate something and its value is changed. After the above procedure, the amount of data on each processor is exchanged between all processors. This work is repeated to a limit.
Now, the main issue is that when I increase the number of processors within the limit of one chip (24 threads), performance increases. However, performance does not improve while the number of processors > 24 threads.
An example:
$mpiexec -n 6 ./mywork : 72s
$mpiexec -n 12 ./mywork : 46s
$mpiexec -n 24 ./mywork : 36s
$mpiexec -n 32 ./mywork : 36s
$mpiexec -n 48 ./mywork : 35s
I have tried on the both OpenMPI and MPICH, obtained result is the same. So, I think issue of physical connect type (NUMA nodes) of two chips. It is assumption of mine, I have never used a really supercomputer. I hope anyone know this issue and help me. Thank you for reading.

strange CPU binding/pining result within OpenMPI

I have tried to evaluate an OpenMPI program with Matrix Multiplication algorithm, the written code scales very well on a single thread per core machine in our Laboratory (close to ideal speedup within 48 and 64 cores), However, on some other machines which are hyperthreaded there is strange behavior, as you can see in the screenshot from htop I realized the CPU utilization when I run the same experiment with the same command is different and strange, I executed the program with
mpirun --bind-to hwthread--use-hwthread-cpus -n 2 ...
Here I bind the MPI workers to each hwthread, and can be seen with -n 2 which means I overwrite the variable in such a way to bind the execution on two processors (here hwthreads), however, seems it uses another hwthread with more or less 50% of utilization as well! I found this strange because there is not any extra CPU utilization on other machines, I tried this experiment many times and I'm sure this is not a temporary check or sth by OS and is due to the execution model of OpenMPI.
I appreciate it if someone could explain this behavior and extra CPU utilization when I execute this on the hyper-threaded machine.
The output of lscpu is as below:
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD Ryzen Threadripper 1950X 16-Core Processor
Stepping: 1
Frequency boost: enabled
CPU MHz: 2200.000
CPU max MHz: 3400.0000
CPU min MHz: 2200.0000
BogoMIPS: 6786.36
Virtualization: AMD-V
L1d cache: 512 KiB
L1i cache: 1 MiB
L2 cache: 8 MiB
L3 cache: 32 MiB
The version of OpenMPI for all machines is the same 2.1.1.
Maybe Hyperthreading is not the case and I was misled by this, but the only big difference between these environments are 1) the Hyperthreading and 2) Clock Frequency of the processors which is based on different CPUs is different between 2200 MHz to 4.8 GHz.

How to derive the Peak performance in GFlop/s of Intel Xeon E5-2690?

I was able to find the theoretical DP peak performance 371 GFlop/s for the Xeon E5-2690 in this Processor Comparison (interesting that it is easier to find this information in Intel's competitor than Intel support pages itself). However, when I try to derive that peak performance my derivation doesn't match:
The frequency (in Turbo mode) for each core of the Xeon E5-2690 = 3.8Ghz
The processor can do an add and mul operation per cycle so we get: 3.8 x 2 = 7.6
Given it has AVX support it can do 4 double operations per cycle: 7.6 x 4 = 30.4
Finally, it has 8 cores, therefore we get: 8 x 30.4 = 243.2
Thus, the peak performance in Gflop/s would be 243.2 GFlop/s and not 371 GFlop/s?
Turbo Mode is not used to calculate Theoretical Peak Performance, you have to consider something like:
CPU speed = 2.9 GHz
CPU Cores = 8
CPU instruction per cycle = 8 (considering AVX-256 -> 256 bits unit, can hold 8 single precision values) x 2 (add and mul operations like you said) = 16
Putting all together:
2.9x8x16 = 371 GFlops/s

How system is calculating to get the NumberOfLogicalProcessors in VB

For a Intel(R) Core(TM)2 Quad CPU Q8400 # 2.66GHz model cpu I am getting both the NumberOfCores and NumberOfLogicalProcessors as 4 .
I want to know how system is calculating NumberOfLogicalProcessors ?
What i should use to get the actual number of cpus ?
OS: win2k8 R2
That depends on what you meant by actual CPU.
Win32_Processor\NumberOfCores specifies the total number of physical CPU cores. A chip can contain one or more CPU cores.
Win32_Processor\NumberOfLogicalProcessors specifies the total number of virtual CPU cores. There can be two or more virtual CPU cores in one physical CPU core. On x86 compatible computers, this is only available in Intel's Hyper-Threading capable CPUs.
On the other hand, the Win32_ComputerSystem\NumberOfProcessors specifies the total number of physical processor chips installed on a multi processor motherboard.
The Win32_ComputerSystem\NumberOfLogicalProcessors is same as Win32_Processor\NumberOfLogicalProcessors.

Cache bandwidth per tick for modern CPUs

What is a speed of cache accessing for modern CPUs? How many bytes can be read or written from memory every processor clock tick by Intel P4, Core2, Corei7, AMD?
Please, answer with both theoretical (width of ld/sd unit with its throughput in uOPs/tick) and practical numbers (even memcpy speed tests, or STREAM benchmark), if any.
PS it is question, related to maximal rate of load/store instructions in assembler. There can be theoretical rate of loading (all Instructions Per Tick are widest loads), but processor can give only part of such, a practical limit of loading.
For nehalem: rolfed.com/nehalem/nehalemPaper.pdf
Each core in the architecture has a 128-bit write port and a
128-bit read port to the L1 cache.
128 bit = 16 bytes / clock read
AND
128 bit = 16 bytes / clock write
(can I combine read and write in single cycle?)
The L2 and L3 caches each have a 256-bit port for reading or writing,
but the L3 cache must share its port with three other cores on the chip.
Can L2 and L3 read and write ports be used in single clock?
Each integrated memory controller has a theoretical bandwidth
peak of 32 Gbps.
Latency (clock ticks), some measured by CPU-Z's latencytool or by lmbench's lat_mem_rd - both uses long linked list walk to correctly measure modern out-of-order cores like Intel Core i7
L1 L2 L3, cycles; mem link
Core 2 3 15 -- 66 ns http://www.anandtech.com/show/2542/5
Core i7-xxx 4 11 39 40c+67ns http://www.anandtech.com/show/2542/5
Itanium 1 5-6 12-17 130-1000 (cycles)
Itanium2 2 6-10 20 35c+160ns http://www.7-cpu.com/cpu/Itanium2.html
AMD K8 12 40-70c +64ns http://www.anandtech.com/show/2139/3
Intel P4 2 19 43 200-210 (cycles) http://www.arsc.edu/files/arsc/phys693_lectures/Performance_I_Arch.pdf
AthlonXP 3k 3 20 180 (cycles) --//--
AthlonFX-51 3 13 125 (cycles) --//--
POWER4 4 12-20 ?? hundreds cycles --//--
Haswell 4 11-12 36 36c+57ns http://www.realworldtech.com/haswell-cpu/5/
And good source on latency data is 7cpu web-site, e.g. for Haswell: http://www.7-cpu.com/cpu/Haswell.html
More about lat_mem_rd program is in its man page or here on SO.
Widest read/writes are 128 bit (16 byte) SSE load/store. L1/L2/L3 caches have different bandwidths and latencies and these are of course CPU-specific. Typical L1 latency is 2 - 4 clocks on modern CPUs but you can usually issue 1 or 2 load instructions per clock.
I suspect there's a more specific question lurking here somewhere - what is it that you are actually trying to achieve ? Do you just want to write the fastest possible memcpy ?

Resources