how to calculate hpc performance rpeak [closed] - performance

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 5 years ago.
Improve this question
My work operates cluster system with 20 computing nodes.
and I'm having difficulty to calculate peak theoretical performance of this HPC system.
I know the HPC world is using the following formula for node peak theoretical performance:
Node performance in GFlops = (CPU speed in GHz) x (number of CPU cores) x (CPU instruction per cycle) x (number of CPUs per node)
but I dont get the how to find out (CPU instruction per cycle) of the CPUs.
here are the model names of the 20 nodes:
Xeon5460 3.16Ghz 4Core *2
Xeon5450 3.00Ghz 4Core *2
Xeon5450 3.00Ghz 4Core *2
Xeon5460 3.16Ghz 4Core *2
Xeon5460 3.16Ghz 4Core *2
Xeon5460 3.16Ghz 4Core *2
Xeon5460 3.16Ghz 4Core *2
Xeon5460 3.16Ghz 4Core *2
Xeon5460 3.16Ghz 4Core *2
Xeon5460 3.16Ghz 4Core *2
Xeon2690 2.90Ghz 8Core *2
Xeon2690 2.90Ghz 8Core *2
Xeon2690 2.90Ghz 8Core *2
Xeon5680 3.33Ghz 6Core *2
Xeon5660 2.80Ghz 6Core *2
Xeon5660 2.80Ghz 6Core *2
Xeon5660 2.80Ghz 6Core *2
Xeon5660 2.80Ghz 6Core *2
Xeon2680 2.80Ghz 10Core *2
Xeon2680 2.80Ghz 10Core *2
I looked up the intel homepage but cant find the information I need.
Can anyone help me to find out (CPU instruction per cycle) and rpeak of the system?

"Instructions per cycle" isn't that relevant to calculate flops, it should be specifically floating point instructions per cycle. The number of floating point instructions per cycle is typically lower than the total number of instructions per cycle. Also don't forget about vector size.
For example for Xeon5460 (Penryn-based Xeon) can execute up to 5 instructions per cycle under the right circumstances, but only two of them can be floating point instructions, and they have to be able to go to different ports (for example addps and mulps, both of which are "worth" 4 operations because they operate on vectors of 4 floats).
Anyway, you can use these numbers, derived from this table,
Penryn/Nehalem/Westmere-like, 2 floating point instructions per cycle, vector size 4 (2 for double), so 8 flop/c or 4 dflop/c.
Sandy and Ivy, 2 floating point operations per cycle, vector size 8 (4 for double), so 16 flop/c or 8 dflop/c.
Haswell/Broadwell/Skylake, still 2 floating point operations per cycle but they can be FMAs, so 32 flop/c or 16 dflop/c since an FMA counts for two.
There are more differences between that don't show in these calculations (nor in total Flops, so as usual I question how useful that number is). For example, on Skylake there are more types of floating point instruction that you can execute 2 of in a cycle, such as addition, min/max, comparisons, and some conversions. Broadwell and Haswell can only do two additions per cycle by making them part of FMAs, and min/max etc are out of luck there. Division throughput more than doubles from Haswell to Broadwell, hopefully division is rare but this probably matters at least sometimes.
You can look up which architecture a processor is based on on wikipedia.
Xeon2690 refers to several very different processors, though they can be differentiated through core count and frequency you should always include the version number, E5-2690 (Sandy) is something completely different than E5-2690 v4 (Broadwell). Based on core count, the ones you listed are Sandy and Ivy.

Related

How to interpret the effect of stride size on Intel's hardware prefetching?

In Section 9.5.3 of IntelĀ® 64 and IA-32 Architectures Optimization Reference Manual, the effects of hardware prefetching are described as follows:
The effective latency reduction for several microarchitecture implementations is shown in Figure 9-2. For a constant-stride access pattern, the benefit of the automatic hardware prefetcher begins at half the trigger threshold distance and reaches maximum benefit when the cache-miss stride is 64 bytes.
Family 6 model 13 and 14 are Pentium M (Dothan and Yonah respectively), from 2004 and 2006. (https://en.wikichip.org/wiki/intel/cpuid)
Family 15 is Netburst (pentium 4), model 0,1,2 early generations, Williamette and Northwood.
Fam 16 model 3,4 is Prescott, and model 6 is a successor to that.
Pentium 4 used 128-byte lines in its L2 cache (or pairs of 64-byte lines kept together), vs. 64-byte lines in its L1d cache.
Pentium M used 64-byte cache lines at all levels, up from 32-byte in Pentium III.
I have two questions:
How to explain the "Effective Latency Reduction" in the figure? Literally, it should be (Latency without prefetch - Latency with prefetch) / Latency without prefetch. However, it seems that the lower the indicator is, the better, contrary to the above understanding.
How long is the trigger threshold distance? As defined in Section 9.5.2, "It will attempt to prefetch two cache lines ahead of the prefetch stream", which is 64B*2 = 128B for LLC. However, the significant inflection point in the figure occurs around 132B. If it is "half the trigger threshold distance", the latter should be 132B*2 = 264B.
I read the context, but did not get the explanation of these two terms.

Theoretical maximum performance (FLOPS ) of Intel Xeon E5-2640 v4 CPU, using only addition?

I am confused about the theoretical maximum performance of the Intel Xeon E5-2640 v4 CPU (Boardwell-based). In this post, >800GFLOPS; in this post, about 200GFLOPS; in this post, 3.69GFLOPS per core, 147.70GFLOPS per computer. So what is the theoretical maximum performance of Intel Xeon E5-2640 v4 CPU?
Some specifications:
Processor Base Frequency = 2.4GHz;
Max turbo frequency = 3.4GHz;
IPC (instruction per cycle) = 2;
Instruction Set Extensions: AVX2, so #SIMD = 256/32 = 8;
I tried to compute the theoretical maximum FLOPS. Based on my understanding, it should be (Max turbo frequency) * (IPC) * (#SIMD), which is 3.4 * 2 * 8 = 54.4GFLOPS, is it right?
Should it be multiplied by 2 (due to the pipeline technique which makes addition and multiplication can be done in parallel)? What if additions and multiplications do not appear at the same time? (eg. if the workload only contains additions, is *2 appropriate?)
Besides, the above computation should be the maximum FLOPS per core, right?
3.4 GHz is the max single-core turbo (and also 2-core), so note that this isn't the per-core GFLOPS, it's the single-core GFLOPS.
The max all-cores turbo is 2.6 GHz on that CPU, and probably won't sustain that for long with all cores maxing out their SIMD FP execution units. That's the most power-intensive thing x86 CPUs can do. So it will likely drop back to 2.4 GHz if you actually keep all cores busy.
And yes you're missing a factor of two because FMA counts as two FP operations, and that's what you need to do to achieve the hardware's theoretical max FLOPS. FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2 . (Your Broadwell is the same as Haswell for max-throughput purposes.)
If you're only using addition then only have one FLOP per SIMD element per instruction, and also only 1/clock FP instruction throughput on a v4 (Broadwell) or earlier.
Haswell / Broadwell have two fully-pipelined SIMD FMA units (on ports 0 and 1), and one fully-pipelined SIMD FP-add unit (on port 1) with lower latency than FMA.
The FP-add unit is on the same execution port as one of the FMA units, so it can start 2 FP uops per clock, up to one of which can be pure addition. (Unless you do addition x+y as fma(x, 1.0, y), trading higher latency for more throughput.)
IPC (instruction per cycle) = 2;
Nope, that's the number of FP math instructions per cycle, max, not total instructions per clock. The pipeline's narrowest point is 4 uops wide, so there's room for a bit of loop overhead and a store instruction every cycle as well as two SIMD FP operations.
But yes, 2 FP operations started per clock, if they're not both addition.
Should it be multiplied by 2 (due to the pipeline technique which makes addition and multiplication can be done in parallel)?
You're already multiplying by IPC=2 for parallel additions and multiplications.
If you mean FMA (Fused Multiply-Add), then no, that's literally doing them both as part of a single operation, not in parallel as a "pipeline technique". That's why it's called "fused".
FMA has the same latency as multiply in many CPUs, not multiply and then addition. (Although on Broadwell, FMA latency = 5 cycles, vmulpd latency = 3 cycles, vaddpd latency = 3 cycles. All are fully pipelined, with a throughput discussed in the rest of this answer, since theoretical max throughput requires arranging your calculations to not bottleneck on the latency of addition or multiplication. e.g. using multiple accumulators for a dot product or other reduction.) Anyway, point being, a hardware FMA execution unit is not terribly more complex than an FP multiplier or adder, and you shouldn't think of it as two separate operations.
If you write a*b + c in the source, a compiler can contract that into an FMA, instead of rounding the a*b to a temporary result before addition, depending on compiler options (and defaults) to allow that or not.
How to use Fused Multiply-Add (FMA) instructions with SSE/AVX
FMA3 in GCC: how to enable
Instruction Set Extensions: AVX2, so #SIMD = 256/64 = 8;
256/64 = 4, not 8. In a 32-byte (256-bit) SIMD vector, you can fit 4 double-precision elements.
Per core per clock, Haswell/Broadwell can begin up to:
two FP math instructions (FMA/MUL/ADD), up to one of which can be addition.
FMA counts as 2 FLOPs per element, MUL/ADD only count as 1 each.
on up to 32 byte wide inputs (e.g. 4 doubles or 8 floats)

How to calculate speedup and efficiency in a hybrid CPU and GPU algorithm?

I have an algorithm that I have executed in parallel using only CPU and I have achieved a speedup of 30x. That is, an efficiency equal to 0.93 (efficiency = speedup/cores, i.e. 0.93 = 30/32).
Later I added 2 GPUs (Tesla C2075 of 448 cores each) together to the 32 CPU cores.
To calculate the efficiency including CPUs and GPUs, should I add the amount of GPU cores to the CPU cores? That is, I would calculate the efficiency using 928 cores (32 + 448 + 448 = 928). Or should it be calculated differently?
Speedup and efficiency has been calculated based on what has been said here:
https://software.intel.com/en-us/articles/predicting-and-measuring-parallel-performance
GPUs have bigger "core complex" architectures called "SM" or "CU" with tens of pipelines each. Not "very" similar to "SIMD" of a CPU, they can issue commands in parallel to these pipelines in a "single-threaded" kernel code.
You have counted "cores" in CPU and not SIMD pipelines (which is 4 to 16 times of number of cores) so, it wouldn't be wrong to count SM units of Nvidia or CU of Amd or Slice subset of Intel etc.
Tesla C2075 has 14 SM units so you could add 14 for each GPU (32+14+14).
If you have also used SIMDified code for CPU, then it wouldn't be wrong to count each pipeline of a GPU which is 32 to 192 times the number of SM/CU(like 448 per GPU of yours) (32*SIMD_WIDTH + 448 + 448).
At least this is how I would compute "core efficiency" and "pipeline efficiency". If data transfer to/from GPU is not a bottleneck, efficiency should not drop much after GPUs are added.

MPICH2 on a machine with two NUMA nodes

I am new to MPI. I am using MPICH2 on a Linux machine with the following information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4114 CPU # 2.20GHz
Stepping: 4
CPU MHz: 799.844
CPU max MHz: 3000.0000
CPU min MHz: 800.0000
BogoMIPS: 4400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 14080K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
My understanding is that I've got 2 nodes, 20 cores and 40 threads (i.e. processors) on this machine. Is this correct? If yes, I think I should set MPICH to spawn 20 processes (one process on each physical core), right? However, when I run the command mpiexec -n 20 MyProgram, the average CPU usage is only 50%. If I change to mpiexec -n 40 MyProgram, the CPU usage is 100% but the overall performance is actually becoming worse so I think I might be over-specifying.
CPU usage is a misleading metric. CPU usage reflects the portion of time some task was scheduled on a logical CPU. CPU average is just that, the average over all logical cores. So 50% CPU average can just mean that every other logical CPU has 100% usage, (and the others 0 %). So you observe this in a situation where each physical core is always utilized.
CPU usage, does mean resource utilization. There are workloads that benefit from using hyperthreading and workloads that don't. There are workloads that can be faster using less threads than physical cores (e.g. memory bandwidth limited). There are workloads that can be faster using more threads than logical CPUs (e.g. I/O latency limited).
Always use your performance metric (e.g. time) to figure out the best configuration. If you want to understand resource utilization you must look at many different performance metrics, cycles, instructions, memory bandwidth, cache, ....

Cache bandwidth per tick for modern CPUs

What is a speed of cache accessing for modern CPUs? How many bytes can be read or written from memory every processor clock tick by Intel P4, Core2, Corei7, AMD?
Please, answer with both theoretical (width of ld/sd unit with its throughput in uOPs/tick) and practical numbers (even memcpy speed tests, or STREAM benchmark), if any.
PS it is question, related to maximal rate of load/store instructions in assembler. There can be theoretical rate of loading (all Instructions Per Tick are widest loads), but processor can give only part of such, a practical limit of loading.
For nehalem: rolfed.com/nehalem/nehalemPaper.pdf
Each core in the architecture has a 128-bit write port and a
128-bit read port to the L1 cache.
128 bit = 16 bytes / clock read
AND
128 bit = 16 bytes / clock write
(can I combine read and write in single cycle?)
The L2 and L3 caches each have a 256-bit port for reading or writing,
but the L3 cache must share its port with three other cores on the chip.
Can L2 and L3 read and write ports be used in single clock?
Each integrated memory controller has a theoretical bandwidth
peak of 32 Gbps.
Latency (clock ticks), some measured by CPU-Z's latencytool or by lmbench's lat_mem_rd - both uses long linked list walk to correctly measure modern out-of-order cores like Intel Core i7
L1 L2 L3, cycles; mem link
Core 2 3 15 -- 66 ns http://www.anandtech.com/show/2542/5
Core i7-xxx 4 11 39 40c+67ns http://www.anandtech.com/show/2542/5
Itanium 1 5-6 12-17 130-1000 (cycles)
Itanium2 2 6-10 20 35c+160ns http://www.7-cpu.com/cpu/Itanium2.html
AMD K8 12 40-70c +64ns http://www.anandtech.com/show/2139/3
Intel P4 2 19 43 200-210 (cycles) http://www.arsc.edu/files/arsc/phys693_lectures/Performance_I_Arch.pdf
AthlonXP 3k 3 20 180 (cycles) --//--
AthlonFX-51 3 13 125 (cycles) --//--
POWER4 4 12-20 ?? hundreds cycles --//--
Haswell 4 11-12 36 36c+57ns http://www.realworldtech.com/haswell-cpu/5/
And good source on latency data is 7cpu web-site, e.g. for Haswell: http://www.7-cpu.com/cpu/Haswell.html
More about lat_mem_rd program is in its man page or here on SO.
Widest read/writes are 128 bit (16 byte) SSE load/store. L1/L2/L3 caches have different bandwidths and latencies and these are of course CPU-specific. Typical L1 latency is 2 - 4 clocks on modern CPUs but you can usually issue 1 or 2 load instructions per clock.
I suspect there's a more specific question lurking here somewhere - what is it that you are actually trying to achieve ? Do you just want to write the fastest possible memcpy ?

Resources