How does one compute FLOPS from time elapsed for a computation? - performance

One can see from this tutorial on the usage of Intel MKL DFTs that Dr. Andrey E. Vladimirov uses the time elapsed during a task, namely t1-t0, to compute the number of GigaFLOPS using GF/s = HztoPerf/(t1-t0) where HztoPerf = 5.0 * 1e-9 * double(fft_size) * log2(double(fft_size)) * double(num_fft).
Is this a general formula? If not, how do I deduce the average GF/s for my CPU (Intel Xeon E5-1660 at 3 GHz with 8 cores) if I know the time elapsed to run a computation (e.g. involving various FFTs)?

You have to know how many FP operations your problem requires. Then you divide that by time.
1e-9 accounts for the Giga = 10^9 metric prefix. Without that, you'd have FLOP/s not GFLOP/s if you divide FLoating point OPeration count by seconds.
5.0 * fft_size * log2(fft_size) appears to be the number of FP ops per FFT.
An efficient FFT is O(n log2(n)), and apparently this implementation has a constant factor of 5. (Or possibly that's including some work done using the result?)
num_fft is presumably the total number of FFTs of that size done, i.e. the repeat count. So the product of all those things is the number of FP ops actually done during computation of the FFT.
Hardware performance counters on Intel CPUs can record number of FLOPs (even counting FMAs as 2): there are events like fp_arith_inst_retired.256b_packed_double for various SIMD widths.
perf has a GFLOPs "metric group" you can use that enables the relevant events and calculates it for you:
perf stat --all-user -M GFLOPs ./my_program my args
Counting only in user-space is probably redundant; kernel code might use SIMD for software RAID5/6, but not in interrupt handlers and probably not system calls. And not FP math.
Example on my i7-6700k Skylake
$ perf stat --all-user -M GFLOPs awk 'BEGIN{for(i=0;i<100000000;i++){}}'
Performance counter stats for 'awk BEGIN{for(i=0;i<100000000;i++){}}':
0 fp_arith_inst_retired.256b_packed_single # 0.03 GFLOPs (66.58%)
99,934,901 fp_arith_inst_retired.scalar_double (66.68%)
0 fp_arith_inst_retired.128b_packed_single (66.71%)
0 fp_arith_inst_retired.scalar_single (66.71%)
0 fp_arith_inst_retired.256b_packed_double (66.71%)
0 fp_arith_inst_retired.128b_packed_double (66.62%)
3,352,766,500 ns duration_time
3.352766500 seconds time elapsed
3.347268000 seconds user
0.000000000 seconds sys
Unfortunately it had to multiplex between those events, since there were more than 4, and hyperthreading is enabled, so the total number of scalar double-precision FP operations (99,934,901) was measured a bit lower than the awk loop iteration count. With just -e task_clock,cycles,instructions,fp_arith_inst_retired.scalar_double, it came out at exactly 100,000,000 counts, since apparently gawk did no other FP operations.
Of course awk is not a high-FP-throughput program, and only used scalar FP math. Numeric variables in awk are double-precision, like JavaScript, but unlike JS it doesn't JIT, let alone take advantage of the ability to do them as integer.

Related

Theoretical maximum performance (FLOPS ) of Intel Xeon E5-2640 v4 CPU, using only addition?

I am confused about the theoretical maximum performance of the Intel Xeon E5-2640 v4 CPU (Boardwell-based). In this post, >800GFLOPS; in this post, about 200GFLOPS; in this post, 3.69GFLOPS per core, 147.70GFLOPS per computer. So what is the theoretical maximum performance of Intel Xeon E5-2640 v4 CPU?
Some specifications:
Processor Base Frequency = 2.4GHz;
Max turbo frequency = 3.4GHz;
IPC (instruction per cycle) = 2;
Instruction Set Extensions: AVX2, so #SIMD = 256/32 = 8;
I tried to compute the theoretical maximum FLOPS. Based on my understanding, it should be (Max turbo frequency) * (IPC) * (#SIMD), which is 3.4 * 2 * 8 = 54.4GFLOPS, is it right?
Should it be multiplied by 2 (due to the pipeline technique which makes addition and multiplication can be done in parallel)? What if additions and multiplications do not appear at the same time? (eg. if the workload only contains additions, is *2 appropriate?)
Besides, the above computation should be the maximum FLOPS per core, right?
3.4 GHz is the max single-core turbo (and also 2-core), so note that this isn't the per-core GFLOPS, it's the single-core GFLOPS.
The max all-cores turbo is 2.6 GHz on that CPU, and probably won't sustain that for long with all cores maxing out their SIMD FP execution units. That's the most power-intensive thing x86 CPUs can do. So it will likely drop back to 2.4 GHz if you actually keep all cores busy.
And yes you're missing a factor of two because FMA counts as two FP operations, and that's what you need to do to achieve the hardware's theoretical max FLOPS. FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2 . (Your Broadwell is the same as Haswell for max-throughput purposes.)
If you're only using addition then only have one FLOP per SIMD element per instruction, and also only 1/clock FP instruction throughput on a v4 (Broadwell) or earlier.
Haswell / Broadwell have two fully-pipelined SIMD FMA units (on ports 0 and 1), and one fully-pipelined SIMD FP-add unit (on port 1) with lower latency than FMA.
The FP-add unit is on the same execution port as one of the FMA units, so it can start 2 FP uops per clock, up to one of which can be pure addition. (Unless you do addition x+y as fma(x, 1.0, y), trading higher latency for more throughput.)
IPC (instruction per cycle) = 2;
Nope, that's the number of FP math instructions per cycle, max, not total instructions per clock. The pipeline's narrowest point is 4 uops wide, so there's room for a bit of loop overhead and a store instruction every cycle as well as two SIMD FP operations.
But yes, 2 FP operations started per clock, if they're not both addition.
Should it be multiplied by 2 (due to the pipeline technique which makes addition and multiplication can be done in parallel)?
You're already multiplying by IPC=2 for parallel additions and multiplications.
If you mean FMA (Fused Multiply-Add), then no, that's literally doing them both as part of a single operation, not in parallel as a "pipeline technique". That's why it's called "fused".
FMA has the same latency as multiply in many CPUs, not multiply and then addition. (Although on Broadwell, FMA latency = 5 cycles, vmulpd latency = 3 cycles, vaddpd latency = 3 cycles. All are fully pipelined, with a throughput discussed in the rest of this answer, since theoretical max throughput requires arranging your calculations to not bottleneck on the latency of addition or multiplication. e.g. using multiple accumulators for a dot product or other reduction.) Anyway, point being, a hardware FMA execution unit is not terribly more complex than an FP multiplier or adder, and you shouldn't think of it as two separate operations.
If you write a*b + c in the source, a compiler can contract that into an FMA, instead of rounding the a*b to a temporary result before addition, depending on compiler options (and defaults) to allow that or not.
How to use Fused Multiply-Add (FMA) instructions with SSE/AVX
FMA3 in GCC: how to enable
Instruction Set Extensions: AVX2, so #SIMD = 256/64 = 8;
256/64 = 4, not 8. In a 32-byte (256-bit) SIMD vector, you can fit 4 double-precision elements.
Per core per clock, Haswell/Broadwell can begin up to:
two FP math instructions (FMA/MUL/ADD), up to one of which can be addition.
FMA counts as 2 FLOPs per element, MUL/ADD only count as 1 each.
on up to 32 byte wide inputs (e.g. 4 doubles or 8 floats)

Calculate Speed up

This question is related Computer Organization and Architecture. i would truly appreciate some assistance
Processor A has a clock speed of 1 GHz, and takes 1 cycle for integer operations, 2 cycles for memory operations, and 4 cycles for floating point operations. Empirical data shows that programs run on Processor A typically are composed of 35% floating point operations, 30% memory operations; and 35% integer operations.
However you wish to design a new processor, Processor B, which is an improvement on Processor A. Processor B will run the same programs as Processor A. To complete your design, you are faced with two options for improving performance:
Increase the clock speed to 1.2 GHz, but memory operations take 3 cycles
Decrease the clock speed to 900 MHz, but floating point operations only take 3 cycles
Compute the speedup for both options and decide the option Processor B should take.
i have tried finding the Execution time for Processors A and B by using ICxCPIxCT. I used MIPS (Clock rate/CPIx10^6) as the IC in all cases. When i was finished with everything, the Processor B(1) was deemed the most efficient since it had the lowest CPU time(940.3x10^-9) but i'm not sure if my method was correct.

Calculating Cycles Per Instruction

From what I understand, to calculate CPI, it's the percentage of the type of instruction multiplied by the number of cycles right? Does the type of machine have any part of this calculation whatsoever?
I have a problem that asks me if a change should be recommended.
Machine 1: 40% R - 5 Cycles, 30% lw - 6 Cycles, 15% sw - 6 Cycles, 15% beq 3 - Cycles, on a 2.5 GHz machine
Machine 2: 40% R - 5 Cycles, 30% lw - 6 Cycles, 15% sw - 6 Cycles, 15% beq 4 - Cycles, on a 2.7 GHz machine
By my calculations, machine 1 has 5.15 CPI while machine 2 has 5.3 CPI. Is it okay to ignore the GHz of the machine and say that the change would not be a good idea or do I have to factor the machine in?
I think the point is to evaluate a design change that makes an instruction take more clocks, but allows you to raise the clock frequency. (i.e. leaning towards a speed-demon design like Pentium 4, instead of brainiac like Apple's A7/A8 ARM cores. http://www.lighterra.com/papers/modernmicroprocessors/)
So you need to calculate instructions per second to see which one will get more work done in the same amount of real time. i.e. (clock/sec) / (clocks/insn) = insn/sec, cancelling out the clocks from the units.
Your CPI calculation looks ok; I didn't check it, but yes a weighted average of the cycles according to the instruction mix.
These numbers are obviously super simplified; any CPU worth building at 2.5GHz would have some kind of branch prediction so the cost of a branch isn't just a 3 or 4 instruction bubble. And taking ~5 cycles per instruction on average is pathetic. (Most pipelined designs aim for at least 1 instruction per clock.)
Caches and superscalar CPUs also lead to complex interactions between instructions depending on whether they depend on earlier results or not.
But this is sort of like what you might do if considering increasing the L1d cache load-use latency by 1 cycle (for example), if that took it off the critical path and let you raise the clock frequency. Or vice versa, tightening up the latency or reducing the number of pipeline stages on something at the cost of reducing frequency.
Cycles per instruction a count of cycles. ghz doesnt matter as far as that average goes. But saying that we can see from your numbers that one instruction is more clocks but the processors are a different speed.
So while it takes more cycles to do the same job on the faster processor the speed of the processor DOES compensate for that so it seems clear this is a question about does the processor speed account for the extra clock?
5.15 cycles/instruction / 2.5 (giga) cycles/second, cycles cancels out you get
2.06 seconds/(giga) instruction or (nano) seconds/ instruction
5.30 / 2.7 = 1.96296 (nano) seconds / instruction
The faster one takes a slightly less amount of time so it will run the program faster.
Another way to see this to check the math.
For 100 clock cycles on the slower machine 15% of those are beq. So 15 of the 100 clocks, which is 5 beq instructions. The same 5 beq instructions take 20 clocks on the faster machine so 105 clocks total for the same instructions on the faster machine.
100 cycles at 2.5ghz vs 105 at 2.7ghz
we want the amount of time
hz is cycles / second we want seconds on the top
so we want
cycles / (cycles/second) to have cycles cancel out and have seconds on the top
1/2.5 = 0.400 (400 picoseconds)
1/2.7 = 0.370
0.400 * 100 = 40.00 units of time
0.370 * 105 = 38.85 units of time
So despite taking 5 more cycles the processor speed differences is fast enough to compensate.
2.7/2.5 = 1.08
105/100 = 1.05
so 2.5 * 1.05 = 2.625 so a processor 2.625ghz or faster would run that program faster.
Now what were the rules for changing computers, is less time defined as a reason to change computers? What is the definition of better? How much more power does the faster one consume it might take less time but the power consumption might not be linear so it may take more watts despite taking less time. I assume the question is not that detailed, meaning it is vague meaning it is a poorly written question on its own, so it goes to what the textbook or lecture defined as the threshold for change to the other processor.
Disclaimer, dont blame me if you miss this question on your homework/test.
Outside an academic exercise like this, the real world is full of pipelined processors (not all but most of the folks writing programs are writing programs for) and basically you cant put a number on clock cycles per instruction type in a way that you can do this calculation because of a laundry list of factors. Make sore you understand that, nice exercise, but that specific exercise is difficult and dangerous to attempt on real world processors. Dangerous in that as hard as you work you may be incorrectly measuring something and jumping to the wrong conclusions and as a result making bad recommendations. At the same time there is very much the reality that faster ghz does improve some percentage of the execution, but another percentage suffers, and is there a net gain or loss. Or a new processor design faster or slower may have features that perform better than an older processor, but not all feature will be better, there is a tradeoff and then we get into what "better" means.

How can CPU's have FLOPS much higher than their clock speeds?

For example, a modern i7-8700k can supposedly do ~60 GFLOPS (single-precision, source) while its maximum frequency is 4.7GHz. As far as I am aware, an instruction has to take at least one cycle to complete, so how is this possible?
There are multiple factors that are all multiplied together for this large effect:
SIMD, Intel 8700k and similar processors support AVX and AVX2, which includes many instructions that operate on registers that can hold 8 floats at the same time.
multiple cores, 8700k has 6 cores.
fused multiply-add, part of AVX2, has both a multiplication and addition in the same instruction.
high throughput execution. The latency (time an individual instruction takes) is not directly important to how much computation a processor can do in a unit of time. A modern CPU such as 8700k can start executing two (independent) FMAs in the same cycle (and keep in mind these are still SIMD instructions so that represents a lot of floating point operations) even through the latency of the operation is actually 4 cycles.
Multiplying all those factors together we get: 8 * 6 * 2 * 2 * 4.3 = 825 GFLOPS (matching the stats reported here). This calculation certainly does not mean that it can actually be attained. For example the processor may downclock significantly under such a workload in order to stay within its power budget, which is what Intel has been doing at least since Haswell (though the specifics have changed and it applied to server parts). Also, most real code has significant trouble feeding that many FMAs with data. Large matrix multiplications can get close though, and for example according to these stats the 8700k reached 496.7 Gflops in their SGEMM benchmark. Possibly the 8700k's max AVX2 turbo speed on 6 cores is 2.6GHz but as far as I can find it does not have an AVX offset by default (only needed when overclocked), or that GEMM is just not that close to hitting peak FLOPS.

Matlab matrix multiplication speed

I was wondering how can matlab multiply two matrices so fast. When multiplying two NxN matrices, N^3 multiplications are performed. Even with the Strassen Algorithm it takes N^2.8 multiplications, which is still a large number. I was running the following test program:
a = rand(2160);
b = rand(2160);
tic;a*b;toc
2160 was used because 2160^3=~10^10 ( a*b should be about 10^10 multiplications)
I got:
Elapsed time is 1.164289 seconds.
(I'm running on 2.4Ghz notebook and no threading occurs)
which mean my computer made ~10^10 operation in a little more than 1 second.
How this could be??
It's a combination of several things:
Matlab does indeed multi-thread.
The core is heavily optimized with vector instructions.
Here's the numbers on my machine: Core i7 920 # 3.5 GHz (4 cores)
>> a = rand(10000);
>> b = rand(10000);
>> tic;a*b;toc
Elapsed time is 52.624931 seconds.
Task Manager shows 4 cores of CPU usage.
Now for some math:
Number of multiplies = 10000^3 = 1,000,000,000,000 = 10^12
Max multiplies in 53 secs =
(3.5 GHz) * (4 cores) * (2 mul/cycle via SSE) * (52.6 secs) = 1.47 * 10^12
So Matlab is achieving about 1 / 1.47 = 68% efficiency of the maximum possible CPU throughput.
I see nothing out of the ordinary.
To check whether you do or not use multi-threading in MATLAB use this command
maxNumCompThreads(n)
This sets the number of cores to use to n. Now I have a Core i7-2620M, which has a maximum frequency of 2.7GHz, but it also has a turbo mode with 3.4GHz. The CPU has two cores. Let's see:
A = rand(5000);
B = rand(5000);
maxNumCompThreads(1);
tic; C=A*B; toc
Elapsed time is 10.167093 seconds.
maxNumCompThreads(2);
tic; C=A*B; toc
Elapsed time is 5.864663 seconds.
So there is multi-threading.
Let's look at the single CPU results. A*B executes approximately 5000^3 multiplications and additions. So the performance of single-threaded code is
5000^3*2/10.8 = 23 GFLOP/s
Now the CPU. 3.4 GHz, and Sandy Bridge can do maximum 8 FLOPs per cycle with AVX:
3.4 [Ginstructions/second] * 8 [FLOPs/instruction] = 27.2 GFLOP/s peak performance
So single core performance is around 85% peak, which is to be expected for this problem.
You really need to look deeply into the capabilities of your CPU to get accurate performannce estimates.

Resources