Matlab matrix multiplication speed - performance

I was wondering how can matlab multiply two matrices so fast. When multiplying two NxN matrices, N^3 multiplications are performed. Even with the Strassen Algorithm it takes N^2.8 multiplications, which is still a large number. I was running the following test program:
a = rand(2160);
b = rand(2160);
tic;a*b;toc
2160 was used because 2160^3=~10^10 ( a*b should be about 10^10 multiplications)
I got:
Elapsed time is 1.164289 seconds.
(I'm running on 2.4Ghz notebook and no threading occurs)
which mean my computer made ~10^10 operation in a little more than 1 second.
How this could be??

It's a combination of several things:
Matlab does indeed multi-thread.
The core is heavily optimized with vector instructions.
Here's the numbers on my machine: Core i7 920 # 3.5 GHz (4 cores)
>> a = rand(10000);
>> b = rand(10000);
>> tic;a*b;toc
Elapsed time is 52.624931 seconds.
Task Manager shows 4 cores of CPU usage.
Now for some math:
Number of multiplies = 10000^3 = 1,000,000,000,000 = 10^12
Max multiplies in 53 secs =
(3.5 GHz) * (4 cores) * (2 mul/cycle via SSE) * (52.6 secs) = 1.47 * 10^12
So Matlab is achieving about 1 / 1.47 = 68% efficiency of the maximum possible CPU throughput.
I see nothing out of the ordinary.

To check whether you do or not use multi-threading in MATLAB use this command
maxNumCompThreads(n)
This sets the number of cores to use to n. Now I have a Core i7-2620M, which has a maximum frequency of 2.7GHz, but it also has a turbo mode with 3.4GHz. The CPU has two cores. Let's see:
A = rand(5000);
B = rand(5000);
maxNumCompThreads(1);
tic; C=A*B; toc
Elapsed time is 10.167093 seconds.
maxNumCompThreads(2);
tic; C=A*B; toc
Elapsed time is 5.864663 seconds.
So there is multi-threading.
Let's look at the single CPU results. A*B executes approximately 5000^3 multiplications and additions. So the performance of single-threaded code is
5000^3*2/10.8 = 23 GFLOP/s
Now the CPU. 3.4 GHz, and Sandy Bridge can do maximum 8 FLOPs per cycle with AVX:
3.4 [Ginstructions/second] * 8 [FLOPs/instruction] = 27.2 GFLOP/s peak performance
So single core performance is around 85% peak, which is to be expected for this problem.
You really need to look deeply into the capabilities of your CPU to get accurate performannce estimates.

Related

Theoretical maximum performance (FLOPS ) of Intel Xeon E5-2640 v4 CPU, using only addition?

I am confused about the theoretical maximum performance of the Intel Xeon E5-2640 v4 CPU (Boardwell-based). In this post, >800GFLOPS; in this post, about 200GFLOPS; in this post, 3.69GFLOPS per core, 147.70GFLOPS per computer. So what is the theoretical maximum performance of Intel Xeon E5-2640 v4 CPU?
Some specifications:
Processor Base Frequency = 2.4GHz;
Max turbo frequency = 3.4GHz;
IPC (instruction per cycle) = 2;
Instruction Set Extensions: AVX2, so #SIMD = 256/32 = 8;
I tried to compute the theoretical maximum FLOPS. Based on my understanding, it should be (Max turbo frequency) * (IPC) * (#SIMD), which is 3.4 * 2 * 8 = 54.4GFLOPS, is it right?
Should it be multiplied by 2 (due to the pipeline technique which makes addition and multiplication can be done in parallel)? What if additions and multiplications do not appear at the same time? (eg. if the workload only contains additions, is *2 appropriate?)
Besides, the above computation should be the maximum FLOPS per core, right?
3.4 GHz is the max single-core turbo (and also 2-core), so note that this isn't the per-core GFLOPS, it's the single-core GFLOPS.
The max all-cores turbo is 2.6 GHz on that CPU, and probably won't sustain that for long with all cores maxing out their SIMD FP execution units. That's the most power-intensive thing x86 CPUs can do. So it will likely drop back to 2.4 GHz if you actually keep all cores busy.
And yes you're missing a factor of two because FMA counts as two FP operations, and that's what you need to do to achieve the hardware's theoretical max FLOPS. FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2 . (Your Broadwell is the same as Haswell for max-throughput purposes.)
If you're only using addition then only have one FLOP per SIMD element per instruction, and also only 1/clock FP instruction throughput on a v4 (Broadwell) or earlier.
Haswell / Broadwell have two fully-pipelined SIMD FMA units (on ports 0 and 1), and one fully-pipelined SIMD FP-add unit (on port 1) with lower latency than FMA.
The FP-add unit is on the same execution port as one of the FMA units, so it can start 2 FP uops per clock, up to one of which can be pure addition. (Unless you do addition x+y as fma(x, 1.0, y), trading higher latency for more throughput.)
IPC (instruction per cycle) = 2;
Nope, that's the number of FP math instructions per cycle, max, not total instructions per clock. The pipeline's narrowest point is 4 uops wide, so there's room for a bit of loop overhead and a store instruction every cycle as well as two SIMD FP operations.
But yes, 2 FP operations started per clock, if they're not both addition.
Should it be multiplied by 2 (due to the pipeline technique which makes addition and multiplication can be done in parallel)?
You're already multiplying by IPC=2 for parallel additions and multiplications.
If you mean FMA (Fused Multiply-Add), then no, that's literally doing them both as part of a single operation, not in parallel as a "pipeline technique". That's why it's called "fused".
FMA has the same latency as multiply in many CPUs, not multiply and then addition. (Although on Broadwell, FMA latency = 5 cycles, vmulpd latency = 3 cycles, vaddpd latency = 3 cycles. All are fully pipelined, with a throughput discussed in the rest of this answer, since theoretical max throughput requires arranging your calculations to not bottleneck on the latency of addition or multiplication. e.g. using multiple accumulators for a dot product or other reduction.) Anyway, point being, a hardware FMA execution unit is not terribly more complex than an FP multiplier or adder, and you shouldn't think of it as two separate operations.
If you write a*b + c in the source, a compiler can contract that into an FMA, instead of rounding the a*b to a temporary result before addition, depending on compiler options (and defaults) to allow that or not.
How to use Fused Multiply-Add (FMA) instructions with SSE/AVX
FMA3 in GCC: how to enable
Instruction Set Extensions: AVX2, so #SIMD = 256/64 = 8;
256/64 = 4, not 8. In a 32-byte (256-bit) SIMD vector, you can fit 4 double-precision elements.
Per core per clock, Haswell/Broadwell can begin up to:
two FP math instructions (FMA/MUL/ADD), up to one of which can be addition.
FMA counts as 2 FLOPs per element, MUL/ADD only count as 1 each.
on up to 32 byte wide inputs (e.g. 4 doubles or 8 floats)

How does one compute FLOPS from time elapsed for a computation?

One can see from this tutorial on the usage of Intel MKL DFTs that Dr. Andrey E. Vladimirov uses the time elapsed during a task, namely t1-t0, to compute the number of GigaFLOPS using GF/s = HztoPerf/(t1-t0) where HztoPerf = 5.0 * 1e-9 * double(fft_size) * log2(double(fft_size)) * double(num_fft).
Is this a general formula? If not, how do I deduce the average GF/s for my CPU (Intel Xeon E5-1660 at 3 GHz with 8 cores) if I know the time elapsed to run a computation (e.g. involving various FFTs)?
You have to know how many FP operations your problem requires. Then you divide that by time.
1e-9 accounts for the Giga = 10^9 metric prefix. Without that, you'd have FLOP/s not GFLOP/s if you divide FLoating point OPeration count by seconds.
5.0 * fft_size * log2(fft_size) appears to be the number of FP ops per FFT.
An efficient FFT is O(n log2(n)), and apparently this implementation has a constant factor of 5. (Or possibly that's including some work done using the result?)
num_fft is presumably the total number of FFTs of that size done, i.e. the repeat count. So the product of all those things is the number of FP ops actually done during computation of the FFT.
Hardware performance counters on Intel CPUs can record number of FLOPs (even counting FMAs as 2): there are events like fp_arith_inst_retired.256b_packed_double for various SIMD widths.
perf has a GFLOPs "metric group" you can use that enables the relevant events and calculates it for you:
perf stat --all-user -M GFLOPs ./my_program my args
Counting only in user-space is probably redundant; kernel code might use SIMD for software RAID5/6, but not in interrupt handlers and probably not system calls. And not FP math.
Example on my i7-6700k Skylake
$ perf stat --all-user -M GFLOPs awk 'BEGIN{for(i=0;i<100000000;i++){}}'
Performance counter stats for 'awk BEGIN{for(i=0;i<100000000;i++){}}':
0 fp_arith_inst_retired.256b_packed_single # 0.03 GFLOPs (66.58%)
99,934,901 fp_arith_inst_retired.scalar_double (66.68%)
0 fp_arith_inst_retired.128b_packed_single (66.71%)
0 fp_arith_inst_retired.scalar_single (66.71%)
0 fp_arith_inst_retired.256b_packed_double (66.71%)
0 fp_arith_inst_retired.128b_packed_double (66.62%)
3,352,766,500 ns duration_time
3.352766500 seconds time elapsed
3.347268000 seconds user
0.000000000 seconds sys
Unfortunately it had to multiplex between those events, since there were more than 4, and hyperthreading is enabled, so the total number of scalar double-precision FP operations (99,934,901) was measured a bit lower than the awk loop iteration count. With just -e task_clock,cycles,instructions,fp_arith_inst_retired.scalar_double, it came out at exactly 100,000,000 counts, since apparently gawk did no other FP operations.
Of course awk is not a high-FP-throughput program, and only used scalar FP math. Numeric variables in awk are double-precision, like JavaScript, but unlike JS it doesn't JIT, let alone take advantage of the ability to do them as integer.

Building a Roofline Model

I'm trying to build a roofline model for a node in a supercomputer that I'm running simulations on. The node has 2x Intel Xeon E5-2650 v2 (Ivy Bridge) 8 core 2.6 GHz processors (16 cores per
node), with 64GB RAM total (4GB each). The maximum memory bandwidth for the Intel Xeon E5-2650 is shown here as 59.7 GB/s.
Achieved GFLOPS = max mem bandwidth x arithmetic intensity.
Max GFLOPS = num cores x clock frequency in GHz x ops/cycle.
My code has arithmetic intensity of 1/3 and uses double precision floating point.
Here are my calculations for calculating the peak GFLOPs for the different types of program:
Sequential program (single core) no vectorisation:
1x2.6x1 (I assume without vectorisation, we can only achieve 1 op/cycle?) = 2.6 GFLOPs
Sequential program (single core) with vectorisation (SSE):
1x2.6x8 = 20.8 GFLOPs
All cores on one Xeon with vectorisation (SSE):
8x2.6x8 = 166.4 GFLOPs
All cores one both Xeons with vectorisation (SSE):
2x 8x2.6x8 = 332.8 GFLOPs
How does the memory bandwidth available to the program change between the different types of program shown above? I know that the max memory bandwidth for 1 Xeon E5-2650 is 59.7 GB/s, however is this achieveable on a single core? Does this become 119.4 GB/s with 2 Xeon E2650s?
So would the achieved GFLOPs (using peak bandwidth x arithmetic intensity) be:
Sequential program w/o vectorisation:
59.7 * 1/3 = 19.9 GFLOPs, however because our roofline is 2.6 GFLOPs, we are limited to 2.6 GFLOPs?
Sequential program with vectorisation:
59.7 * 1/3 = 19.9 GFLOPs. This is achieveable because our roofline is 20.8 GFLOPs.
One Xeon (using all 8 cores) with vectorisation:
59.7 * 1/3 = 19.9 GFLOPs. I am suspicious of this, because surely our parallel program is capable of producing more mem reqs than the sequential program, and surely the sequential program doesn't saturate the memory system?
Two Xeons (total of 16 cores) with vectorisation:
119.4 * 1/3 = 39.8 GFLOPs.
I feel like something is wrong with the achieved GFLOPs, have I made a mistake somewhere?

How to derive the Peak performance in GFlop/s of Intel Xeon E5-2690?

I was able to find the theoretical DP peak performance 371 GFlop/s for the Xeon E5-2690 in this Processor Comparison (interesting that it is easier to find this information in Intel's competitor than Intel support pages itself). However, when I try to derive that peak performance my derivation doesn't match:
The frequency (in Turbo mode) for each core of the Xeon E5-2690 = 3.8Ghz
The processor can do an add and mul operation per cycle so we get: 3.8 x 2 = 7.6
Given it has AVX support it can do 4 double operations per cycle: 7.6 x 4 = 30.4
Finally, it has 8 cores, therefore we get: 8 x 30.4 = 243.2
Thus, the peak performance in Gflop/s would be 243.2 GFlop/s and not 371 GFlop/s?
Turbo Mode is not used to calculate Theoretical Peak Performance, you have to consider something like:
CPU speed = 2.9 GHz
CPU Cores = 8
CPU instruction per cycle = 8 (considering AVX-256 -> 256 bits unit, can hold 8 single precision values) x 2 (add and mul operations like you said) = 16
Putting all together:
2.9x8x16 = 371 GFlops/s

Measuring effective bandwidth on CUDA

So I want to know how to calculate the total memory effective bandwidth for:
cublasSdot(handle, M, devPtrA, 1, devPtrB, 1, &curesult);
where that function belows to cublas_v2.h
That function runs in 0.46 ms, and the vectors are 10000 * sizeof(float)
Am I having ((10000 * 4) / 10^9 )/0.00046 = 0.086 GB/s?
I'm wondering about it because I don't know what is inside the cublasSdot function, and I don't know if it is necesary.
In your case, the size of the input data is 10000 * 4 * 2 since you have 2 input vectors, and the size of the output data is 4. The effective bandwidth should be about 0.172 GB/s.
Basically cublasSdot() does nothing much more than computing.
Profile result shows cublasSdot() invokes 2 kernels to compute the result. An extra 4-bytes device-to-host mem transfer is also invoked if the pointer mode is CUBLAS_POINTER_MODE_HOST, which is the default mode for cublas lib.
If kernel time is in ms then a multiplication factor of 1000 is necessary.
That results in 86 GB/s.
As an example refer to example provide by NVIDIA for Matrix Transpose
at http://docs.nvidia.com/cuda/samples/6_Advanced/transpose/doc/MatrixTranspose.pdf
On Last Page entire code is present. The way the Effective Bandwidth is computed is 2.*1000*mem_size/(1024*1024*1024)/(Time in ms)

Resources