Calculation of gflops for double precision - performance

I have a device providing the peak GFLOPS specs and I want to measure how far my program is away from it. Since all the data I used was double precision, should I multiply the number of ops by 2 to get the GLOPS value and do the comparison?

No. 1 double-precision floating-point operation is still one floating-point operation.
Most GPUs process double-precision data slower than single-precision, so there should be two specifications of peak GFLOPS. One peak single-precision GFLOPS spec, and one peak double-precision GFLOPS spec. Sometimes it is broken done further, so that (for example) peak division performance is listed separately from peak addition performance.

" ... , should I multiply the number of ops by 2 to get the GLOPS value and do the comparison?"
No, not for any (but one) of these Cards: http://www.geeks3d.com/20140305/amd-radeon-and-nvidia-geforce-fp32-fp64-gflops-table-computing/ .
Note that the ratio varies from 1/24th to as good as 1/3 in most cases, also note that the 'Workstation Graphics Card' has a ratio 1/2 - it is specifically designed that way to improve DP performance.
You need to read the Specs for the Hardware in your Card and determine what performance hit you should expect from switching to DP from SP. There will be a small additional amount of overhead to load the additional precision into the Registers (Memory where the Hardware will perform the Operation on) and to retrieve the additional precision after each Operation.

Related

Desired Compute-To-Memory-Ratio (OP/B) on GPU

I am trying to undertand the architecture of the GPUs and how we assess the performance of our programs on the GPU. I know that the application can be:
Compute-bound: performance limited by the FLOPS rate. The processor’s cores are fully utilized (always have work to do)
Memory-bound: performance limited by the memory
bandwidth. The processor’s cores are frequently idle because memory cannot supply data fast enough
The image below shows the FLOPS rate, peak memory bandwidth, and the Desired Compute to memory ratio, labeled by (OP/B), for each microarchitecture.
I also have an example of how to compute this OP/B metric. Example: Below is part of a CUDA code for applying matrix-matrix multiplication
for(unsigned int i = 0; i < N; ++i) {
sum += A[row*N + i]*B[i*N + col];
}
and the way to calculate OP/B for this matrix-matrix multiplication is as follows:
Matrix multiplication performs 0.25 OP/B
1 FP add and 1 FP mul for every 2 FP values (8B) loaded
Ignoring stores
and if we want to utilize this:
But matrix multiplication has high potential for reuse. For NxN matrices:
Data loaded: (2 input matrices)×(N^2 values)×(4 B) = 8N^2 B
Operations: (N^2 dot products)(N adds + N muls each) = 2N^3 OP
Potential compute-to-memory ratio: 0.25N OP/B
So if I understand this clearly well, I have the following questions:
It is always the case that the greater OP/B, the better ?
how do we know how much FP operations we have ? Is it the adds and the multiplications
how do we know how many bytes are loaded per FP operation ?
It is always the case that the greater OP/B, the better ?
Not always. The target value balances the load on compute pipe throughput and memory pipe throughput (i.e. that level of op/byte means that both pipes will be fully loaded). As you increase op/byte beyond that or some level, your code will switch from balanced to compute-bound. Once your code is compute bound, the performance will be dictated by the compute pipe that is the limiting factor. Additional op/byte increase beyond this point may have no effect on code performance.
how do we know how much FP operations we have ? Is it the adds and the multiplications
Yes, for the simple code you have shown, it is the adds and multiplies. Other more complicated codes may have other factors (e.g. sin, cos, etc.) which may also contribute.
As an alternative to "manually counting" the FP operations, the GPU profilers can indicate the number of FP ops that a code has executed.
how do we know how many bytes are loaded per FP operation ?
Similar to the previous question, for simple codes you can "manually count". For complex codes you may wish to try to use profiler capabilities to estimate. For the code you have shown:
sum += A[row*N + i]*B[i*N + col];
The values from A and B have to be loaded. If they are float quantities then they are 4 bytes each. That is a total of 8 bytes. That line of code will require 1 floating point multiplication (A * B) and one floating point add operation (sum +=). The compiler will fuse these into a single instruction (fused multiply-add) but the net effect is you are performing two floating point operations per 8 bytes. op/byte is 2/8 = 1/4. The loop does not change the ratio in this case. To increase this number, you would want to explore various optimization methods, such as a tiled shared-memory matrix multiply, or just use CUBLAS.
(Operations like row*N + i are integer arithmetic and don't contribute to the floating-point load, although its possible they may be significant, performance-wise.)

What is the definition of Floating Point Operations ( FLOPs )

I'm trying to optimize my code with SIMD ( on ARM CPUs ), and want to know its arithmetic intensity (flops/byte, AI) and FLOPS.
In order to calculate AI and FLOPS, I have to count the number of floating point operations(FLOPs).
However, I can't find any precise definition of FLOPs.
Of course, mul, add, sub, div are clearly FLOPs, but how about move operations, shuffle operations (e.g. _mm_shuffle_ps), set operations (e.g. _mm_set1_ps), conversion operations (e.g. _mm_cvtps_pi32), etc. ?
They're operations that deal with floating point values. Should I count them as FLOPs ? If not, why ?
Which operations do profilers like Intel VTune and Nvidia's nvprof, or PMUs usually count ?
EDIT:
What all operations does FLOPS include?
This question is mainly about mathematically complex operations.
I also want to know the standard way to deal with "not mathematical" operations which take floating point values or vectors as inputs.
Shuffle / blend on FP values are not considered FLOPs. They are just overhead of using SIMD on not purely "vertical" problems, or for problems with branching that you do branchlessly with a blend.
Neither are FP AND/OR/XOR. You could try to justify counting FP absolute value using andps (_mm_and_ps), but normally it's not counted. FP abs doesn't require looking at the exponent / significand, or normalizing the result, or any of the things that make FP execution units expensive. abs (AND) / sign-flip (XOR) or make negative (OR) are trivial bitwise ops.
FMA is normally counted as two floating point ops (the mul and add), even though it's a single instruction with the same (or similar) performance to SIMD FP add or mul. The most important problem that bottlenecks on raw FLOP/s is matmul, which does need an equal mix of mul and add, and can take advantage of FMA perfectly.
So the FLOP/s of a Haswell core is
its SIMD vector width (8 float elements per vector)
times SIMD FMA per clock (2)
times FLOPs per FMA (2)
times clock speed (max single core turbo it can sustain while maxing out both FMA units; long-term depends on cooling, short term just depends on power limits).
For a whole CPU, not just a single core: multiply by number of cores and use the max sustained clock speed with all cores busy, usually lower than single-core turbo on CPUs that have turbo at all.)
Intel and other CPU vendors don't count the fact that their CPUs can also sustain a vandps in parallel with 2 vfma132ps instructions per clock, because FP abs is not a difficult operation.
See also How do I achieve the theoretical maximum of 4 FLOPs per cycle?. (It's actually more than 4 on modern CPUs :P)
Peak FLOPS (FP ops per second, or FLOP/s) isn't achievable if you have much other overhead taking up front-end bandwidth or creating other bottlenecks. The metric is just the raw amount of math you can do when running in a straight line, not on any specific practical problem.
Although people would think it's silly if theoretical peak flops is much higher than a carefully hand-tuned matmul or Mandelbrot could ever achieve, even for compile-time-constant problem sizes. e.g. if the front-end couldn't keep up with doing any stores as well as the FMAs. e.g. if Haswell had four FMA execution units, so it could only sustain max FLOPs if literally every instruction was an FMA. Memory source operands could micro-fuse for loads, but there'd be no room to store without hurting throughput.
The reason Intel doesn't have even 3 FMA units is that most real code has trouble saturating 2 FMA units, especially with only 2 load ports and 1 store port. They'd be wasted almost all of the time, and 256-bit FMA unit takes a lot of transistors.
(Ice Lake widens issue/rename stage of the pipeline to 5 uops/clock, but also widens SIMD execution units to 512-bit with AVX-512 instead of adding a 3rd 256-bit FMA unit. It has 2/clock load and 2/clock store, although that store throughput is only sustainable to L1d cache for 32-byte or narrower stores, not 64-byte.)
When it comes to optimisation, it is common practise to only measure FLOPs on the hotspots of your code, for example, the number of Floating Point Multiply & Accumulate operations in Convolution. This is mainly because other operations might be insignificant or irreplaceable and therefore can't be exploited for any kind of optimization.
For example, all instructions under Vector Floating Point Instructions in A4.13 in ARMv7 Reference Manual fall under a Floating Point Operation as a FLOPs/Cycle for an FPU instruction is typically constant in a processor.
Not just ARM, but many micro-processors have a dedicated Floating Point Unit, so when you are measuring FLOPs, you're measuring the speed of this unit. With this and FLOPs/cycle you can more or less calculate the theoretical peak performance.
But, FLOPs are to be taken with a grain of salt, as they can only be used to approximately estimate the speed of your code because they fail to take into account other conditions your processor operates under. This is why counting FLOPs only for your hotspots (usually arithmetic ops) is more or less enough in most cases.
Having said that, FLOPs can act as a comparative metric for two strenuous piece of code but doesn't say much about your code per se.

What are the relative cycle times for the 6 basic arithmetic operations?

When I try to optimize my code, for a very long time I've just been using a rule of thumb that addition and subtraction are worth 1, multiplication and division are worth 3, squaring is worth 3 (I rarely use the more general pow function so I have no rule of thumb for it), and square roots are worth 10. (And I assume squaring a number is just a multiplication, so worth 3.)
Here's an example from a 2D orbital simulation. To calculate and apply acceleration from gravity, first I get distance from the ship to the center of earth, then calculate the acceleration.
D = sqrt( sqr(Ship.x - Earth.x) + sqr(Ship.y - Earth.y) ); // this is worth 19
A = G*Earth.mass/sqr(D); // this is worth 9, total is 28
However, notice that in calculating D, you take a square root, but when using it in the next calculation, you square it. Therefore you can just do this:
A = G*Earth.mass/( sqr(Ship.x - Earth.x) + sqr(Ship.y - Earth.y) ); // this is worth 15
So if my rule of thumb is true, I almost cut in half the cycle time.
However, I cannot even remember where I heard that rule before. I'd like to ask what is the actual cycle times for those basic arithmetic operations?
Assumptions:
everything is a 64-bit floating number in x64 architecture.
everything is already loaded into registers, so no worrying about hits and misses from caches or memory.
no interrupts to the CPU
no if/branching logic such as look ahead prediction
Edit: I suppose what I'm really trying to do is look inside the ALU and only count the cycle time of its logic for the 6 operations. If there is still variance within that, please explain what and why.
Note: I did not see any tags for machine code, so I chose the next closest thing, assembly. To be clear, I am talking about actual machine code operations in x64 architecture. Thus it doesn't matter whether those lines of code I wrote are in C#, C, Javascript, whatever. I'm sure each high-level language will have its own varying times so I don't wanna get into an argument over that. I think it's a shame that there's no machine code tag because when talking about performance and/or operation, you really need to get down into it.
At a minimum, one must understand that an operation has at least two interesting timings: the latency and the throughput.
Latency
The latency is how long any particular operation takes, from its inputs to its output. If you had a long series of operations where the output of one operation is fed into the input of the next, the latency would determine the total time. For example, an integer multiplication on most recent x86 hardware has a latency of 3 cycles: it takes 3 cycles to complete a single multiplication operation. Integer addition has a latency of 1 cycle: the result is available the cycle after the addition executes. Latencies are generally positive integers.
Throughput
The throughput is the number of independent operations that can be performed per unit time. Since CPUs are pipelined and superscalar, this is often more than the inverse of the latency. For example, on most recent x86 chips, 4 integer addition operations can execute per cycle, even though the latency is 1 cycle. Similarly, 1 integer multiplication can execute, on average per cycle, even though any particular multiplication takes 3 cycles to complete (meaning that you must have multiple independent multiplications in progress at once to achieve this).
Inverse Throughput
When discussing instruction performance, it is common to give throughput numbers as "inverse throughput", which is simply 1 / throughput. This makes it easy to directly compare with latency figures without doing a division in your head. For example, the inverse throughput of addition is 0.25 cycles, versus a latency of 1 cycle, so you can immediately see that you if you have sufficient independent additions, they use only something like 0.25 cycles each.
Below I'll use inverse throughput.
Variable Timings
Most simple instructions have fixed timings, at least in their reg-reg form. Some more complex mathematical operations, however, may have input-dependent timings. For example, addition, subtraction and multiplication usually have fixed timings in their integer and floating point forms, but on many platforms division has variable timings in integer, floating point or both. Agner's numbers often show a range to indicate this, but you shouldn't assume the operand space has been tested extensively, especially for floating point.
The Skylake numbers below, for example, show a small range, but it isn't clear if that's due to operand dependency (which would likely be larger) or something else.
Passing denormal inputs, or results that themselves are denormal may incur significant additional cost depending on the denormal mode. The numbers you'll see in the guides generally assume no denormals, but you might be able to find a discussion of denormal costs per operation elsewhere.
More Details
The above is necessary but often not sufficient information to fully qualify performance, since you have other factors to consider such as execution port contention, front-end bottlenecks, and so on. It's enough to start though and you are only asking for "rule of thumb" numbers if I understand it correctly.
Agner Fog
My recommended source for measured latency and inverse throughput numbers are Agner's Fogs guides. You want the files under 4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs, which lists fairly exhaustive timings on a huge variety of AMD and Intel CPUs. You can also get the numbers for some CPUs directly from Intel's guides, but I find them less complete and more difficult to use than Agner's.
Below I'll pull out the numbers for a couple of modern CPUs, for the basic operations you are interested in.
Intel Skylake
Lat Inv Tpt
add/sub (addsd, subsd) 4 0.5
multiply (mulsd) 4 0.5
divide (divsd) 13-14 4
sqrt (sqrtpd) 15-16 4-6
So a "rule of thumb" for latency would be add/sub/mul all cost 1, and division and sqrt are about 3 and 4, respectively. For throughput, the rule would be 1, 8, 8-12 respectively. Note also that the latency is much larger than the inverse throughput, especially for add, sub and mul: you'd need 8 parallel chains of operations if you wanted to hit the max throughput.
AMD Ryzen
Lat Inv Tpt
add/sub (addsd, subsd) 3 0.5
multiply (mulsd) 4 0.5
divide (divsd) 8-13 4-5
sqrt (sqrtpd) 14-15 4-8
The Ryzen numbers are broadly similar to recent Intel. Addition and subtraction are slightly lower latency, multiplication is the same. Latency-wise, the rule of thumb could still generally be summarized as 1/3/4 for add,sub,mul/div/sqrt, with some loss of precision.
Here, the latency range for divide is fairly large, so I expect it is data dependent.

Floating-point number vs fixed-point number: speed on Intel I5 CPU

I have a C/C++ program which involves intensive 32-bit floating-point matrix math computations such as addition, subtraction, multiplication, division, etc.
Can I speed up my program by converting 32-bit floating-point numbers into 16-bit fixed-point numbers ? How much speed gain can I get ?
Currently I'm working on a Intel I5 CPU. I'm using Openblas to perform the matrix calculations. How should I re-implement Openblas functions such as cblas_dgemm to perform fixed-point calculations ?
I know that SSE(Simple SIMD Extensions) operates on 4x32=8x16=128 bit data at one time, i.e., 4 32-bit floating-point type or 8 16-bit fixed-point type. I guess that after conversion from 32-bit floating-point to 16-bit fixed-point, my program would be twice faster.
Summary: Modern FPU hardware is hard to beat with fixed-point, even if you have twice as many elements per vector.
Modern BLAS library are typically very well tuned for cache performance (with cache blocking / loop tiling) as well as for instruction throughput. That makes them very hard to beat. Especially DGEMM has lots of room for this kind of optimization, because it does O(N^3) work on O(N^2) data, so it's worth transposing just a cache-sized chunk of one input, and stuff like that.
What might help is reducing memory bottlenecks by storing your floats in 16-bit half-float format. There is no hardware support for doing math on them in that format, just a couple instructions to convert between that format and normal 32-bit element float vectors while loading/storing: VCVTPH2PS (__m256 _mm256_cvtph_ps(__m128i)) and VCVTPS2PH (__m128i _mm256_cvtps_ph(__m256 m1, const int imm8_rounding_control). These two instructions comprise the F16C extension, first supported by AMD Bulldozer and Intel IvyBridge.
IDK if any BLAS libraries support that format.
Fixed point:
SSE/AVX doesn't have any integer division instructions. If you're only dividing by constants, you might not need a real div instruction, though. So that's one major stumbling block for fixed point.
Another big downside of fixed point is the extra cost of shifting to correct the position of the decimal (binary?) point after multiplies. That will eat into any gain you could get from having twice as many elements per vector with 16-bit fixed point.
SSE/AVX actually has quite a good selection of packed 16-bit multiplies (better than for any other element size). There's packed multiply producing the low half, high half (signed or unsigned), and even one that takes 16 bits from 2 bits below the top, with rounding (PMULHRSW.html). Skylake runs those at two per clock, with 5 cycle latency. There are also integer multiply-add instructions, but they do a horizontal add between pairs of multiply results. (See Agner Fog's insn tables, and also the x86 tag wiki for performance links.) Haswell and previous don't have as many integer-vector add and multiply execution units. Often code bottlenecks on total uop throughput, not on a specific execution port anyway. (But a good BLAS library might even have hand-tuned asm.)
If your inputs and outputs are integer, it's often faster to work with integer vectors, instead of converting to floats. (e.g. see my answer on Scaling byte pixel values (y=ax+b) with SSE2 (as floats)?, where I used 16-bit fixed-point to deal with 8-bit integers).
But if you're really working with floats, and have a lot of multiplying and dividing to do, just use the hardware FPUs. They're amazingly powerful in modern CPUs, and have made fixed-point mostly obsolete for many tasks. As #Iwill points out, FMA instructions are another big boost for FP throughput (and sometimes latency).
Integer add/subtract/compare instructions (but not multiply) are also lower latency than their FP counterparts.

256 bit fixed point arithmetic, the future?

Just some silly musings, but if computers were able to efficiently calculate 256 bit arithmetic, say if they had a 256 bit architecture, I reckon we'd be able to do away with floating point. I also wonder, if there'd be any reason to progress past 256 bit architecture? My basis for this is rather flimsy, but I'm confident that you'll put me straight if I'm wrong ;) Here's my thinking:
You could have a 256 bit type that used the 127 or 128 bits for integers, 127 or 128 bits for fractional values, and then of course a sign bit. If you had hardware that was capable of calculating, storing and moving such big numbers with no problems, I reckon you'd be set to handle any calculation you'd come across.
One example: If you were working with lengths, and you represented all values in meters, then the minimum value (2^-128 m) would be smaller than the planck length, and the biggest value (2^127 m) would be bigger than the diameter of the observable universe. Imagine calculating light-years of distances with a precision smaller than a planck length?
Ok, that's only one example, but I'm struggling to think of any situations that could possibly warrant bigger and smaller numbers than that. Any thoughts? Are there possible problems with fixed point arithmetic that I haven't considered? Are there issues with creating a 256 bit architecture?
SIMD will make narrow types valuable forever. If you can do a 256bit add, you can do eight 32bit integer adds in parallel on the same hardware (by not propagating carry across element boundaries). Or you can do thirty-two 8bit adds.
Hardware multiplier circuits are a lot more expensive to make wider, so it's not a good assumption to assume that a 256b X 256b multiplier will be practical to build.
Even besides SIMD considerations, memory bandwidth / cache footprint is a huge deal.
So 4B float will continue to be excellent for being precise enough to be useful, but small enough to pack many elements into a big vector, or in cache.
Floating-point also allows a much wider range of numbers by using some of its bits as an exponent. With mantissa = 1.0, the range of IEEE binary64 double goes from 2-1022 to 21023, for "normal" numbers (53-bit mantissa precision over the whole range, only getting worse for denormals (gradual underflow)). Your proposal only handles numbers from about 2-127 (with 1 bit of precision) to 2127 (with 256b of precision).
Floating point has the same number of significant figures at any magnitude (until you get into denormals very close to zero), because the mantissa is fixed width. Normally this is a useful property, especially when multiplying or dividing. See Fixed Point Cholesky Algorithm Advantages for an example of why FP is good. (Subtracting two nearby numbers is a problem, though...)
Even though current SIMD instruction sets already have 256b vectors, the widest element width is 64b for add. AVX2's widest multiply is 32bit * 32bit => 64bit.
AVX512DQ has a 64b * 64b -> 64b (low half) vpmullq, which may show up in Skylake-E (Purley Xeon).
AVX512IFMA introduces a 52b * 52b + 64b => 64bit integer FMA. (VPMADD52LUQ low half and VPMADD52HUQ high half.) The 52 bits input precision is clearly so they can use the FP mantissa multiplier hardware, instead of requiring separate 64bit integer multipliers. (A full vector width of 64bit full-multipliers would be even more expensive than vpmullq. A compromise design like this even for 64bit integers should be a big hint that wide multipliers are expensive). Note that this isn't part of baseline AVX512F either, and may show up in Cannonlake, based on a Clang git commit.
Supporting arbitrary-precision adds/multiplies in SIMD (for crypto applications like RSA) is possible if the instruction set is designed for it (which Intel SSE/AVX isn't). Discussion on Agner Fog's recent proposal for a new ISA included an idea for SIMD add-with-carry.
For actually implementing 256b math on 32 or 64-bit hardware, see https://locklessinc.com/articles/256bit_arithmetic/ and https://gmplib.org/. It's really not that bad considering how rarely it's needed.
Another big downside to building hardware with very wide integer registers is that even if the upper bits are usually unused, out-of-order execution hardware needs to be able to handle the case where it is used. This means a much larger physical register file compared to an architecture with 64-bit registers (which is bad, because it needs to be very fast and physically close to other parts of the CPU, and have many read ports). e.g. Intel Haswell has 168-entry PRFs for integer and FP/SIMD.
The FP register file already has 256b registers, so I guess if you were going to do something like this, you'd do it with execution units that used the SIMD vector registers as inputs/outputs, not by widening the integer registers. But the FP/SIMD execution units aren't normally connected to the integer carry flag, so you might need a separate SIMD-carry register for 256b add.
Intel or AMD already could have implemented an instruction / execution unit for adding 128b or 256b integers in xmm or ymm registers, but they haven't. (The max SIMD element width even for addition is 64-bit. Only shuffles operate on the whole register as a unit, and then only with byte-granularity or wider.)
128 bit computers. It is also about addressing memory and when we run out 64-bits when addressing memory. Currently there are servers with 4TB memory. That requires about 42 bits (2^42 > 4 x 10^12). If we assume that memory prices halves every second year then we need one bit more every second year. We still have 22 bits left so at least 2 * 22 years and it is likely that memory prices are not dropping that fast -> more than 50 years when we run out of 64-bits addressing capabilities.

Resources