Related
I'm trying to optimize my code with SIMD ( on ARM CPUs ), and want to know its arithmetic intensity (flops/byte, AI) and FLOPS.
In order to calculate AI and FLOPS, I have to count the number of floating point operations(FLOPs).
However, I can't find any precise definition of FLOPs.
Of course, mul, add, sub, div are clearly FLOPs, but how about move operations, shuffle operations (e.g. _mm_shuffle_ps), set operations (e.g. _mm_set1_ps), conversion operations (e.g. _mm_cvtps_pi32), etc. ?
They're operations that deal with floating point values. Should I count them as FLOPs ? If not, why ?
Which operations do profilers like Intel VTune and Nvidia's nvprof, or PMUs usually count ?
EDIT:
What all operations does FLOPS include?
This question is mainly about mathematically complex operations.
I also want to know the standard way to deal with "not mathematical" operations which take floating point values or vectors as inputs.
Shuffle / blend on FP values are not considered FLOPs. They are just overhead of using SIMD on not purely "vertical" problems, or for problems with branching that you do branchlessly with a blend.
Neither are FP AND/OR/XOR. You could try to justify counting FP absolute value using andps (_mm_and_ps), but normally it's not counted. FP abs doesn't require looking at the exponent / significand, or normalizing the result, or any of the things that make FP execution units expensive. abs (AND) / sign-flip (XOR) or make negative (OR) are trivial bitwise ops.
FMA is normally counted as two floating point ops (the mul and add), even though it's a single instruction with the same (or similar) performance to SIMD FP add or mul. The most important problem that bottlenecks on raw FLOP/s is matmul, which does need an equal mix of mul and add, and can take advantage of FMA perfectly.
So the FLOP/s of a Haswell core is
its SIMD vector width (8 float elements per vector)
times SIMD FMA per clock (2)
times FLOPs per FMA (2)
times clock speed (max single core turbo it can sustain while maxing out both FMA units; long-term depends on cooling, short term just depends on power limits).
For a whole CPU, not just a single core: multiply by number of cores and use the max sustained clock speed with all cores busy, usually lower than single-core turbo on CPUs that have turbo at all.)
Intel and other CPU vendors don't count the fact that their CPUs can also sustain a vandps in parallel with 2 vfma132ps instructions per clock, because FP abs is not a difficult operation.
See also How do I achieve the theoretical maximum of 4 FLOPs per cycle?. (It's actually more than 4 on modern CPUs :P)
Peak FLOPS (FP ops per second, or FLOP/s) isn't achievable if you have much other overhead taking up front-end bandwidth or creating other bottlenecks. The metric is just the raw amount of math you can do when running in a straight line, not on any specific practical problem.
Although people would think it's silly if theoretical peak flops is much higher than a carefully hand-tuned matmul or Mandelbrot could ever achieve, even for compile-time-constant problem sizes. e.g. if the front-end couldn't keep up with doing any stores as well as the FMAs. e.g. if Haswell had four FMA execution units, so it could only sustain max FLOPs if literally every instruction was an FMA. Memory source operands could micro-fuse for loads, but there'd be no room to store without hurting throughput.
The reason Intel doesn't have even 3 FMA units is that most real code has trouble saturating 2 FMA units, especially with only 2 load ports and 1 store port. They'd be wasted almost all of the time, and 256-bit FMA unit takes a lot of transistors.
(Ice Lake widens issue/rename stage of the pipeline to 5 uops/clock, but also widens SIMD execution units to 512-bit with AVX-512 instead of adding a 3rd 256-bit FMA unit. It has 2/clock load and 2/clock store, although that store throughput is only sustainable to L1d cache for 32-byte or narrower stores, not 64-byte.)
When it comes to optimisation, it is common practise to only measure FLOPs on the hotspots of your code, for example, the number of Floating Point Multiply & Accumulate operations in Convolution. This is mainly because other operations might be insignificant or irreplaceable and therefore can't be exploited for any kind of optimization.
For example, all instructions under Vector Floating Point Instructions in A4.13 in ARMv7 Reference Manual fall under a Floating Point Operation as a FLOPs/Cycle for an FPU instruction is typically constant in a processor.
Not just ARM, but many micro-processors have a dedicated Floating Point Unit, so when you are measuring FLOPs, you're measuring the speed of this unit. With this and FLOPs/cycle you can more or less calculate the theoretical peak performance.
But, FLOPs are to be taken with a grain of salt, as they can only be used to approximately estimate the speed of your code because they fail to take into account other conditions your processor operates under. This is why counting FLOPs only for your hotspots (usually arithmetic ops) is more or less enough in most cases.
Having said that, FLOPs can act as a comparative metric for two strenuous piece of code but doesn't say much about your code per se.
When I try to optimize my code, for a very long time I've just been using a rule of thumb that addition and subtraction are worth 1, multiplication and division are worth 3, squaring is worth 3 (I rarely use the more general pow function so I have no rule of thumb for it), and square roots are worth 10. (And I assume squaring a number is just a multiplication, so worth 3.)
Here's an example from a 2D orbital simulation. To calculate and apply acceleration from gravity, first I get distance from the ship to the center of earth, then calculate the acceleration.
D = sqrt( sqr(Ship.x - Earth.x) + sqr(Ship.y - Earth.y) ); // this is worth 19
A = G*Earth.mass/sqr(D); // this is worth 9, total is 28
However, notice that in calculating D, you take a square root, but when using it in the next calculation, you square it. Therefore you can just do this:
A = G*Earth.mass/( sqr(Ship.x - Earth.x) + sqr(Ship.y - Earth.y) ); // this is worth 15
So if my rule of thumb is true, I almost cut in half the cycle time.
However, I cannot even remember where I heard that rule before. I'd like to ask what is the actual cycle times for those basic arithmetic operations?
Assumptions:
everything is a 64-bit floating number in x64 architecture.
everything is already loaded into registers, so no worrying about hits and misses from caches or memory.
no interrupts to the CPU
no if/branching logic such as look ahead prediction
Edit: I suppose what I'm really trying to do is look inside the ALU and only count the cycle time of its logic for the 6 operations. If there is still variance within that, please explain what and why.
Note: I did not see any tags for machine code, so I chose the next closest thing, assembly. To be clear, I am talking about actual machine code operations in x64 architecture. Thus it doesn't matter whether those lines of code I wrote are in C#, C, Javascript, whatever. I'm sure each high-level language will have its own varying times so I don't wanna get into an argument over that. I think it's a shame that there's no machine code tag because when talking about performance and/or operation, you really need to get down into it.
At a minimum, one must understand that an operation has at least two interesting timings: the latency and the throughput.
Latency
The latency is how long any particular operation takes, from its inputs to its output. If you had a long series of operations where the output of one operation is fed into the input of the next, the latency would determine the total time. For example, an integer multiplication on most recent x86 hardware has a latency of 3 cycles: it takes 3 cycles to complete a single multiplication operation. Integer addition has a latency of 1 cycle: the result is available the cycle after the addition executes. Latencies are generally positive integers.
Throughput
The throughput is the number of independent operations that can be performed per unit time. Since CPUs are pipelined and superscalar, this is often more than the inverse of the latency. For example, on most recent x86 chips, 4 integer addition operations can execute per cycle, even though the latency is 1 cycle. Similarly, 1 integer multiplication can execute, on average per cycle, even though any particular multiplication takes 3 cycles to complete (meaning that you must have multiple independent multiplications in progress at once to achieve this).
Inverse Throughput
When discussing instruction performance, it is common to give throughput numbers as "inverse throughput", which is simply 1 / throughput. This makes it easy to directly compare with latency figures without doing a division in your head. For example, the inverse throughput of addition is 0.25 cycles, versus a latency of 1 cycle, so you can immediately see that you if you have sufficient independent additions, they use only something like 0.25 cycles each.
Below I'll use inverse throughput.
Variable Timings
Most simple instructions have fixed timings, at least in their reg-reg form. Some more complex mathematical operations, however, may have input-dependent timings. For example, addition, subtraction and multiplication usually have fixed timings in their integer and floating point forms, but on many platforms division has variable timings in integer, floating point or both. Agner's numbers often show a range to indicate this, but you shouldn't assume the operand space has been tested extensively, especially for floating point.
The Skylake numbers below, for example, show a small range, but it isn't clear if that's due to operand dependency (which would likely be larger) or something else.
Passing denormal inputs, or results that themselves are denormal may incur significant additional cost depending on the denormal mode. The numbers you'll see in the guides generally assume no denormals, but you might be able to find a discussion of denormal costs per operation elsewhere.
More Details
The above is necessary but often not sufficient information to fully qualify performance, since you have other factors to consider such as execution port contention, front-end bottlenecks, and so on. It's enough to start though and you are only asking for "rule of thumb" numbers if I understand it correctly.
Agner Fog
My recommended source for measured latency and inverse throughput numbers are Agner's Fogs guides. You want the files under 4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs, which lists fairly exhaustive timings on a huge variety of AMD and Intel CPUs. You can also get the numbers for some CPUs directly from Intel's guides, but I find them less complete and more difficult to use than Agner's.
Below I'll pull out the numbers for a couple of modern CPUs, for the basic operations you are interested in.
Intel Skylake
Lat Inv Tpt
add/sub (addsd, subsd) 4 0.5
multiply (mulsd) 4 0.5
divide (divsd) 13-14 4
sqrt (sqrtpd) 15-16 4-6
So a "rule of thumb" for latency would be add/sub/mul all cost 1, and division and sqrt are about 3 and 4, respectively. For throughput, the rule would be 1, 8, 8-12 respectively. Note also that the latency is much larger than the inverse throughput, especially for add, sub and mul: you'd need 8 parallel chains of operations if you wanted to hit the max throughput.
AMD Ryzen
Lat Inv Tpt
add/sub (addsd, subsd) 3 0.5
multiply (mulsd) 4 0.5
divide (divsd) 8-13 4-5
sqrt (sqrtpd) 14-15 4-8
The Ryzen numbers are broadly similar to recent Intel. Addition and subtraction are slightly lower latency, multiplication is the same. Latency-wise, the rule of thumb could still generally be summarized as 1/3/4 for add,sub,mul/div/sqrt, with some loss of precision.
Here, the latency range for divide is fairly large, so I expect it is data dependent.
Just some silly musings, but if computers were able to efficiently calculate 256 bit arithmetic, say if they had a 256 bit architecture, I reckon we'd be able to do away with floating point. I also wonder, if there'd be any reason to progress past 256 bit architecture? My basis for this is rather flimsy, but I'm confident that you'll put me straight if I'm wrong ;) Here's my thinking:
You could have a 256 bit type that used the 127 or 128 bits for integers, 127 or 128 bits for fractional values, and then of course a sign bit. If you had hardware that was capable of calculating, storing and moving such big numbers with no problems, I reckon you'd be set to handle any calculation you'd come across.
One example: If you were working with lengths, and you represented all values in meters, then the minimum value (2^-128 m) would be smaller than the planck length, and the biggest value (2^127 m) would be bigger than the diameter of the observable universe. Imagine calculating light-years of distances with a precision smaller than a planck length?
Ok, that's only one example, but I'm struggling to think of any situations that could possibly warrant bigger and smaller numbers than that. Any thoughts? Are there possible problems with fixed point arithmetic that I haven't considered? Are there issues with creating a 256 bit architecture?
SIMD will make narrow types valuable forever. If you can do a 256bit add, you can do eight 32bit integer adds in parallel on the same hardware (by not propagating carry across element boundaries). Or you can do thirty-two 8bit adds.
Hardware multiplier circuits are a lot more expensive to make wider, so it's not a good assumption to assume that a 256b X 256b multiplier will be practical to build.
Even besides SIMD considerations, memory bandwidth / cache footprint is a huge deal.
So 4B float will continue to be excellent for being precise enough to be useful, but small enough to pack many elements into a big vector, or in cache.
Floating-point also allows a much wider range of numbers by using some of its bits as an exponent. With mantissa = 1.0, the range of IEEE binary64 double goes from 2-1022 to 21023, for "normal" numbers (53-bit mantissa precision over the whole range, only getting worse for denormals (gradual underflow)). Your proposal only handles numbers from about 2-127 (with 1 bit of precision) to 2127 (with 256b of precision).
Floating point has the same number of significant figures at any magnitude (until you get into denormals very close to zero), because the mantissa is fixed width. Normally this is a useful property, especially when multiplying or dividing. See Fixed Point Cholesky Algorithm Advantages for an example of why FP is good. (Subtracting two nearby numbers is a problem, though...)
Even though current SIMD instruction sets already have 256b vectors, the widest element width is 64b for add. AVX2's widest multiply is 32bit * 32bit => 64bit.
AVX512DQ has a 64b * 64b -> 64b (low half) vpmullq, which may show up in Skylake-E (Purley Xeon).
AVX512IFMA introduces a 52b * 52b + 64b => 64bit integer FMA. (VPMADD52LUQ low half and VPMADD52HUQ high half.) The 52 bits input precision is clearly so they can use the FP mantissa multiplier hardware, instead of requiring separate 64bit integer multipliers. (A full vector width of 64bit full-multipliers would be even more expensive than vpmullq. A compromise design like this even for 64bit integers should be a big hint that wide multipliers are expensive). Note that this isn't part of baseline AVX512F either, and may show up in Cannonlake, based on a Clang git commit.
Supporting arbitrary-precision adds/multiplies in SIMD (for crypto applications like RSA) is possible if the instruction set is designed for it (which Intel SSE/AVX isn't). Discussion on Agner Fog's recent proposal for a new ISA included an idea for SIMD add-with-carry.
For actually implementing 256b math on 32 or 64-bit hardware, see https://locklessinc.com/articles/256bit_arithmetic/ and https://gmplib.org/. It's really not that bad considering how rarely it's needed.
Another big downside to building hardware with very wide integer registers is that even if the upper bits are usually unused, out-of-order execution hardware needs to be able to handle the case where it is used. This means a much larger physical register file compared to an architecture with 64-bit registers (which is bad, because it needs to be very fast and physically close to other parts of the CPU, and have many read ports). e.g. Intel Haswell has 168-entry PRFs for integer and FP/SIMD.
The FP register file already has 256b registers, so I guess if you were going to do something like this, you'd do it with execution units that used the SIMD vector registers as inputs/outputs, not by widening the integer registers. But the FP/SIMD execution units aren't normally connected to the integer carry flag, so you might need a separate SIMD-carry register for 256b add.
Intel or AMD already could have implemented an instruction / execution unit for adding 128b or 256b integers in xmm or ymm registers, but they haven't. (The max SIMD element width even for addition is 64-bit. Only shuffles operate on the whole register as a unit, and then only with byte-granularity or wider.)
128 bit computers. It is also about addressing memory and when we run out 64-bits when addressing memory. Currently there are servers with 4TB memory. That requires about 42 bits (2^42 > 4 x 10^12). If we assume that memory prices halves every second year then we need one bit more every second year. We still have 22 bits left so at least 2 * 22 years and it is likely that memory prices are not dropping that fast -> more than 50 years when we run out of 64-bits addressing capabilities.
I have a device providing the peak GFLOPS specs and I want to measure how far my program is away from it. Since all the data I used was double precision, should I multiply the number of ops by 2 to get the GLOPS value and do the comparison?
No. 1 double-precision floating-point operation is still one floating-point operation.
Most GPUs process double-precision data slower than single-precision, so there should be two specifications of peak GFLOPS. One peak single-precision GFLOPS spec, and one peak double-precision GFLOPS spec. Sometimes it is broken done further, so that (for example) peak division performance is listed separately from peak addition performance.
" ... , should I multiply the number of ops by 2 to get the GLOPS value and do the comparison?"
No, not for any (but one) of these Cards: http://www.geeks3d.com/20140305/amd-radeon-and-nvidia-geforce-fp32-fp64-gflops-table-computing/ .
Note that the ratio varies from 1/24th to as good as 1/3 in most cases, also note that the 'Workstation Graphics Card' has a ratio 1/2 - it is specifically designed that way to improve DP performance.
You need to read the Specs for the Hardware in your Card and determine what performance hit you should expect from switching to DP from SP. There will be a small additional amount of overhead to load the additional precision into the Registers (Memory where the Hardware will perform the Operation on) and to retrieve the additional precision after each Operation.
On most moder 64-bit processors (such as Intel Core 2 Duo or the Intel i7 series), does the speed of the x86_64 command mulq and its variants depend on the operands? For example, will multiplying 11 * 13 be faster than 11111111 * 13131313? Or does it always take the time of the worst case?
TL;DR: No. Constant-length integer math operations (barring division, which is non-linear) consume a constant number of cycles, regardless of the numerical value of the operands.
mulq takes two QWORD arguments.
The values are represented in little-endian binary format (used by x86 architecture) as follows:
1011000000000000000000000000000000000000000000000000000000000000 = 13
1000110001111010000100110000000000000000000000000000000000000000 = 13131313
The processor sees both of these as the same "size", as both are 64-bit values.
Therefore, the cycle count should always be the same, regardless of the actual numerical value of the operands.
More info:
There are the concepts of Leading Zero Anticipation and Leading Zero Detection[1][2] (LZA/LZD) that can be employed to speed up floating-point operations.
To the best of my knowledge however, there are no mainstream processors that employ either of these methods towards integer arithmetic. This is most likely due to the simplistic nature of most integer arithmetic (multiplication in this case). The overhead of LZA/LZD may simply not be worth it, for simple integer math circuits that can complete the full multiplication in less time anyhow.
I don't have any reference to hand, but I would place money on the latency/throughput being invariant of the values of the operands. Otherwise, it would be a nightmare to schedule.
For decades, Agner Fog has been publishing tables of instruction timings. His August 2019 tables confirm what I had expected: that the CPU chips in modern laptops and desktop computers have invariant timing for their integer-multiply units. These are extremely fast and rather power-hungry.
The CPU design space is quite different for battery-limited devices such as smartphones. On such devices, the integer multiply may be implemented in a microcoded loop with variable timing.
In (approx) 2016, Thomas Pornin had this to say about "the problem" posed by variable-latency multiplication instructions to the design of his SSL/TLS library:
"... integer multiplication opcodes in CPU may or may not execute in constant time; when they do not, implementations that use such operations may exhibit execution time variations that depend on the involved data, thereby potentially leaking secret information... When a CPU has non-constant-time multiplication opcodes, the execution time depends on the size (as integers) of one or both of the operands. For instance, on the ARM Cortex-M3, the umull opcode takes 3 to 5 cycles to complete, the `short' counts (3 or 4) being taken only if both operands are numerically less than 65536 ... In general, Intel x86 CPU all provide constant-time multiplications since the days of the first Pentium. The same does not necessarily hold for other vendors, in particular the early VIA Nano." 2