256 bit fixed point arithmetic, the future? - performance

Just some silly musings, but if computers were able to efficiently calculate 256 bit arithmetic, say if they had a 256 bit architecture, I reckon we'd be able to do away with floating point. I also wonder, if there'd be any reason to progress past 256 bit architecture? My basis for this is rather flimsy, but I'm confident that you'll put me straight if I'm wrong ;) Here's my thinking:
You could have a 256 bit type that used the 127 or 128 bits for integers, 127 or 128 bits for fractional values, and then of course a sign bit. If you had hardware that was capable of calculating, storing and moving such big numbers with no problems, I reckon you'd be set to handle any calculation you'd come across.
One example: If you were working with lengths, and you represented all values in meters, then the minimum value (2^-128 m) would be smaller than the planck length, and the biggest value (2^127 m) would be bigger than the diameter of the observable universe. Imagine calculating light-years of distances with a precision smaller than a planck length?
Ok, that's only one example, but I'm struggling to think of any situations that could possibly warrant bigger and smaller numbers than that. Any thoughts? Are there possible problems with fixed point arithmetic that I haven't considered? Are there issues with creating a 256 bit architecture?

SIMD will make narrow types valuable forever. If you can do a 256bit add, you can do eight 32bit integer adds in parallel on the same hardware (by not propagating carry across element boundaries). Or you can do thirty-two 8bit adds.
Hardware multiplier circuits are a lot more expensive to make wider, so it's not a good assumption to assume that a 256b X 256b multiplier will be practical to build.
Even besides SIMD considerations, memory bandwidth / cache footprint is a huge deal.
So 4B float will continue to be excellent for being precise enough to be useful, but small enough to pack many elements into a big vector, or in cache.
Floating-point also allows a much wider range of numbers by using some of its bits as an exponent. With mantissa = 1.0, the range of IEEE binary64 double goes from 2-1022 to 21023, for "normal" numbers (53-bit mantissa precision over the whole range, only getting worse for denormals (gradual underflow)). Your proposal only handles numbers from about 2-127 (with 1 bit of precision) to 2127 (with 256b of precision).
Floating point has the same number of significant figures at any magnitude (until you get into denormals very close to zero), because the mantissa is fixed width. Normally this is a useful property, especially when multiplying or dividing. See Fixed Point Cholesky Algorithm Advantages for an example of why FP is good. (Subtracting two nearby numbers is a problem, though...)
Even though current SIMD instruction sets already have 256b vectors, the widest element width is 64b for add. AVX2's widest multiply is 32bit * 32bit => 64bit.
AVX512DQ has a 64b * 64b -> 64b (low half) vpmullq, which may show up in Skylake-E (Purley Xeon).
AVX512IFMA introduces a 52b * 52b + 64b => 64bit integer FMA. (VPMADD52LUQ low half and VPMADD52HUQ high half.) The 52 bits input precision is clearly so they can use the FP mantissa multiplier hardware, instead of requiring separate 64bit integer multipliers. (A full vector width of 64bit full-multipliers would be even more expensive than vpmullq. A compromise design like this even for 64bit integers should be a big hint that wide multipliers are expensive). Note that this isn't part of baseline AVX512F either, and may show up in Cannonlake, based on a Clang git commit.
Supporting arbitrary-precision adds/multiplies in SIMD (for crypto applications like RSA) is possible if the instruction set is designed for it (which Intel SSE/AVX isn't). Discussion on Agner Fog's recent proposal for a new ISA included an idea for SIMD add-with-carry.
For actually implementing 256b math on 32 or 64-bit hardware, see https://locklessinc.com/articles/256bit_arithmetic/ and https://gmplib.org/. It's really not that bad considering how rarely it's needed.
Another big downside to building hardware with very wide integer registers is that even if the upper bits are usually unused, out-of-order execution hardware needs to be able to handle the case where it is used. This means a much larger physical register file compared to an architecture with 64-bit registers (which is bad, because it needs to be very fast and physically close to other parts of the CPU, and have many read ports). e.g. Intel Haswell has 168-entry PRFs for integer and FP/SIMD.
The FP register file already has 256b registers, so I guess if you were going to do something like this, you'd do it with execution units that used the SIMD vector registers as inputs/outputs, not by widening the integer registers. But the FP/SIMD execution units aren't normally connected to the integer carry flag, so you might need a separate SIMD-carry register for 256b add.
Intel or AMD already could have implemented an instruction / execution unit for adding 128b or 256b integers in xmm or ymm registers, but they haven't. (The max SIMD element width even for addition is 64-bit. Only shuffles operate on the whole register as a unit, and then only with byte-granularity or wider.)

128 bit computers. It is also about addressing memory and when we run out 64-bits when addressing memory. Currently there are servers with 4TB memory. That requires about 42 bits (2^42 > 4 x 10^12). If we assume that memory prices halves every second year then we need one bit more every second year. We still have 22 bits left so at least 2 * 22 years and it is likely that memory prices are not dropping that fast -> more than 50 years when we run out of 64-bits addressing capabilities.

Related

What is the definition of Floating Point Operations ( FLOPs )

I'm trying to optimize my code with SIMD ( on ARM CPUs ), and want to know its arithmetic intensity (flops/byte, AI) and FLOPS.
In order to calculate AI and FLOPS, I have to count the number of floating point operations(FLOPs).
However, I can't find any precise definition of FLOPs.
Of course, mul, add, sub, div are clearly FLOPs, but how about move operations, shuffle operations (e.g. _mm_shuffle_ps), set operations (e.g. _mm_set1_ps), conversion operations (e.g. _mm_cvtps_pi32), etc. ?
They're operations that deal with floating point values. Should I count them as FLOPs ? If not, why ?
Which operations do profilers like Intel VTune and Nvidia's nvprof, or PMUs usually count ?
EDIT:
What all operations does FLOPS include?
This question is mainly about mathematically complex operations.
I also want to know the standard way to deal with "not mathematical" operations which take floating point values or vectors as inputs.
Shuffle / blend on FP values are not considered FLOPs. They are just overhead of using SIMD on not purely "vertical" problems, or for problems with branching that you do branchlessly with a blend.
Neither are FP AND/OR/XOR. You could try to justify counting FP absolute value using andps (_mm_and_ps), but normally it's not counted. FP abs doesn't require looking at the exponent / significand, or normalizing the result, or any of the things that make FP execution units expensive. abs (AND) / sign-flip (XOR) or make negative (OR) are trivial bitwise ops.
FMA is normally counted as two floating point ops (the mul and add), even though it's a single instruction with the same (or similar) performance to SIMD FP add or mul. The most important problem that bottlenecks on raw FLOP/s is matmul, which does need an equal mix of mul and add, and can take advantage of FMA perfectly.
So the FLOP/s of a Haswell core is
its SIMD vector width (8 float elements per vector)
times SIMD FMA per clock (2)
times FLOPs per FMA (2)
times clock speed (max single core turbo it can sustain while maxing out both FMA units; long-term depends on cooling, short term just depends on power limits).
For a whole CPU, not just a single core: multiply by number of cores and use the max sustained clock speed with all cores busy, usually lower than single-core turbo on CPUs that have turbo at all.)
Intel and other CPU vendors don't count the fact that their CPUs can also sustain a vandps in parallel with 2 vfma132ps instructions per clock, because FP abs is not a difficult operation.
See also How do I achieve the theoretical maximum of 4 FLOPs per cycle?. (It's actually more than 4 on modern CPUs :P)
Peak FLOPS (FP ops per second, or FLOP/s) isn't achievable if you have much other overhead taking up front-end bandwidth or creating other bottlenecks. The metric is just the raw amount of math you can do when running in a straight line, not on any specific practical problem.
Although people would think it's silly if theoretical peak flops is much higher than a carefully hand-tuned matmul or Mandelbrot could ever achieve, even for compile-time-constant problem sizes. e.g. if the front-end couldn't keep up with doing any stores as well as the FMAs. e.g. if Haswell had four FMA execution units, so it could only sustain max FLOPs if literally every instruction was an FMA. Memory source operands could micro-fuse for loads, but there'd be no room to store without hurting throughput.
The reason Intel doesn't have even 3 FMA units is that most real code has trouble saturating 2 FMA units, especially with only 2 load ports and 1 store port. They'd be wasted almost all of the time, and 256-bit FMA unit takes a lot of transistors.
(Ice Lake widens issue/rename stage of the pipeline to 5 uops/clock, but also widens SIMD execution units to 512-bit with AVX-512 instead of adding a 3rd 256-bit FMA unit. It has 2/clock load and 2/clock store, although that store throughput is only sustainable to L1d cache for 32-byte or narrower stores, not 64-byte.)
When it comes to optimisation, it is common practise to only measure FLOPs on the hotspots of your code, for example, the number of Floating Point Multiply & Accumulate operations in Convolution. This is mainly because other operations might be insignificant or irreplaceable and therefore can't be exploited for any kind of optimization.
For example, all instructions under Vector Floating Point Instructions in A4.13 in ARMv7 Reference Manual fall under a Floating Point Operation as a FLOPs/Cycle for an FPU instruction is typically constant in a processor.
Not just ARM, but many micro-processors have a dedicated Floating Point Unit, so when you are measuring FLOPs, you're measuring the speed of this unit. With this and FLOPs/cycle you can more or less calculate the theoretical peak performance.
But, FLOPs are to be taken with a grain of salt, as they can only be used to approximately estimate the speed of your code because they fail to take into account other conditions your processor operates under. This is why counting FLOPs only for your hotspots (usually arithmetic ops) is more or less enough in most cases.
Having said that, FLOPs can act as a comparative metric for two strenuous piece of code but doesn't say much about your code per se.

Why is the register length static in any CPU

Why is the register length (in bits) that a CPU operates on not dynamically/manually/arbitrarily adjustable? Would it make the computer slower if it was adjustable this way?
Imagine you had an 8-bit integer. If you could adjust the CPU register length to 8 bits, the CPU would only have to go through the first 8 bits instead of extending the 8-bit integer to 64 bits and then going through all 64 bits.
At first I thought you were asking if it was possible to have a CPU with no definitive register size. That make no sense since the number and size of the registers is a physical property of the hardware and cannot be changed.
However some architecture let the programmer work on a smaller part of a register or to pair registers.
The x86 does both for example, with add al, 9 (uses only 8 bits of the 64-bit rax) and div rbx (pairs rdx:rax to form a 128-bit register).
The reason this scheme is not so diffuse is that it comes with a lot of trade-offs.
More registers means more bits needed to address them, simply put: longer instructions.
Longer instructions mean less code density, more complex decoders and less performance.
Furthermore most elementary operations, like the logic ones, addition and subtraction are already implemented as operating on a full register in a single cycle.
Finally, one execution unit can handle only one instruction at a time, we cannot issue eight 8-bit additions in a 64-bit ALU at the same time.
So there wouldn't be any improvement, nor in the latency nor in the throughput.
Accessing partial registers is useful for the programmer to fan-out the number of available registers, so for example if an algorithm works with 16-bit data, the programmer could use a single physical 64-bit register to store four items and operate on them independently (but not in parallel).
The ISAs that have variable length instructions can also benefit from using partial register because that usually means smaller immediate values, for example and instruction that set a register to a specific value usually have an immediate operand that matches the size of register being loaded (though RISC usually sign-extends or zero-extends it).
Architectures like ARM (presumably others as well) supports half precision floats. The idea is to do what you were speculating and #Margaret explained. With half precision floats, you can pack two float values in a single register, thereby introducing less bandwidth at a cost of reduced accuracy.
Reference:
[1] ARM
[2] GCC

Floating-point number vs fixed-point number: speed on Intel I5 CPU

I have a C/C++ program which involves intensive 32-bit floating-point matrix math computations such as addition, subtraction, multiplication, division, etc.
Can I speed up my program by converting 32-bit floating-point numbers into 16-bit fixed-point numbers ? How much speed gain can I get ?
Currently I'm working on a Intel I5 CPU. I'm using Openblas to perform the matrix calculations. How should I re-implement Openblas functions such as cblas_dgemm to perform fixed-point calculations ?
I know that SSE(Simple SIMD Extensions) operates on 4x32=8x16=128 bit data at one time, i.e., 4 32-bit floating-point type or 8 16-bit fixed-point type. I guess that after conversion from 32-bit floating-point to 16-bit fixed-point, my program would be twice faster.
Summary: Modern FPU hardware is hard to beat with fixed-point, even if you have twice as many elements per vector.
Modern BLAS library are typically very well tuned for cache performance (with cache blocking / loop tiling) as well as for instruction throughput. That makes them very hard to beat. Especially DGEMM has lots of room for this kind of optimization, because it does O(N^3) work on O(N^2) data, so it's worth transposing just a cache-sized chunk of one input, and stuff like that.
What might help is reducing memory bottlenecks by storing your floats in 16-bit half-float format. There is no hardware support for doing math on them in that format, just a couple instructions to convert between that format and normal 32-bit element float vectors while loading/storing: VCVTPH2PS (__m256 _mm256_cvtph_ps(__m128i)) and VCVTPS2PH (__m128i _mm256_cvtps_ph(__m256 m1, const int imm8_rounding_control). These two instructions comprise the F16C extension, first supported by AMD Bulldozer and Intel IvyBridge.
IDK if any BLAS libraries support that format.
Fixed point:
SSE/AVX doesn't have any integer division instructions. If you're only dividing by constants, you might not need a real div instruction, though. So that's one major stumbling block for fixed point.
Another big downside of fixed point is the extra cost of shifting to correct the position of the decimal (binary?) point after multiplies. That will eat into any gain you could get from having twice as many elements per vector with 16-bit fixed point.
SSE/AVX actually has quite a good selection of packed 16-bit multiplies (better than for any other element size). There's packed multiply producing the low half, high half (signed or unsigned), and even one that takes 16 bits from 2 bits below the top, with rounding (PMULHRSW.html). Skylake runs those at two per clock, with 5 cycle latency. There are also integer multiply-add instructions, but they do a horizontal add between pairs of multiply results. (See Agner Fog's insn tables, and also the x86 tag wiki for performance links.) Haswell and previous don't have as many integer-vector add and multiply execution units. Often code bottlenecks on total uop throughput, not on a specific execution port anyway. (But a good BLAS library might even have hand-tuned asm.)
If your inputs and outputs are integer, it's often faster to work with integer vectors, instead of converting to floats. (e.g. see my answer on Scaling byte pixel values (y=ax+b) with SSE2 (as floats)?, where I used 16-bit fixed-point to deal with 8-bit integers).
But if you're really working with floats, and have a lot of multiplying and dividing to do, just use the hardware FPUs. They're amazingly powerful in modern CPUs, and have made fixed-point mostly obsolete for many tasks. As #Iwill points out, FMA instructions are another big boost for FP throughput (and sometimes latency).
Integer add/subtract/compare instructions (but not multiply) are also lower latency than their FP counterparts.

On most modern 64-bit processors, does the speed of `mulq` depend on the operands?

On most moder 64-bit processors (such as Intel Core 2 Duo or the Intel i7 series), does the speed of the x86_64 command mulq and its variants depend on the operands? For example, will multiplying 11 * 13 be faster than 11111111 * 13131313? Or does it always take the time of the worst case?
TL;DR: No. Constant-length integer math operations (barring division, which is non-linear) consume a constant number of cycles, regardless of the numerical value of the operands.
mulq takes two QWORD arguments.
The values are represented in little-endian binary format (used by x86 architecture) as follows:
1011000000000000000000000000000000000000000000000000000000000000 = 13
1000110001111010000100110000000000000000000000000000000000000000 = 13131313
The processor sees both of these as the same "size", as both are 64-bit values.
Therefore, the cycle count should always be the same, regardless of the actual numerical value of the operands.
More info:
There are the concepts of Leading Zero Anticipation and Leading Zero Detection[1][2] (LZA/LZD) that can be employed to speed up floating-point operations.
To the best of my knowledge however, there are no mainstream processors that employ either of these methods towards integer arithmetic. This is most likely due to the simplistic nature of most integer arithmetic (multiplication in this case). The overhead of LZA/LZD may simply not be worth it, for simple integer math circuits that can complete the full multiplication in less time anyhow.
I don't have any reference to hand, but I would place money on the latency/throughput being invariant of the values of the operands. Otherwise, it would be a nightmare to schedule.
For decades, Agner Fog has been publishing tables of instruction timings. His August 2019 tables confirm what I had expected: that the CPU chips in modern laptops and desktop computers have invariant timing for their integer-multiply units. These are extremely fast and rather power-hungry.
The CPU design space is quite different for battery-limited devices such as smartphones. On such devices, the integer multiply may be implemented in a microcoded loop with variable timing.
In (approx) 2016, Thomas Pornin had this to say about "the problem" posed by variable-latency multiplication instructions to the design of his SSL/TLS library:
"... integer multiplication opcodes in CPU may or may not execute in constant time; when they do not, implementations that use such operations may exhibit execution time variations that depend on the involved data, thereby potentially leaking secret information... When a CPU has non-constant-time multiplication opcodes, the execution time depends on the size (as integers) of one or both of the operands. For instance, on the ARM Cortex-M3, the umull opcode takes 3 to 5 cycles to complete, the `short' counts (3 or 4) being taken only if both operands are numerically less than 65536 ... In general, Intel x86 CPU all provide constant-time multiplications since the days of the first Pentium. The same does not necessarily hold for other vendors, in particular the early VIA Nano." 2

overlapping variables and performance

Sorry in advance if I have some of this wrong. I may edit to correct later if it's not too disruptive.
When multiple variables are declared in adjacent memory, as I understand it, on a very low level, registers are created that encapsulate a number of bytes, commonly 1, 2, 4 or 8. This allows those bit ranges to be binary rotated, as well as thought of by the processor as numbers and so mutated with simple mathematics such as add, subtract, multiply and devide.
There may be abstraction reasons for not overlapping thease ranges, but as many langueges consider instructions to occur in a well defined sequential order that the coder will be aware of, are there any performance reasons to not overlap one or more in adjacent bytes of allocated memory?
For example in a block of allocated memory where every bit starts as 0. Bytes 0 to 3 could be being used as an int, as well as bytes 1 to 4. The first could be set to a value before the second range was multiplied by 3.
If there are performance reasons not to then are they overcome by otherwise having to to copy values in and out of completely new variables and perform more complicated processes to achieve certain algorithms that could otherwise be done on a very low level?
There is nothing wrong with this trick when it is done in assembly: optimizers have been routinely making use of knowing where parts of an integer are to save CPU cycles and reduce the size of the code. For example, when a 32-bit integer variable is initialized to a value that fits in only 16 bits, optimizing compilers would replace the instruction that stores a 32-bit value in memory with a faster instruction that stores a 16-bit value to the lower bits of the variable, and clear the upper 16 bits. Moreover, many optimizers would go even further: if a constant is divisible by 2^16, they would store the value divided by 2^16 to the upper 16 bits, and clear lower 16 bits.
Some architectures restrict such manipulations to addresses of certain properties, for example, by requiring all 4-byte memory load/store instructions to be done at addresses divisible by four. These restrictions may reduce applicability of partial-value writing tricks.

Resources