instrinsic _mm512_round_ps is missing for AVX512 - avx512

I'm missing the intrinsic _mm512_round_ps for AVX512 (it is only available for KNC). Any idea why this is not available?
What would be a good workaround?
apply _mm256_round_ps to upper and lower half and fuse the results?
use _mm512_add_round_ps with one argument being zero?
Thanks!

TL:DR: AVX512F
__m512 nearest_integer = _mm512_roundscale_ps(input_vec, _MM_FROUND_TO_NEAREST_INT|_MM_FROUND_NO_EXC);
related: AVX512DQ _mm512_reduce_pd or _ps will subtract the integer part (and a specified number of leading fraction bits), range-reducing your input to only the fractional part. asm docs for vreducepd have the most detail.
The EVEX prefix allows overriding the default rounding direction {er} and setting suppress-all-exceptions {sae}, for FP instructions. (This is what the ..._round_ps() versions of intrinsics are for.) But it doesn't have a "round to integer" option; you still need a separate asm instruction for that.
vroundps xy, xy/mem, imm8 didn't get upgraded to AVX512. Actually it did: the same opcode has a new mnemonic for the EVEX version, using the high 4 bits of the immediate that are reserved in the SSE and VEX encodings.
vrndscaleps xyz, xyz/mem/m32broadcast, imm8 is available in ss/sd/ps/pd flavours. The high 4 bits of the imm8 specify the number of fraction bits to round to. In these terms, rounding to the nearest integer is rounding to 0 fraction bits. Rounding to nearest 0.5 would be rounding to 1 fraction bit. It's the same as scaling by 2^M, rounding to nearest integer, then scaling back down (done without overflow).
I think the field is unsigned, so you can't use M=-1 to round to an even number. The ISA ref manual doesn't mention signedness, so I'm leaning towards unsigned being the most likely.
The low 4 bits of the field specify the rounding mode like with roundps. As usual, the PD version of the instruction has the diagram (because it's alphabetically first).
With the upper 4 bits = 0, it behaves the same as roundps: they use the same encoding for the low 4 bits. It's not a coincidence that the instructions have the same opcode, just different prefixes.
(I'm curious if SSE or VEX roundpd on an AVX512 CPU would actually scale based on the upper 4 bits; it says they're "reserved" not "ignored". But probably not.)
__m512 _mm512_roundscale_ps( __m512 a, int imm); is the no-frills intrinsic. See Intel's intrinsic finder
The merge-masking + SAE-override version is __m512 _mm512_mask_roundscale_round_ps(__m512 s, __mmask16 k, __m512 a, int imm, int sae);. There's nothing you can do with the sae operand that roundscale can't already do with its imm8, though, so it's a bit pointless.
You can use the _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC and so on constants documented for _mm_round_pd / _mm256_round_pd, to round up, down, or truncate towards zero, or the usual nearest with even-as-tiebreak that's the IEEE default rounding mode. Or _MM_FROUND_CUR_DIRECTION to use whatever the current mode is. _MM_FROUND_NO_EXC suppresses setting the inexact exception bit in the MXCSR.
You might be wondering why vrndscaleps needs any immediate bits to specify rounding direction when you could just use the EVEX prefix to override the rounding direction with vrndscaleps zmm0 {k1}, zmm1, {rz-sae} (Or whatever the right syntax is; NASM doesn't seem to be accepting any of the examples I found.)
The answer is that explicit rounding is only available with 512-bit vectors or with scalars, and only for register operands. (It repurposes 3 EVEX bits used to set vector length (if AVX512VL is supported), and to distinguish between broadcast memory operands vs. vector. EVEX bits are overloaded based on context to pack more functionality into limited space.)
So having the rounding-control in the imm8 makes it possible to do vrndscaleps zmm0{k1}, [rdi]{m32bcst}, imm8 to broadcast a float from memory, round it, and merge that into an existing register according to mask register k1. All in a single instruction which decodes to probably 3 uops on SKX, assuming it's the same as vroundps. (http://agner.org/optimize/).

Related

Addition efficiency proportional to size of operands?

Question that just popped into my head, and I don't think I've seen an answer on here. Is the time taken by a binary addition algorithm, proportional to the size of the operands?
Obviously, adding 1101011010101010101101010 and 10110100101010010101 is going to take longer than 1 + 1, but my question refers more to the smaller values. Is there a negligible difference, no difference, a theoretical difference?
At what point, with these sorts of rudimentary calculations should we start looking into more efficient methods of calculation? ie: Exponentiation by squaring with large exponents for calculating huge powers.
How we see the binary patterns...
1101011010101010101101010 (big)
10110100101010010101 (medium)
1 (small)
How a 32bit computer sees the binary patterns...
00000001101011010101010101101010 32bit,
00000000000010110100101010010101 32bit,
00000000000000000000000000000001 i'm lovin it
On a 32bit system, all the above numbers will take the same time (no. of CPU instructions) to be added. As all of them fit within the basic computational block i.e. the 32bit CPU register.
How a 16bit computer sees the binary patterns...
1
+1 = ?
0000000000000001 i'm lovin it
0000000000000001 i'm lovin it
00000001101011010101010101101010
+00000000000010110100101010010101 = ?
00000001101011010101010101101010 too BIG for me!
00000000000010110100101010010101 too BIG for me!
On a 16bit system, as the larger numbers will NOT fit in a 16bit register, it will need an additional pass(to add the significant bits that remain after the first 16LSBs are added).
Step1: ADD Least significant bits
0101010101101010
0100101010010101
Step2: ADD the rest (remember carry bit from previous operation)
000000000000000C
0000000110101101
0000000000001011
We can start thinking of optimising the mathematical operations on
numbers once the numbers no longer fit in the basic computation unit
of the system i.e. the CPU-register.
Modern hardware architectures are developed keeping this in mind and support SIMD instructions. Compilers will often employ them (SSE on x86, NEON on ARM) when they see such a case being made i.e. 128bit decryption logic being run on a 32bit system.
Also instead of checking ONLY the size of the operands, the size of the result also determines whether the system can accomplish the mathematical operation within one step. Not only the operands involved, but the operation being performed needs to be taken into consideration as well.
For example, on a 32bit system, adding two 30bit numbers can be definitely carried out using the regular operations as the result is guaranteed to NOT exceed a 32bit register. But multiplying the same two 30bit numbers may result in a number that does NOT fit within 32bits.
In the absence of such a guarantee of being able to store the result in a single computational unit, to ensure validity of the result for all possible values, the architecture(and the compiler) must :
go the long way i.e. multi-step mathematical operations
or
employ SIMD optimisations
or
define and implement custom mechanisms
(like register-pairs EDX:EAX to hold the result on x86)
In practice, there's no (or completely negligible) difference between adding different integers that fit in the processor words as that should always be a fixed-time operation.
In theory, the complexity for adding two unsigned integers should be O(log(n)) where n is the bigger of the two. As such, you need to go pretty high before mere additions become a problem.
As for where exactly to draw the line between simple and complex algorithms for computing numbers, I don't have an exact answer. However, the GMP library comes to mind. From what I understand, they've carefully chosen their algorithms and under what circumstances to use each in terms of performance. You may want to look into what they did.
I somewhat disagree with the above answers. It very much depends on the context.
For simple integer arithmetic (for loop counters etc), then on 64bit machines that computation will be done using 64bit general purpose registers (RSI/RCX/etc). In those cases, there is no difference in speed between an 8bit or 64bit addition.
If however you are processing arrays of integers, and assuming the compiler has been able to optimise the code nicely, then yes, smaller is faster (but not for the reason you think).
In the AVX2 instruction set, you have access to 4 integer addition instructions:
__m256i _mm256_add_epi8 (__m256i a, __m256i b); // 32 x 8bit
__m256i _mm256_add_epi16(__m256i a, __m256i b); // 16 x 16bit
__m256i _mm256_add_epi32(__m256i a, __m256i b); // 8 x 32bit
__m256i _mm256_add_epi64(__m256i a, __m256i b); // 4 x 64bit
You'll notice that all of them operate on 256bits at a time, which means you can process 4 integer additions if you're using 64bit, compared to 32 additions if you are using 8bit integers. (As mentioned above, you'd need to make sure you have enough precision). They all take the same number of clock cycles to compute - 1clk.
There are also other effects of using smaller data types, which are mainly better CPU cache usage, and a reduced number of memory reads/writes.
However, back to your original question on bit-by-bit computation. Prior to the new AVX-512 instruction set, it might not have seemed a little silly. However, the new instruction set contains a ternary logic instruction. With this instruction, it is possible to compute 512 additions on numbers of any bit length fairly easily.
inline __m512i add(__m512i x, __m512i x, __m512i carry_in)
{
return _mm512_ternarylogic_epi32(carry_in, y, x, 0x96);
}
inline __m512i adc(__m512i x, __m512i x, __m512i carry_in)
{
return _mm512_ternarylogic_epi32(carry_in, y, x, 0xE8);
}
__m512i A[NUM_BITS];
__m512i B[NUM_BITS];
__m512i RESULT[NUM_BITS];
__m512i CARRY = _mm512_setzero_ps();
for(int i = 0; i < NUM_BITS; ++i)
{
RESULT[i] = add(A[i], B[i], CARRY);
CARRY = adc(A[i], B[i], CARRY);
}
In this particular example (which to be honest, probably has very limited real world usage!), The time it takes to perform the 512 additions, is indeed directly proportional to NUM_BITS.

Am I using _mm_srl_epi32 wrong?

In this small code example:
__m128i twos = _mm_set_epi32(2,3,1,2);
__m128i foo = _mm_set_epi32(128,128,128,128);
__m128i shifted = _mm_srl_epi32(foo,twos);
"shifted" is full of zeroes, while I expect it two be full of four 32-bit integers with the values 32,16,64, and 32, respectively. Am I using the intrinsic wrong?
Yes, you are using it incorrectly. The second argument to _mm_srl_epi32() specifies the amount of bits to shift the first argument by, but it isn't a vectored argument as you might expect, allowing you to shift each 32-bit integer by a different number of bits. Instead, the 128-bit argument is truncated to 64 bits, and the resulting count is used to determine the number of bits to shift; the same shift amount is used for all 4 integers in the first argument. In your case, the lower 64 bits are 0x0000000100000010, which evaluates to a very large positive number. This results in all of the elements of foo getting flushed to zero as all of the bits are shifted out.
A good place to find all of the little details on every instruction out there is Intel's AVX Programmer's Reference. While the title may be somewhat of a misnomer, the document contains descriptions of all SSE/SSE2/.../AVX/AVX2 instructions and descriptions of their intrinsics available in Intel's C++ compiler (which are typically also available in gcc and others). Searching for _mm_srl_epi32 in the document yields a clear explanation on exactly what the instruction does.

long double (GCC specific) and __float128

I'm looking for detailed information on long double and __float128 in GCC/x86 (more out of curiosity than because of an actual problem).
Few people will probably ever need these (I've just, for the first time ever, truly needed a double), but I guess it is still worthwile (and interesting) to know what you have in your toolbox and what it's about.
In that light, please excuse my somewhat open questions:
Could someone explain the implementation rationale and intended usage of these types, also in comparison of each other? For example, are they "embarrassment implementations" because the standard allows for the type, and someone might complain if they're only just the same precision as double, or are they intended as first-class types?
Alternatively, does someone have a good, usable web reference to share? A Google search on "long double" site:gcc.gnu.org/onlinedocs didn't give me much that's truly useful.
Assuming that the common mantra "if you believe that you need double, you probably don't understand floating point" does not apply, i.e. you really need more precision than just float, and one doesn't care whether 8 or 16 bytes of memory are burnt... is it reasonable to expect that one can as well just jump to long double or __float128 instead of double without a significant performance impact?
The "extended precision" feature of Intel CPUs has historically been source of nasty surprises when values were moved between memory and registers. If actually 96 bits are stored, the long double type should eliminate this issue. On the other hand, I understand that the long double type is mutually exclusive with -mfpmath=sse, as there is no such thing as "extended precision" in SSE. __float128, on the other hand, should work just perfectly fine with SSE math (though in absence of quad precision instructions certainly not on a 1:1 instruction base). Am I right in these assumptions?
(3. and 4. can probably be figured out with some work spent on profiling and disassembling, but maybe someone else had the same thought previously and has already done that work.)
Background (this is the TL;DR part):
I initially stumbled over long double because I was looking up DBL_MAX in <float.h>, and incidentially LDBL_MAX is on the next line. "Oh look, GCC actually has 128 bit doubles, not that I need them, but... cool" was my first thought. Surprise, surprise: sizeof(long double) returns 12... wait, you mean 16?
The C and C++ standards unsurprisingly do not give a very concrete definition of the type. C99 (6.2.5 10) says that the numbers of double are a subset of long double whereas C++03 states (3.9.1 8) that long double has at least as much precision as double (which is the same thing, only worded differently). Basically, the standards leave everything to the implementation, in the same manner as with long, int, and short.
Wikipedia says that GCC uses "80-bit extended precision on x86 processors regardless of the physical storage used".
The GCC documentation states, all on the same page, that the size of the type is 96 bits because of the i386 ABI, but no more than 80 bits of precision are enabled by any option (huh? what?), also Pentium and newer processors want them being aligned as 128 bit numbers. This is the default under 64 bits and can be manually enabled under 32 bits, resulting in 32 bits of zero padding.
Time to run a test:
#include <stdio.h>
#include <cfloat>
int main()
{
#ifdef USE_FLOAT128
typedef __float128 long_double_t;
#else
typedef long double long_double_t;
#endif
long_double_t ld;
int* i = (int*) &ld;
i[0] = i[1] = i[2] = i[3] = 0xdeadbeef;
for(ld = 0.0000000000000001; ld < LDBL_MAX; ld *= 1.0000001)
printf("%08x-%08x-%08x-%08x\r", i[0], i[1], i[2], i[3]);
return 0;
}
The output, when using long double, looks somewhat like this, with the marked digits being constant, and all others eventually changing as the numbers get bigger and bigger:
5636666b-c03ef3e0-00223fd8-deadbeef
^^ ^^^^^^^^
This suggests that it is not an 80 bit number. An 80-bit number has 18 hex digits. I see 22 hex digits changing, which looks much more like a 96 bits number (24 hex digits). It also isn't a 128 bit number since 0xdeadbeef isn't touched, which is consistent with sizeof returning 12.
The output for __int128 looks like it's really just a 128 bit number. All bits eventually flip.
Compiling with -m128bit-long-double does not align long double to 128 bits with a 32-bit zero padding, as indicated by the documentation. It doesn't use __int128 either, but indeed seems to align to 128 bits, padding with the value 0x7ffdd000(?!).
Further, LDBL_MAX, seems to work as +inf for both long double and __float128. Adding or subtracting a number like 1.0E100 or 1.0E2000 to/from LDBL_MAX results in the same bit pattern.
Up to now, it was my belief that the foo_MAX constants were to hold the largest representable number that is not +inf (apparently that isn't the case?). I'm also not quite sure how an 80-bit number could conceivably act as +inf for a 128 bit value... maybe I'm just too tired at the end of the day and have done something wrong.
Ad 1.
Those types are designed to work with numbers with huge dynamic range. The long double is implemented in a native way in the x87 FPU. The 128b double I suspect would be implemented in software mode on modern x86s, as there's no hardware to do the computations in hardware.
The funny thing is that it's quite common to do many floating point operations in a row and the intermediate results are not actually stored in declared variables but rather stored in FPU registers taking advantage of full precision. That's why comparison:
double x = sin(0); if (x == sin(0)) printf("Equal!");
Is not safe and cannot be guaranteed to work (without additional switches).
Ad. 3.
There's an impact on the speed depending what precision you use. You can change used the precision of the FPU by using:
void
set_fpu (unsigned int mode)
{
asm ("fldcw %0" : : "m" (*&mode));
}
It will be faster for shorter variables, slower for longer. 128bit doubles will be probably done in software so will be much slower.
It's not only about RAM memory wasted, it's about cache being wasted. Going to 80 bit double from 64b double will waste from 33% (32b) to almost 50% (64b) of the memory (including cache).
Ad 4.
On the other hand, I understand that the long double type is mutually
exclusive with -mfpmath=sse, as there is no such thing as "extended
precision" in SSE. __float128, on the other hand, should work just
perfectly fine with SSE math (though in absence of quad precision
instructions certainly not on a 1:1 instruction base). Am I right under
these assumptions?
The FPU and SSE units are totally separate. You can write code using FPU at the same time as SSE. The question is what will the compiler generate if you constrain it to use only SSE? Will it try to use FPU anyway? I've been doing some programming with SSE and GCC will generate only single SISD on its own. You have to help it to use SIMD versions. __float128 will probably work on every machine, even the 8-bit AVR uC. It's just fiddling with bits after all.
The 80 bit in hex representation is actually 20 hex digits. Maybe the bits which are not used are from some old operation? On my machine, I compiled your code and only 20 bits change in long
mode: 66b4e0d2-ec09c1d5-00007ffe-deadbeef
The 128-bit version has all the bits changing. Looking at the objdump it looks as if it was using software emulation, there are almost no FPU instructions.
Further, LDBL_MAX, seems to work as +inf for both long double and
__float128. Adding or subtracting a number like 1.0E100 or 1.0E2000 to/from LDBL_MAX results in the same bit pattern. Up to now, it was my
belief that the foo_MAX constants were to hold the largest
representable number that is not +inf (apparently that isn't the
case?).
This seems to be strange...
I'm also not quite sure how an 80-bit number could conceivably
act as +inf for a 128-bit value... maybe I'm just too tired at the end
of the day and have done something wrong.
It's probably being extended. The pattern which is recognized to be +inf in 80-bit is translated to +inf in 128-bit float too.
IEEE-754 defined 32 and 64 floating-point representations for the purpose of efficient data storage, and an 80-bit representation for the purpose of efficient computation. The intention was that given float f1,f2; double d1,d2; a statement like d1=f1+f2+d2; would be executed by converting the arguments to 80-bit floating-point values, adding them, and converting the result back to a 64-bit floating-point type. This would offer three advantages compared with performing operations on other floating-point types directly:
While separate code or circuitry would be required for conversions to/from 32-bit types and 64-bit types, it would only be necessary to have only one "add" implementation, one "multiply" implementation, one "square root" implementation, etc.
Although in rare cases using an 80-bit computational type could yield results that were very slightly less accurate than using other types directly (worst-case rounding error is 513/1024ulp in cases where computations on other types would yield an error of 511/1024ulp), chained computations using 80-bit types would frequently be more accurate--sometimes much more accurate--than computations using other types.
On a system without a FPU, separating a double into a separate exponent and mantissa before performing computations, normalizing a mantissa, and converting a separate mantissa and exponent into a double, are somewhat time consuming. If the result of one computation will be used as input to another and discarded, using an unpacked 80-bit type will allow these steps to be omitted.
In order for this approach to floating-point math to be useful, however, it is imperative that it be possible for code to store intermediate results with the same precision as would be used in computation, such that temp = d1+d2; d4=temp+d3; will yield the same result as d4=d1+d2+d3;. From what I can tell, the purpose of long double was to be that type. Unfortunately, even though K&R designed C so that all floating-point values would be passed to variadic methods the same way, ANSI C broke that. In C as originally designed, given the code float v1,v2; ... printf("%12.6f", v1+v2);, the printf method wouldn't have to worry about whether v1+v2 would yield a float or a double, since the result would get coerced to a known type regardless. Further, even if the type of v1 or v2 changed to double, the printf statement wouldn't have to change.
ANSI C, however, requires that code which calls printf must know which arguments are double and which are long double; a lot of code--if not a majority--of code which uses long double but was written on platforms where it's synonymous with double fails to use the correct format specifiers for long double values. Rather than having long double be an 80-bit type except when passed as a variadic method argument, in which case it would be coerced to 64 bits, many compilers decided to make long double be synonymous with double and not offer any means of storing the results of intermediate computations. Since using an extended precision type for computation is only good if that type is made available to the programmer, many people came to conclude regard extended precision as evil even though it was only ANSI C's failure to handle variadic arguments sensibly that made it problematic.
PS--The intended purpose of long double would have benefited if there had also been a long float which was defined as the type to which float arguments could be most efficiently promoted; on many machines without floating-point units that would probably be a 48-bit type, but the optimal size could range anywhere from 32 bits (on machines with an FPU that does 32-bit math directly) up to 80 (on machines which use the design envisioned by IEEE-754). Too late now, though.
It boils down to the difference between 4.9999999999999999999 and 5.0.
Although the range is the main difference, it is precision that is important.
These type of data will be needed in great circle calculations or coordinate mathematics that is likely to be used with GPS systems.
As the precision is much better than normal double, it means you can retain typically 18 significant digits without loosing accuracy in calculations.
Extended precision I believe uses 80 bits (used mostly in maths processors), so 128 bits will be much more accurate.
C99 and C++11 added types float_t and double_t which are aliases for built-in floating-point types. Roughly, float_t is the type of the result of doing arithmetic among values of type float, and double_t is the type of the result of doing arithmetic among values of type double.

Performance implications of long double. Why does C choose 64-bits instead of the hardware's 80-bit for its default?

For specifics I am talking about x87 PC architecture and the C compiler.
I am writing my own interpreter and the reasoning behind the double datatype confuses me. Especially where efficiency is concerned. Could someone explain WHY C has decided on a 64-bit double and not the hardware native 80-bit double? And why has the hardware settled on an 80-bit double, since that is not aligned? What are the performance implications of each? I would like to use an 80-bit double for my default numeric type. But the choices of the compiler developers make me concerned that this is not the best choice.
double on x86 is only 2 bytes shorter, why doesn't the compiler use the 10 byte long double by default?
Can I get an example of the extra precision gotten by 80-bit long double vs double?
Why does Microsoft disable long double by default?
In terms of magnitude, how much worse / slower is long double on typical x86/x64 PC hardware?
The answer, according to Mysticial, is that Microsoft uses SSE2 for its double data-type. The Floating point unit (FPU) x87 is seen as outdated and slow in comparison to modern CPU extensions. SSE2 does not support 80-bit, hence the compiler's choice of 64-bit precision.
On 32-bit x86 architecture, since all CPUs don't have SSE2 yet, Microsoft still uses the floating point unit (FPU) x87 unless the compiler switch /arch:SSE2 is given. Which makes the code incompatible with those older? CPUs.
Wrong question.
It has nothing to do with C, all languages use AFAIK as standard floating-point single precision with 32 bit and double precision with 64 bit. C as a language supporting different
hardware defines only
sizeof(float) <= sizeof(double) <= sizeof(long double)
so it is perfectly acceptable that a specific C compiler uses 32bit floats for all datatypes.
Intel decided on Kahans advise that they support as much precision as possible and that calculations in less precise formats (32 & 64 bit) should be performed internally with 80bit precision.
The difference in precision and exponent range: 64bit has approx. 16 decimal digits and a max exponent of 308, 80bit has 19 digits and a max exponent of 4932.
Being much more precise and having a far greater exponent range you can calculate intermediate results without overflow or underflow and your result has less rounding errors.
So the question is why long double does not support 80bit. In fact many compilers did support it, but a lack of use and the run for benchmark performance killed it effectively.
This is actually so many questions in one, some of which are even too broad
Could someone explain WHY C has decided on a 64-bit double and not the hardware native 80-bit double?
It's irrelevant to C, because the C standard only mandates the minimum requirements for the built-in types and it's entirely up to the compiler implementation to choose whatever format they want to use for a type. Nothing prevents a C compiler to use some custom-made 77-bit floating-point type
And why has the hardware settled on an 80-bit double, since that is not aligned? What are the performance implications of each?
It's aligned to a multiple of 2 bytes. Remember that x87 dates back to 8086 + 8087.
It's a good trade-off for modern hardware implementers and software writers who needs more precision for exact rounding in double operations. Too big of a type and you'll need significantly more transistors. Double the number of bits in the significand and the multiplier will need to be 4 times as big
William Kahan, a primary designer of the x87 arithmetic and initial IEEE 754 standard proposal notes on the development of the x87 floating point: "An Extended format as wide as we dared (80 bits) was included to serve the same support role as the 13-decimal internal format serves in Hewlett-Packard’s 10-decimal calculators." Moreover, Kahan notes that 64 bits was the widest significand across which carry propagation could be done without increasing the cycle time on the 8087, and that the x87 extended precision was designed to be extensible to higher precision in future processors: "For now the 10-byte Extended format is a tolerable compromise between the value of extra-precise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a 16-byte format... That kind of gradual evolution towards wider precision was already in view when IEEE Standard 754 for Floating-Point Arithmetic was framed.
https://en.wikipedia.org/wiki/Extended_precision#IEEE_754_extended_precision_formats
As you can see, with the 64-bit significand you can share the components (adder, multiplier...) with the integer ALU.
I would like to use an 80-bit double for my default numeric type. But the choices of the compiler developers make me concerned that this is not the best choice. double on x86 is only 2 bytes shorter, why doesn't the compiler use the 10 byte long double by default?
It's actually intended for using as a temporary variable (like tmp = (b*c + d)/e) to avoid intra overflow or underflow issues without special techniques like the Kahan summation. It's not your default floating-point type. In fact so many people use floating-point literals incorrectly when they use long double or float. They forgot to add the correct suffix which results in a lack of precision and then they ask why long double is just exactly the same as double. In summary, double should be used for almost every cases, unless you're limited by bandwidth or precision and you really know what you're doing
Can I get an example of the extra precision gotten by 80-bit long double vs double?
You can print the full value and see it your own. There are also a lot of questions that are worth reading
What are the applications/benefits of an 80-bit extended precision data type?
Difference between long double and double in C and C++
I want to know the difference between a long double and a double
Why does Microsoft disable long double by default?
Microsoft doesn't disable long double by default. They just choose to map long double to IEEE-754 double precision which incidentally the same format as double. The type long double can still be used normally. They did that because math on SSE is faster and more consistent. That way you'll avoid "bugs" like the below
Why casting double to int might give different results?
Apparently identical math expressions with different output
Why would the same code yield different numeric results on 32 vs 64-bit machines?
std::pow produce different result in 32 bit and 64 bit application
...
Besides 64-bit long double doesn't have the odd size which requires compiler to pad 6 zero bytes more (or deal with a non-power-of-2 type width) which is a waste of resources.
That said, it's not even that 80-bit long double is not available on x86. Currently only MSVC abandoned the extended precision type, other compilers for x86 (like GCC, Clang, ICC...) still support it and made 80-bit IEEE-754 the default format for long double. For example GCC has -mlong-double-64/80/128 and -m96/128bit-long-double to control the exact format of long double
Or without potentially breaking ABI compatibility by changing long double, you can use GNU C floating point type names like __float80 on targets that support it. This example on Godbolt compiles to 80-bit FP math whether it targets Windows or Linux.
In terms of magnitude, how much worse / slower is long double on typical x86/x64 PC hardware?
This cannot be answered because latency and throughput depends on each specific microarchitecture. However if you do a lot of floating-point operations then double will be significantly faster, because it has fewer bits in the significand, and it can be parallelized with SIMD. For example you can working on a vector of 8 doubles at a time with AVX-512. That can't be done with the extended precision type
Also, 80-bit x87 fp load and store instructions are significantly slower than the "normal" versions that convert to/from 32 or 64-bit, and only fstp is available, not fst. See Peter Cordes's answer on retrocomputing about x87 performance on modern CPUs. (In fact that's a cross-site duplicate of this, asking why MSVC doesn't expose an 80-bit x87 type as long double.)

Can long integer routines benefit from SSE?

I'm still working on routines for arbitrary long integers in C++. So far, I have implemented addition/subtraction and multiplication for 64-bit Intel CPUs.
Everything works fine, but I wondered if I can speed it a bit by using SSE. I browsed through the SSE docs and processor instruction lists, but I could not find anything I think I can use and here is why:
SSE has some integer instructions, but most instructions handle floating point. It doesn't look like it was designed for use with integers (e.g. is there an integer compare for less?)
The SSE idea is SIMD (same instruction, multiple data), so it provides instructions for 2 or 4 independent operations. I, on the other hand, would like to have something like a 128 bit integer add (128 bit input and output). This doesn't seem to exist. (Yet? In AVX2 maybe?)
The integer additions and subtractions handle neither input nor output carries. So it's very cumbersome (and thus, slow) to do it by hand.
My question is: is my assessment correct or is there anything I have overlooked? Can long integer routines benefit from SSE? In particular, can they help me to write a quicker add, sub or mul routine?
In the past, the answer to this question was a solid, "no". But as of 2017, the situation is changing.
But before I continue, time for some background terminology:
Full Word Arithmetic
Partial Word Arithmetic
Full-Word Arithmetic:
This is the standard representation where the number is stored in base 232 or 264 using an array of 32-bit or 64-bit integers.
Many bignum libraries and applications (including GMP) use this representation.
In full-word representation, every integer has a unique representation. Operations like comparisons are easy. But stuff like addition are more difficult because of the need for carry-propagation.
It is this carry-propagation that makes bignum arithmetic almost impossible to vectorize.
Partial-Word Arithmetic
This is a lesser-used representation where the number uses a base less than the hardware word-size. For example, putting only 60 bits in each 64-bit word. Or using base 1,000,000,000 with a 32-bit word-size for decimal arithmetic.
The authors of GMP call this, "nails" where the "nail" is the unused portion of the word.
In the past, use of partial-word arithmetic was mostly restricted to applications working in non-binary bases. But nowadays, it's becoming more important in that it allows carry-propagation to be delayed.
Problems with Full-Word Arithmetic:
Vectorizing full-word arithmetic has historically been a lost cause:
SSE/AVX2 has no support for carry-propagation.
SSE/AVX2 has no 128-bit add/sub.
SSE/AVX2 has no 64 x 64-bit integer multiply.*
*AVX512-DQ adds a lower-half 64x64-bit multiply. But there is still no upper-half instruction.
Furthermore, x86/x64 has plenty of specialized scalar instructions for bignums:
Add-with-Carry: adc, adcx, adox.
Double-word Multiply: Single-operand mul and mulx.
In light of this, both bignum-add and bignum-multiply are difficult for SIMD to beat scalar on x64. Definitely not with SSE or AVX.
With AVX2, SIMD is almost competitive with scalar bignum-multiply if you rearrange the data to enable "vertical vectorization" of 4 different (and independent) multiplies of the same lengths in each of the 4 SIMD lanes.
AVX512 will tip things more in favor of SIMD again assuming vertical vectorization.
But for the most part, "horizontal vectorization" of bignums is largely still a lost cause unless you have many of them (of the same size) and can afford the cost of transposing them to make them "vertical".
Vectorization of Partial-Word Arithmetic
With partial-word arithmetic, the extra "nail" bits enable you to delay carry-propagation.
So as long as you as you don't overflow the word, SIMD add/sub can be done directly. In many implementations, partial-word representation uses signed integers to allow words to go negative.
Because there is (usually) no need to perform carryout, SIMD add/sub on partial words can be done equally efficiently on both vertically and horizontally-vectorized bignums.
Carryout on horizontally-vectorized bignums is still cheap as you merely shift the nails over the next lane. A full carryout to completely clear the nail bits and get to a unique representation usually isn't necessary unless you need to do a comparison of two numbers that are almost the same.
Multiplication is more complicated with partial-word arithmetic since you need to deal with the nail bits. But as with add/sub, it is nevertheless possible to do it efficiently on horizontally-vectorized bignums.
AVX512-IFMA (coming with Cannonlake processors) will have instructions that give the full 104 bits of a 52 x 52-bit multiply (presumably using the FPU hardware). This will play very well with partial-word representations that use 52 bits per word.
Large Multiplication using FFTs
For really large bignums, multiplication is most efficiently done using Fast-Fourier Transforms (FFTs).
FFTs are completely vectorizable since they work on independent doubles. This is possible because fundamentally, the representation that FFTs use is
a partial word representation.
To summarize, vectorization of bignum arithmetic is possible. But sacrifices must be made.
If you expect SSE/AVX to be able to speed up some existing bignum code without fundamental changes to the representation and/or data layout, that's not likely to happen.
But nevertheless, bignum arithmetic is possible to vectorize.
Disclosure:
I'm the author of y-cruncher which does plenty of large number arithmetic.

Resources