Why is comparing double faster than uint64?

Why is comparing double faster than uint64? - performance

I analysed the following program using Matlab's profile. Both double and uint64 are 64-bit variables. Why is comparing two double much faster than comparing two uint64? Aren't they both compared bitwise?
big = 1000000;
a = uint64(randi(100,big,1));
b = uint64(randi(100,big,1));
c = uint64(zeros(big,1));
tic;
for i=1:big
if a(i) == b(i)
c(i) = c(i) + 1;
end
end
toc;
a = randi(100,big,1);
b = randi(100,big,1);
c = zeros(big,1);
tic;
for i=1:big
if a(i) == b(i)
c(i) = c(i) + 1;
end
end
toc;
This is the measurement of profile:
This is what tictoc measures:
Elapsed time is 6.259040 seconds.
Elapsed time is 0.015387 seconds.
The effect disappears when uint8..uint32 or int8..int32 are used instead of 64-bit data types.

It's probably a combination of the Matlab interpreter and the underlying CPU not supporting 64-bit int types as well as the others.
Matlab favors double over int operations. Most values are stored in double types, even if they represent integer values. The double and int == operations will take different code paths, and MathWorks will have spent a lot more attention on tuning and optimizing the code for double than for ints, especially int64. In fact, older versions of Matlab did not support arithmetic operations on int64 at all. (And IIRC, it still doesn't support mixed-integer math.) When you do int64 math, you're using less mature, less tuned code, and the same may apply to ==. Int types are not a priority in Matlab. The presence of the int64 may even interfere with the JIT optimizing that code, but that's just a guess.
But there might be an underlying hardware reason for this too. Here's a hypothesis: if you're on 32-bit x86, you're working with 32-bit general purpose (integer) registers. That means the smaller int types can fit in a register and be compared using fast instructions, but the 64-bit int values won't fit in a register, and may take more expensive instruction sequences to compare. The doubles, even though they are 64 bits wide, will fit in the wide floating-point registers of the x87 floating point unit, and can be compared in hardware using fast floating-point comparison instructions. This means the [u]int64s are the only ones that can't be compared using fast single-register, single-instruction operations.
If that's the case, if you run this same code on 64-bit x86-64 (in 64-bit Matlab), the difference may disappear because you then have 64-bit wide general purpose registers. But then again, it may not, if the Matlab interpreter's code isn't compiled to take advantage of it.

Related

Fastest way to implement floating-point multiplication by a small integer constant

Suppose you are trying to multiply a floating-point number k by a small integer constant n (by small I mean -20 <= n <= 20). The naive way of doing this is converting n to a floating point number (which for the purposes of this question does not count towards the runtime) and executing a floating-point multiply. However, for n = 2, it seems likely that k + k is a faster way of computing it. At what n does the multiply instruction become faster than repeated additions (plus an inversion at the end if n < 0)?
Note that I am not particularly concerned about accuracy here; I am willing to allow unsound optimizations as long as they get roughly the right answer (i.e.: up to 1024 ULP error is probably fine).
I am writing OpenCL code, so I'm interested in the answer to this question in many computational contexts (x86-64, x86-64 + AVX256, GPUs).
I could benchmark this, but since I don't have a particular architecture in mind, I'd prefer a theoretical justification of the choice.

According to AMD's OpenCL optimisation guide for GPUs, section 3.8.1 "Instruction Bandwidths", for single-precision floating point operands, addition, multiplication and 'MAD' (multiply-add) all have a throughput of 5 per cycle on GCN based GPUs. The same is true for 24-bit integers. Only once you move to 32-bit integers are multiplications much more expensive (1/cycle). Int-to-float conversions and vice versa are also comparatively slow (1/cycle), and unless you have a double-precision float capable model (mostly FirePro/Radeon Pro series or Quadro/Tesla from nvidia) operations on doubles are super slow (<1/cycle). Negation is typically "free" on GPUs - for example GCN has sign flags on instruction operands, so -(a + b) compiles to one instruction after transforming to (-a) + (-b).
Nvidia GPUs tend to be a bit slower at integer operations, for floats it's a similar story to AMD's though: multiplications are just as fast as addition, and if you can combine them into MAD operations, you can double throughput. Intel's GPUs are quite different in other regards, but again they're very fast at FP multiplication and addition.
Basically, it's really hard to beat a GPU at floating-point multiplication, as that's essentially the one thing they're optimised for.
On the CPU it's typically more complicated - Agner Fog's optimisation resources and instruction tables are the place to go for the details. Note though that on many CPUs you'll pay a penalty for interpreting float data as integer and back because ALU and FPU are typically separate. (For example if you wanted to optimise multiplying floats by a power of 2 by performing an integer addition on their exponents. On x86, you can easily do this by operating on SSE or AVX registers using first float instructions, then integer ones, but it's generally not good for performance.)

Addition efficiency proportional to size of operands?

Question that just popped into my head, and I don't think I've seen an answer on here. Is the time taken by a binary addition algorithm, proportional to the size of the operands?
Obviously, adding 1101011010101010101101010 and 10110100101010010101 is going to take longer than 1 + 1, but my question refers more to the smaller values. Is there a negligible difference, no difference, a theoretical difference?
At what point, with these sorts of rudimentary calculations should we start looking into more efficient methods of calculation? ie: Exponentiation by squaring with large exponents for calculating huge powers.

How we see the binary patterns...
1101011010101010101101010 (big)
10110100101010010101 (medium)
1 (small)
How a 32bit computer sees the binary patterns...
00000001101011010101010101101010 32bit,
00000000000010110100101010010101 32bit,
00000000000000000000000000000001 i'm lovin it
On a 32bit system, all the above numbers will take the same time (no. of CPU instructions) to be added. As all of them fit within the basic computational block i.e. the 32bit CPU register.
How a 16bit computer sees the binary patterns...
1
+1 = ?
0000000000000001 i'm lovin it
0000000000000001 i'm lovin it
00000001101011010101010101101010
+00000000000010110100101010010101 = ?
00000001101011010101010101101010 too BIG for me!
00000000000010110100101010010101 too BIG for me!
On a 16bit system, as the larger numbers will NOT fit in a 16bit register, it will need an additional pass(to add the significant bits that remain after the first 16LSBs are added).
Step1: ADD Least significant bits
0101010101101010
0100101010010101
Step2: ADD the rest (remember carry bit from previous operation)
000000000000000C
0000000110101101
0000000000001011
We can start thinking of optimising the mathematical operations on
numbers once the numbers no longer fit in the basic computation unit
of the system i.e. the CPU-register.
Modern hardware architectures are developed keeping this in mind and support SIMD instructions. Compilers will often employ them (SSE on x86, NEON on ARM) when they see such a case being made i.e. 128bit decryption logic being run on a 32bit system.
Also instead of checking ONLY the size of the operands, the size of the result also determines whether the system can accomplish the mathematical operation within one step. Not only the operands involved, but the operation being performed needs to be taken into consideration as well.
For example, on a 32bit system, adding two 30bit numbers can be definitely carried out using the regular operations as the result is guaranteed to NOT exceed a 32bit register. But multiplying the same two 30bit numbers may result in a number that does NOT fit within 32bits.
In the absence of such a guarantee of being able to store the result in a single computational unit, to ensure validity of the result for all possible values, the architecture(and the compiler) must :
go the long way i.e. multi-step mathematical operations
or
employ SIMD optimisations
or
define and implement custom mechanisms
(like register-pairs EDX:EAX to hold the result on x86)

In practice, there's no (or completely negligible) difference between adding different integers that fit in the processor words as that should always be a fixed-time operation.
In theory, the complexity for adding two unsigned integers should be O(log(n)) where n is the bigger of the two. As such, you need to go pretty high before mere additions become a problem.
As for where exactly to draw the line between simple and complex algorithms for computing numbers, I don't have an exact answer. However, the GMP library comes to mind. From what I understand, they've carefully chosen their algorithms and under what circumstances to use each in terms of performance. You may want to look into what they did.

I somewhat disagree with the above answers. It very much depends on the context.
For simple integer arithmetic (for loop counters etc), then on 64bit machines that computation will be done using 64bit general purpose registers (RSI/RCX/etc). In those cases, there is no difference in speed between an 8bit or 64bit addition.
If however you are processing arrays of integers, and assuming the compiler has been able to optimise the code nicely, then yes, smaller is faster (but not for the reason you think).
In the AVX2 instruction set, you have access to 4 integer addition instructions:
__m256i _mm256_add_epi8 (__m256i a, __m256i b); // 32 x 8bit
__m256i _mm256_add_epi16(__m256i a, __m256i b); // 16 x 16bit
__m256i _mm256_add_epi32(__m256i a, __m256i b); // 8 x 32bit
__m256i _mm256_add_epi64(__m256i a, __m256i b); // 4 x 64bit
You'll notice that all of them operate on 256bits at a time, which means you can process 4 integer additions if you're using 64bit, compared to 32 additions if you are using 8bit integers. (As mentioned above, you'd need to make sure you have enough precision). They all take the same number of clock cycles to compute - 1clk.
There are also other effects of using smaller data types, which are mainly better CPU cache usage, and a reduced number of memory reads/writes.
However, back to your original question on bit-by-bit computation. Prior to the new AVX-512 instruction set, it might not have seemed a little silly. However, the new instruction set contains a ternary logic instruction. With this instruction, it is possible to compute 512 additions on numbers of any bit length fairly easily.
inline __m512i add(__m512i x, __m512i x, __m512i carry_in)
{
return _mm512_ternarylogic_epi32(carry_in, y, x, 0x96);
}
inline __m512i adc(__m512i x, __m512i x, __m512i carry_in)
{
return _mm512_ternarylogic_epi32(carry_in, y, x, 0xE8);
}
__m512i A[NUM_BITS];
__m512i B[NUM_BITS];
__m512i RESULT[NUM_BITS];
__m512i CARRY = _mm512_setzero_ps();
for(int i = 0; i < NUM_BITS; ++i)
{
RESULT[i] = add(A[i], B[i], CARRY);
CARRY = adc(A[i], B[i], CARRY);
}
In this particular example (which to be honest, probably has very limited real world usage!), The time it takes to perform the 512 additions, is indeed directly proportional to NUM_BITS.

Tips for improving performance of a 2d image 'tracing' CUDA kernel?

Can you give me some tips to optimize this CUDA code?
I'm running this on a device with compute capability 1.3 (I need it for a Tesla C1060 although I'm testing it now on a GTX 260 which has the same compute capability) and I have several kernels like the one below. The number of threads I need to execute this kernel is given by long SUM and depends on size_t M and size_t N which are the dimensions of a rectangular image received as parameter it can vary greatly from 50x50 to 10000x10000 in pixels or more. Although I'm mostly interested in working the bigger images with Cuda.
Now each image has to be traced in all directions and angles and some computations must be done over the values extracted from the tracing. So, for example, for a 500x500 image I need 229080 threads computing that kernel below which is the value of SUM (that's why I check that the thread id idHilo doesn't go over it). I copied several arrays into the global memory of the device one after another since I need to access them for the calculations all of length SUM. Like this
cudaMemcpy(xb_cuda,xb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);
cudaMemcpy(yb_cuda,yb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);
...etc
So each value of every array can be accessed by one thread. All are done before the kernel calls. According to the Cuda Profiler on Nsight the highest memcopy duration is 246.016 us for a 500x500 image so that is not taking so long.
But the kernels like the one I copied below are taking too long for any practical use (3.25 seconds according to the Cuda profiler for the kernel below for a 500x500 image and 5.052 seconds for the kernel with the highest duration) so I need to see if I can optimize them.
I arrange the data this way
First the block dimension
dim3 dimBlock(256,1,1);
then the number of blocks per Grid
dim3 dimGrid((SUM+255)/256);
For a number of 895 blocks for a 500x500 image.
I'm not sure how to use coalescing and shared memory in my case or even if it's a good idea to call the kernel several times with different portions of the data. The data is independent one from the other so I could in theory call that kernel several times and not with the 229080 threads all at once if needs be.
Now take into account that the outer for loop
for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){
depends on
tendbegin_cuda[idHilo]
the value of which depends on each thread but most threads have similar values for it.
According to the Cuda Profiler the Global Store Efficiency is of 0.619 and the Global Load Efficiency is 0.951 for this kernel. Other kernels have similar values .
Is that good? bad? how can I interpret those values? Sadly the devices of compute capability 1.3 don't provide other useful info for assessing the code like the Multiprocessor and Kernel Memory or Instruction analysis. The only results I get after the analysis is "Low Global Memory Store Efficiency" and "Low Global Memory Load Efficiency" but I'm not sure how I can optimize those.
void __global__ t21_trazo(long SUM,int cT, double Bn, size_t M, size_t N, float* imagen_cuda, double* vector_trazo_cuda, long* xb_cuda, long* yb_cuda, long* xinc_cuda, long* yinc_cuda, long* tbegin_cuda, long* tendbegin_cuda){
long xi;
long yi;
int t;
int k;
int a;
int ji;
long idHilo=blockIdx.x*blockDim.x+threadIdx.x;
int neighborhood[31];
int v=0;
if(idHilo<SUM){
for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){
xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
yi = yb_cuda[idHilo] + floor((double)t*yinc_cuda[idHilo]);
neighborhood[v]=floor(xi/Bn);
ji=floor(yi/Bn);
if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
{
if(tendbegin_cuda[idHilo]>30 && v==30){
if(t==0)
vector_trazo_cuda[20+idHilo*31]=0;
for(k=1;k<=15;k++)
vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
for(a=0;a<30;a++)
neighborhood[a]=neighborhood[a+1];
v=v-1;
}
v=v+1;
}
}
}
}
EDIT:
Changing the DP flops for SP flops only slightly improved the duration. Loop unrolling the inner loops practically didn't help.

Sorry for the unstructured answer, I'm just going to throw out some generally useful comments with references to your code to make this more useful to others.
Algorithm changes are always number one for optimizing. Is there another way to solve the problem that requires less math/iterations/memory etc.
If precision is not a big concern, use floating point (or half precision floating point with newer architectures). Part of the reason it didn't affect your performance much when you briefly tried is because you're still using double precision calculations on your floating point data (fabs takes double, so if you use with float, it converts your float to a double, does double math, returns a double and converts to float, use fabsf).
If you don't need to use the absolute full precision of float use fast math (compiler option).
Multiply is much faster than divide (especially for full precision/non-fast math). Calculate 1/var outside the kernel and then multiply instead of dividing inside kernel.
Don't know if it gets optimized out, but you should use increment and decrement operators. v=v-1; could be v--; etc.
Casting to an int will truncate toward zero. floor() will truncate toward negative infinite. you probably don't need explicit floor(), also, floorf() for float as above. when you use it for the intermediate computations on integer types, they're already truncated. So you're converting to double and back for no reason. Use the appropriately typed function (abs, fabs, fabsf, etc.)
if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
change to
if(abs(neighborhood[v]) < M && abs(ji)<N)
vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+
fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
change to
vector_trazo_cuda[20+idHilo*31] +=
fabsf(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
.
xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
change to
xi = xb_cuda[idHilo] + t*xinc_cuda[idHilo];
The above line is needlessly complicated. In essence you are doing this,
convert t to double,
convert xinc_cuda to double and multiply,
floor it (returns double),
convert xb_cuda to double and add,
convert to long.
The new line will store the same result in much, much less time (also better because if you exceed the precision of double in the previous case, you would be rounding to a nearest power of 2). Also, those four lines should be outside the for loop...you don't need to recompute them if they don't depend on t. Together, i wouldn't be surprised if this cuts your run time by a factor of 10-30.
Your structure results in a lot of global memory reads, try to read once from global, handle calculations on local memory, and write once to global (if at all possible).
Compile with -lineinfo always. Makes profiling easier, and i haven't been able to assess any overhead whatsoever (using kernels in the 0.1 to 10ms execution time range).
Figure out with the profiler if you're compute or memory bound and devote time accordingly.
Try to allow the compiler use registers when possible, this is a big topic.
As always, don't change everything at once. I typed all this out with compiling/testing so i may have an error.

You may be running too many threads simultaneously. The optimum performance seems to come when you run the right number of threads: enough threads to keep busy, but not so many as to over-fragment the local memory available to each simultaneous thread.
Last fall I built a tutorial to investigate optimization of the Travelling Salesman problem (TSP) using CUDA with CUDAFY. The steps I went through in achieving a several-times speed-up from a published algorithm may be useful in guiding your endeavours, even though the problem domain is different. The tutorial and code is available at CUDA Tuning with CUDAFY.

How do I translate this range coding C++ snippet to performant Haskell?

I know enough Haskell to translate the code below, but I don't know much about making it perform well:
typedef unsigned long precision;
typedef unsigned char uc;
const int kSpaceForByte = sizeof(precision) * 8 - 8;
const int kHalfPrec = sizeof(precision) * 8 / 2;
const precision kTop = ((precision)1) << kSpaceForByte;
const precision kBot = ((precision)1) << kHalfPrec;
//This must be called before encoding starts
void RangeCoder::StartEncode(){
_low = 0;
_range = (precision) -1;
}
/*
RangeCoder does not concern itself with models of the data.
To encode each symbol, you pass the parameters *cumFreq*, which gives
the cumulative frequency of the possible symbols ordered before this symbol,
*freq*, which gives the frequency of this symbol. And *totFreq*, which gives
the total frequency of all symbols.
This means that you can have different frequency distributions / models for
each encoded symbol, as long as you can restore the same distribution at
this point, when restoring.
*/
void RangeCoder::Encode(precision cumFreq, precision freq, precision totFreq){
assert(cumFreq + freq <= totFreq && freq && totFreq <= kBot);
_low += cumFreq * (_range /= totFreq);
_range *= freq;
while ((_low ^ _low + _range) < kTop or
_range < kBot and ((_range= -_low & kBot - 1), 1)){
//the "a or b and (r=..,1)" idiom is a way to assign r only if a is false.
OutByte(_low >> kSpaceForByte); //output one byte.
_range <<= sizeof(uc) * 8;
_low <<= sizeof(uc) * 8;
}
}
I know, I know "Write several versions and use criterion to see what works". I don't know enough to know what my options are though, or to avoid silly mistakes.
Here are my thoughts so far. One way would be to use the State monad and/or lenses. Another would be to translate the loop and state to explicit recursion. I read somewhere that explicit recursion tends to performs badly on ghc though. I think using ByteString Builder would be a good way to output each byte. Assuming I run on a 64 bit platform, should I use unboxed Word64 arguments? The compression quality will not decrease significantly if I decrease the precision to 32 bits. Will GHC optimize better for this?
Since this is not a 1-1 mapping, pipes with StateP would lead to very neat code, where I would request arguments one at a time and then let the while-loop respond byte for byte. Unfortunately, when i benchmarked it, it seems the pipe overhead (unsurprisingly) is quite large. Since each symbol can lead to many byte outputs, it feels a bit like a concatMap with State. Perhaps this would be the idiomatic solution? Concatenating lists of bytes does not sound very fast to me, though. ByteString has a concatMap. Perhaps this is the correct way? EDIT: no it is not. It takes a ByteString as input.
I intend to release the package on Hackage when I'm done, so any advice (or actual code!) you can give will benefit the community :). I plan to use this compression as a base for writing a very memory efficient compressed map.

I read somewhere that explicit recursion tends to performs badly on ghc though.
No. GHC produce slow machine code for recursion, which couldn't be reduced (or GHC "don't want" to reduce). If recursion could be unrolled (I don't see any fundamential problems with it in your snippet), it is translated to almost the same machine code as while-loop in C or C++.
Assuming I run on a 64 bit platform, should I use unboxed Word64 arguments? The compression quality will not decrease significantly if I decrease the precision to 32 bits. Will GHC optimize better for this?
Do you mean Word#? Let GHC to deal with it, use boxed types. I've never met a situation when some profit could be achived only by using unboxed types. Using 32bit types wouldn't help on 64bit platform.
One general rule of optimizing performance for GHC is avoiding data structures where possible. If you can pass pieces of data through function arguments or closures, use the chance.

long double (GCC specific) and __float128

I'm looking for detailed information on long double and __float128 in GCC/x86 (more out of curiosity than because of an actual problem).
Few people will probably ever need these (I've just, for the first time ever, truly needed a double), but I guess it is still worthwile (and interesting) to know what you have in your toolbox and what it's about.
In that light, please excuse my somewhat open questions:
Could someone explain the implementation rationale and intended usage of these types, also in comparison of each other? For example, are they "embarrassment implementations" because the standard allows for the type, and someone might complain if they're only just the same precision as double, or are they intended as first-class types?
Alternatively, does someone have a good, usable web reference to share? A Google search on "long double" site:gcc.gnu.org/onlinedocs didn't give me much that's truly useful.
Assuming that the common mantra "if you believe that you need double, you probably don't understand floating point" does not apply, i.e. you really need more precision than just float, and one doesn't care whether 8 or 16 bytes of memory are burnt... is it reasonable to expect that one can as well just jump to long double or __float128 instead of double without a significant performance impact?
The "extended precision" feature of Intel CPUs has historically been source of nasty surprises when values were moved between memory and registers. If actually 96 bits are stored, the long double type should eliminate this issue. On the other hand, I understand that the long double type is mutually exclusive with -mfpmath=sse, as there is no such thing as "extended precision" in SSE. __float128, on the other hand, should work just perfectly fine with SSE math (though in absence of quad precision instructions certainly not on a 1:1 instruction base). Am I right in these assumptions?
(3. and 4. can probably be figured out with some work spent on profiling and disassembling, but maybe someone else had the same thought previously and has already done that work.)
Background (this is the TL;DR part):
I initially stumbled over long double because I was looking up DBL_MAX in <float.h>, and incidentially LDBL_MAX is on the next line. "Oh look, GCC actually has 128 bit doubles, not that I need them, but... cool" was my first thought. Surprise, surprise: sizeof(long double) returns 12... wait, you mean 16?
The C and C++ standards unsurprisingly do not give a very concrete definition of the type. C99 (6.2.5 10) says that the numbers of double are a subset of long double whereas C++03 states (3.9.1 8) that long double has at least as much precision as double (which is the same thing, only worded differently). Basically, the standards leave everything to the implementation, in the same manner as with long, int, and short.
Wikipedia says that GCC uses "80-bit extended precision on x86 processors regardless of the physical storage used".
The GCC documentation states, all on the same page, that the size of the type is 96 bits because of the i386 ABI, but no more than 80 bits of precision are enabled by any option (huh? what?), also Pentium and newer processors want them being aligned as 128 bit numbers. This is the default under 64 bits and can be manually enabled under 32 bits, resulting in 32 bits of zero padding.
Time to run a test:
#include <stdio.h>
#include <cfloat>
int main()
{
#ifdef USE_FLOAT128
typedef __float128 long_double_t;
#else
typedef long double long_double_t;
#endif
long_double_t ld;
int* i = (int*) &ld;
i[0] = i[1] = i[2] = i[3] = 0xdeadbeef;
for(ld = 0.0000000000000001; ld < LDBL_MAX; ld *= 1.0000001)
printf("%08x-%08x-%08x-%08x\r", i[0], i[1], i[2], i[3]);
return 0;
}
The output, when using long double, looks somewhat like this, with the marked digits being constant, and all others eventually changing as the numbers get bigger and bigger:
5636666b-c03ef3e0-00223fd8-deadbeef
^^ ^^^^^^^^
This suggests that it is not an 80 bit number. An 80-bit number has 18 hex digits. I see 22 hex digits changing, which looks much more like a 96 bits number (24 hex digits). It also isn't a 128 bit number since 0xdeadbeef isn't touched, which is consistent with sizeof returning 12.
The output for __int128 looks like it's really just a 128 bit number. All bits eventually flip.
Compiling with -m128bit-long-double does not align long double to 128 bits with a 32-bit zero padding, as indicated by the documentation. It doesn't use __int128 either, but indeed seems to align to 128 bits, padding with the value 0x7ffdd000(?!).
Further, LDBL_MAX, seems to work as +inf for both long double and __float128. Adding or subtracting a number like 1.0E100 or 1.0E2000 to/from LDBL_MAX results in the same bit pattern.
Up to now, it was my belief that the foo_MAX constants were to hold the largest representable number that is not +inf (apparently that isn't the case?). I'm also not quite sure how an 80-bit number could conceivably act as +inf for a 128 bit value... maybe I'm just too tired at the end of the day and have done something wrong.

Ad 1.
Those types are designed to work with numbers with huge dynamic range. The long double is implemented in a native way in the x87 FPU. The 128b double I suspect would be implemented in software mode on modern x86s, as there's no hardware to do the computations in hardware.
The funny thing is that it's quite common to do many floating point operations in a row and the intermediate results are not actually stored in declared variables but rather stored in FPU registers taking advantage of full precision. That's why comparison:
double x = sin(0); if (x == sin(0)) printf("Equal!");
Is not safe and cannot be guaranteed to work (without additional switches).
Ad. 3.
There's an impact on the speed depending what precision you use. You can change used the precision of the FPU by using:
void
set_fpu (unsigned int mode)
{
asm ("fldcw %0" : : "m" (*&mode));
}
It will be faster for shorter variables, slower for longer. 128bit doubles will be probably done in software so will be much slower.
It's not only about RAM memory wasted, it's about cache being wasted. Going to 80 bit double from 64b double will waste from 33% (32b) to almost 50% (64b) of the memory (including cache).
Ad 4.
On the other hand, I understand that the long double type is mutually
exclusive with -mfpmath=sse, as there is no such thing as "extended
precision" in SSE. __float128, on the other hand, should work just
perfectly fine with SSE math (though in absence of quad precision
instructions certainly not on a 1:1 instruction base). Am I right under
these assumptions?
The FPU and SSE units are totally separate. You can write code using FPU at the same time as SSE. The question is what will the compiler generate if you constrain it to use only SSE? Will it try to use FPU anyway? I've been doing some programming with SSE and GCC will generate only single SISD on its own. You have to help it to use SIMD versions. __float128 will probably work on every machine, even the 8-bit AVR uC. It's just fiddling with bits after all.
The 80 bit in hex representation is actually 20 hex digits. Maybe the bits which are not used are from some old operation? On my machine, I compiled your code and only 20 bits change in long
mode: 66b4e0d2-ec09c1d5-00007ffe-deadbeef
The 128-bit version has all the bits changing. Looking at the objdump it looks as if it was using software emulation, there are almost no FPU instructions.
Further, LDBL_MAX, seems to work as +inf for both long double and
__float128. Adding or subtracting a number like 1.0E100 or 1.0E2000 to/from LDBL_MAX results in the same bit pattern. Up to now, it was my
belief that the foo_MAX constants were to hold the largest
representable number that is not +inf (apparently that isn't the
case?).
This seems to be strange...
I'm also not quite sure how an 80-bit number could conceivably
act as +inf for a 128-bit value... maybe I'm just too tired at the end
of the day and have done something wrong.
It's probably being extended. The pattern which is recognized to be +inf in 80-bit is translated to +inf in 128-bit float too.

IEEE-754 defined 32 and 64 floating-point representations for the purpose of efficient data storage, and an 80-bit representation for the purpose of efficient computation. The intention was that given float f1,f2; double d1,d2; a statement like d1=f1+f2+d2; would be executed by converting the arguments to 80-bit floating-point values, adding them, and converting the result back to a 64-bit floating-point type. This would offer three advantages compared with performing operations on other floating-point types directly:
While separate code or circuitry would be required for conversions to/from 32-bit types and 64-bit types, it would only be necessary to have only one "add" implementation, one "multiply" implementation, one "square root" implementation, etc.
Although in rare cases using an 80-bit computational type could yield results that were very slightly less accurate than using other types directly (worst-case rounding error is 513/1024ulp in cases where computations on other types would yield an error of 511/1024ulp), chained computations using 80-bit types would frequently be more accurate--sometimes much more accurate--than computations using other types.
On a system without a FPU, separating a double into a separate exponent and mantissa before performing computations, normalizing a mantissa, and converting a separate mantissa and exponent into a double, are somewhat time consuming. If the result of one computation will be used as input to another and discarded, using an unpacked 80-bit type will allow these steps to be omitted.
In order for this approach to floating-point math to be useful, however, it is imperative that it be possible for code to store intermediate results with the same precision as would be used in computation, such that temp = d1+d2; d4=temp+d3; will yield the same result as d4=d1+d2+d3;. From what I can tell, the purpose of long double was to be that type. Unfortunately, even though K&R designed C so that all floating-point values would be passed to variadic methods the same way, ANSI C broke that. In C as originally designed, given the code float v1,v2; ... printf("%12.6f", v1+v2);, the printf method wouldn't have to worry about whether v1+v2 would yield a float or a double, since the result would get coerced to a known type regardless. Further, even if the type of v1 or v2 changed to double, the printf statement wouldn't have to change.
ANSI C, however, requires that code which calls printf must know which arguments are double and which are long double; a lot of code--if not a majority--of code which uses long double but was written on platforms where it's synonymous with double fails to use the correct format specifiers for long double values. Rather than having long double be an 80-bit type except when passed as a variadic method argument, in which case it would be coerced to 64 bits, many compilers decided to make long double be synonymous with double and not offer any means of storing the results of intermediate computations. Since using an extended precision type for computation is only good if that type is made available to the programmer, many people came to conclude regard extended precision as evil even though it was only ANSI C's failure to handle variadic arguments sensibly that made it problematic.
PS--The intended purpose of long double would have benefited if there had also been a long float which was defined as the type to which float arguments could be most efficiently promoted; on many machines without floating-point units that would probably be a 48-bit type, but the optimal size could range anywhere from 32 bits (on machines with an FPU that does 32-bit math directly) up to 80 (on machines which use the design envisioned by IEEE-754). Too late now, though.

It boils down to the difference between 4.9999999999999999999 and 5.0.
Although the range is the main difference, it is precision that is important.
These type of data will be needed in great circle calculations or coordinate mathematics that is likely to be used with GPS systems.
As the precision is much better than normal double, it means you can retain typically 18 significant digits without loosing accuracy in calculations.
Extended precision I believe uses 80 bits (used mostly in maths processors), so 128 bits will be much more accurate.

C99 and C++11 added types float_t and double_t which are aliases for built-in floating-point types. Roughly, float_t is the type of the result of doing arithmetic among values of type float, and double_t is the type of the result of doing arithmetic among values of type double.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio