gcc precision bug? - gcc

I can only assume this is a bug. The first assert passes while the second fails:
double sum_1 = 4.0 + 6.3;
assert(sum_1 == 4.0 + 6.3);
double t1 = 4.0, t2 = 6.3;
double sum_2 = t1 + t2;
assert(sum_2 == t1 + t2);
If not a bug, why?

This is something that has bitten me, too.
Yes, floating point numbers should never be compared for equality because of rounding error, and you probably knew that.
But in this case, you're computing t1+t2, then computing it again. Surely that has to produce an identical result?
Here's what's probably going on. I'll bet you're running this on an x86 CPU, correct? The x86 FPU uses 80 bits for its internal registers, but values in memory are stored as 64-bit doubles.
So t1+t2 is first computed with 80 bits of precision, then -- I presume -- stored out to memory in sum_2 with 64 bits of precision -- and some rounding occurs. For the assert, it's loaded back into a floating point register, and t1+t2 is computed again, again with 80 bits of precision. So now you're comparing sum_2, which was previously rounded to a 64-bit floating point value, with t1+t2, which was computed with higher precision (80 bits) -- and that's why the values aren't exactly identical.
Edit So why does the first test pass? In this case, the compiler probably evaluates 4.0+6.3 at compile time and stores it as a 64-bit quantity -- both for the assignment and for the assert. So identical values are being compared, and the assert passes.
Second Edit Here's the assembly code generated for the second part of the code (gcc, x86), with comments -- pretty much follows the scenario outlined above:
// t1 = 4.0
fldl LC3
fstpl -16(%ebp)
// t2 = 6.3
fldl LC4
fstpl -24(%ebp)
// sum_2 = t1+t2
fldl -16(%ebp)
faddl -24(%ebp)
fstpl -32(%ebp)
// Compute t1+t2 again
fldl -16(%ebp)
faddl -24(%ebp)
// Load sum_2 from memory and compare
fldl -32(%ebp)
fxch %st(1)
fucompp
Interesting side note: This was compiled without optimization. When it's compiled with -O3, the compiler optimizes all of the code away.

You are comparing floating point numbers. Don't do that, floating point numbers have inherent precision error in some circumstances. Instead, take the absolute value of the difference of the two values and assert that the value is less than some small number (epsilon).
void CompareFloats( double d1, double d2, double epsilon )
{
assert( abs( d1 - d2 ) < epsilon );
}
This has nothing to do with the compiler and everything to do with the way floating point numbers are implemented. here is the IEEE spec:
http://www.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF

I've duplicated your problem on my Intel Core 2 Duo, and I looked at the assembly code. Here's what's happening: when your compiler evaluates t1 + t2, it does
load t1 into an 80-bit register
load t2 into an 80-bit register
compute the 80-bit sum
When it stores into sum_2 it does
round the 80-bit sum to a 64-bit number and store it
Then the == comparison compares the 80-bit sum to a 64-bit sum, and they're different, primarily because the fractional part 0.3 cannot be represented exactly using a binary floating-point number, so you are comparing a 'repeating decimal' (actually repeating binary) that has been truncated to two different lengths.
What's really irritating is that if you compiler with gcc -O1 or gcc -O2, gcc does the wrong arithmetic at compile time, and the problem goes away. Maybe this is OK according to the standard, but it's just one more reason that gcc is not my favorite compiler.
P.S. When I say that == compares an 80-bit sum with a 64-bit sum, of course I really mean it compares the extended version of the 64-bit sum. You might do well to think
sum_2 == t1 + t2
resolves to
extend(sum_2) == extend(t1) + extend(t2)
and
sum_2 = t1 + t2
resolves to
sum_2 = round(extend(t1) + extend(t2))
Welcome to the wonderful world of floating point!

When comparing floating point numbers for closeness you usually want to measure their relative difference, which is defined as
if (abs(x) != 0 || abs(y) != 0)
rel_diff (x, y) = abs((x - y) / max(abs(x),abs(y))
else
rel_diff(x,y) = max(abs(x),abs(y))
For example,
rel_diff(1.12345, 1.12367) = 0.000195787019
rel_diff(112345.0, 112367.0) = 0.000195787019
rel_diff(112345E100, 112367E100) = 0.000195787019
The idea is to measure the number of leading significant digits the numbers have in common; if you take the -log10 of 0.000195787019 you get 3.70821611, which is about the number of leading base 10 digits all the examples have in common.
If you need to determine if two floating point numbers are equal you should do something like
if (rel_diff(x,y) < error_factor * machine_epsilon()) then
print "equal\n";
where machine epsilon is the smallest number that can be held in the mantissa of the floating point hardware being used. Most computer languages have a function call to get this value. error_factor should be based on the number of significant digits you think will be consumed by rounding errors (and others) in the calculations of the numbers x and y. For example, if I knew that x and y were the result of about 1000 summations and did not know any bounds on the numbers being summed, I would set error_factor to about 100.
Tried to add these as links but couldn't since this is my first post:
en.wikipedia.org/wiki/Relative_difference
en.wikipedia.org/wiki/Machine_epsilon
en.wikipedia.org/wiki/Significand (mantissa)
en.wikipedia.org/wiki/Rounding_error

It may be that in one of the cases, you end up comparing a 64-bit double to an 80-bit internal register. It may be enlightening to look at the assembly instructions GCC emits for the two cases...

Comparisons of double precision numbers are inherently inaccurate. For instance, you can often find 0.0 == 0.0 returning false. This is due to the way the FPU stores and tracks numbers.
Wikipedia says:
Testing for equality is problematic. Two computational sequences that are mathematically equal may well produce different floating-point values.
You will need to use a delta to give a tolerance for your comparisons, rather than an exact value.

This "problem" can be "fixed" by using these options:
-msse2 -mfpmath=sse
as explained on this page:
http://www.network-theory.co.uk/docs/gccintro/gccintro_70.html
Once I used these options, both asserts passed.

Related

instrinsic _mm512_round_ps is missing for AVX512

I'm missing the intrinsic _mm512_round_ps for AVX512 (it is only available for KNC). Any idea why this is not available?
What would be a good workaround?
apply _mm256_round_ps to upper and lower half and fuse the results?
use _mm512_add_round_ps with one argument being zero?
Thanks!
TL:DR: AVX512F
__m512 nearest_integer = _mm512_roundscale_ps(input_vec, _MM_FROUND_TO_NEAREST_INT|_MM_FROUND_NO_EXC);
related: AVX512DQ _mm512_reduce_pd or _ps will subtract the integer part (and a specified number of leading fraction bits), range-reducing your input to only the fractional part. asm docs for vreducepd have the most detail.
The EVEX prefix allows overriding the default rounding direction {er} and setting suppress-all-exceptions {sae}, for FP instructions. (This is what the ..._round_ps() versions of intrinsics are for.) But it doesn't have a "round to integer" option; you still need a separate asm instruction for that.
vroundps xy, xy/mem, imm8 didn't get upgraded to AVX512. Actually it did: the same opcode has a new mnemonic for the EVEX version, using the high 4 bits of the immediate that are reserved in the SSE and VEX encodings.
vrndscaleps xyz, xyz/mem/m32broadcast, imm8 is available in ss/sd/ps/pd flavours. The high 4 bits of the imm8 specify the number of fraction bits to round to. In these terms, rounding to the nearest integer is rounding to 0 fraction bits. Rounding to nearest 0.5 would be rounding to 1 fraction bit. It's the same as scaling by 2^M, rounding to nearest integer, then scaling back down (done without overflow).
I think the field is unsigned, so you can't use M=-1 to round to an even number. The ISA ref manual doesn't mention signedness, so I'm leaning towards unsigned being the most likely.
The low 4 bits of the field specify the rounding mode like with roundps. As usual, the PD version of the instruction has the diagram (because it's alphabetically first).
With the upper 4 bits = 0, it behaves the same as roundps: they use the same encoding for the low 4 bits. It's not a coincidence that the instructions have the same opcode, just different prefixes.
(I'm curious if SSE or VEX roundpd on an AVX512 CPU would actually scale based on the upper 4 bits; it says they're "reserved" not "ignored". But probably not.)
__m512 _mm512_roundscale_ps( __m512 a, int imm); is the no-frills intrinsic. See Intel's intrinsic finder
The merge-masking + SAE-override version is __m512 _mm512_mask_roundscale_round_ps(__m512 s, __mmask16 k, __m512 a, int imm, int sae);. There's nothing you can do with the sae operand that roundscale can't already do with its imm8, though, so it's a bit pointless.
You can use the _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC and so on constants documented for _mm_round_pd / _mm256_round_pd, to round up, down, or truncate towards zero, or the usual nearest with even-as-tiebreak that's the IEEE default rounding mode. Or _MM_FROUND_CUR_DIRECTION to use whatever the current mode is. _MM_FROUND_NO_EXC suppresses setting the inexact exception bit in the MXCSR.
You might be wondering why vrndscaleps needs any immediate bits to specify rounding direction when you could just use the EVEX prefix to override the rounding direction with vrndscaleps zmm0 {k1}, zmm1, {rz-sae} (Or whatever the right syntax is; NASM doesn't seem to be accepting any of the examples I found.)
The answer is that explicit rounding is only available with 512-bit vectors or with scalars, and only for register operands. (It repurposes 3 EVEX bits used to set vector length (if AVX512VL is supported), and to distinguish between broadcast memory operands vs. vector. EVEX bits are overloaded based on context to pack more functionality into limited space.)
So having the rounding-control in the imm8 makes it possible to do vrndscaleps zmm0{k1}, [rdi]{m32bcst}, imm8 to broadcast a float from memory, round it, and merge that into an existing register according to mask register k1. All in a single instruction which decodes to probably 3 uops on SKX, assuming it's the same as vroundps. (http://agner.org/optimize/).

Addition efficiency proportional to size of operands?

Question that just popped into my head, and I don't think I've seen an answer on here. Is the time taken by a binary addition algorithm, proportional to the size of the operands?
Obviously, adding 1101011010101010101101010 and 10110100101010010101 is going to take longer than 1 + 1, but my question refers more to the smaller values. Is there a negligible difference, no difference, a theoretical difference?
At what point, with these sorts of rudimentary calculations should we start looking into more efficient methods of calculation? ie: Exponentiation by squaring with large exponents for calculating huge powers.
How we see the binary patterns...
1101011010101010101101010 (big)
10110100101010010101 (medium)
1 (small)
How a 32bit computer sees the binary patterns...
00000001101011010101010101101010 32bit,
00000000000010110100101010010101 32bit,
00000000000000000000000000000001 i'm lovin it
On a 32bit system, all the above numbers will take the same time (no. of CPU instructions) to be added. As all of them fit within the basic computational block i.e. the 32bit CPU register.
How a 16bit computer sees the binary patterns...
1
+1 = ?
0000000000000001 i'm lovin it
0000000000000001 i'm lovin it
00000001101011010101010101101010
+00000000000010110100101010010101 = ?
00000001101011010101010101101010 too BIG for me!
00000000000010110100101010010101 too BIG for me!
On a 16bit system, as the larger numbers will NOT fit in a 16bit register, it will need an additional pass(to add the significant bits that remain after the first 16LSBs are added).
Step1: ADD Least significant bits
0101010101101010
0100101010010101
Step2: ADD the rest (remember carry bit from previous operation)
000000000000000C
0000000110101101
0000000000001011
We can start thinking of optimising the mathematical operations on
numbers once the numbers no longer fit in the basic computation unit
of the system i.e. the CPU-register.
Modern hardware architectures are developed keeping this in mind and support SIMD instructions. Compilers will often employ them (SSE on x86, NEON on ARM) when they see such a case being made i.e. 128bit decryption logic being run on a 32bit system.
Also instead of checking ONLY the size of the operands, the size of the result also determines whether the system can accomplish the mathematical operation within one step. Not only the operands involved, but the operation being performed needs to be taken into consideration as well.
For example, on a 32bit system, adding two 30bit numbers can be definitely carried out using the regular operations as the result is guaranteed to NOT exceed a 32bit register. But multiplying the same two 30bit numbers may result in a number that does NOT fit within 32bits.
In the absence of such a guarantee of being able to store the result in a single computational unit, to ensure validity of the result for all possible values, the architecture(and the compiler) must :
go the long way i.e. multi-step mathematical operations
or
employ SIMD optimisations
or
define and implement custom mechanisms
(like register-pairs EDX:EAX to hold the result on x86)
In practice, there's no (or completely negligible) difference between adding different integers that fit in the processor words as that should always be a fixed-time operation.
In theory, the complexity for adding two unsigned integers should be O(log(n)) where n is the bigger of the two. As such, you need to go pretty high before mere additions become a problem.
As for where exactly to draw the line between simple and complex algorithms for computing numbers, I don't have an exact answer. However, the GMP library comes to mind. From what I understand, they've carefully chosen their algorithms and under what circumstances to use each in terms of performance. You may want to look into what they did.
I somewhat disagree with the above answers. It very much depends on the context.
For simple integer arithmetic (for loop counters etc), then on 64bit machines that computation will be done using 64bit general purpose registers (RSI/RCX/etc). In those cases, there is no difference in speed between an 8bit or 64bit addition.
If however you are processing arrays of integers, and assuming the compiler has been able to optimise the code nicely, then yes, smaller is faster (but not for the reason you think).
In the AVX2 instruction set, you have access to 4 integer addition instructions:
__m256i _mm256_add_epi8 (__m256i a, __m256i b); // 32 x 8bit
__m256i _mm256_add_epi16(__m256i a, __m256i b); // 16 x 16bit
__m256i _mm256_add_epi32(__m256i a, __m256i b); // 8 x 32bit
__m256i _mm256_add_epi64(__m256i a, __m256i b); // 4 x 64bit
You'll notice that all of them operate on 256bits at a time, which means you can process 4 integer additions if you're using 64bit, compared to 32 additions if you are using 8bit integers. (As mentioned above, you'd need to make sure you have enough precision). They all take the same number of clock cycles to compute - 1clk.
There are also other effects of using smaller data types, which are mainly better CPU cache usage, and a reduced number of memory reads/writes.
However, back to your original question on bit-by-bit computation. Prior to the new AVX-512 instruction set, it might not have seemed a little silly. However, the new instruction set contains a ternary logic instruction. With this instruction, it is possible to compute 512 additions on numbers of any bit length fairly easily.
inline __m512i add(__m512i x, __m512i x, __m512i carry_in)
{
return _mm512_ternarylogic_epi32(carry_in, y, x, 0x96);
}
inline __m512i adc(__m512i x, __m512i x, __m512i carry_in)
{
return _mm512_ternarylogic_epi32(carry_in, y, x, 0xE8);
}
__m512i A[NUM_BITS];
__m512i B[NUM_BITS];
__m512i RESULT[NUM_BITS];
__m512i CARRY = _mm512_setzero_ps();
for(int i = 0; i < NUM_BITS; ++i)
{
RESULT[i] = add(A[i], B[i], CARRY);
CARRY = adc(A[i], B[i], CARRY);
}
In this particular example (which to be honest, probably has very limited real world usage!), The time it takes to perform the 512 additions, is indeed directly proportional to NUM_BITS.

Why is comparing double faster than uint64?

I analysed the following program using Matlab's profile. Both double and uint64 are 64-bit variables. Why is comparing two double much faster than comparing two uint64? Aren't they both compared bitwise?
big = 1000000;
a = uint64(randi(100,big,1));
b = uint64(randi(100,big,1));
c = uint64(zeros(big,1));
tic;
for i=1:big
if a(i) == b(i)
c(i) = c(i) + 1;
end
end
toc;
a = randi(100,big,1);
b = randi(100,big,1);
c = zeros(big,1);
tic;
for i=1:big
if a(i) == b(i)
c(i) = c(i) + 1;
end
end
toc;
This is the measurement of profile:
This is what tictoc measures:
Elapsed time is 6.259040 seconds.
Elapsed time is 0.015387 seconds.
The effect disappears when uint8..uint32 or int8..int32 are used instead of 64-bit data types.
It's probably a combination of the Matlab interpreter and the underlying CPU not supporting 64-bit int types as well as the others.
Matlab favors double over int operations. Most values are stored in double types, even if they represent integer values. The double and int == operations will take different code paths, and MathWorks will have spent a lot more attention on tuning and optimizing the code for double than for ints, especially int64. In fact, older versions of Matlab did not support arithmetic operations on int64 at all. (And IIRC, it still doesn't support mixed-integer math.) When you do int64 math, you're using less mature, less tuned code, and the same may apply to ==. Int types are not a priority in Matlab. The presence of the int64 may even interfere with the JIT optimizing that code, but that's just a guess.
But there might be an underlying hardware reason for this too. Here's a hypothesis: if you're on 32-bit x86, you're working with 32-bit general purpose (integer) registers. That means the smaller int types can fit in a register and be compared using fast instructions, but the 64-bit int values won't fit in a register, and may take more expensive instruction sequences to compare. The doubles, even though they are 64 bits wide, will fit in the wide floating-point registers of the x87 floating point unit, and can be compared in hardware using fast floating-point comparison instructions. This means the [u]int64s are the only ones that can't be compared using fast single-register, single-instruction operations.
If that's the case, if you run this same code on 64-bit x86-64 (in 64-bit Matlab), the difference may disappear because you then have 64-bit wide general purpose registers. But then again, it may not, if the Matlab interpreter's code isn't compiled to take advantage of it.

What has a better performance: multiplication or division?

Which version is faster:
x * 0.5
or
x / 2 ?
I've had a course at the university called computer systems some time ago. From back then I remember that multiplying two values can be achieved with comparably "simple" logical gates but division is not a "native" operation and requires a sum register that is in a loop increased by the divisor and compared to the dividend.
Now I have to optimise an algorithm with a lot of divisions. Unfortunately it's not just dividing by two, so binary shifting is not an option. Will it make a difference to change all divisions to multiplications ?
Update:
I have changed my code and didn't notice any difference. You're probably right about compiler optimisations. Since all the answers were great ive upvoted them all. I chose rahul's answer because of the great link.
Usually division is a lot more expensive than multiplication, but a smart compiler will often convert division by a compile-time constant to a multiplication anyway. If your compiler is not smart enough though, or if there are floating point accuracy issues, then you can always do the optimisation explicitly, e.g. change:
float x = y / 2.5f;
to:
const float k = 1.0f / 2.5f;
...
float x = y * k;
Note that this is most likely a case of premature optimisation - you should only do this kind of thing if you have profiled your code and positively identified division as being a performance bottlneck.
Division by a compile-time constant that's a power of 2 is quite fast (comparable to multiplication by a compile-time constant) for both integers and floats (it's basically convertible into a bit shift).
For floats even dynamic division by powers of two is much faster than regular (dynamic or static division) as it basically turns into a subtraction on its exponent.
In all other cases, division appears to be several times slower than multiplication.
For dynamic divisor the slowndown factor at my Intel(R) Core(TM) i5 CPU M 430 # 2.27GHz appears to be about 8, for static ones about 2.
The results are from a little benchmark of mine, which I made because I was somewhat curious about this (notice the aberrations at powers of two) :
ulong -- 64 bit unsigned
1 in the label means dynamic argument
0 in the lable means statically known argument
The results were generated from the following bash template:
#include <stdio.h>
#include <stdlib.h>
typedef unsigned long ulong;
int main(int argc, char** argv){
$TYPE arg = atoi(argv[1]);
$TYPE i = 0, res = 0;
for (i=0;i< $IT;i++)
res+=i $OP $ARG;
printf($FMT, res);
return 0;
}
with the $-variables assigned and the resulting program compiled with -O3 and run (dynamic values came from the command line as it's obvious from the C code).
Well if it is a single calculation you wil hardly notice any difference but if you talk about millions of transaction then definitely Division is costlier than Multiplication. You can always use whatever is the clearest and readable.
Please refer this link:- Should I use multiplication or division?
That will likely depend on your specific CPU and the types of your arguments. For instance, in your example you're doing a floating-point multiplication but an integer division. (Probably, at least, in most languages I know of that use C syntax.)
If you are doing work in assembler, you can look up the specific instructions you are using and see how long they take.
If you are not doing work in assembler, you probably don't need to care. All modern compilers with optimization will change your operations in this way to the most appropriate instructions.
Your big wins on optimization will not be from twiddling the arithmetic like this. Instead, focus on how well you are using your cache. Consider whether there are algorithm changes that might speed things up.
One note to make, if you are looking for numerical stability:
Don't recycle the divisions for solutions that require multiple components/coordinates, e.g. like implementing an n-D vector normalize() function, i.e. the following will NOT give you a unit-length vector:
V3d v3d(x,y,z);
float l = v3d.length();
float oneOverL = 1.f / l;
v3d.x *= oneOverL;
v3d.y *= oneOverL;
v3d.z *= oneOverL;
assert(1. == v3d.length()); // fails!
.. but this code will..
V3d v3d(x,y,z);
float l = v3d.length();
v3d.x /= l;
v3d.y /= l;
v3d.z /= l;
assert(1. == v3d.length()); // ok!
Guess the problem in the first code excerpt is the additional float normalization (the pre-division will impose a different scale normalization to the floating point number, which is then forced upon the actual result and introducing additional error).
Didn't look into this for too long, so please share your explanation why this happens. Tested it with x,y and z being .1f (and with doubles instead of floats)

long long division with 32-bits memory

I am currently working on a framework which transforms C to VHDL and I am getting stuck on the implementation of the long long division. Indeed, my framework is only able to work on 32-bits variable, so parsing a C long long variable will result into 2 VHDL variables, one containing the most significant part, one containing the least significant part. So to sum up, from this :
long long a = 1LL;
The VHDL which will be generated will be something like :
var30 <= 00000000000000000000000000000000;
var31 <= 00000000000000000000000000000001;
Now my problem is : how can I divide 2 long long parameters (in VHDL), since they are splitted in 2 variables ? I had no problem for the addition/substraction, since I can work on the most (resp. least) significant part independently (just a carry to propagate), but I really don't see how I could perform a division, since with this kind of operation, the least and the most significant part are really bound together... If someone has an idea, it would be much appreciated
PS : I have the same problem for the multiplication
EDIT : I both work on signed/unsigned variables and the result should be a 64-bit variable
For both the multiplication and the division problem you can break the problem down like this: consider that each 64 bit value, x can be expressed as k*x.hi+x.lo where x.hi is the upper 32 bits, x.lo is the lower 32 bits, and k = 2^32. So for multiplication:
a*b = (a.hi*k+a.lo)*(b.hi*k+b.lo)
= a.hi*b.hi*k*k + (a.hi*b.lo + a.lo*b.hi)*k + a.lo*b.lo
If you just want a 64 bit result then the first term disappears and you get:
a*b = (a.hi*b.lo + a.lo*b.hi)*k + a.lo*b.lo
Remember that in general multiplication doubles the number of bits, so each 32 bit x 32 bit multiply in the above expressions will generate a 64 bit term. In some cases you only want the low 32 bits (first two terms in above expression) but for the last term you need both the low and high 32 bits.

Resources