When is (x==(x+y)-y) or (x==(x-y)+y) guaranteed for IEEE floats? - precision

In C or another language which uses IEEE floats, I have two variables x and y which are both guaranteed to be finite, non-NaN, basically normal numbers.
I have some code which assumes, in essence, that the following code has no effect:
float x = get_x ();
float y = get_y ();
float old_x = x;
x += y;
x -= y;
assert (old_x == x);
x -= y;
x += y;
assert (old_x == x);
I know that this will be true for certain classes of values, i.e. those which do not have "many" significant figures in the mantissa, but I would like to be clear about the edge cases.
For example, the binary expression of 1.3 will have significant figures all the way down the mantissa, and so will 1.7, and I should not assume that 1.3+1.7==3 exactly, but can I assume that if I add such numbers together and then subtract them, or vice versa, I will get the first value back again?
What are the formal edge conditions for this?

The number of bits in the floating point pipeline is not part of the standard.
From Wikipedia:
The standard also recommends extended format(s) to be used to perform
internal computations at a higher precision than that required for the
final result, to minimise round-off errors: the standard only
specifies minimum precision and exponent requirements for such
formats. The x87 80-bit extended format is the most commonly
implemented extended format that meets these requirements.
So since the internal formats can be extended, not knowing when internal formats get truncated to standard formats, what rounding method is being used, the assumption that adding a value and then subtracting it again will result in the original value is not guaranteed by the standard.
For the trivial case you posted it probably would work most of the time.
Then there is the case of handling NAN.
You may be able to determine edge cases for the architecture you are currently using but its probably easier to just check if the current value is within margin of error of original value.

Related

Why does D3DCOLORtoUBYTE4 multiplies components by 255.001953f?

I’ve compiled a pixel shader that uses D3DCOLORtoUBYTE4 intrinsic, then decompiled.
Here’s what I found:
r0.xyzw = float4(255.001953,255.001953,255.001953,255.001953) * r0.zyxw;
o0.xyzw = (int4)r0.xyzw;
The rgba->bgra swizzle is expected but why does it use 255.001953 instead of 255.0? Data Conversion Rules is quite specific about what should happen, it says following:
Convert from float scale to integer scale: c = c * (2^n-1).
The short answer is: This is just part of the definition of the intrinsic. It's implemented that way in all the modern versions of the HLSL compiler.
255.0 as a 32-bit float is represented in binary as
0100`0011`0111`1111`0000`0000`0000`0000
255.001953 as a 32-bit float is actually represented as 255.001953125 which in binary is:
0100`0011`0111`1111`0000`0000`1000`0000
This slight bias helps in specific cases, such as the input value being 0.999999. If we used 255.0, you'd get 254. With 255.001953 you get 255. Otherwise in most other cases the answer after converting to integer (using truncation) results in the same answer either way.
Some useful and interesting musings on floating-point numbers here

When is it useful to compare floating-point values for equality?

I seem to see people asking all the time around here questions about comparing floating-point numbers. The canonical answer is always: just see if the numbers are within some small number of each other…
So the question is this: Why would you ever need to know if two floating point numbers are equal to each other?
In all my years of coding I have never needed to do this (although I will admit that even I am not omniscient). From my point of view, if you are trying to use floating point numbers for and for some reason want to know if the numbers are equal, you should probably be using an integer type (or a decimal type in languages that support it). Am I just missing something?
A few reasons to compare floating-point numbers for equality are:
Testing software. Given software that should conform to a precise specification, exact results might be known or feasibly computable, so a test program would compare the subject software’s results to the expected results.
Performing exact arithmetic. Carefully designed software can perform exact arithmetic with floating-point. At its simplest, this may simply be integer arithmetic. (On platforms which provide IEEE-754 64-bit double-precision floating-point but only 32-bit integer arithmetic, floating-point arithmetic can be used to perform 53-bit integer arithmetic.) Comparing for equality when performing exact arithmetic is the same as comparing for equality with integer operations.
Searching sorted or structured data. Floating-point values can be used as keys for searching, in which case testing for equality is necessary to determine that the sought item has been found. (There are issues if NaNs may be present, since they report false for any order test.)
Avoiding poles and discontinuities. Functions may have special behaviors at certain points, the most obvious of which is division. Software may need to test for these points and divert execution to alternate methods.
Note that only the last of these tests for equality when using floating-point arithmetic to approximate real arithmetic. (This list of examples is not complete, so I do not expect this is the only such use.) The first three are special situations. Usually when using floating-point arithmetic, one is approximating real arithmetic and working with mostly continuous functions. Continuous functions are “okay” for working with floating-point arithmetic because they transmit errors in “normal” ways. For example, if your calculations so far have produced some a' that approximates an ideal mathematical result a, and you have a b' that approximates an ideal mathematical result b, then the computed sum a'+b' will approximate a+b.
Discontinuous functions, on the other hand, can disrupt this behavior. For example, if we attempt to round a number to the nearest integer, what happens when a is 3.49? Our approximation a' might be 3.48 or 3.51. When the rounding is computed, the approximation may produce 3 or 4, turning a very small error into a very large error. When working with discontinuous functions in floating-point arithmetic, one has to be careful. For example, consider evaluating the quadratic formula, (−b±sqrt(b2−4ac))/(2a). If there is a slight error during the calculations for b2−4ac, the result might be negative, and then sqrt will return NaN. So software cannot simply use floating-point arithmetic as if it easily approximated real arithmetic. The programmer must understand floating-point arithmetic and be wary of the pitfalls, and these issues and their solutions can be specific to the particular software and application.
Testing for equality is a discontinuous function. It is a function f(a, b) that is 0 everywhere except along the line a=b. Since it is a discontinuous function, it can turn small errors into large errors—it can report as equal numbers that are unequal if computed with ideal mathematics, and it can report as unequal numbers that are equal if computed with ideal mathematics.
With this view, we can see testing for equality is a member of a general class of functions. It is not any more special than square root or division—it is continuous in most places but discontinuous in some, and so its use must be treated with care. That care is customized to each application.
I will relate one place where testing for equality was very useful. We implement some math library routines that are specified to be faithfully rounded. The best quality for a routine is that it is correctly rounded. Consider a function whose exact mathematical result (for a particular input x) is y. In some cases, y is exactly representable in the floating-point format, in which case a good routine will return y. Often, y is not exactly representable. In this case, it is between two numbers representable in the floating-point format, some numbers y0 and y1. If a routine is correctly rounded, it returns whichever of y0 and y1 is closer to y. (In case of a tie, it returns the one with an even low digit. Also, I am discussing only the round-to-nearest ties-to-even mode.)
If a routine is faithfully rounded, it is allowed to return either y0 or y1.
Now, here is the problem we wanted to solve: We have some version of a single-precision routine, say sin0, that we know is faithfully rounded. We have a new version, sin1, and we want to test whether it is faithfully rounded. We have multiple-precision software that can evaluate the mathematical sin function to great precision, so we can use that to check whether the results of sin1 are faithfully rounded. However, the multiple-precision software is slow, and we want to test all four billion inputs. sin0 and sin1 are both fast, but sin1 is allowed to have outputs different from sin0, because sin1 is only required to be faithfully rounded, not to be the same as sin0.
However, it happens that most of the sin1 results are the same as sin0. (This is partly a result of how math library routines are designed, using some extra precision to get a very close result before using a few final arithmetic operations to deliver the final result. That tends to get the correctly rounded result most of the time but sometimes slips to the next nearest value.) So what we can do is this:
For each input, calculate both sin0 and sin1.
Compare the results for equality.
If the results are equal, we are done. If they are not, use the extended precision software to test whether the sin1 result is faithfully rounded.
Again, this is a special case for using floating-point arithmetic. But it is one where testing for equality serves very well; the final test program runs in a few minutes instead of many hours.
The only time I needed, it was to check if the GPU was IEEE 754 compliant.
It was not.
Anyway I haven't used a comparison with a programming language. I just run the program on the CPU and on the GPU producing some binary output (no literals) and compared the outputs with a simple diff.
There are plenty possible reasons.
Since I know Squeak/Pharo Smalltalk better, here are a few trivial examples taken out of it (it relies on strict IEEE 754 model):
Float>>isFinite
"simple, byte-order independent test for rejecting Not-a-Number and (Negative)Infinity"
^(self - self) = 0.0
Float>>isInfinite
"Return true if the receiver is positive or negative infinity."
^ self = Infinity or: [self = NegativeInfinity]
Float>>predecessor
| ulp |
self isFinite ifFalse: [
(self isNaN or: [self negative]) ifTrue: [^self].
^Float fmax].
ulp := self ulp.
^self - (0.5 * ulp) = self
ifTrue: [self - ulp]
ifFalse: [self - (0.5 * ulp)]
I'm sure that you would find some more involved == if you open some libm implementation and check... Unfortunately, I don't know how to search == thru github web interface, but manually I found this example in julia libm (a variant of fdlibm)
https://github.com/JuliaLang/openlibm/blob/master/src/s_remquo.c
remquo(double x, double y, int *quo)
{
...
fixup:
INSERT_WORDS(x,hx,lx);
y = fabs(y);
if (y < 0x1p-1021) {
if (x+x>y || (x+x==y && (q & 1))) {
q++;
x-=y;
}
} else if (x>0.5*y || (x==0.5*y && (q & 1))) {
q++;
x-=y;
}
GET_HIGH_WORD(hx,x);
SET_HIGH_WORD(x,hx^sx);
q &= 0x7fffffff;
*quo = (sxy ? -q : q);
return x;
Here, the remainder function answer a result x between -y/2 and y/2. If it is exactly y/2, then there are 2 choices (a tie)... The == test in fixup is here to test the case of exact tie (resolved so as to always have an even quotient).
There are also a few ==zero tests, for example in __ieee754_logf (test for trivial case log(1)) or __ieee754_rem_pio2 (modulo pi/2 used for trigonometric functions).

GNU Simulated Annealing

I'm working from the template program given here:
https://www.gnu.org/software/gsl/manual/html_node/Trivial-example.html
The program as they give it compiles and runs perfectly, which is nice. What I would like to do is generalise this method to find the minimum of a function with an arbitrary number of parameters.
Some cursory reading suggests that the metric function (M1) is only used in certain diagnostic and printing situations and so can more or less be ignored. All that remains is then to define E1 and S1 appropriately. Unfortunately my knowledge of using pointers and void is incomplete, so I'm stuck trying to upgrade the configuration 'xp' to be an array of parameters, rather than a single double.
In my naivete tried moving from
double x = *((double *) xp);
to
double x = (*((double *) xp))[0];
where appropriate, but obviously that didn't work. I'm sure I'm missing something stupid, so any hints would be nice! I will obviously be defining my own E1 output function which will take these N parameters and return a number.
The underlying algorithm, gsl_siman_solve() from the link provided, is generalized to work with any data type. This is why the ubiquitous xp parameter is always being cast to a double pointer before use. It should be straightforward to use any struct or array or array of arrays instead of simply doubles provided all the callbacks are coded properly.
The problem is that gsl_siman_solve() only seems to support a scalar double step size, initial guess, and 'uniform' value (from gsl_rng_uniform()), so you would need to map scalar double values into what are naturally multidimensional quantities. This can be done, but it is messy and not very flexible. In your case, the mapping would be done in S1().
This is akin to mapping the digits of a decimal number into a multidimensional space: the ones digit represents the X axis, the tens digit represents the Y axis, and the hundreds digit represents the Z axis, for example. By incrementing an integer, one can walk the entire 3D space from (0, 0, 0) to (9, 9, 9). You don't have to use integers and powers of 10, and the components don't even have to have the same range, but there is an inherent limit in the range of each component of the packed value. You would actually do this in reverse: taking a scalar double and unpacking it into multiple quantities.
Lastly, your code double x = (*((double *) xp))[0]; won't work because you are attempting to dereference a double as an array, not a pointer to a double, which would be OK. In other words, it's that first * that is the problem.

What has a better performance: multiplication or division?

Which version is faster:
x * 0.5
or
x / 2 ?
I've had a course at the university called computer systems some time ago. From back then I remember that multiplying two values can be achieved with comparably "simple" logical gates but division is not a "native" operation and requires a sum register that is in a loop increased by the divisor and compared to the dividend.
Now I have to optimise an algorithm with a lot of divisions. Unfortunately it's not just dividing by two, so binary shifting is not an option. Will it make a difference to change all divisions to multiplications ?
Update:
I have changed my code and didn't notice any difference. You're probably right about compiler optimisations. Since all the answers were great ive upvoted them all. I chose rahul's answer because of the great link.
Usually division is a lot more expensive than multiplication, but a smart compiler will often convert division by a compile-time constant to a multiplication anyway. If your compiler is not smart enough though, or if there are floating point accuracy issues, then you can always do the optimisation explicitly, e.g. change:
float x = y / 2.5f;
to:
const float k = 1.0f / 2.5f;
...
float x = y * k;
Note that this is most likely a case of premature optimisation - you should only do this kind of thing if you have profiled your code and positively identified division as being a performance bottlneck.
Division by a compile-time constant that's a power of 2 is quite fast (comparable to multiplication by a compile-time constant) for both integers and floats (it's basically convertible into a bit shift).
For floats even dynamic division by powers of two is much faster than regular (dynamic or static division) as it basically turns into a subtraction on its exponent.
In all other cases, division appears to be several times slower than multiplication.
For dynamic divisor the slowndown factor at my Intel(R) Core(TM) i5 CPU M 430 # 2.27GHz appears to be about 8, for static ones about 2.
The results are from a little benchmark of mine, which I made because I was somewhat curious about this (notice the aberrations at powers of two) :
ulong -- 64 bit unsigned
1 in the label means dynamic argument
0 in the lable means statically known argument
The results were generated from the following bash template:
#include <stdio.h>
#include <stdlib.h>
typedef unsigned long ulong;
int main(int argc, char** argv){
$TYPE arg = atoi(argv[1]);
$TYPE i = 0, res = 0;
for (i=0;i< $IT;i++)
res+=i $OP $ARG;
printf($FMT, res);
return 0;
}
with the $-variables assigned and the resulting program compiled with -O3 and run (dynamic values came from the command line as it's obvious from the C code).
Well if it is a single calculation you wil hardly notice any difference but if you talk about millions of transaction then definitely Division is costlier than Multiplication. You can always use whatever is the clearest and readable.
Please refer this link:- Should I use multiplication or division?
That will likely depend on your specific CPU and the types of your arguments. For instance, in your example you're doing a floating-point multiplication but an integer division. (Probably, at least, in most languages I know of that use C syntax.)
If you are doing work in assembler, you can look up the specific instructions you are using and see how long they take.
If you are not doing work in assembler, you probably don't need to care. All modern compilers with optimization will change your operations in this way to the most appropriate instructions.
Your big wins on optimization will not be from twiddling the arithmetic like this. Instead, focus on how well you are using your cache. Consider whether there are algorithm changes that might speed things up.
One note to make, if you are looking for numerical stability:
Don't recycle the divisions for solutions that require multiple components/coordinates, e.g. like implementing an n-D vector normalize() function, i.e. the following will NOT give you a unit-length vector:
V3d v3d(x,y,z);
float l = v3d.length();
float oneOverL = 1.f / l;
v3d.x *= oneOverL;
v3d.y *= oneOverL;
v3d.z *= oneOverL;
assert(1. == v3d.length()); // fails!
.. but this code will..
V3d v3d(x,y,z);
float l = v3d.length();
v3d.x /= l;
v3d.y /= l;
v3d.z /= l;
assert(1. == v3d.length()); // ok!
Guess the problem in the first code excerpt is the additional float normalization (the pre-division will impose a different scale normalization to the floating point number, which is then forced upon the actual result and introducing additional error).
Didn't look into this for too long, so please share your explanation why this happens. Tested it with x,y and z being .1f (and with doubles instead of floats)

gcc precision bug?

I can only assume this is a bug. The first assert passes while the second fails:
double sum_1 = 4.0 + 6.3;
assert(sum_1 == 4.0 + 6.3);
double t1 = 4.0, t2 = 6.3;
double sum_2 = t1 + t2;
assert(sum_2 == t1 + t2);
If not a bug, why?
This is something that has bitten me, too.
Yes, floating point numbers should never be compared for equality because of rounding error, and you probably knew that.
But in this case, you're computing t1+t2, then computing it again. Surely that has to produce an identical result?
Here's what's probably going on. I'll bet you're running this on an x86 CPU, correct? The x86 FPU uses 80 bits for its internal registers, but values in memory are stored as 64-bit doubles.
So t1+t2 is first computed with 80 bits of precision, then -- I presume -- stored out to memory in sum_2 with 64 bits of precision -- and some rounding occurs. For the assert, it's loaded back into a floating point register, and t1+t2 is computed again, again with 80 bits of precision. So now you're comparing sum_2, which was previously rounded to a 64-bit floating point value, with t1+t2, which was computed with higher precision (80 bits) -- and that's why the values aren't exactly identical.
Edit So why does the first test pass? In this case, the compiler probably evaluates 4.0+6.3 at compile time and stores it as a 64-bit quantity -- both for the assignment and for the assert. So identical values are being compared, and the assert passes.
Second Edit Here's the assembly code generated for the second part of the code (gcc, x86), with comments -- pretty much follows the scenario outlined above:
// t1 = 4.0
fldl LC3
fstpl -16(%ebp)
// t2 = 6.3
fldl LC4
fstpl -24(%ebp)
// sum_2 = t1+t2
fldl -16(%ebp)
faddl -24(%ebp)
fstpl -32(%ebp)
// Compute t1+t2 again
fldl -16(%ebp)
faddl -24(%ebp)
// Load sum_2 from memory and compare
fldl -32(%ebp)
fxch %st(1)
fucompp
Interesting side note: This was compiled without optimization. When it's compiled with -O3, the compiler optimizes all of the code away.
You are comparing floating point numbers. Don't do that, floating point numbers have inherent precision error in some circumstances. Instead, take the absolute value of the difference of the two values and assert that the value is less than some small number (epsilon).
void CompareFloats( double d1, double d2, double epsilon )
{
assert( abs( d1 - d2 ) < epsilon );
}
This has nothing to do with the compiler and everything to do with the way floating point numbers are implemented. here is the IEEE spec:
http://www.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF
I've duplicated your problem on my Intel Core 2 Duo, and I looked at the assembly code. Here's what's happening: when your compiler evaluates t1 + t2, it does
load t1 into an 80-bit register
load t2 into an 80-bit register
compute the 80-bit sum
When it stores into sum_2 it does
round the 80-bit sum to a 64-bit number and store it
Then the == comparison compares the 80-bit sum to a 64-bit sum, and they're different, primarily because the fractional part 0.3 cannot be represented exactly using a binary floating-point number, so you are comparing a 'repeating decimal' (actually repeating binary) that has been truncated to two different lengths.
What's really irritating is that if you compiler with gcc -O1 or gcc -O2, gcc does the wrong arithmetic at compile time, and the problem goes away. Maybe this is OK according to the standard, but it's just one more reason that gcc is not my favorite compiler.
P.S. When I say that == compares an 80-bit sum with a 64-bit sum, of course I really mean it compares the extended version of the 64-bit sum. You might do well to think
sum_2 == t1 + t2
resolves to
extend(sum_2) == extend(t1) + extend(t2)
and
sum_2 = t1 + t2
resolves to
sum_2 = round(extend(t1) + extend(t2))
Welcome to the wonderful world of floating point!
When comparing floating point numbers for closeness you usually want to measure their relative difference, which is defined as
if (abs(x) != 0 || abs(y) != 0)
rel_diff (x, y) = abs((x - y) / max(abs(x),abs(y))
else
rel_diff(x,y) = max(abs(x),abs(y))
For example,
rel_diff(1.12345, 1.12367) = 0.000195787019
rel_diff(112345.0, 112367.0) = 0.000195787019
rel_diff(112345E100, 112367E100) = 0.000195787019
The idea is to measure the number of leading significant digits the numbers have in common; if you take the -log10 of 0.000195787019 you get 3.70821611, which is about the number of leading base 10 digits all the examples have in common.
If you need to determine if two floating point numbers are equal you should do something like
if (rel_diff(x,y) < error_factor * machine_epsilon()) then
print "equal\n";
where machine epsilon is the smallest number that can be held in the mantissa of the floating point hardware being used. Most computer languages have a function call to get this value. error_factor should be based on the number of significant digits you think will be consumed by rounding errors (and others) in the calculations of the numbers x and y. For example, if I knew that x and y were the result of about 1000 summations and did not know any bounds on the numbers being summed, I would set error_factor to about 100.
Tried to add these as links but couldn't since this is my first post:
en.wikipedia.org/wiki/Relative_difference
en.wikipedia.org/wiki/Machine_epsilon
en.wikipedia.org/wiki/Significand (mantissa)
en.wikipedia.org/wiki/Rounding_error
It may be that in one of the cases, you end up comparing a 64-bit double to an 80-bit internal register. It may be enlightening to look at the assembly instructions GCC emits for the two cases...
Comparisons of double precision numbers are inherently inaccurate. For instance, you can often find 0.0 == 0.0 returning false. This is due to the way the FPU stores and tracks numbers.
Wikipedia says:
Testing for equality is problematic. Two computational sequences that are mathematically equal may well produce different floating-point values.
You will need to use a delta to give a tolerance for your comparisons, rather than an exact value.
This "problem" can be "fixed" by using these options:
-msse2 -mfpmath=sse
as explained on this page:
http://www.network-theory.co.uk/docs/gccintro/gccintro_70.html
Once I used these options, both asserts passed.

Resources