When is it useful to compare floating-point values for equality? - algorithm

I seem to see people asking all the time around here questions about comparing floating-point numbers. The canonical answer is always: just see if the numbers are within some small number of each other…
So the question is this: Why would you ever need to know if two floating point numbers are equal to each other?
In all my years of coding I have never needed to do this (although I will admit that even I am not omniscient). From my point of view, if you are trying to use floating point numbers for and for some reason want to know if the numbers are equal, you should probably be using an integer type (or a decimal type in languages that support it). Am I just missing something?

A few reasons to compare floating-point numbers for equality are:
Testing software. Given software that should conform to a precise specification, exact results might be known or feasibly computable, so a test program would compare the subject software’s results to the expected results.
Performing exact arithmetic. Carefully designed software can perform exact arithmetic with floating-point. At its simplest, this may simply be integer arithmetic. (On platforms which provide IEEE-754 64-bit double-precision floating-point but only 32-bit integer arithmetic, floating-point arithmetic can be used to perform 53-bit integer arithmetic.) Comparing for equality when performing exact arithmetic is the same as comparing for equality with integer operations.
Searching sorted or structured data. Floating-point values can be used as keys for searching, in which case testing for equality is necessary to determine that the sought item has been found. (There are issues if NaNs may be present, since they report false for any order test.)
Avoiding poles and discontinuities. Functions may have special behaviors at certain points, the most obvious of which is division. Software may need to test for these points and divert execution to alternate methods.
Note that only the last of these tests for equality when using floating-point arithmetic to approximate real arithmetic. (This list of examples is not complete, so I do not expect this is the only such use.) The first three are special situations. Usually when using floating-point arithmetic, one is approximating real arithmetic and working with mostly continuous functions. Continuous functions are “okay” for working with floating-point arithmetic because they transmit errors in “normal” ways. For example, if your calculations so far have produced some a' that approximates an ideal mathematical result a, and you have a b' that approximates an ideal mathematical result b, then the computed sum a'+b' will approximate a+b.
Discontinuous functions, on the other hand, can disrupt this behavior. For example, if we attempt to round a number to the nearest integer, what happens when a is 3.49? Our approximation a' might be 3.48 or 3.51. When the rounding is computed, the approximation may produce 3 or 4, turning a very small error into a very large error. When working with discontinuous functions in floating-point arithmetic, one has to be careful. For example, consider evaluating the quadratic formula, (−b±sqrt(b2−4ac))/(2a). If there is a slight error during the calculations for b2−4ac, the result might be negative, and then sqrt will return NaN. So software cannot simply use floating-point arithmetic as if it easily approximated real arithmetic. The programmer must understand floating-point arithmetic and be wary of the pitfalls, and these issues and their solutions can be specific to the particular software and application.
Testing for equality is a discontinuous function. It is a function f(a, b) that is 0 everywhere except along the line a=b. Since it is a discontinuous function, it can turn small errors into large errors—it can report as equal numbers that are unequal if computed with ideal mathematics, and it can report as unequal numbers that are equal if computed with ideal mathematics.
With this view, we can see testing for equality is a member of a general class of functions. It is not any more special than square root or division—it is continuous in most places but discontinuous in some, and so its use must be treated with care. That care is customized to each application.
I will relate one place where testing for equality was very useful. We implement some math library routines that are specified to be faithfully rounded. The best quality for a routine is that it is correctly rounded. Consider a function whose exact mathematical result (for a particular input x) is y. In some cases, y is exactly representable in the floating-point format, in which case a good routine will return y. Often, y is not exactly representable. In this case, it is between two numbers representable in the floating-point format, some numbers y0 and y1. If a routine is correctly rounded, it returns whichever of y0 and y1 is closer to y. (In case of a tie, it returns the one with an even low digit. Also, I am discussing only the round-to-nearest ties-to-even mode.)
If a routine is faithfully rounded, it is allowed to return either y0 or y1.
Now, here is the problem we wanted to solve: We have some version of a single-precision routine, say sin0, that we know is faithfully rounded. We have a new version, sin1, and we want to test whether it is faithfully rounded. We have multiple-precision software that can evaluate the mathematical sin function to great precision, so we can use that to check whether the results of sin1 are faithfully rounded. However, the multiple-precision software is slow, and we want to test all four billion inputs. sin0 and sin1 are both fast, but sin1 is allowed to have outputs different from sin0, because sin1 is only required to be faithfully rounded, not to be the same as sin0.
However, it happens that most of the sin1 results are the same as sin0. (This is partly a result of how math library routines are designed, using some extra precision to get a very close result before using a few final arithmetic operations to deliver the final result. That tends to get the correctly rounded result most of the time but sometimes slips to the next nearest value.) So what we can do is this:
For each input, calculate both sin0 and sin1.
Compare the results for equality.
If the results are equal, we are done. If they are not, use the extended precision software to test whether the sin1 result is faithfully rounded.
Again, this is a special case for using floating-point arithmetic. But it is one where testing for equality serves very well; the final test program runs in a few minutes instead of many hours.

The only time I needed, it was to check if the GPU was IEEE 754 compliant.
It was not.
Anyway I haven't used a comparison with a programming language. I just run the program on the CPU and on the GPU producing some binary output (no literals) and compared the outputs with a simple diff.

There are plenty possible reasons.
Since I know Squeak/Pharo Smalltalk better, here are a few trivial examples taken out of it (it relies on strict IEEE 754 model):
"simple, byte-order independent test for rejecting Not-a-Number and (Negative)Infinity"
^(self - self) = 0.0
"Return true if the receiver is positive or negative infinity."
^ self = Infinity or: [self = NegativeInfinity]
| ulp |
self isFinite ifFalse: [
(self isNaN or: [self negative]) ifTrue: [^self].
^Float fmax].
ulp := self ulp.
^self - (0.5 * ulp) = self
ifTrue: [self - ulp]
ifFalse: [self - (0.5 * ulp)]
I'm sure that you would find some more involved == if you open some libm implementation and check... Unfortunately, I don't know how to search == thru github web interface, but manually I found this example in julia libm (a variant of fdlibm)
remquo(double x, double y, int *quo)
y = fabs(y);
if (y < 0x1p-1021) {
if (x+x>y || (x+x==y && (q & 1))) {
} else if (x>0.5*y || (x==0.5*y && (q & 1))) {
q &= 0x7fffffff;
*quo = (sxy ? -q : q);
return x;
Here, the remainder function answer a result x between -y/2 and y/2. If it is exactly y/2, then there are 2 choices (a tie)... The == test in fixup is here to test the case of exact tie (resolved so as to always have an even quotient).
There are also a few ==zero tests, for example in __ieee754_logf (test for trivial case log(1)) or __ieee754_rem_pio2 (modulo pi/2 used for trigonometric functions).


Mathematica precision differs from other calculators

If I evaluate the following input in Mathematica 12:
SetPrecision[DecimalForm[123.432654/54.1122356, 130], 130]
The result is:
When I run the same calculation in other calculators, the results are equal until the 15th digit of the Mathematica result: 2,281048872429140. However, as of the 16th digit, the other calculators show an equal result whereas Mathematica is showing a different result:
Windows Calculator:
Why is (only) Mathematica ending up with a different result?
Can Mathematica somehow end up with the same result as the other calculators (supposing that these unanimous results are the correct ones)?
Mathematica's model of approximate decimal numbers is different from almost everyone else's model of approximate decimal numbers.
Because of the number of digits you supplied for each of 123.432654 and 54.1122356 these are assumed to be and treated as MachinePrecision numbers. That means they have the usual "about 16 digits of precision as supplied by the CPU floating point hardware in your computer, but it is a little more complicated than that.
Because of precedence rules Mathematica first evaluated each of those numbers and converted them to the internal floating point form, with the limited accuracy and all the problems that brings and all the speed of being able to perform calculations in hardware instead of software.
Then it did the division using the internal floating point hardware which resulted in another MachinePrecision number with only about 16 digits of precision.
Then with DecimalForm you asked Mathematica to extrapolate that result with only about 16 good digits into a 130 digit display.
All, or almost all with some very subtle things in dark corners, of the *Form functions are intended to and only used to produce something that can be displayed and not used for further calculations. For example, new users routinely do m=MatrixForm[mymatrix] to see a pretty formatting of the matrix and then proceed to try to do calculations with m, which fails.
Then you asked Mathematica to perform the SetPrecision function on that display to try to turn that into a 130 bit precision number. I can't even guess what that really did internally.
It seems those other calculators assume that the precision of the entered numbers is infinite. WL does not. You can specify what precision the entered numbers have e.g.

Should I use double data structure to store very large Integer values?

int types have a very low range of number it supports as compared to double. For example I want to use a integer number with a high range. Should I use double for this purpose. Or is there an alternative for this.
Is arithmetic slow in doubles ?
Whether double arithmetic is slow as compared to integer arithmetic depends on the CPU and the bit size of the integer/double.
On modern hardware floating point arithmetic is generally not slow. Even though the general rule may be that integer arithmetic is typically a bit faster than floating point arithmetic, this is not always true. For instance multiplication & division can even be significantly faster for floating point than the integer counterpart (see this answer)
This may be different for embedded systems with no hardware support for floating point. Then double arithmetic will be extremely slow.
Regarding your original problem: You should note that a 64 bit long long int can store more integers exactly (2^63) while double can store integers only up to 2^53 exactly. It can store higher numbers though, but not all integers: they will get rounded.
The nice thing about floating point is that it is much more convenient to work with. You have special symbols for infinity (Inf) and a symbol for undefined (NaN). This makes division by zero for instance possible and not an exception. Also one can use NaN as a return value in case of error or abnormal conditions. With integers one often uses -1 or something to indicate an error. This can propagate in calculations undetected, while NaN will not be undetected as it propagates.
Practical example: The programming language MATLAB has double as the default data type. It is used always even for cases where integers are typically used, e.g. array indexing. Even though MATLAB is an intepreted language and not so fast as a compiled language such as C or C++ is is quite fast and a powerful tool.
Bottom line: Using double instead of integers will not be slow. Perhaps not most efficient, but performance hit is not severe (at least not on modern desktop computer hardware).

error bound in function approximation algorithm

Suppose we have the set of floating point number with "m" bit mantissa and "e" bits for exponent. Suppose more over we want to approximate a function "f".
From the theory we know that usually a "range reduced function" is used and then from such function we derive the global function value.
For example let x = (sx,ex,mx) (sign exp and mantissa) then...
log2(x) = ex + log2(1.mx) so basically the range reduced function is "log2(1.mx)".
I have implemented at present reciprocal, square root, log2 and exp2, recently i've started to work with the trigonometric functions. But i was wandering if given a global error bound (ulp error especially) it is possible to derive an error bound for the range reduced function, is there some study about this kind of problem? Speaking of the log2(x) (as example) i would lke to be able to say...
"ok i want log2(x) with k ulp error, to achieve this given our floating point system we need to approximate log2(1.mx) with p ulp error"
Remember that as i said we know we are working with floating point number, but the format is generic, so it could be the classic F32, but even for example e=10, m = 8 end so on.
I can't actually find any reference that shows such kind of study. Reference i have (i.e. muller book) doesn't treat the topic in this way so i was looking for some kind of paper or similar. Do you know any reference?
I'm also trying to derive such bound by myself but it is not easy...
There is a description of current practice, along with a proposed improvement and an error analysis, at https://hal.inria.fr/ensl-00086904/document. The description of current practice appears consistent with the overview at https://docs.oracle.com/cd/E37069_01/html/E39019/z4000ac119729.html, which is consistent with my memory of the most talked about problem being the mod pi range reduction of trigonometric functions.
I think IEEE floating point was a big step forwards just because it standardized things at a time when there were a variety of computer architectures, so lowering the risks of porting code between them, but the accuracy requirements implied by this may have been overkill: for many problems the constraint on the accuracy of the output is the accuracy of the input data, not the accuracy of the calculation of intermediate values.

JDBC / Oracle Double value insertion fails [duplicate]

double r = 11.631;
double theta = 21.4;
In the debugger, these are shown as 11.631000000000000 and 21.399999618530273.
How can I avoid this?
These accuracy problems are due to the internal representation of floating point numbers and there's not much you can do to avoid it.
By the way, printing these values at run-time often still leads to the correct results, at least using modern C++ compilers. For most operations, this isn't much of an issue.
I liked Joel's explanation, which deals with a similar binary floating point precision issue in Excel 2007:
See how there's a lot of 0110 0110 0110 there at the end? That's because 0.1 has no exact representation in binary... it's a repeating binary number. It's sort of like how 1/3 has no representation in decimal. 1/3 is 0.33333333 and you have to keep writing 3's forever. If you lose patience, you get something inexact.
So you can imagine how, in decimal, if you tried to do 3*1/3, and you didn't have time to write 3's forever, the result you would get would be 0.99999999, not 1, and people would get angry with you for being wrong.
If you have a value like:
double theta = 21.4;
And you want to do:
if (theta == 21.4)
You have to be a bit clever, you will need to check if the value of theta is really close to 21.4, but not necessarily that value.
if (fabs(theta - 21.4) <= 1e-6)
This is partly platform-specific - and we don't know what platform you're using.
It's also partly a case of knowing what you actually want to see. The debugger is showing you - to some extent, anyway - the precise value stored in your variable. In my article on binary floating point numbers in .NET, there's a C# class which lets you see the absolutely exact number stored in a double. The online version isn't working at the moment - I'll try to put one up on another site.
Given that the debugger sees the "actual" value, it's got to make a judgement call about what to display - it could show you the value rounded to a few decimal places, or a more precise value. Some debuggers do a better job than others at reading developers' minds, but it's a fundamental problem with binary floating point numbers.
Use the fixed-point decimal type if you want stability at the limits of precision. There are overheads, and you must explicitly cast if you wish to convert to floating point. If you do convert to floating point you will reintroduce the instabilities that seem to bother you.
Alternately you can get over it and learn to work with the limited precision of floating point arithmetic. For example you can use rounding to make values converge, or you can use epsilon comparisons to describe a tolerance. "Epsilon" is a constant you set up that defines a tolerance. For example, you may choose to regard two values as being equal if they are within 0.0001 of each other.
It occurs to me that you could use operator overloading to make epsilon comparisons transparent. That would be very cool.
For mantissa-exponent representations EPSILON must be computed to remain within the representable precision. For a number N, Epsilon = N / 10E+14
System.Double.Epsilon is the smallest representable positive value for the Double type. It is too small for our purpose. Read Microsoft's advice on equality testing
I've come across this before (on my blog) - I think the surprise tends to be that the 'irrational' numbers are different.
By 'irrational' here I'm just referring to the fact that they can't be accurately represented in this format. Real irrational numbers (like π - pi) can't be accurately represented at all.
Most people are familiar with 1/3 not working in decimal: 0.3333333333333...
The odd thing is that 1.1 doesn't work in floats. People expect decimal values to work in floating point numbers because of how they think of them:
1.1 is 11 x 10^-1
When actually they're in base-2
1.1 is 154811237190861 x 2^-47
You can't avoid it, you just have to get used to the fact that some floats are 'irrational', in the same way that 1/3 is.
One way you can avoid this is to use a library that uses an alternative method of representing decimal numbers, such as BCD
If you are using Java and you need accuracy, use the BigDecimal class for floating point calculations. It is slower but safer.
Seems to me that 21.399999618530273 is the single precision (float) representation of 21.4. Looks like the debugger is casting down from double to float somewhere.
You cant avoid this as you're using floating point numbers with fixed quantity of bytes. There's simply no isomorphism possible between real numbers and its limited notation.
But most of the time you can simply ignore it. 21.4==21.4 would still be true because it is still the same numbers with the same error. But 21.4f==21.4 may not be true because the error for float and double are different.
If you need fixed precision, perhaps you should try fixed point numbers. Or even integers. I for example often use int(1000*x) for passing to debug pager.
Dangers of computer arithmetic
If it bothers you, you can customize the way some values are displayed during debug. Use it with care :-)
Enhancing Debugging with the Debugger Display Attributes
Refer to General Decimal Arithmetic
Also take note when comparing floats, see this answer for more information.
According to the javadoc
"If at least one of the operands to a numerical operator is of type double, then the
operation is carried out using 64-bit floating-point arithmetic, and the result of the
numerical operator is a value of type double. If the other operand is not a double, it is
first widened (§5.1.5) to type double by numeric promotion (§5.6)."
Here is the Source

Algorithm to find a common multiplier to convert decimal numbers to whole numbers

I have an array of numbers that potentially have up to 8 decimal places and I need to find the smallest common number I can multiply them by so that they are all whole numbers. I need this so all the original numbers can all be multiplied out to the same scale and be processed by a sealed system that will only deal with whole numbers, then I can retrieve the results and divide them by the common multiplier to get my relative results.
Currently we do a few checks on the numbers and multiply by 100 or 1,000,000, but the processing done by the *sealed system can get quite expensive when dealing with large numbers so multiplying everything by a million just for the sake of it isn’t really a great option. As an approximation lets say that the sealed algorithm gets 10 times more expensive every time you multiply by a factor of 10.
What is the most efficient algorithm, that will also give the best possible result, to accomplish what I need and is there a mathematical name and/or formula for what I’m need?
*The sealed system isn’t really sealed. I own/maintain the source code for it but its 100,000 odd lines of proprietary magic and it has been thoroughly bug and performance tested, altering it to deal with floats is not an option for many reasons. It is a system that creates a grid of X by Y cells, then rects that are X by Y are dropped into the grid, “proprietary magic” occurs and results are spat out – obviously this is an extremely simplified version of reality, but it’s a good enough approximation.
So far there are quiet a few good answers and I wondered how I should go about choosing the ‘correct’ one. To begin with I figured the only fair way was to create each solution and performance test it, but I later realised that pure speed wasn’t the only relevant factor – an more accurate solution is also very relevant. I wrote the performance tests anyway, but currently the I’m choosing the correct answer based on speed as well accuracy using a ‘gut feel’ formula.
My performance tests process 1000 different sets of 100 randomly generated numbers.
Each algorithm is tested using the same set of random numbers.
Algorithms are written in .Net 3.5 (although thus far would be 2.0 compatible)
I tried pretty hard to make the tests as fair as possible.
Greg – Multiply by large number
and then divide by GCD – 63
Andy – String Parsing
– 199 milliseconds
Eric – Decimal.GetBits – 160 milliseconds
Eric – Binary search – 32
Ima – sorry I couldn’t
figure out a how to implement your
solution easily in .Net (I didn’t
want to spend too long on it)
Bill – I figure your answer was pretty
close to Greg’s so didn’t implement
it. I’m sure it’d be a smidge faster
but potentially less accurate.
So Greg’s Multiply by large number and then divide by GCD” solution was the second fastest algorithm and it gave the most accurate results so for now I’m calling it correct.
I really wanted the Decimal.GetBits solution to be the fastest, but it was very slow, I’m unsure if this is due to the conversion of a Double to a Decimal or the Bit masking and shifting. There should be a
similar usable solution for a straight Double using the BitConverter.GetBytes and some knowledge contained here: http://blogs.msdn.com/bclteam/archive/2007/05/29/bcl-refresher-floating-point-types-the-good-the-bad-and-the-ugly-inbar-gazit-matthew-greig.aspx but my eyes just kept glazing over every time I read that article and I eventually ran out of time to try to implement a solution.
I’m always open to other solutions if anyone can think of something better.
I'd multiply by something sufficiently large (100,000,000 for 8 decimal places), then divide by the GCD of the resulting numbers. You'll end up with a pile of smallest integers that you can feed to the other algorithm. After getting the result, reverse the process to recover your original range.
Multiple all the numbers by 10
until you have integers.
by 2,3,5,7 while you still have all
I think that covers all cases.
2.1 * 10/7 -> 3
0.008 * 10^3/2^3 -> 1
That's assuming your multiplier can be a rational fraction.
If you want to find some integer N so that N*x is also an exact integer for a set of floats x in a given set are all integers, then you have a basically unsolvable problem. Suppose x = the smallest positive float your type can represent, say it's 10^-30. If you multiply all your numbers by 10^30, and then try to represent them in binary (otherwise, why are you even trying so hard to make them ints?), then you'll lose basically all the information of the other numbers due to overflow.
So here are two suggestions:
If you have control over all the related code, find another
approach. For example, if you have some function that takes only
int's, but you have floats, and you want to stuff your floats into
the function, just re-write or overload this function to accept
floats as well.
If you don't have control over the part of your system that requires
int's, then choose a precision to which you care about, accept that
you will simply have to lose some information sometimes (but it will
always be "small" in some sense), and then just multiply all your
float's by that constant, and round to the nearest integer.
By the way, if you're dealing with fractions, rather than float's, then it's a different game. If you have a bunch of fractions a/b, c/d, e/f; and you want a least common multiplier N such that N*(each fraction) = an integer, then N = abc / gcd(a,b,c); and gcd(a,b,c) = gcd(a, gcd(b, c)). You can use Euclid's algorithm to find the gcd of any two numbers.
Greg: Nice solution but won't calculating a GCD that's common in an array of 100+ numbers get a bit expensive? And how would you go about that? Its easy to do GCD for two numbers but for 100 it becomes more complex (I think).
Evil Andy: I'm programing in .Net and the solution you pose is pretty much a match for what we do now. I didn't want to include it in my original question cause I was hoping for some outside the box (or my box anyway) thinking and I didn't want to taint peoples answers with a potential solution. While I don't have any solid performance statistics (because I haven't had any other method to compare it against) I know the string parsing would be relatively expensive and I figured a purely mathematical solution could potentially be more efficient.
To be fair the current string parsing solution is in production and there have been no complaints about its performance yet (its even in production in a separate system in a VB6 format and no complaints there either). It's just that it doesn't feel right, I guess it offends my programing sensibilities - but it may well be the best solution.
That said I'm still open to any other solutions, purely mathematical or otherwise.
What language are you programming in? Something like
would give you the number of decimal places for a double in C#. You could run each number through that and find the largest number of decimal places(x), then multiply each number by 10 to the power of x.
Edit: Out of curiosity, what is this sealed system which you can pass only integers to?
In a loop get mantissa and exponent of each number as integers. You can use frexp for exponent, but I think bit mask will be required for mantissa. Find minimal exponent. Find most significant digits in mantissa (loop through bits looking for last "1") - or simply use predefined number of significant digits.
Your multiple is then something like 2^(numberOfDigits-minMantissa). "Something like" because I don't remember biases/offsets/ranges, but I think idea is clear enough.
So basically you want to determine the number of digits after the decimal point for each number.
This would be rather easier if you had the binary representation of the number. Are the numbers being converted from rationals or scientific notation earlier in your program? If so, you could skip the earlier conversion and have a much easier time. Otherwise you might want to pass each number to a function in an external DLL written in C, where you could work with the floating point representation directly. Or you could cast the numbers to decimal and do some work with Decimal.GetBits.
The fastest approach I can think of in-place and following your conditions would be to find the smallest necessary power-of-ten (or 2, or whatever) as suggested before. But instead of doing it in a loop, save some computation by doing binary search on the possible powers. Assuming a maximum of 8, something like:
int NumDecimals( double d )
// make d positive for clarity; it won't change the result
if( d<0 ) d=-d;
// now do binary search on the possible numbers of post-decimal digits to
// determine the actual number as quickly as possible:
if( NeedsMore( d, 10e4 ) )
// more than 4 decimals
if( NeedsMore( d, 10e6 ) )
// > 6 decimal places
if( NeedsMore( d, 10e7 ) ) return 10e8;
return 10e7;
// <= 6 decimal places
if( NeedsMore( d, 10e5 ) ) return 10e6;
return 10e5;
// <= 4 decimal places
// etc...
bool NeedsMore( double d, double e )
// check whether the representation of D has more decimal points than the
// power of 10 represented in e.
return (d*e - Math.Floor( d*e )) > 0;
PS: you wouldn't be passing security prices to an option pricing engine would you? It has exactly the flavor...
