Integers and float precision - precision

This is more of a numerical analysis rather than programming question, but I suppose some of you will be able to answer it.
In the sum two floats, is there any precision lost? Why?
In the sum of a float and a integer, is there any precision lost? Why?
Thanks.

In the sum two floats, is there any precision lost?
If both floats have differing magnitude and both are using the complete precision range (of about 7 decimal digits) then yes, you will see some loss in the last places.
Why?
This is because floats are stored in the form of (sign) (mantissa) × 2(exponent). If two values have differing exponents and you add them, then the smaller value will get reduced to less digits in the mantissa (because it has to adapt to the larger exponent):
PS> [float]([float]0.0000001 + [float]1)
1
In the sum of a float and a integer, is there any precision lost?
Yes, a normal 32-bit integer is capable of representing values exactly which do not fit exactly into a float. A float can still store approximately the same number, but no longer exactly. Of course, this only applies to numbers that are large enough, i. e. longer than 24 bits.
Why?
Because float has 24 bits of precision and (32-bit) integers have 32. float will still be able to retain the magnitude and most of the significant digits, but the last places may likely differ:
PS> [float]2100000050 + [float]100
2100000100

The precision depends on the magnitude of the original numbers. In floating point, the computer represents the number 312 internally as scientific notation:
3.12000000000 * 10 ^ 2
The decimal places in the left hand side (mantissa) are fixed. The exponent also has an upper and lower bound. This allows it to represent very large or very small numbers.
If you try to add two numbers which are the same in magnitude, the result should remain the same in precision, because the decimal point doesn't have to move:
312.0 + 643.0 <==>
3.12000000000 * 10 ^ 2 +
6.43000000000 * 10 ^ 2
-----------------------
9.55000000000 * 10 ^ 2
If you tried to add a very big and a very small number, you would lose precision because they must be squeezed into the above format. Consider 312 + 12300000000000000000000. First you have to scale the smaller number to line up with the bigger one, then add:
1.23000000000 * 10 ^ 15 +
0.00000000003 * 10 ^ 15
-----------------------
1.23000000003 <-- precision lost here!
Floating point can handle very large, or very small numbers. But it can't represent both at the same time.
As for ints and doubles being added, the int gets turned into a double immediately, then the above applies.

When adding two floating point numbers, there is generally some error. D. Goldberg's "What Every Computer Scientist Should Know About Floating-Point Arithmetic" describes the effect and the reasons in detail, and also how to calculate an upper bound on the error, and how to reason about the precision of more complex calculations.
When adding a float to an integer, the integer is first converted to a float by C++, so two floats are being added and error is introduced for the same reasons as above.

The precision available for a float is limited, so of course there is always the risk that any given operation drops precision.
The answer for both your questions is "yes".
If you try adding a very large float to a very small one, you will for instance have problems.
Or if you try to add an integer to a float, where the integer uses more bits than the float has available for its mantissa.

The short answer: a computer represents a float with a limited number of bits, which is often done with mantissa and exponent, so only a few bytes are used for the significant digits, and the others are used to represent the position of the decimal point.
If you were to try to add (say) 10^23 and 7, then it won't be able to accurately represent that result. A similar argument applies when adding a float and integer -- the integer will be promoted to a float.

In the sum two floats, is there any precision lost?
In the sum of a float and a integer, is there any precision lost? Why?
Not always. If the sum is representable with the precision you ask, and you won't get any precision loss.
Example: 0.5 + 0.75 => no precision loss
x * 0.5 => no precision loss (except if x is too much small)
In the general case, one add floats in slightly different ranges so there is a precision loss which actually depends on the rounding mode.
ie: if you're adding numbers with totally different ranges, expect precision problems.
Denormals are here to give extra-precision in extreme cases, at the expense of CPU.
Depending on how your compiler handle floating-point computation, results can vary.
With strict IEEE semantics, adding two 32 bits floats should not give better accuracy than 32 bits.
In practice it may requires more instruction to ensure that, so you shouldn't rely on accurate and repeatable results with floating-point.

In both cases yes:
assert( 1E+36f + 1.0f == 1E+36f );
assert( 1E+36f + 1 == 1E+36f );

The case float + int is the same as float + float, because a standard conversion is applied to the int. In the case of float + float, this is implementation dependent, because an implementation may choose to do the addition at double precision. There may be some loss when you store the result, of course.

In both cases, the answer is "yes". When adding an int to a float, the integer is converted to floating point representation before the addition takes place anyway.
To understand why, I suggest you read this gem: What Every Computer Scientist Should Know About Floating-Point Arithmetic.

Related

why is the result of boost::math::float_distance(1.0, 0.0) so big? [duplicate]

This is something that's been on my mind for years, but I never took the time to ask before.
Many (pseudo) random number generators generate a random number between 0.0 and 1.0. Mathematically there are infinite numbers in this range, but double is a floating point number, and therefore has a finite precision.
So the questions are:
Just how many double numbers are there between 0.0 and 1.0?
Are there just as many numbers between 1 and 2? Between 100 and 101? Between 10^100 and 10^100+1?
Note: if it makes a difference, I'm interested in Java's definition of double in particular.
Java doubles are in IEEE-754 format, therefore they have a 52-bit fraction; between any two adjacent powers of two (inclusive of one and exclusive of the next one), there will therefore be 2 to the 52th power different doubles (i.e., 4503599627370496 of them). For example, that's the number of distinct doubles between 0.5 included and 1.0 excluded, and exactly that many also lie between 1.0 included and 2.0 excluded, and so forth.
Counting the doubles between 0.0 and 1.0 is harder than doing so between powers of two, because there are many powers of two included in that range, and, also, one gets into the thorny issues of denormalized numbers. 10 of the 11 bits of the exponents cover the range in question, so, including denormalized numbers (and I think a few kinds of NaN) you'd have 1024 times the doubles as lay between powers of two -- no more than 2**62 in total anyway. Excluding denormalized &c, I believe the count would be 1023 times 2**52.
For an arbitrary range like "100 to 100.1" it's even harder because the upper bound cannot be exactly represented as a double (not being an exact multiple of any power of two). As a handy approximation, since the progression between powers of two is linear, you could say that said range is 0.1 / 64th of the span between the surrounding powers of two (64 and 128), so you'd expect about
(0.1 / 64) * 2**52
distinct doubles -- which comes to 7036874417766.4004... give or take one or two;-).
Every double value whose representation is between 0x0000000000000000 and 0x3ff0000000000000 lies in the interval [0.0, 1.0]. That's (2^62 - 2^52) distinct values (plus or minus a couple depending on whether you count the endpoints).
The interval [1.0, 2.0] corresponds to representations between 0x3ff0000000000000 and 0x400000000000000; that's 2^52 distinct values.
The interval [100.0, 101.0] corresponds to representations between 0x4059000000000000 and 0x4059400000000000; that's 2^46 distinct values.
There are no doubles between 10^100 and 10^100 + 1. Neither one of those numbers is representable in double precision, and there are no doubles that fall between them. The closest two double precision numbers are:
99999999999999982163600188718701095...
and
10000000000000000159028911097599180...
Others have already explained that there are around 2^62 doubles in the range [0.0, 1.0].
(Not really surprising: there are almost 2^64 distinct finite doubles; of those, half are positive, and roughly half of those are < 1.0.)
But you mention random number generators: note that a random number generator generating numbers between 0.0 and 1.0 cannot in general produce all these numbers; typically it'll only produce numbers of the form n/2^53 with n an integer (see e.g. the Java documentation for nextDouble). So there are usually only around 2^53 (+/-1, depending on which endpoints are included) possible values for the random() output. This means that most doubles in [0.0, 1.0] will never be generated.
The article Java's new math, Part 2: Floating-point numbers from IBM offers the following code snippet to solve this (in floats, but I suspect it works for doubles as well):
public class FloatCounter {
public static void main(String[] args) {
float x = 1.0F;
int numFloats = 0;
while (x <= 2.0) {
numFloats++;
System.out.println(x);
x = Math.nextUp(x);
}
System.out.println(numFloats);
}
}
They have this comment about it:
It turns out there are exactly 8,388,609 floats between 1.0 and 2.0 inclusive; large but hardly the uncountable infinity of real numbers that exist in this range. Successive numbers are about 0.0000001 apart. This distance is called an ULP for unit of least precision or unit in the last place.
2^53 - the size of the significand/mantissa of a 64bit floating point number including the hidden bit.
Roughly yes, as the significand is fixed but the exponent changes.
See the wikipedia article for more information.
The Java double is a IEEE 754 binary64 number.
This means that we need to consider:
Mantissa is 52 bit
Exponent is 11 bit number with 1023 bias (ie with 1023 added to it)
If the exponent is all 0 and the mantissa is non zero then the number is said to be non-normalized
This basically means there is a total of 2^62-2^52+1 of possible double representations that according to the standard are between 0 and 1. Note that 2^52+1 is to the remove the cases of the non-normalized numbers.
Remember that if mantissa is positive but exponent is negative number is positive but less than 1 :-)
For other numbers it is a bit harder because the edge integer numbers may not representable in a precise manner in the IEEE 754 representation, and because there are other bits used in the exponent to be able represent the numbers, so the larger the number the lower the different values.

Ruby float precision

As I understand it, Ruby (1.9.2) floats have a precision of 15 decimal digits. Therefore, I would expect rounding float x to 15 decimal places would equal x. For this calculation this isn't the case.
x = (0.33 * 10)
x == x.round(15) # => false
Incidentally, rounding to 16 places returns true.
Can you please explain this to me?
Part of the problem is that 0.33 does not have an exact representation in the underlying format, because it cannot be expressed by a series of 1 / 2n terms. So, when it is multiplied by 10 a number slightly different than 0.33 is being multiplied.
For that matter, 3.3 does not have an exact representation either.
Part One
When numbers don't have an exact base-10 representation, there will be a remainder when converting the least significant digit for which there was information in the mantissa. This remainder will propagate to the right, possibly forever, but it's largely meaningless. The apparent randomness of this error is due to the same reason that explains the apparently-inconsistent rounding you and Matchu noticed. That's in part two.
Part Two
And this information (the right-most bits) is not aligned neatly with the information conveyed by a single decimal digit, so the decimal digit will typically be somewhat smaller than its value would have been if the original precision had been greater.
This is why a conversion might round to 1 at 15 digits and 0.x at 16 digits: because a longer conversion has no value for the bits to the right of the end of the mantissa.
Well, though I'm not certain on the details of how Ruby handles floats internally, I do know why this particular bit of code is failing on my box:
> x = (0.33 * 10)
=> 3.3000000000000003
> x.round(15)
=> 3.300000000000001
The first float keeps 16 decimal places for a total of 17 digits, for whatever reason. So, rounding to 15 discards those digits.

what is the best numerical precision in ruby

I was wondering, how I can get the best precision on ruby. Someone told me that the best precision is probably between 0 and 1, because as you go into larger numbers the step increases as well.
I suppose a way to find out would be to know what the minimum float number is and what the next float number, then the precision would be the difference, right? If I'm correct how could I do this on ruby?
I am not sure how to use this http://ruby.wikia.com/wiki/Float to find that information.
Any help appreciated.
In terms of significant digits, the precision is the same, regardless of scale. That is, if you scale your range from [0.0, 1000.0] down to [0.0, 1.0] just by dividing numbers in the natural range by 1000.0, this will have no discernible effect on the precision of your range. In fact, a larger range will have marginally greater precision since it fully contains the smaller range.
As for discovering the absolute precision, you have two problems:
The absolute precision depends on the magnitude, which varies "infinitely" within the range [0, 1] (limx→0 log(x) = –∞). So there is no one precision for numbers in that range. You can only derive absolute precision at a given point in the range.
The common technique for discovering the minimum step — known as the ulp — is to interpret the bit-representation of the float as an integer, increment it by one, and reinterpret the result as a float. Ruby doesn't, AFAIK, let you do this.
There is, however, an iterative solution. Simply add 1.0 to the number and subtract ((x + 1.0) - x). If the difference is zero, double the addend ((x + 2.0) - x) and repeat until the difference is non-zero. Otherwise, halve the addend (to 0.5) and repeat until the difference is zero. Whenever you stop, the lowest addend that produces a non-zero difference is the ulp. (I described this from vague memory, so it might be NQR.)
You can use class Rational - it stores non-integer numbers as fraction of two Integers, which (as far as I know) will be automatically converted to Bignum, when need.
The Flt ruby library provides arbitrary floating point precision.

How many double numbers are there between 0.0 and 1.0?

This is something that's been on my mind for years, but I never took the time to ask before.
Many (pseudo) random number generators generate a random number between 0.0 and 1.0. Mathematically there are infinite numbers in this range, but double is a floating point number, and therefore has a finite precision.
So the questions are:
Just how many double numbers are there between 0.0 and 1.0?
Are there just as many numbers between 1 and 2? Between 100 and 101? Between 10^100 and 10^100+1?
Note: if it makes a difference, I'm interested in Java's definition of double in particular.
Java doubles are in IEEE-754 format, therefore they have a 52-bit fraction; between any two adjacent powers of two (inclusive of one and exclusive of the next one), there will therefore be 2 to the 52th power different doubles (i.e., 4503599627370496 of them). For example, that's the number of distinct doubles between 0.5 included and 1.0 excluded, and exactly that many also lie between 1.0 included and 2.0 excluded, and so forth.
Counting the doubles between 0.0 and 1.0 is harder than doing so between powers of two, because there are many powers of two included in that range, and, also, one gets into the thorny issues of denormalized numbers. 10 of the 11 bits of the exponents cover the range in question, so, including denormalized numbers (and I think a few kinds of NaN) you'd have 1024 times the doubles as lay between powers of two -- no more than 2**62 in total anyway. Excluding denormalized &c, I believe the count would be 1023 times 2**52.
For an arbitrary range like "100 to 100.1" it's even harder because the upper bound cannot be exactly represented as a double (not being an exact multiple of any power of two). As a handy approximation, since the progression between powers of two is linear, you could say that said range is 0.1 / 64th of the span between the surrounding powers of two (64 and 128), so you'd expect about
(0.1 / 64) * 2**52
distinct doubles -- which comes to 7036874417766.4004... give or take one or two;-).
Every double value whose representation is between 0x0000000000000000 and 0x3ff0000000000000 lies in the interval [0.0, 1.0]. That's (2^62 - 2^52) distinct values (plus or minus a couple depending on whether you count the endpoints).
The interval [1.0, 2.0] corresponds to representations between 0x3ff0000000000000 and 0x400000000000000; that's 2^52 distinct values.
The interval [100.0, 101.0] corresponds to representations between 0x4059000000000000 and 0x4059400000000000; that's 2^46 distinct values.
There are no doubles between 10^100 and 10^100 + 1. Neither one of those numbers is representable in double precision, and there are no doubles that fall between them. The closest two double precision numbers are:
99999999999999982163600188718701095...
and
10000000000000000159028911097599180...
Others have already explained that there are around 2^62 doubles in the range [0.0, 1.0].
(Not really surprising: there are almost 2^64 distinct finite doubles; of those, half are positive, and roughly half of those are < 1.0.)
But you mention random number generators: note that a random number generator generating numbers between 0.0 and 1.0 cannot in general produce all these numbers; typically it'll only produce numbers of the form n/2^53 with n an integer (see e.g. the Java documentation for nextDouble). So there are usually only around 2^53 (+/-1, depending on which endpoints are included) possible values for the random() output. This means that most doubles in [0.0, 1.0] will never be generated.
The article Java's new math, Part 2: Floating-point numbers from IBM offers the following code snippet to solve this (in floats, but I suspect it works for doubles as well):
public class FloatCounter {
public static void main(String[] args) {
float x = 1.0F;
int numFloats = 0;
while (x <= 2.0) {
numFloats++;
System.out.println(x);
x = Math.nextUp(x);
}
System.out.println(numFloats);
}
}
They have this comment about it:
It turns out there are exactly 8,388,609 floats between 1.0 and 2.0 inclusive; large but hardly the uncountable infinity of real numbers that exist in this range. Successive numbers are about 0.0000001 apart. This distance is called an ULP for unit of least precision or unit in the last place.
2^53 - the size of the significand/mantissa of a 64bit floating point number including the hidden bit.
Roughly yes, as the significand is fixed but the exponent changes.
See the wikipedia article for more information.
The Java double is a IEEE 754 binary64 number.
This means that we need to consider:
Mantissa is 52 bit
Exponent is 11 bit number with 1023 bias (ie with 1023 added to it)
If the exponent is all 0 and the mantissa is non zero then the number is said to be non-normalized
This basically means there is a total of 2^62-2^52+1 of possible double representations that according to the standard are between 0 and 1. Note that 2^52+1 is to the remove the cases of the non-normalized numbers.
Remember that if mantissa is positive but exponent is negative number is positive but less than 1 :-)
For other numbers it is a bit harder because the edge integer numbers may not representable in a precise manner in the IEEE 754 representation, and because there are other bits used in the exponent to be able represent the numbers, so the larger the number the lower the different values.

Why is Math.sqrt(i*i).floor == i?

I am wondering if this is true: When I take the square root of a squared integer, like in
f = Math.sqrt(123*123)
I will get a floating point number very close to 123. Due to floating point representation precision, this could be something like 122.99999999999999999999 or 123.000000000000000000001.
Since floor(122.999999999999999999) is 122, I should get 122 instead of 123. So I expect that floor(sqrt(i*i)) == i-1 in about 50% of the cases. Strangely, for all the numbers I have tested, floor(sqrt(i*i) == i. Here is a small ruby script to test the first 100 million numbers:
100_000_000.times do |i|
puts i if Math.sqrt(i*i).floor != i
end
The above script never prints anything. Why is that so?
UPDATE: Thanks for the quick reply, this seems to be the solution: According to wikipedia
Any integer with absolute value less
than or equal to 2^24 can be exactly
represented in the single precision
format, and any integer with absolute
value less than or equal to 2^53 can
be exactly represented in the double
precision format.
Math.sqrt(i*i) starts to behave as I've expected it starting from i=9007199254740993, which is 2^53 + 1.
Here's the essence of your confusion:
Due to floating point representation
precision, this could be something
like 122.99999999999999999999 or
123.000000000000000000001.
This is false. It will always be exactly 123 on a IEEE-754 compliant system, which is almost all systems in these modern times. Floating-point arithmetic does not have "random error" or "noise". It has precise, deterministic rounding, and many simple computations (like this one) do not incur any rounding at all.
123 is exactly representable in floating-point, and so is 123*123 (so are all modest-sized integers). So no rounding error occurs when you convert 123*123 to a floating-point type. The result is exactly 15129.
Square root is a correctly rounded operation, per the IEEE-754 standard. This means that if there is an exact answer, the square root function is required to produce it. Since you are taking the square root of exactly 15129, which is exactly 123, that's exactly the result you get from the square root function. No rounding or approximation occurs.
Now, for how large of an integer will this be true?
Double precision can exactly represent all integers up to 2^53. So as long as i*i is less than 2^53, no rounding will occur in your computation, and the result will be exact for that reason. This means that for all i smaller than 94906265, we know the computation will be exact.
But you tried i larger than that! What's happening?
For the largest i that you tried, i*i is just barely larger than 2^53 (1.1102... * 2^53, actually). Because conversions from integer to double (or multiplication in double) are also correctly rounded operations, i*i will be the representable value closest to the exact square of i. In this case, since i*i is 54 bits wide, the rounding will happen in the very lowest bit. Thus we know that:
i*i as a double = the exact value of i*i + rounding
where rounding is either -1,0, or 1. If rounding is zero, then the square is exact, so the square root is exact, so we already know you get the right answer. Let's ignore that case.
So now we're looking at the square root of i*i +/- 1. Using a Taylor series expansion, the infinitely precise (unrounded) value of this square root is:
i * (1 +/- 1/(2i^2) + O(1/i^4))
Now this is a bit fiddly to see if you haven't done any floating point error analysis before, but if you use the fact that i^2 > 2^53, you can see that the:
1/(2i^2) + O(1/i^4)
term is smaller than 2^-54, which means that (since square root is correctly rounded, and hence its rounding error must be smaller than 2^54), the rounded result of the sqrt function is exactly i.
It turns out that (with a similar analysis), for any exactly representable floating point number x, sqrt(x*x) is exactly x (assuming that the intermediate computation of x*x doesn't over- or underflow), so the only way you can encounter rounding for this type of computation is in the representation of x itself, which is why you see it starting at 2^53 + 1 (the smallest unrepresentable integer).
For "small" integers, there is usually an exact floating-point representation.
It's not too hard to find cases where this breaks down as you'd expect:
Math.sqrt(94949493295293425**2).floor
# => 94949493295293424
Math.sqrt(94949493295293426**2).floor
# => 94949493295293424
Math.sqrt(94949493295293427**2).floor
# => 94949493295293424
Ruby's Float is a double-precision floating point number, which means that it can accurately represent numbers with (rule of thumb) about 16 significant decimal digits. For regular single-precision floating point numbers it's about significant 7 digits.
You can find more information here:
What Every Computer Scientist Should Know About Floating-Point Arithmetic:
http://docs.sun.com/source/819-3693/ncg_goldberg.html

Resources