Impossible to achieve certain float value - xcode

I am testing my code. I have boolean conditions with float numbers.
I assign a float value to test my code:
float testFloatValue = 0.9;
And the boolean condition is not met:
if (testFloatValue == 0.9) {
}
Because when I debugging, my float number has changed from 0.9 to 0.899999976
I do not understand anything!

Due to the nature of floating point numbers certain values are not possible to be represented exactly. So you NEVER want to do a direct check for equality... there are numerous articles about this on the web and if you search you will find many. However here is a quick routine you can use in Objective C to check for ALMOST equal.
bool areAlmostEqual(double result, double expectedResult)
{
return fabs(result - expectedResult) < .0000000001;
}
You would use it like so as per the values in your original question:
if(areAlmostEqual(testFloatValue, 0.9) {
// Do something
}

This is a very common misconception. A floating point number is an approximate representation of a real number. The most common standard for floating point (IEEE 754) uses base 2, and base 2 cannot directly represent all base 10 numbers.
This is nothing to do with Xcode
When you wrote 0.9 which is 9 * 10^-1, the computer stored it as the closest binary equivalent, expressed in base 2. When this binary (base 2) approximation is converted back to decimal (base 10) for display, you get 0.899999976 which is as close as floating point could represent your number.
The standard way to compare floating point numbers is to choose a precision or tolerance, often called epsilon, which is how close two numbers are to be considered equal (ie. "close enough"). And because the closest approximation might be slightly lower or slightly higher than your number, you would take the absolute difference and compare to the tolerance. Thus:
const float eps = 0.00001f;
if (fabs(a - b) < eps)
{
// a and b are approximately equal
}
Floating point is a large and complicated topic, and is definitely worth researching to get a good grasp. Start here:
Floating Point on Wikipedia
You should definitely read this fantastic introduction to floating point:
What every computer programmer should know about floating point

Related

Heuristics to sort array of 2D/3D points according their mutual distance

Consider array of points in 2D,3D,(4D...) space ( e.g. nodes of unstructured mesh ). Initially the index of a point in array is not related to its position in space. In simple case, assume I already know some nearest neighbor connectivity graph.
I would like some heuristics which increase probability that two points which are close to each other in space would have similar index (would be close in array).
I understand that exact solution is very hard (perhaps similar to Travelling salesman problem ) but I don't need exact solution, just something which increase probability.
My ideas on solution:
some naive solution would be like:
1. for each point "i" compute fitness E_i given by sum of distances in array (i.e. index-wise) from its spatial neighbors (i.e. space-wise)
E_i = -Sum_k ( abs( index(i)-index(k) ) )
where "k" are spatial nearest neighbors of "i"
2. for pairs of points (i,j) which have low fitness (E_i,E_j)
try to swap them,
if fitness improves, accept
but the detailed implementation and its performance optimization is not so clear.
Other solution which does not need precomputed nearest-neighbors would be based on some Locality-sensitive_hashing
I think this could be quite common problem, and there may exist good solutions, I do not want to reinvent the wheel.
Application:
improve cache locality, considering that memory access is often bottleneck of graph-traversal
it could accelerate interpolation of unstructured grid, more specifically search for nodes which are near the smaple (e.g. centers of Radial-basis function).
I'd say space filling curves (SPC) are the standard solution to map proximity in space to a linear ordering. The most common ones are Hilbert-curves and z-curves (Morton order).
Hilbert curves have the best proximity mapping, but they are somewhat expensive to calculate. Z-ordering still has a good proximity mapping but is very easy to calculate. For z-ordering, it is sufficient to interleave the bits of each dimension. Assuming integer values, if you have a 64bit 3D point (x,y,z), the z-value is $x_0,y_0,z_0,x_1,y_1,z_1, ... x_63,y_63,z_63$, i.e. a 192 bit value consisting of the first bit of every dimension, followed by the second bit of every dimension, and so on. If your array is ordered according to that z-value, points that are close in space are usually also close in the array.
Here are example functions that interleave (merge) values into a z-value (nBitsPerValue is usually 32 or 64):
public static long[] mergeLong(final int nBitsPerValue, long[] src) {
final int DIM = src.length;
int intArrayLen = (src.length*nBitsPerValue+63) >>> 6;
long[] trg = new long[intArrayLen];
long maskSrc = 1L << (nBitsPerValue-1);
long maskTrg = 0x8000000000000000L;
int srcPos = 0;
int trgPos = 0;
for (int j = 0; j < nBitsPerValue*DIM; j++) {
if ((src[srcPos] & maskSrc) != 0) {
trg[trgPos] |= maskTrg;
} else {
trg[trgPos] &= ~maskTrg;
}
maskTrg >>>= 1;
if (maskTrg == 0) {
maskTrg = 0x8000000000000000L;
trgPos++;
}
if (++srcPos == DIM) {
srcPos = 0;
maskSrc >>>= 1;
}
}
return trg;
}
You can also interleave the bits of floating point values (if encoded with IEEE 754, as they usually are in standard computers), but this results in non-euclidean distance properties. You may have to transform negative values first, see here, section 2.3.
EDIT
Two answer the questions from the comments:
1) I understand how to make space filling curve for regular
rectangular grid. However, if I have randomly positioned floating
points, several points can map into one box. Would that algorithm work
in that case?
There are several ways to use floating point (FP) values. The simplest is to convert them to integer values by multiplying them by a large constant. For example multiply everything by 10^6 to preserve 6 digit precision.
Another way is to use the bitlevel representation of the FP value to turn it into an integer. This has the advantage that no precision is lost and you don't have to determine a multiplication constant. The disadvantage is that euclidean distance metric do not work anymore.
It works as follows: The trick is that the floating point values do not have infinite precision, but are limited to 64bit. Hence they automatically form a grid. The difference to integer values is that floating point values do not form a quadratic grid but a rectangular grid where the rectangles get bigger with growing distance from (0,0). The grid-size is determined by how much precision is available at a given point. Close to (0,0), the precision (=grid_size) is 10^-28, close to (1,1), it is 10^-16 see here. This distorted grid still has the proximity mapping, but distances are not euclidean anymore.
Here is the code to do the transformation (Java, taken from here; in C++ you can simply cast the float to int):
public static long toSortableLong(double value) {
long r = Double.doubleToRawLongBits(value);
return (r >= 0) ? r : r ^ 0x7FFFFFFFFFFFFFFFL;
}
public static double toDouble(long value) {
return Double.longBitsToDouble(value >= 0.0 ? value : value ^ 0x7FFFFFFFFFFFFFFFL);
}
These conversion preserve ordering of the converted values, i.e. for every two FP values the resulting integers have the same ordering with respect to <,>,=. The non-euclidean behaviour is caused by the exponent which is encoded in the bit-string. As mentioned above, this is also discussed here, section 2.3, however the code is slightly less optimized.
2) Is there some algorithm how to do iterative update of such space
filling curve if my points moves in space? ( i.e. without reordering
the whole array each time )
The space filling curve imposes a specific ordering, so for every set of points there is only one valid ordering. If a point is moved, it has to be reinserted at the new position determined by it's z-value.
The good news is that small movement will likely mean that a point may often stay in the same 'area' of your array. So if you really use a fixed array, you only have to shift small parts of it.
If you have a lot of moving objects and the array is to cumbersome, you may want to look into 'moving objects indexes' (MX-CIF-quadtree, etc). I personally can recommend my very own PH-Tree. It is a kind of bitwise radix-quadtree that uses a z-curve for internal ordering. It is quite efficient for updates (and other operations). However, I usually recommend it only for larger datasets, for small datasets a simple quadtree is usually good enough.
The problem you are trying to solve has meaning iff, given a point p and its NN q, then it is true that the NN of q is p.
That is not trivial, since for example the two points can represent positions in a landscape, so the one point can be high in a mountain, so going from the bottom up to mountain costs more that the other way around (from the mountain to the bottom). So, make sure you check that's not your case.
Since TilmannZ already proposed a solution, I would like to emphasize on LSH you mentioned. I would not choose that, since your points lie in a really low dimensional space, it's not even 100, so why using LSH?
I would go for CGAL's algorithm on that case, such as 2D NNS, or even a simple kd-tree. And if speed is critical, but space is not, then why not going for a quadtree (octree in 3D)? I had built one, it won't go beyond 10 dimensions in an 8GB RAM.
If however, you feel that your data may belong in a higher dimensional space in the future, then I would suggest using:
LSH from Andoni, really cool guy.
FLANN, which offers another approach.
kd-GeRaF, which is developed by me.

How to represent -infinity in programming

How can I represent -infinity in C++, Java, etc.?
In my exercise, I need to initialize a variable with -infinity to show that it's a very small number.
When computing -infinity - 3, or -infinity + 5 it should also result -infinity.
I tried initializing it with INT_MIN, but when I compute INT_MIN - 1 I get the upper limit, so I can't make a test like: if(value < INT_MIN) var = INT_MIN;
So how can I do that?
You cannot represent infinity with integers[1]. However, you can do so with floating point numbers, i.e., float and double.
You list several languages in the tags, and they all have different ways of obtaining the infinity value (e.g., C99 defines INFINITY in math.h, if infinity is available with that implementation, while Java has POSITIVE_INFINITY and NEGATIVE_INFINITY in Float and Double classes). It is also often (but not always) possible to obtain infinity values by dividing floating point numbers by zero.
[1] Excepting the possibility that you could wrap every arithmetic operation on your integers with code that checks for a special value that you treat as infinity. I wouldn't recommend this.
You can have -Infinity as a floating point literal (at least in Java):
double negInf = Double.NEGATIVE_INFINITY;
It is implemented according to the IEEE754 floating point spec.
If there was the possibility that a number was not there, instead of picking a number from its domain to represent 'not there', I would pick a type with both every integer I care about, and a 'not there' state.
A (deferred) C++1y proposal for optional is an example of that: an optional<int> is either absent, or an integer. To access the integer, you first ask if it is there, and if it is you 'dereference' the optional to get it.
Making infectious optionals: ones that, on almost any binary operation, infect the result if either value is absent, should be an easy extension of this idea.
You could define a number as -infinite and, when adding or substracting something from a number, you do first check if the variable was equal to that pseudo-number. If so you just leave it as it was.
But you might find library functions giving you that functionality implemented, e.g Double in Java or std in c++.
If you want to represent the minimum value you can use, for example
for integers
int a = Integer.MIN_VALUE;
and floats, doubles, etc take each object wrapper min value
You can't truly represent an infinite value because you've got to store it in a finite number of bits. There are symbolic versions of infinity in certain types (e.g. in the typical floating point specification), but it won't behave exactly like infinity in the strict sense. You'll need to include some additional logic in your program to emulate the behaviour you need.

what is the best numerical precision in ruby

I was wondering, how I can get the best precision on ruby. Someone told me that the best precision is probably between 0 and 1, because as you go into larger numbers the step increases as well.
I suppose a way to find out would be to know what the minimum float number is and what the next float number, then the precision would be the difference, right? If I'm correct how could I do this on ruby?
I am not sure how to use this http://ruby.wikia.com/wiki/Float to find that information.
Any help appreciated.
In terms of significant digits, the precision is the same, regardless of scale. That is, if you scale your range from [0.0, 1000.0] down to [0.0, 1.0] just by dividing numbers in the natural range by 1000.0, this will have no discernible effect on the precision of your range. In fact, a larger range will have marginally greater precision since it fully contains the smaller range.
As for discovering the absolute precision, you have two problems:
The absolute precision depends on the magnitude, which varies "infinitely" within the range [0, 1] (limx→0 log(x) = –∞). So there is no one precision for numbers in that range. You can only derive absolute precision at a given point in the range.
The common technique for discovering the minimum step — known as the ulp — is to interpret the bit-representation of the float as an integer, increment it by one, and reinterpret the result as a float. Ruby doesn't, AFAIK, let you do this.
There is, however, an iterative solution. Simply add 1.0 to the number and subtract ((x + 1.0) - x). If the difference is zero, double the addend ((x + 2.0) - x) and repeat until the difference is non-zero. Otherwise, halve the addend (to 0.5) and repeat until the difference is zero. Whenever you stop, the lowest addend that produces a non-zero difference is the ulp. (I described this from vague memory, so it might be NQR.)
You can use class Rational - it stores non-integer numbers as fraction of two Integers, which (as far as I know) will be automatically converted to Bignum, when need.
The Flt ruby library provides arbitrary floating point precision.

Why is Math.sqrt(i*i).floor == i?

I am wondering if this is true: When I take the square root of a squared integer, like in
f = Math.sqrt(123*123)
I will get a floating point number very close to 123. Due to floating point representation precision, this could be something like 122.99999999999999999999 or 123.000000000000000000001.
Since floor(122.999999999999999999) is 122, I should get 122 instead of 123. So I expect that floor(sqrt(i*i)) == i-1 in about 50% of the cases. Strangely, for all the numbers I have tested, floor(sqrt(i*i) == i. Here is a small ruby script to test the first 100 million numbers:
100_000_000.times do |i|
puts i if Math.sqrt(i*i).floor != i
end
The above script never prints anything. Why is that so?
UPDATE: Thanks for the quick reply, this seems to be the solution: According to wikipedia
Any integer with absolute value less
than or equal to 2^24 can be exactly
represented in the single precision
format, and any integer with absolute
value less than or equal to 2^53 can
be exactly represented in the double
precision format.
Math.sqrt(i*i) starts to behave as I've expected it starting from i=9007199254740993, which is 2^53 + 1.
Here's the essence of your confusion:
Due to floating point representation
precision, this could be something
like 122.99999999999999999999 or
123.000000000000000000001.
This is false. It will always be exactly 123 on a IEEE-754 compliant system, which is almost all systems in these modern times. Floating-point arithmetic does not have "random error" or "noise". It has precise, deterministic rounding, and many simple computations (like this one) do not incur any rounding at all.
123 is exactly representable in floating-point, and so is 123*123 (so are all modest-sized integers). So no rounding error occurs when you convert 123*123 to a floating-point type. The result is exactly 15129.
Square root is a correctly rounded operation, per the IEEE-754 standard. This means that if there is an exact answer, the square root function is required to produce it. Since you are taking the square root of exactly 15129, which is exactly 123, that's exactly the result you get from the square root function. No rounding or approximation occurs.
Now, for how large of an integer will this be true?
Double precision can exactly represent all integers up to 2^53. So as long as i*i is less than 2^53, no rounding will occur in your computation, and the result will be exact for that reason. This means that for all i smaller than 94906265, we know the computation will be exact.
But you tried i larger than that! What's happening?
For the largest i that you tried, i*i is just barely larger than 2^53 (1.1102... * 2^53, actually). Because conversions from integer to double (or multiplication in double) are also correctly rounded operations, i*i will be the representable value closest to the exact square of i. In this case, since i*i is 54 bits wide, the rounding will happen in the very lowest bit. Thus we know that:
i*i as a double = the exact value of i*i + rounding
where rounding is either -1,0, or 1. If rounding is zero, then the square is exact, so the square root is exact, so we already know you get the right answer. Let's ignore that case.
So now we're looking at the square root of i*i +/- 1. Using a Taylor series expansion, the infinitely precise (unrounded) value of this square root is:
i * (1 +/- 1/(2i^2) + O(1/i^4))
Now this is a bit fiddly to see if you haven't done any floating point error analysis before, but if you use the fact that i^2 > 2^53, you can see that the:
1/(2i^2) + O(1/i^4)
term is smaller than 2^-54, which means that (since square root is correctly rounded, and hence its rounding error must be smaller than 2^54), the rounded result of the sqrt function is exactly i.
It turns out that (with a similar analysis), for any exactly representable floating point number x, sqrt(x*x) is exactly x (assuming that the intermediate computation of x*x doesn't over- or underflow), so the only way you can encounter rounding for this type of computation is in the representation of x itself, which is why you see it starting at 2^53 + 1 (the smallest unrepresentable integer).
For "small" integers, there is usually an exact floating-point representation.
It's not too hard to find cases where this breaks down as you'd expect:
Math.sqrt(94949493295293425**2).floor
# => 94949493295293424
Math.sqrt(94949493295293426**2).floor
# => 94949493295293424
Math.sqrt(94949493295293427**2).floor
# => 94949493295293424
Ruby's Float is a double-precision floating point number, which means that it can accurately represent numbers with (rule of thumb) about 16 significant decimal digits. For regular single-precision floating point numbers it's about significant 7 digits.
You can find more information here:
What Every Computer Scientist Should Know About Floating-Point Arithmetic:
http://docs.sun.com/source/819-3693/ncg_goldberg.html

Integers and float precision

This is more of a numerical analysis rather than programming question, but I suppose some of you will be able to answer it.
In the sum two floats, is there any precision lost? Why?
In the sum of a float and a integer, is there any precision lost? Why?
Thanks.
In the sum two floats, is there any precision lost?
If both floats have differing magnitude and both are using the complete precision range (of about 7 decimal digits) then yes, you will see some loss in the last places.
Why?
This is because floats are stored in the form of (sign) (mantissa) × 2(exponent). If two values have differing exponents and you add them, then the smaller value will get reduced to less digits in the mantissa (because it has to adapt to the larger exponent):
PS> [float]([float]0.0000001 + [float]1)
1
In the sum of a float and a integer, is there any precision lost?
Yes, a normal 32-bit integer is capable of representing values exactly which do not fit exactly into a float. A float can still store approximately the same number, but no longer exactly. Of course, this only applies to numbers that are large enough, i. e. longer than 24 bits.
Why?
Because float has 24 bits of precision and (32-bit) integers have 32. float will still be able to retain the magnitude and most of the significant digits, but the last places may likely differ:
PS> [float]2100000050 + [float]100
2100000100
The precision depends on the magnitude of the original numbers. In floating point, the computer represents the number 312 internally as scientific notation:
3.12000000000 * 10 ^ 2
The decimal places in the left hand side (mantissa) are fixed. The exponent also has an upper and lower bound. This allows it to represent very large or very small numbers.
If you try to add two numbers which are the same in magnitude, the result should remain the same in precision, because the decimal point doesn't have to move:
312.0 + 643.0 <==>
3.12000000000 * 10 ^ 2 +
6.43000000000 * 10 ^ 2
-----------------------
9.55000000000 * 10 ^ 2
If you tried to add a very big and a very small number, you would lose precision because they must be squeezed into the above format. Consider 312 + 12300000000000000000000. First you have to scale the smaller number to line up with the bigger one, then add:
1.23000000000 * 10 ^ 15 +
0.00000000003 * 10 ^ 15
-----------------------
1.23000000003 <-- precision lost here!
Floating point can handle very large, or very small numbers. But it can't represent both at the same time.
As for ints and doubles being added, the int gets turned into a double immediately, then the above applies.
When adding two floating point numbers, there is generally some error. D. Goldberg's "What Every Computer Scientist Should Know About Floating-Point Arithmetic" describes the effect and the reasons in detail, and also how to calculate an upper bound on the error, and how to reason about the precision of more complex calculations.
When adding a float to an integer, the integer is first converted to a float by C++, so two floats are being added and error is introduced for the same reasons as above.
The precision available for a float is limited, so of course there is always the risk that any given operation drops precision.
The answer for both your questions is "yes".
If you try adding a very large float to a very small one, you will for instance have problems.
Or if you try to add an integer to a float, where the integer uses more bits than the float has available for its mantissa.
The short answer: a computer represents a float with a limited number of bits, which is often done with mantissa and exponent, so only a few bytes are used for the significant digits, and the others are used to represent the position of the decimal point.
If you were to try to add (say) 10^23 and 7, then it won't be able to accurately represent that result. A similar argument applies when adding a float and integer -- the integer will be promoted to a float.
In the sum two floats, is there any precision lost?
In the sum of a float and a integer, is there any precision lost? Why?
Not always. If the sum is representable with the precision you ask, and you won't get any precision loss.
Example: 0.5 + 0.75 => no precision loss
x * 0.5 => no precision loss (except if x is too much small)
In the general case, one add floats in slightly different ranges so there is a precision loss which actually depends on the rounding mode.
ie: if you're adding numbers with totally different ranges, expect precision problems.
Denormals are here to give extra-precision in extreme cases, at the expense of CPU.
Depending on how your compiler handle floating-point computation, results can vary.
With strict IEEE semantics, adding two 32 bits floats should not give better accuracy than 32 bits.
In practice it may requires more instruction to ensure that, so you shouldn't rely on accurate and repeatable results with floating-point.
In both cases yes:
assert( 1E+36f + 1.0f == 1E+36f );
assert( 1E+36f + 1 == 1E+36f );
The case float + int is the same as float + float, because a standard conversion is applied to the int. In the case of float + float, this is implementation dependent, because an implementation may choose to do the addition at double precision. There may be some loss when you store the result, of course.
In both cases, the answer is "yes". When adding an int to a float, the integer is converted to floating point representation before the addition takes place anyway.
To understand why, I suggest you read this gem: What Every Computer Scientist Should Know About Floating-Point Arithmetic.

Resources