Pascal double value is not exact, how to fix this? - pascal

I have been solving a programming challenge in UVA and got this problem, which is REALLY strange. Here is the flawed code:
program WTF;
begin
WriteLn(Trunc(2.01 * 100));
ReadLn();
end.
Obviously, I need to get 201 as Integer, but I get 200, this happens because Double somehow doesn't store the exact value... It's 2.01 = 2.00(9) for reasons unbeknownst to me, can someone explain this and provide a solution?
Edit: Yet, I figgured that using Round() instead of Trunc() fixes this... But still, why wouldn't Trunc() work?

Double stores numbers of the form s*2p where s and p are integers. The number 2.01 is not of the form s*2p for any integers s, p so it cannot be stored exactly in a Double.
The solution here is to round 2.01 * 100 to the nearest integer instead of truncating it. Although 2.01 is not exactly 2.01, it is only a little bit below. Rounding to the nearest integer would result in 201.
Note that if by 2.00(9) you mean 2.0099999999… repeating indefinitely, then 2.00(9)is not the Double you get when you write 2.01. The nearest Double to the real 2.01, and the number you got, is 2.0099999999999997868371792719699442386627197265625. It is of the form s * 2p: 9052235251014696 * 2-52

Related

Get 1, 0, -1 as positive, zero, or negative for an integer with math only

I have a situation where I'm performing a calculate over a huge number of rows, and I can really increase the performance if I can eschew a conditional statement.
What I need is for a given positive, zero, or negative integer I want the result 1, 0, -1 respectively.
So if I do col/ABS(col), I will get 1 for a positive number, and -1 for a negative number, but of course if col equals 0 then I'll get an error. I can't get an error.
This seems simple enough, but I can't wrap my ahead around it.
Assuming either two's complement 32-bit integers, or one's complement with no negative-zero to worry about, then the following works well:
(x>>31) - (-x>>31);
Replace 31 with 63 for 64-bit integers, and so on.
col/max(1, abs(col))
Ugly but works. For integers, that is. For floating point values where there's no well-defined smallest positive value, you're stuck unless the language allows you to look into it as a bit sequence, then you can just do the same with the sign flag and the significand.
Whether this helps optimising anything is highly debatable though. It certainly makes things harder to read.
I am doing this in an SSAS tabular model, which doesn't have a MAX function like in #bizclop's answer (which helps in many other applications such as EXCEL).
I ended up doing the following which was inspired by the accepted answer:
ROUND([col] / (ABS([col]) + 1), 0)
This ended up reducing my query time quite significantly (40%) versus IF([col] <> 0, [col]/ABS([col], 0).

JDBC / Oracle Double value insertion fails [duplicate]

double r = 11.631;
double theta = 21.4;
In the debugger, these are shown as 11.631000000000000 and 21.399999618530273.
How can I avoid this?
These accuracy problems are due to the internal representation of floating point numbers and there's not much you can do to avoid it.
By the way, printing these values at run-time often still leads to the correct results, at least using modern C++ compilers. For most operations, this isn't much of an issue.
I liked Joel's explanation, which deals with a similar binary floating point precision issue in Excel 2007:
See how there's a lot of 0110 0110 0110 there at the end? That's because 0.1 has no exact representation in binary... it's a repeating binary number. It's sort of like how 1/3 has no representation in decimal. 1/3 is 0.33333333 and you have to keep writing 3's forever. If you lose patience, you get something inexact.
So you can imagine how, in decimal, if you tried to do 3*1/3, and you didn't have time to write 3's forever, the result you would get would be 0.99999999, not 1, and people would get angry with you for being wrong.
If you have a value like:
double theta = 21.4;
And you want to do:
if (theta == 21.4)
{
}
You have to be a bit clever, you will need to check if the value of theta is really close to 21.4, but not necessarily that value.
if (fabs(theta - 21.4) <= 1e-6)
{
}
This is partly platform-specific - and we don't know what platform you're using.
It's also partly a case of knowing what you actually want to see. The debugger is showing you - to some extent, anyway - the precise value stored in your variable. In my article on binary floating point numbers in .NET, there's a C# class which lets you see the absolutely exact number stored in a double. The online version isn't working at the moment - I'll try to put one up on another site.
Given that the debugger sees the "actual" value, it's got to make a judgement call about what to display - it could show you the value rounded to a few decimal places, or a more precise value. Some debuggers do a better job than others at reading developers' minds, but it's a fundamental problem with binary floating point numbers.
Use the fixed-point decimal type if you want stability at the limits of precision. There are overheads, and you must explicitly cast if you wish to convert to floating point. If you do convert to floating point you will reintroduce the instabilities that seem to bother you.
Alternately you can get over it and learn to work with the limited precision of floating point arithmetic. For example you can use rounding to make values converge, or you can use epsilon comparisons to describe a tolerance. "Epsilon" is a constant you set up that defines a tolerance. For example, you may choose to regard two values as being equal if they are within 0.0001 of each other.
It occurs to me that you could use operator overloading to make epsilon comparisons transparent. That would be very cool.
For mantissa-exponent representations EPSILON must be computed to remain within the representable precision. For a number N, Epsilon = N / 10E+14
System.Double.Epsilon is the smallest representable positive value for the Double type. It is too small for our purpose. Read Microsoft's advice on equality testing
I've come across this before (on my blog) - I think the surprise tends to be that the 'irrational' numbers are different.
By 'irrational' here I'm just referring to the fact that they can't be accurately represented in this format. Real irrational numbers (like π - pi) can't be accurately represented at all.
Most people are familiar with 1/3 not working in decimal: 0.3333333333333...
The odd thing is that 1.1 doesn't work in floats. People expect decimal values to work in floating point numbers because of how they think of them:
1.1 is 11 x 10^-1
When actually they're in base-2
1.1 is 154811237190861 x 2^-47
You can't avoid it, you just have to get used to the fact that some floats are 'irrational', in the same way that 1/3 is.
One way you can avoid this is to use a library that uses an alternative method of representing decimal numbers, such as BCD
If you are using Java and you need accuracy, use the BigDecimal class for floating point calculations. It is slower but safer.
Seems to me that 21.399999618530273 is the single precision (float) representation of 21.4. Looks like the debugger is casting down from double to float somewhere.
You cant avoid this as you're using floating point numbers with fixed quantity of bytes. There's simply no isomorphism possible between real numbers and its limited notation.
But most of the time you can simply ignore it. 21.4==21.4 would still be true because it is still the same numbers with the same error. But 21.4f==21.4 may not be true because the error for float and double are different.
If you need fixed precision, perhaps you should try fixed point numbers. Or even integers. I for example often use int(1000*x) for passing to debug pager.
Dangers of computer arithmetic
If it bothers you, you can customize the way some values are displayed during debug. Use it with care :-)
Enhancing Debugging with the Debugger Display Attributes
Refer to General Decimal Arithmetic
Also take note when comparing floats, see this answer for more information.
According to the javadoc
"If at least one of the operands to a numerical operator is of type double, then the
operation is carried out using 64-bit floating-point arithmetic, and the result of the
numerical operator is a value of type double. If the other operand is not a double, it is
first widened (§5.1.5) to type double by numeric promotion (§5.6)."
Here is the Source

Double to Int Rounding, Logic Issue

This is more of a puzzle/logic type question, and less of a programming question -- but I was hoping someone could help, because I'm stumped.
I start off with an integer number and need to take X percent of it. Once I get that double number I need to round it back to an integer (Call this Y). It doesn't actually matter if I round it, and I don't really care which way it goes (up or down), so using floor/ceiling would be an acceptable solution -- I just need an int.
Later in the system I will be presented with the int Y, and will need to know what the original number was. At this point if i know what the percentage was, and I know what method i used to get the double to an int -- How can i determine the original number.
This needs to work for every original-number/percentage combination.
Example:
Original number: 997
Percentage: 90
Int-Conversion Method: Floor
997 * .90 = 897.3
floor(897.3) = 897
Y = 897
..
..
Given 897, and knowing the percentage and int-conversion method, how can I determine the original number was 997?
997 * .90 = 897.3
floor(897.3) = 897
Y = 897
Given the method you have suggested, I will explain why returning to the exact number is not possible, given the method and the data you've provided.
Computing the percentage is fine, and you're left with the double value. You then, effectively, truncate some of the data, using the floor function. Truncation is not a bijection, so you can't go back from the 897 -> 897.3, so reproducing the original number is not possible with any degree of certainty. You won't be far wrong, given the scale of the data you've provided, but for larger values, or for situations that require more precision you might find this method to much like guess work.
You would use the formula
original number = calculated number / percentage
Using your numbers
997 = 897 / 0.90
In the original calculation and the reverse calculation, you need to round the answer.
If you don't, you'll be off by one more often when you do the reverse calculation.

Arithmetic in ruby

Why this code 7.30 - 7.20 in ruby returns 0.0999999999999996, not 0.10?
But if i'll write 7.30 - 7.16, for example, everything will be ok, i'll get 0.14.
What the problem, and how can i solve it?
What Every Computer Scientist Should Know About Floating-Point Arithmetic
The problem is that some numbers we can easily write in decimal don't have an exact representation in the particular floating point format implemented by current hardware. A casual way of stating this is that all the integers do, but not all of the fractions, because we normally store the fraction with a 2**e exponent. So, you have 3 choices:
Round off appropriately. The unrounded result is always really really close, so a rounded result is invariably "perfect". This is what Javascript does and lots of people don't even realize that JS does everything in floating point.
Use fixed point arithmetic. Ruby actually makes this really easy; it's one of the only languages that seamlessly shifts to Class Bignum from Fixnum as numbers get bigger.
Use a class that is designed to solve this problem, like BigDecimal
To look at the problem in more detail, we can try to represent your "7.3" in binary. The 7 part is easy, 111, but how do we do .3? 111.1 is 7.5, too big, 111.01 is 7.25, getting closer. Turns out, 111.010011 is the "next closest smaller number", 7.296875, and when we try to fill in the missing .003125 eventually we find out that it's just 111.010011001100110011... forever, not representable in our chosen encoding in a finite bit string.
The problem is that floating point is inaccurate. You can solve it by using Rational, BigDecimal or just plain integers (for example if you want to store money you can store the number of cents as an int instead of the number of dollars as a float).
BigDecimal can accurately store any number that has a finite number of digits in base 10 and rounds numbers that don't (so three thirds aren't one whole).
Rational can accurately store any rational number and can't store irrational numbers at all.
That is a common error from how float point numbers are represented in memory.
Use BigDecimal if you need exact results.
result=BigDecimal.new("7.3")-BigDecimal("7.2")
puts "%2.2f" % result
It is interesting to note that a number that has few decimals in one base may typically have a very large number of decimals in another. For instance, it takes an infinite number of decimals to express 1/3 (=0.3333...) in the base 10, but only one decimal in the base 3. Similarly, it takes many decimals to express the number 1/10 (=0.1) in the base 2.
Since you are doing floating point math then the number returned is what your computer uses for precision.
If you want a closer answer, to a set precision, just multiple the float by that (such as by 100), convert it to an int, do the math, then divide.
There are other solutions, but I find this to be the simplest since rounding always seems a bit iffy to me.
This has been asked before here, you may want to look for some of the answers given before, such as this one:
Dealing with accuracy problems in floating-point numbers

Algorithm to find a common multiplier to convert decimal numbers to whole numbers

I have an array of numbers that potentially have up to 8 decimal places and I need to find the smallest common number I can multiply them by so that they are all whole numbers. I need this so all the original numbers can all be multiplied out to the same scale and be processed by a sealed system that will only deal with whole numbers, then I can retrieve the results and divide them by the common multiplier to get my relative results.
Currently we do a few checks on the numbers and multiply by 100 or 1,000,000, but the processing done by the *sealed system can get quite expensive when dealing with large numbers so multiplying everything by a million just for the sake of it isn’t really a great option. As an approximation lets say that the sealed algorithm gets 10 times more expensive every time you multiply by a factor of 10.
What is the most efficient algorithm, that will also give the best possible result, to accomplish what I need and is there a mathematical name and/or formula for what I’m need?
*The sealed system isn’t really sealed. I own/maintain the source code for it but its 100,000 odd lines of proprietary magic and it has been thoroughly bug and performance tested, altering it to deal with floats is not an option for many reasons. It is a system that creates a grid of X by Y cells, then rects that are X by Y are dropped into the grid, “proprietary magic” occurs and results are spat out – obviously this is an extremely simplified version of reality, but it’s a good enough approximation.
So far there are quiet a few good answers and I wondered how I should go about choosing the ‘correct’ one. To begin with I figured the only fair way was to create each solution and performance test it, but I later realised that pure speed wasn’t the only relevant factor – an more accurate solution is also very relevant. I wrote the performance tests anyway, but currently the I’m choosing the correct answer based on speed as well accuracy using a ‘gut feel’ formula.
My performance tests process 1000 different sets of 100 randomly generated numbers.
Each algorithm is tested using the same set of random numbers.
Algorithms are written in .Net 3.5 (although thus far would be 2.0 compatible)
I tried pretty hard to make the tests as fair as possible.
Greg – Multiply by large number
and then divide by GCD – 63
milliseconds
Andy – String Parsing
– 199 milliseconds
Eric – Decimal.GetBits – 160 milliseconds
Eric – Binary search – 32
milliseconds
Ima – sorry I couldn’t
figure out a how to implement your
solution easily in .Net (I didn’t
want to spend too long on it)
Bill – I figure your answer was pretty
close to Greg’s so didn’t implement
it. I’m sure it’d be a smidge faster
but potentially less accurate.
So Greg’s Multiply by large number and then divide by GCD” solution was the second fastest algorithm and it gave the most accurate results so for now I’m calling it correct.
I really wanted the Decimal.GetBits solution to be the fastest, but it was very slow, I’m unsure if this is due to the conversion of a Double to a Decimal or the Bit masking and shifting. There should be a
similar usable solution for a straight Double using the BitConverter.GetBytes and some knowledge contained here: http://blogs.msdn.com/bclteam/archive/2007/05/29/bcl-refresher-floating-point-types-the-good-the-bad-and-the-ugly-inbar-gazit-matthew-greig.aspx but my eyes just kept glazing over every time I read that article and I eventually ran out of time to try to implement a solution.
I’m always open to other solutions if anyone can think of something better.
I'd multiply by something sufficiently large (100,000,000 for 8 decimal places), then divide by the GCD of the resulting numbers. You'll end up with a pile of smallest integers that you can feed to the other algorithm. After getting the result, reverse the process to recover your original range.
Multiple all the numbers by 10
until you have integers.
Divide
by 2,3,5,7 while you still have all
integers.
I think that covers all cases.
2.1 * 10/7 -> 3
0.008 * 10^3/2^3 -> 1
That's assuming your multiplier can be a rational fraction.
If you want to find some integer N so that N*x is also an exact integer for a set of floats x in a given set are all integers, then you have a basically unsolvable problem. Suppose x = the smallest positive float your type can represent, say it's 10^-30. If you multiply all your numbers by 10^30, and then try to represent them in binary (otherwise, why are you even trying so hard to make them ints?), then you'll lose basically all the information of the other numbers due to overflow.
So here are two suggestions:
If you have control over all the related code, find another
approach. For example, if you have some function that takes only
int's, but you have floats, and you want to stuff your floats into
the function, just re-write or overload this function to accept
floats as well.
If you don't have control over the part of your system that requires
int's, then choose a precision to which you care about, accept that
you will simply have to lose some information sometimes (but it will
always be "small" in some sense), and then just multiply all your
float's by that constant, and round to the nearest integer.
By the way, if you're dealing with fractions, rather than float's, then it's a different game. If you have a bunch of fractions a/b, c/d, e/f; and you want a least common multiplier N such that N*(each fraction) = an integer, then N = abc / gcd(a,b,c); and gcd(a,b,c) = gcd(a, gcd(b, c)). You can use Euclid's algorithm to find the gcd of any two numbers.
Greg: Nice solution but won't calculating a GCD that's common in an array of 100+ numbers get a bit expensive? And how would you go about that? Its easy to do GCD for two numbers but for 100 it becomes more complex (I think).
Evil Andy: I'm programing in .Net and the solution you pose is pretty much a match for what we do now. I didn't want to include it in my original question cause I was hoping for some outside the box (or my box anyway) thinking and I didn't want to taint peoples answers with a potential solution. While I don't have any solid performance statistics (because I haven't had any other method to compare it against) I know the string parsing would be relatively expensive and I figured a purely mathematical solution could potentially be more efficient.
To be fair the current string parsing solution is in production and there have been no complaints about its performance yet (its even in production in a separate system in a VB6 format and no complaints there either). It's just that it doesn't feel right, I guess it offends my programing sensibilities - but it may well be the best solution.
That said I'm still open to any other solutions, purely mathematical or otherwise.
What language are you programming in? Something like
myNumber.ToString().Substring(myNumber.ToString().IndexOf(".")+1).Length
would give you the number of decimal places for a double in C#. You could run each number through that and find the largest number of decimal places(x), then multiply each number by 10 to the power of x.
Edit: Out of curiosity, what is this sealed system which you can pass only integers to?
In a loop get mantissa and exponent of each number as integers. You can use frexp for exponent, but I think bit mask will be required for mantissa. Find minimal exponent. Find most significant digits in mantissa (loop through bits looking for last "1") - or simply use predefined number of significant digits.
Your multiple is then something like 2^(numberOfDigits-minMantissa). "Something like" because I don't remember biases/offsets/ranges, but I think idea is clear enough.
So basically you want to determine the number of digits after the decimal point for each number.
This would be rather easier if you had the binary representation of the number. Are the numbers being converted from rationals or scientific notation earlier in your program? If so, you could skip the earlier conversion and have a much easier time. Otherwise you might want to pass each number to a function in an external DLL written in C, where you could work with the floating point representation directly. Or you could cast the numbers to decimal and do some work with Decimal.GetBits.
The fastest approach I can think of in-place and following your conditions would be to find the smallest necessary power-of-ten (or 2, or whatever) as suggested before. But instead of doing it in a loop, save some computation by doing binary search on the possible powers. Assuming a maximum of 8, something like:
int NumDecimals( double d )
{
// make d positive for clarity; it won't change the result
if( d<0 ) d=-d;
// now do binary search on the possible numbers of post-decimal digits to
// determine the actual number as quickly as possible:
if( NeedsMore( d, 10e4 ) )
{
// more than 4 decimals
if( NeedsMore( d, 10e6 ) )
{
// > 6 decimal places
if( NeedsMore( d, 10e7 ) ) return 10e8;
return 10e7;
}
else
{
// <= 6 decimal places
if( NeedsMore( d, 10e5 ) ) return 10e6;
return 10e5;
}
}
else
{
// <= 4 decimal places
// etc...
}
}
bool NeedsMore( double d, double e )
{
// check whether the representation of D has more decimal points than the
// power of 10 represented in e.
return (d*e - Math.Floor( d*e )) > 0;
}
PS: you wouldn't be passing security prices to an option pricing engine would you? It has exactly the flavor...

Resources