Why are negative numbers rounded down after division in Ruby? - ruby

I am looking through a documentation on divmod. Part of a table showing the difference between methods div, divmod, modulo, and remainder is displayed below:
Why is 13.div(-4) rounded to -4 and not to -3? Is there any rule or convention in Ruby to round down negative numbers? If so, why is the following code not rounding down?
-3.25.round() #3

13.div(-4) == -4 and 13.modulo(-4) == -3 so that
(-4 * -4) + -3 == 13
and you get the consistent relationship
(b * (a/b)) + a.modulo(b) == a
Why is 13.div(-4) rounded to -4 and not to -3?
This is a misconception. 13.div(-4) is not really rounded at all. It is integer division, and follows self-consistent rules for working with integers and modular arithmetic. The rounding logic described in your link fits with it, and is then applied consistently when dealing with the same divmod operation when one or both the parameters are Floats. Mathematical operations on negative or fractional numbers are often extended from simpler, more intuitive results on positive integers in this kind of way. E.g. this follows similar logic to how fractional and negative powers, or non-integer factorials are created from their positive integer variants.
In this case, it's all about self-consistency of divmod, but not about rounding in general.
Ruby's designers had a choice to make when dealing with negative numbers, not all languages will give the same result. However, once it was decided Ruby would return sign of modulo result matching the divisor (as opposed to matching the division as a whole), that set how the rest of the numbers work.
Is there any rule or convention in Ruby to round down negative numbers?
Yes. Rounding a float number means to return the numerically closest integer. When there are two equally close integers, Ruby rounds to the integer furthest from 0. This is entirely separate design decision from how integer division and modulo arithmetic methods work.
If so, why is the following code not rounding down? -3.25.round() #3
I assume you mean the result to read -3. The round method does not "round down". It does "round closest". -3 is the closest integer to -3.25. Ruby's designers did have to make a choice though, what to do with -3.5.round() # -4. Some languages would instead return a -3 when rounding that number.

Related

What are all the situations in which Fortran outputs NaN?

I know that division by zero and square root of negative real number outputs NaN. Are there any other similar problems?
I will refer to the wikipedia entry on NaN and to Fortran Standard to try to enumerate them.
There are three kinds of operations that can return NaN:[5]
Operations with a NaN as at least one operand.
In Fortran that would include the application of arithmetic and comparisson operators, plus math intrinsic functions.
Indeterminate forms:
The divisions (±0) / (±0) and (±∞) / (±∞).
The multiplications (±0) × (±∞) and (±∞) × (±0).
The additions (+∞) + (−∞), (−∞) + (+∞) and equivalent subtractions (+∞) − (+∞) and (−∞) − (−∞).
The standard has alternative functions for powers:
The standard pow function and the integer exponent pown function define 0⁰, 1∞, and ∞⁰ as 1.
The powr function defines all three indeterminate forms as invalid operations and so returns NaN.
So, all arithmetic operators are included (and also atomic operation functions) . All this was pretty obvious, the fun is next:
Real operations with complex results, for example:
The square root of a negative number.
The logarithm of a negative number.
The inverse sine or cosine of a number that is less than −1 or greater than 1.
That would mean (as said by #kvantour in the comments) any intrinsic function called out of its domain: SQRT, LOG, ATAN, ATAN2, ACOS, ACOSH, ASIN, ASINH, FRACTION, RRSPACING, SET_EXPONENT, SPACING

Why is Float::INFINITY == Float::INFINITY in Ruby?

In mathematics 2 infinities are not equal, nor greater than or less then. So what gives?
In irb, Float::INFINITY == Float::INFINITY (tested in ruby 1.9.3)
In more technical terms, it all comes down to the IEEE 754 standard for floating-point arithmetics.
The IEEE 754 standard does implicitly define Infinity == Infinity to
be true. The relevant part of the standard is section 5.7: "Four
mutually exclusive relations are possible [between two IEEE 754
values]: less than, equal, greater than, and unordered. The last case
arises when at least one operand is NaN."
Between any pair of floating point values exactly one of these four
relations is true. Therefore, since Infinity is not NaN, Infinity is
not unordered with respect to itself. Having one of (Infinity <
Infinity) and (Infinity > Infinity) be true wouldn't be consistent, so
(Infinity == Infinity).
This was taken from http://compilers.iecc.com/comparch/article/98-07-134
While there are multiple different infinities in most set theories, the infinities represented by real numbers typically represent the infinities of the extended real number line, where +∞ and -∞ are values specifically chosen to be bigger than and smaller than all real numbers. In this setup, ∞ = ∞ and -∞ = -∞.
The set theoretic infinities that are not equal are cardinal or ordinal numbers, which typically wouldn't be represented by a floating-point value. They measure sizes and positions within sets, respectively, so would be better off as generalizations of another type, such as an integer type. If you wanted to store those sorts of values, you would probably have a custom type representing an infinite ordinal or infinite cardinal number.
Also, it is definitely possible for set-theoretic infinities to equal one another. ℵ0 = ℵ0, for example (though ℵ0 ≠ ℵ1).
Hope this helps!
Most modern computers use IEEE Floating Point to represent real numbers. These provide an approximation to real numbers, not the real thing. In particular there are two values that represent all infinite values, +infinity and -infinity. Just like you can't represent .1 or 1/3 totally accurately in binary, the infinities are approximations.
As such all +infinities are equal to each other and all -inifinities are equalto each other.

BigDecimal in 1.8 vs. 1.9

When Upgrading to ruby 1.9, I have a failing test when comparing expected vs. actual values for a BigDecimal that is the result of dividing a Float.
expected: '0.495E0',9(18)
got: '0.4950000000 0000005E0',18(27)
googling for things like "bigdecimal ruby precision" and "bigdecimal changes ruby 1.9" isn't getting me anywhere.
How did BigDecimal's behavior change in ruby 1.9?
update 1
> RUBY_VERSION
=> "1.8.7"
> 1.23.to_d
=> #<BigDecimal:1034630a8,'0.123E1',18(18)>
> RUBY_VERSION
=> "1.9.3"
> 1.23.to_d
=> #<BigDecimal:1029f3988,'0.123E1',18(45)>
What does 18(18) and 18(45) mean? Precision I imagine, but what is the notation/unit?
update 2
the code is running:
((10 - 0.1) * (5.0/100)).to_d
My test is expecting this to be equal (==) to:
0.495.to_f
This passed under 1.8, fails under 1.9.2 and 1.9.3
Equality comparisons rarely succeed on FP values
The short answer is that the Float#to_d is more accurate in 1.9 and is correctly failing the equality test that should not have succeeded in 1.8.7.
The long answer involves a basic rule of floating point programming: never do equality comparisons. Instead, fuzzy comparisons like if (abs(x-y) < epsilon) are recommended, or code is written to avoid the need for equality comparison altogether.
Although there are in theory about 232 single-precision numbers and 264 double-precision numbers that could be exactly compared, there are an infinite number that cannot be so compared. (Note: it is safe to do equality comparisons on FP values that happen to be integral. So, contrary to much advice, they are actually perfectly safe for loop indices and subscripts.)
Worse, the way we write fractional numbers makes it unlikely that a comparison with any specific constant will be successful.
That's because the fractions are binary, that is 1/2 + 1/4 + 1/8 ... but our constants are decimal. So, for example, consider monetary amounts in the range $1.00, $1.01, $1.02 .. $1.99. There are 100 values in this range and yet only 4 of them have exact FP representations: 1.00, 1.25, 1.50, and 1.75.
So, back to your problem. Your result of 0.495 has no exact representation and neither does the input constant of 0.1. You begin the calculation with a subtraction of two FP numbers with different magnitudes. The smaller number will be denormalized in order to accomplish the subtraction and so it will lose two or three low-order bits. As a result, the calculation will lead to a slightly large number than 0.495, because the entire 0.1 was not subtracted from 10. Your constant is actually slightly smaller (internally) than 0.495. And that's why the comparison fails.
Ruby 1.8 must have been accidentally or deliberately losing some low order bits and effectively introducing a rounding step that ended up helping your test.
Remember: the rule of thumb is that you must explicitly program in such rounding for floating point comparisons.
Notes. To answer the question from the comments about simple decimal fraction constants not having exact representations: They don't have exact finite forms because they repeat in binary. Every machine fraction is a rational number of the form x/2n. Now, the constants are decimal and every decimal constant is a rational number of the form x/(2n * 5m). The 5m numbers are odd, so there isn't a 2n factor for any of them. Only when m == 0 is there a finite representation in both the binary and decimal expansion of the fraction. So, 1.25 is exact because it's 5/(22*50) but 0.1 is not because it's 1/(20*51). There is simply no way to express 0.1 as a finite sum of x/2n components.
See the Wikipedia article on floating point accuracy problems. It does a very good job of explaining why numbers like 0.1 and 0.01 cannot be represented exactly using floating point numbers.
The simple explanation is that these numbers, when represented in binary floating-point format, are recurring, just like one third is 0.3333333333... recurring in decimal.
Just as you can never represent one third exactly using a finite set of decimal digits, you cannot represent these numbers exactly using a finite set of binary digits.

Why is Math.sqrt(i*i).floor == i?

I am wondering if this is true: When I take the square root of a squared integer, like in
f = Math.sqrt(123*123)
I will get a floating point number very close to 123. Due to floating point representation precision, this could be something like 122.99999999999999999999 or 123.000000000000000000001.
Since floor(122.999999999999999999) is 122, I should get 122 instead of 123. So I expect that floor(sqrt(i*i)) == i-1 in about 50% of the cases. Strangely, for all the numbers I have tested, floor(sqrt(i*i) == i. Here is a small ruby script to test the first 100 million numbers:
100_000_000.times do |i|
puts i if Math.sqrt(i*i).floor != i
end
The above script never prints anything. Why is that so?
UPDATE: Thanks for the quick reply, this seems to be the solution: According to wikipedia
Any integer with absolute value less
than or equal to 2^24 can be exactly
represented in the single precision
format, and any integer with absolute
value less than or equal to 2^53 can
be exactly represented in the double
precision format.
Math.sqrt(i*i) starts to behave as I've expected it starting from i=9007199254740993, which is 2^53 + 1.
Here's the essence of your confusion:
Due to floating point representation
precision, this could be something
like 122.99999999999999999999 or
123.000000000000000000001.
This is false. It will always be exactly 123 on a IEEE-754 compliant system, which is almost all systems in these modern times. Floating-point arithmetic does not have "random error" or "noise". It has precise, deterministic rounding, and many simple computations (like this one) do not incur any rounding at all.
123 is exactly representable in floating-point, and so is 123*123 (so are all modest-sized integers). So no rounding error occurs when you convert 123*123 to a floating-point type. The result is exactly 15129.
Square root is a correctly rounded operation, per the IEEE-754 standard. This means that if there is an exact answer, the square root function is required to produce it. Since you are taking the square root of exactly 15129, which is exactly 123, that's exactly the result you get from the square root function. No rounding or approximation occurs.
Now, for how large of an integer will this be true?
Double precision can exactly represent all integers up to 2^53. So as long as i*i is less than 2^53, no rounding will occur in your computation, and the result will be exact for that reason. This means that for all i smaller than 94906265, we know the computation will be exact.
But you tried i larger than that! What's happening?
For the largest i that you tried, i*i is just barely larger than 2^53 (1.1102... * 2^53, actually). Because conversions from integer to double (or multiplication in double) are also correctly rounded operations, i*i will be the representable value closest to the exact square of i. In this case, since i*i is 54 bits wide, the rounding will happen in the very lowest bit. Thus we know that:
i*i as a double = the exact value of i*i + rounding
where rounding is either -1,0, or 1. If rounding is zero, then the square is exact, so the square root is exact, so we already know you get the right answer. Let's ignore that case.
So now we're looking at the square root of i*i +/- 1. Using a Taylor series expansion, the infinitely precise (unrounded) value of this square root is:
i * (1 +/- 1/(2i^2) + O(1/i^4))
Now this is a bit fiddly to see if you haven't done any floating point error analysis before, but if you use the fact that i^2 > 2^53, you can see that the:
1/(2i^2) + O(1/i^4)
term is smaller than 2^-54, which means that (since square root is correctly rounded, and hence its rounding error must be smaller than 2^54), the rounded result of the sqrt function is exactly i.
It turns out that (with a similar analysis), for any exactly representable floating point number x, sqrt(x*x) is exactly x (assuming that the intermediate computation of x*x doesn't over- or underflow), so the only way you can encounter rounding for this type of computation is in the representation of x itself, which is why you see it starting at 2^53 + 1 (the smallest unrepresentable integer).
For "small" integers, there is usually an exact floating-point representation.
It's not too hard to find cases where this breaks down as you'd expect:
Math.sqrt(94949493295293425**2).floor
# => 94949493295293424
Math.sqrt(94949493295293426**2).floor
# => 94949493295293424
Math.sqrt(94949493295293427**2).floor
# => 94949493295293424
Ruby's Float is a double-precision floating point number, which means that it can accurately represent numbers with (rule of thumb) about 16 significant decimal digits. For regular single-precision floating point numbers it's about significant 7 digits.
You can find more information here:
What Every Computer Scientist Should Know About Floating-Point Arithmetic:
http://docs.sun.com/source/819-3693/ncg_goldberg.html

Best algorithm for avoiding loss of precision?

A recent homework assignment I have received asks us to take expressions which could create a loss of precision when performed in the computer, and alter them so that this loss is avoided.
Unfortunately, the directions for doing this haven't been made very clear. From watching various examples being performed, I know that there are certain methods of doing this: using Taylor series, using conjugates if square roots are involved, or finding a common denominator when two fractions are being subtracted.
However, I'm having some trouble noticing exactly when loss of precision is going to occur. So far the only thing I know for certain is that when you subtract two numbers that are close to being the same, loss of precision occurs since high order digits are significant, and you lose those from round off.
My question is what are some other common situations I should be looking for, and what are considered 'good' methods of approaching them?
For example, here is one problem:
f(x) = tan(x) − sin(x) when x ~ 0
What is the best and worst algorithm for evaluating this out of these three choices:
(a) (1/ cos(x) − 1) sin(x),
(b) (x^3)/2
(c) tan(x)*(sin(x)^2)/(cos(x) + 1).
I understand that when x is close to zero, tan(x) and sin(x) are nearly the same. I don't understand how or why any of these algorithms are better or worse for solving the problem.
Another rule of thumb usually used is this: When adding a long series of numbers, start adding from numbers closest to zero and end with the biggest numbers.
Explaining why this is good is abit tricky. when you're adding small numbers to a large numbers, there is a chance they will be completely discarded because they are smaller than then lowest digit in the current mantissa of a large number. take for instance this situation:
a = 1,000,000;
do 100,000,000 time:
a += 0.01;
if 0.01 is smaller than the lowest mantissa digit, then the loop does nothing and the end result is a == 1,000,000
but if you do this like this:
a = 0;
do 100,000,000 time:
a += 0.01;
a += 1,000,000;
Than the low number slowly grow and you're more likely to end up with something close to a == 2,000,000 which is the right answer.
This is ofcourse an extreme example but I hope you get the idea.
I had to take a numerics class back when I was an undergrad, and it was thoroughly painful. Anyhow, IEEE 754 is the floating point standard typically implemented by modern CPUs. It's useful to understand the basics of it, as this gives you a lot of intuition about what not to do. The simplified explanation of it is that computers store floating point numbers in something like base-2 scientific notation with a fixed number of digits (bits) for the exponent and for the mantissa. This means that the larger the absolute value of a number, the less precisely it can be represented. For 32-bit floats in IEEE 754, half of the possible bit patterns represent between -1 and 1, even though numbers up to about 10^38 are representable with a 32-bit float. For values larger than 2^24 (approximately 16.7 million) a 32-bit float cannot represent all integers exactly.
What this means for you is that you generally want to avoid the following:
Having intermediate values be large when the final answer is expected to be small.
Adding/subtracting small numbers to/from large numbers. For example, if you wrote something like:
for(float index = 17000000; index < 17000001; index++) {}
This loop would never terminate becuase 17,000,000 + 1 is rounded down to 17,000,000.
If you had something like:
float foo = 10000000 - 10000000.0001
The value for foo would be 0, not -0.0001, due to rounding error.
My question is what are some other
common situations I should be looking
for, and what are considered 'good'
methods of approaching them?
There are several ways you can have severe or even catastrophic loss of precision.
The most important reason is that floating-point numbers have a limited number of digits, e.g..doubles have 53 bits. That means if you have "useless" digits which are not part of the solution but must be stored, you lose precision.
For example (We are using decimal types for demonstration):
2.598765000000000000000000000100 -
2.598765000000000000000000000099
The interesting part is the 100-99 = 1 answer. As 2.598765 is equal in both cases, it
does not change the result, but waste 8 digits. Much worse, because the computer doesn't
know that the digits is useless, it is forced to store it and crams 21 zeroes after it,
wasting at all 29 digits. Unfortunately there is no way to circumvent it for differences,
but there are other cases, e.g. exp(x)-1 which is a function occuring very often in physics.
The exp function near 0 is almost linear, but it enforces a 1 as leading digit. So with 12
significant digits
exp(0.001)-1 = 1.00100050017 - 1 = 1.00050017e-3
If we use instead a function expm1(), use the taylor series:
1 + x +x^2/2 +x^3/6 ... -1 =
x +x^2/2 +x^3/6 =: expm1(x)
expm1(0.001) = 1.00500166667e-3
Much better.
The second problem are functions with a very steep slope like tangent of x near pi/2.
tan(11) has a slope of 50000 which means that any small deviation caused by rounding errors
before will be amplified by the factor 50000 ! Or you have singularities if e.g. the result approaches 0/0, that means it can have any value.
In both cases you create a substitute function, simplying the original function. It is of no use to highlight the different solution approaches because without training you will simply not "see" the problem in the first place.
A very good book to learn and train: Forman S. Acton: Real Computing made real
Another thing to avoid is subtracting numbers that are nearly equal, as this can also lead to increased sensitivity to roundoff error. For values near 0, cos(x) will be close to 1, so 1/cos(x) - 1 is one of those subtractions that you'd like to avoid if possible, so I would say that (a) should be avoided.

Resources