I know that division by zero and square root of negative real number outputs NaN. Are there any other similar problems?
I will refer to the wikipedia entry on NaN and to Fortran Standard to try to enumerate them.
There are three kinds of operations that can return NaN:[5]
Operations with a NaN as at least one operand.
In Fortran that would include the application of arithmetic and comparisson operators, plus math intrinsic functions.
Indeterminate forms:
The divisions (±0) / (±0) and (±∞) / (±∞).
The multiplications (±0) × (±∞) and (±∞) × (±0).
The additions (+∞) + (−∞), (−∞) + (+∞) and equivalent subtractions (+∞) − (+∞) and (−∞) − (−∞).
The standard has alternative functions for powers:
The standard pow function and the integer exponent pown function define 0⁰, 1∞, and ∞⁰ as 1.
The powr function defines all three indeterminate forms as invalid operations and so returns NaN.
So, all arithmetic operators are included (and also atomic operation functions) . All this was pretty obvious, the fun is next:
Real operations with complex results, for example:
The square root of a negative number.
The logarithm of a negative number.
The inverse sine or cosine of a number that is less than −1 or greater than 1.
That would mean (as said by #kvantour in the comments) any intrinsic function called out of its domain: SQRT, LOG, ATAN, ATAN2, ACOS, ACOSH, ASIN, ASINH, FRACTION, RRSPACING, SET_EXPONENT, SPACING
Related
I am trying to compute the floating-point square root of x using assembly code using the newton-raphson method for first finding the inverse square root (1/sqrt(x)) and then multiplying by x to find sqrt(x).
However, I was reading the wikipedia page regarding newton-raphson division and it appears that depending on how you compute X_{i+1}, you will need a different amount of precision in intermediate steps.
From Wikipedia:
From a computation point of view the expressions X_{i+1} = X_i +
X_i(1-DX_i) and X_{i+1} = X_i(2-DX_i) are not equivalent. To obtain a
result with a precision of n bits while making use of the second
expression one must compute the product between X_i and (2-DX_i) with
double the required precision (2n bits). In contrast the product
between X_i and (1-DX_i) need only be computed with a precision of n
bits."
So, I have two questions:
I don't understand why one must compute the product between X_i and (2-DX_i) with double the required precision (2n bits) to obtain a result with a precision of n bits. Can someone please explain why?
Is there something similar that applies with Newton-Raphson Square Root? For instance, I am computing X_{i+1} = X_{i}(3/2 - 1/2 N X_{i}^2) but this can also be computed as X_{i} + X_{i}(1/2 - 1/2 N X_{i}^2). Does one expression require more intermediate precision, just like Newton-Raphson division does? Is there a different format I should be using to require only n bits of precision to obtain a result with n bits of precision?
I am looking for a result with an error <= 1 ulp
I am looking through a documentation on divmod. Part of a table showing the difference between methods div, divmod, modulo, and remainder is displayed below:
Why is 13.div(-4) rounded to -4 and not to -3? Is there any rule or convention in Ruby to round down negative numbers? If so, why is the following code not rounding down?
-3.25.round() #3
13.div(-4) == -4 and 13.modulo(-4) == -3 so that
(-4 * -4) + -3 == 13
and you get the consistent relationship
(b * (a/b)) + a.modulo(b) == a
Why is 13.div(-4) rounded to -4 and not to -3?
This is a misconception. 13.div(-4) is not really rounded at all. It is integer division, and follows self-consistent rules for working with integers and modular arithmetic. The rounding logic described in your link fits with it, and is then applied consistently when dealing with the same divmod operation when one or both the parameters are Floats. Mathematical operations on negative or fractional numbers are often extended from simpler, more intuitive results on positive integers in this kind of way. E.g. this follows similar logic to how fractional and negative powers, or non-integer factorials are created from their positive integer variants.
In this case, it's all about self-consistency of divmod, but not about rounding in general.
Ruby's designers had a choice to make when dealing with negative numbers, not all languages will give the same result. However, once it was decided Ruby would return sign of modulo result matching the divisor (as opposed to matching the division as a whole), that set how the rest of the numbers work.
Is there any rule or convention in Ruby to round down negative numbers?
Yes. Rounding a float number means to return the numerically closest integer. When there are two equally close integers, Ruby rounds to the integer furthest from 0. This is entirely separate design decision from how integer division and modulo arithmetic methods work.
If so, why is the following code not rounding down? -3.25.round() #3
I assume you mean the result to read -3. The round method does not "round down". It does "round closest". -3 is the closest integer to -3.25. Ruby's designers did have to make a choice though, what to do with -3.5.round() # -4. Some languages would instead return a -3 when rounding that number.
I want to make power function using vhdl where the power is floating number and the number is integer (will be always "2").
2^ some floating number.
I use ieee library and (fixed_float_types.all, fixed_pkg.all, and float_pkg.all).
I thought of calculating all the possible outputs and save them in ROM, but i don't know the ranges of the power.
How to implement this function and if there is any implemented function like this where to find it?
thanks
For simulation, you will find suitable power functions in the IEEE.math_real library
library IEEE;
use IEEE.math_real.all;
...
X <= 2 ** Y;
or
X <= 2.0 ** Y;
This is probably not synthesisable. If I needed a similar operation for synthesis, I would use a lookup table of values, slopes and second derivatives, and a quadratic interpolator. I have used this approach for reciprocal and square root functions to single precision accuracy; 2**n over a reasonable range of n is smooth enough that the same approach should work.
If an approximation would do, I think I would use the integer part of my exponent to determine the integer power of 2, like if the floating point number is 111.011010111 You know that the integer power of 2 part is 0b10000000. Then I would do a left to right conditional add based on the fractional bit, so for 111.011010111 you know you need to add implement 0b10000000 times ( 0*(1/2) + 1*(1/4) + 1*(1/8) + 0*(1/16).....and so on). 1/2, 1/4, 1/8, et cetera are right shifts of 0b10000000. This implements the integer part of the exponentiation, and then approximates the fractional part as multiplication of the integer part.
As simple as any, 0.1 in binary is equivalent to 0.5 in decimal and that is equivalent to calculating a square root.
I've been working on floating point numbers and it took about 4-5 hours to figure this out for implementation of power function in the most simple and synthesizeable way. Just go on with repeated square roots like for b"0.01" you want to do double square root like sqrt(sqrt(x)) and for b"0.11" sqrt * double sqrt like sqrt(x)*sqrt(sqrt(x)) and so on...
This is a synthesizeable implementation of pow function...
Suppose we have some arbitrary positive number x.
Is there a method to represent its inverse in binary or x's inverse is 1/x - how does one express that in binary?
e.g. x=5 //101
x's inverse is 1/x, it's binary form is ...?
You'd find it the same way you would in decimal form: long division.
There is no shortcut just because you are in another base, although long division is significantly simpler.
Here is a very nice explanation of long division applied to binary numbers.
Although, just to let you know, most floating-point systems on today's machines do very fast division for you.
In general, the only practical way to "express in binary" an arbitrary fraction is as a pair of integers, numerator and denominator -- "floating point", the most commonly used (and hardware supported) binary representation of non-integer numbers, can represent exactly on those fractions whose denominator (when the fraction is reduced to the minimum terms) is a power of two (and, of course, only when the fixed number of bits allotted to the representation is sufficient for the number we'd like to represent -- but, the latter limitation will also hold for any fixed-size binary representation, including the simplest ones such as integers).
0.125 = 0.001b
0.0625 = 0.0001b
0.0078125 = 0.0000001b
0.00390625 = 0.00000001b
0.00048828125 = 0.00000000001b
0.000244140625 = 0.000000000001b
----------------------------------
0.199951171875 = 0.001100110011b
Knock yourself out if you want higher accuracy/precision.
Another form of multiplicative inverse takes advantage of the modulo nature of integer arithmetic as implemented on most computers; in your case the 32 bit value
11001100110011001100110011001101 (-858993459 signed int32 or 3435973837 unsigned int32) when multiplied by 5 equals 1 (mod 4294967296). Only values which are coprime with the power of two the modulo operates on have such multiplicative inverses.
If you just need the first few bits of a binary fraction number, this trick will give you those bits: (2 << 31) / x. But don't use this trick on any real software project. (because it is rough, inaccurate and plainly wrong way to represent the value)
I am wondering if this is true: When I take the square root of a squared integer, like in
f = Math.sqrt(123*123)
I will get a floating point number very close to 123. Due to floating point representation precision, this could be something like 122.99999999999999999999 or 123.000000000000000000001.
Since floor(122.999999999999999999) is 122, I should get 122 instead of 123. So I expect that floor(sqrt(i*i)) == i-1 in about 50% of the cases. Strangely, for all the numbers I have tested, floor(sqrt(i*i) == i. Here is a small ruby script to test the first 100 million numbers:
100_000_000.times do |i|
puts i if Math.sqrt(i*i).floor != i
end
The above script never prints anything. Why is that so?
UPDATE: Thanks for the quick reply, this seems to be the solution: According to wikipedia
Any integer with absolute value less
than or equal to 2^24 can be exactly
represented in the single precision
format, and any integer with absolute
value less than or equal to 2^53 can
be exactly represented in the double
precision format.
Math.sqrt(i*i) starts to behave as I've expected it starting from i=9007199254740993, which is 2^53 + 1.
Here's the essence of your confusion:
Due to floating point representation
precision, this could be something
like 122.99999999999999999999 or
123.000000000000000000001.
This is false. It will always be exactly 123 on a IEEE-754 compliant system, which is almost all systems in these modern times. Floating-point arithmetic does not have "random error" or "noise". It has precise, deterministic rounding, and many simple computations (like this one) do not incur any rounding at all.
123 is exactly representable in floating-point, and so is 123*123 (so are all modest-sized integers). So no rounding error occurs when you convert 123*123 to a floating-point type. The result is exactly 15129.
Square root is a correctly rounded operation, per the IEEE-754 standard. This means that if there is an exact answer, the square root function is required to produce it. Since you are taking the square root of exactly 15129, which is exactly 123, that's exactly the result you get from the square root function. No rounding or approximation occurs.
Now, for how large of an integer will this be true?
Double precision can exactly represent all integers up to 2^53. So as long as i*i is less than 2^53, no rounding will occur in your computation, and the result will be exact for that reason. This means that for all i smaller than 94906265, we know the computation will be exact.
But you tried i larger than that! What's happening?
For the largest i that you tried, i*i is just barely larger than 2^53 (1.1102... * 2^53, actually). Because conversions from integer to double (or multiplication in double) are also correctly rounded operations, i*i will be the representable value closest to the exact square of i. In this case, since i*i is 54 bits wide, the rounding will happen in the very lowest bit. Thus we know that:
i*i as a double = the exact value of i*i + rounding
where rounding is either -1,0, or 1. If rounding is zero, then the square is exact, so the square root is exact, so we already know you get the right answer. Let's ignore that case.
So now we're looking at the square root of i*i +/- 1. Using a Taylor series expansion, the infinitely precise (unrounded) value of this square root is:
i * (1 +/- 1/(2i^2) + O(1/i^4))
Now this is a bit fiddly to see if you haven't done any floating point error analysis before, but if you use the fact that i^2 > 2^53, you can see that the:
1/(2i^2) + O(1/i^4)
term is smaller than 2^-54, which means that (since square root is correctly rounded, and hence its rounding error must be smaller than 2^54), the rounded result of the sqrt function is exactly i.
It turns out that (with a similar analysis), for any exactly representable floating point number x, sqrt(x*x) is exactly x (assuming that the intermediate computation of x*x doesn't over- or underflow), so the only way you can encounter rounding for this type of computation is in the representation of x itself, which is why you see it starting at 2^53 + 1 (the smallest unrepresentable integer).
For "small" integers, there is usually an exact floating-point representation.
It's not too hard to find cases where this breaks down as you'd expect:
Math.sqrt(94949493295293425**2).floor
# => 94949493295293424
Math.sqrt(94949493295293426**2).floor
# => 94949493295293424
Math.sqrt(94949493295293427**2).floor
# => 94949493295293424
Ruby's Float is a double-precision floating point number, which means that it can accurately represent numbers with (rule of thumb) about 16 significant decimal digits. For regular single-precision floating point numbers it's about significant 7 digits.
You can find more information here:
What Every Computer Scientist Should Know About Floating-Point Arithmetic:
http://docs.sun.com/source/819-3693/ncg_goldberg.html