IEEE-754 single precision floating point mips - precision

How I write -2^-132 in hexadecimal format in IEEE-754 single precision?

See this Wikipedia article for how to represent it in IEEE-754.
Or even this link.
Note: the result is denormal, so the exponent is zero!
According to this and this, the answer is:
0x80020000 == 10000000000000100000000000000000b

Related

Mathematica precision differs from other calculators

If I evaluate the following input in Mathematica 12:
SetPrecision[DecimalForm[123.432654/54.1122356, 130], 130]
The result is:
2.2810488724291406725797060062177479267120361328125000000000000000000000000000000000000000000000000000000000000000000000000000000000
When I run the same calculation in other calculators, the results are equal until the 15th digit of the Mathematica result: 2,281048872429140. However, as of the 16th digit, the other calculators show an equal result whereas Mathematica is showing a different result:
Windows Calculator:
2,281048872429140591633586101550
https://keisan.casio.com/calculator:
2.281048872429140591633586101550[.....]
https://www.mathsisfun.com/calculator-precision.html:
2.281048872429140591633586101550[.....]
Mathematica:
2.281048872429140672579706006217[.....].
Why is (only) Mathematica ending up with a different result?
Can Mathematica somehow end up with the same result as the other calculators (supposing that these unanimous results are the correct ones)?
Mathematica's model of approximate decimal numbers is different from almost everyone else's model of approximate decimal numbers.
Because of the number of digits you supplied for each of 123.432654 and 54.1122356 these are assumed to be and treated as MachinePrecision numbers. That means they have the usual "about 16 digits of precision as supplied by the CPU floating point hardware in your computer, but it is a little more complicated than that.
Because of precedence rules Mathematica first evaluated each of those numbers and converted them to the internal floating point form, with the limited accuracy and all the problems that brings and all the speed of being able to perform calculations in hardware instead of software.
Then it did the division using the internal floating point hardware which resulted in another MachinePrecision number with only about 16 digits of precision.
Then with DecimalForm you asked Mathematica to extrapolate that result with only about 16 good digits into a 130 digit display.
All, or almost all with some very subtle things in dark corners, of the *Form functions are intended to and only used to produce something that can be displayed and not used for further calculations. For example, new users routinely do m=MatrixForm[mymatrix] to see a pretty formatting of the matrix and then proceed to try to do calculations with m, which fails.
Then you asked Mathematica to perform the SetPrecision function on that display to try to turn that into a 130 bit precision number. I can't even guess what that really did internally.
It seems those other calculators assume that the precision of the entered numbers is infinite. WL does not. You can specify what precision the entered numbers have e.g.
123.432654`30/54.1122356`30
2.28104887242914059163358610155

IEEE-754 Standard

I have an actually very easy question about the IEEE-754 standard in which numbers are coded and saved on the computer.
At uni (exams) I have come across the following definition for 16-bit IEEE-754-format (half precision): 1 sign bit, 6 exponent bits & 9 mantissa bits.
An internet search (or books) reveal another definition:
1 sign bit, 5 exponent bits & 10 mantissa bits
The reason why I’m asking is that I cannot believe the uni might have made such a simple mistake, so are there multiple definitions for numbers given in 16-bit IEEE-754 format?
Conforming to an IEEE standard is voluntary. People are free to use other formats. The IEEE-754 standard specifies a binary16 format that uses 1 bit for the sign, 5 bits for the exponent, and 10 bits for the primary significand encoding.
People may use other formats because they want more or less precision in the significand or range in the exponent.
Textbooks and academic exercises often use non-standard formats for the purpose of inducing students to reason about them on their own rather than looking up answers or learning existing formats by rote.
If the hardware you are using supports a 16-bit floating-point format, the binding specification for that format is in the hardware documentation, not in the IEEE-754 standard.

JDBC / Oracle Double value insertion fails [duplicate]

double r = 11.631;
double theta = 21.4;
In the debugger, these are shown as 11.631000000000000 and 21.399999618530273.
How can I avoid this?
These accuracy problems are due to the internal representation of floating point numbers and there's not much you can do to avoid it.
By the way, printing these values at run-time often still leads to the correct results, at least using modern C++ compilers. For most operations, this isn't much of an issue.
I liked Joel's explanation, which deals with a similar binary floating point precision issue in Excel 2007:
See how there's a lot of 0110 0110 0110 there at the end? That's because 0.1 has no exact representation in binary... it's a repeating binary number. It's sort of like how 1/3 has no representation in decimal. 1/3 is 0.33333333 and you have to keep writing 3's forever. If you lose patience, you get something inexact.
So you can imagine how, in decimal, if you tried to do 3*1/3, and you didn't have time to write 3's forever, the result you would get would be 0.99999999, not 1, and people would get angry with you for being wrong.
If you have a value like:
double theta = 21.4;
And you want to do:
if (theta == 21.4)
{
}
You have to be a bit clever, you will need to check if the value of theta is really close to 21.4, but not necessarily that value.
if (fabs(theta - 21.4) <= 1e-6)
{
}
This is partly platform-specific - and we don't know what platform you're using.
It's also partly a case of knowing what you actually want to see. The debugger is showing you - to some extent, anyway - the precise value stored in your variable. In my article on binary floating point numbers in .NET, there's a C# class which lets you see the absolutely exact number stored in a double. The online version isn't working at the moment - I'll try to put one up on another site.
Given that the debugger sees the "actual" value, it's got to make a judgement call about what to display - it could show you the value rounded to a few decimal places, or a more precise value. Some debuggers do a better job than others at reading developers' minds, but it's a fundamental problem with binary floating point numbers.
Use the fixed-point decimal type if you want stability at the limits of precision. There are overheads, and you must explicitly cast if you wish to convert to floating point. If you do convert to floating point you will reintroduce the instabilities that seem to bother you.
Alternately you can get over it and learn to work with the limited precision of floating point arithmetic. For example you can use rounding to make values converge, or you can use epsilon comparisons to describe a tolerance. "Epsilon" is a constant you set up that defines a tolerance. For example, you may choose to regard two values as being equal if they are within 0.0001 of each other.
It occurs to me that you could use operator overloading to make epsilon comparisons transparent. That would be very cool.
For mantissa-exponent representations EPSILON must be computed to remain within the representable precision. For a number N, Epsilon = N / 10E+14
System.Double.Epsilon is the smallest representable positive value for the Double type. It is too small for our purpose. Read Microsoft's advice on equality testing
I've come across this before (on my blog) - I think the surprise tends to be that the 'irrational' numbers are different.
By 'irrational' here I'm just referring to the fact that they can't be accurately represented in this format. Real irrational numbers (like π - pi) can't be accurately represented at all.
Most people are familiar with 1/3 not working in decimal: 0.3333333333333...
The odd thing is that 1.1 doesn't work in floats. People expect decimal values to work in floating point numbers because of how they think of them:
1.1 is 11 x 10^-1
When actually they're in base-2
1.1 is 154811237190861 x 2^-47
You can't avoid it, you just have to get used to the fact that some floats are 'irrational', in the same way that 1/3 is.
One way you can avoid this is to use a library that uses an alternative method of representing decimal numbers, such as BCD
If you are using Java and you need accuracy, use the BigDecimal class for floating point calculations. It is slower but safer.
Seems to me that 21.399999618530273 is the single precision (float) representation of 21.4. Looks like the debugger is casting down from double to float somewhere.
You cant avoid this as you're using floating point numbers with fixed quantity of bytes. There's simply no isomorphism possible between real numbers and its limited notation.
But most of the time you can simply ignore it. 21.4==21.4 would still be true because it is still the same numbers with the same error. But 21.4f==21.4 may not be true because the error for float and double are different.
If you need fixed precision, perhaps you should try fixed point numbers. Or even integers. I for example often use int(1000*x) for passing to debug pager.
Dangers of computer arithmetic
If it bothers you, you can customize the way some values are displayed during debug. Use it with care :-)
Enhancing Debugging with the Debugger Display Attributes
Refer to General Decimal Arithmetic
Also take note when comparing floats, see this answer for more information.
According to the javadoc
"If at least one of the operands to a numerical operator is of type double, then the
operation is carried out using 64-bit floating-point arithmetic, and the result of the
numerical operator is a value of type double. If the other operand is not a double, it is
first widened (§5.1.5) to type double by numeric promotion (§5.6)."
Here is the Source

Ruby float with lots of decimals, why?

Why is the following operation leading me to this value:
14.99 + 1.5 = 16.490000000000002
I would expect it to be 16.49. How can I avoid those extra decimals?
That's how floating point arithmetic works. If you want a rounded number that's still a Float object, you can do
result.round(2) #=> 16.49
or if you just need a string:
"%0.2f" % result
This is not due to Ruby, but because of the way floating point numbers are represented in a computer (according to the IEEE 754 standard).
In short, some floating point numbers just can't be represented exactly in a computer. If you need better precision, you can try the BigDecimal class.

Implied bit in IEEE floating point format

Why is there an implied (or hidden) bit in IEEE floating point format? What is the purpose of it? It is mentioned in passing on Wikipedia.
From (Complete) Tutorial to Understand IEEE Floating-Point Errors:
"[Fraction] is the normalized fractional part of the number, normalized because the exponent is adjusted so that the leading bit is always a 1. This way, it does not have to be stored, and you get one more bit of precision. This is why there is an implied bit."
It basically allows for higher precision.

Resources