Are float types "nested" for different precisions - precision

I am currently coding in a scientific code in python. This code uses several different types of float precision, namely half, single and double precision (using numpy). However, my question is more general as it is not specific to python.
Question: are these precisions "nested" in the sense that any number exactly representable (i.e. no approximation) in a lower precision is also exactly representable in higher precision.
Other phrasing: Do I change the value of a float when casting to higher precision ?

I'm quite sure that the answer is yes - at least for IEEE754 standard floating point types. If you cast a variable precisely representing a number to a higher precision type, the least significand bits of the new mantissa will be zero, so the answer to your second question is: No, the numeric value won't be changed.

Related

Precison of digital computers

I read that multiplying multiple values between 0 and 1 will significantly reduce the precision of digital computers; I want to know the basis on which such postulate is based? And does it still holds for modern-day computers?
The typical IEEE-conformant representation of fractional numbers only supports a limited number of (binary) digits. So, very often, the result of some computation isn't an exact representation of the expected mathematical value, but something close to it (rounded to the next number representable within the digits limit), meaning that there is some amount of error in most calculations.
If you do multi-step calculations, you might be lucky that the error introduced by one step is compensated by some complementary error at a later step. But that's pure luck, and statistics teaches us that the expected error will indeed increase with every step.
If you e.g. do 1000 multiplications using the float datatype (typically achieving 6-7 significant decimal digits accuracy), I'd expect the result to be correct only up to about 5 digits, and in worst case only 3-4 digits.
There are ways to do precise calculations (at least for addition, subtraction, multiplication and division), e.g. using the ratio type in the LISP programming language, but in practice they are rarely used.
So yes, doing multi-step calculations in datatypes supporting fractional numbers quickly degrades precision, and it happens with all number ranges, not only with numbers between 0 and 1.
If this is a problem for some application, it's a special skill to transform mathematical formulas into equivalent ones that can be computed with better precision (e.g. formulas with fewer intermediate steps).

Algorithm and datastructure for calculating trig functions to arbitrary precisions

I need to do a lot of calculations to arbitrarily high precisions - in Javascript which only has a 64 bit float representation of numbers.
I can see how I could combine multiple variables to represent large numbers: for example to represent a large decimal of m digits, where the 64 bit floating point can represent n digits, I need m / n variables.
But how can I implement an algorithm that calculates tan() to an arbitrary precision, using only 64-bit floating point arithmetic?
Why do you want to (re)do it yourself ? I'd use a lib for that. For instance http://mathjs.org/docs/datatypes/bignumbers.html.

Denormalized Numbers - IEEE 754 Floating Point

So I'm trying to learn more about Denormalized numbers as defined in the IEEE 754 standard for Floating Point numbers. I've already read several articles thanks to Google search results, and I've gone through several StackOverFlow posts. However I still have some questions unanswered.
First off, just to review my understanding of what a Denormalized float is:
Numbers which have fewer bits of precision, and are smaller (in
magnitude) than normalized numbers
Essentially, a denormalized float has the ability to represent the SMALLEST (in magnitude) number that is possible to be represented with any floating point value.
Does that sound correct? Anything more to it than that?
I've read that:
using denormalized numbers comes with a performance cost on many
platforms
Any comments on this?
I've also read in one of the articles that
one should "avoid overlap between normalized and denormalized numbers"
Any comments on this?
In some presentations of the IEEE standard, when floating point ranges are presented the denormalized values are excluded and the tables are labeled as an "effective range", almost as if the presenter is thinking "We know that denormalized numbers CAN represent the smallest possible floating point values, but because of certain disadvantages of denormalized numbers, we choose to exclude them from ranges that will better fit common use scenarios" -- As if denormalized numbers are not commonly used.
I guess I just keep getting the impression that using denormalized numbers turns out to not be a good thing in most cases?
If I had to answer that question on my own I would want to think that:
Using denormalized numbers is good because you can represent the smallest (in magnitude) numbers possible -- As long as precision is not important, and you do not mix them up with normalized numbers, AND the resulting performance of the application fits within requirements.
Using denormalized numbers is a bad thing because most applications do not require representations so small -- The precision loss is detrimental, and you can shoot yourself in the foot too easily by mixing them up with normalized numbers, AND the peformance is not worth the cost in most cases.
Any comments on these two answers? What else might I be missing or not understand about denormalized numbers?
Essentially, a denormalized float has the ability to represent the
SMALLEST (in magnitude) number that is possible to be represented with
any floating point value.
That is correct.
using denormalized numbers comes with a performance cost on many platforms
The penalty is different on different processors, but it can be up to 2 orders of magnitude. The reason? The same as for this advice:
one should "avoid overlap between normalized and denormalized numbers"
Here's the key: denormals are a fixed-point "micro-format" within the IEEE-754 floating-point format. In normal numbers, the exponent indicates the position of the binary point. Denormal numbers contain the last 52 bits in the fixed-point notation with an exponent of 2-1074 for doubles.
So, denormals are slow because they require special handling. In practice, they occur very rarely, and chip makers don't like to spend too many valuable resources on rare cases.
Mixing denormals with normals is slow because then you're mixing formats and you have the additional step of converting between the two.
I guess I just keep getting the impression that using denormalized
numbers turns out to not be a good thing in most cases?
Denormals were created for one primary purpose: gradual underflow. It's a way to keep the relative difference between tiny numbers small. If you go straight from the smallest normal number to zero (abrupt underflow), the relative change is infinite. If you go to denormals on underflow, the relative change is still not fully accurate, but at least more reasonable. And that difference shows up in calculations.
To put it a different way. Floating-point numbers are not distributed uniformly. There are always the same amount of numbers between successive powers of two: 252 (for double precision). So without denormals, you always end up with a gap between 0 and the smallest floating-point number that is 252 times the size of the difference between the smallest two numbers. Denormals fill this gap uniformly.
As an example about the effects of abrupt vs. gradual underflow, look at the mathematically equivalent x == y and x - y == 0. If x and y are tiny but different and you use abrupt underflow, then if their difference is less than the minimum cutoff value, their difference will be zero, and so the equivalence is violated.
With gradual underflow, the difference between two tiny but different normal numbers gets to be a denormal, which is still not zero. The equivalence is preserved.
So, using denormals on purpose is not advised, because they were designed only as a backup mechanism in exceptional cases.

LUT versus Newton-Raphson Division For IEEE-754 32-bit Floating Point

I was wondering what are the tradeoffs when implementing 32-bit IEEE-754 floating point division: using LUTs versus through the Newton-Raphson method?
When I say tradeoffs I mean in terms of memory size, instruction count, etc.
I have a small memory (130 words (each 16-bits)). I am storing upper 12-bits of mantissa (including hidden bit) in one memory location and lower 12-bits of mantissa in another location.
Currently I am using newton-raphson division, but am considering what are the tradeoffs if I changed my method. Here is a link to my algorithm: Newton's Method for finding the reciprocal of a floating point number for division
Thank you and please explain your reasoning.
The trade-off is fairly simple. A LUT uses extra memory in the hope of reducing the instruction count enough to save some time. Whether it's effective will depend a lot on the details of the processor -- caching in particular.
For Newton-Raphson, you change X/Y to X* (1/Y) and use your iteration to find 1/Y. At least in my experience, if you need full precision, it's rarely useful -- it's primary strength is in allowing you to find something to (say) 16-bit precision more quickly.
The usual method for division is a bit-by-bit method. Although that particular answer deals with integers, for floating point you do essentially the same except that along with it you subtract the exponents. A floating point number is basically A*2N, where A is the significand and N is the exponent part of the number. So, you take two numbers A*2N / B * 2M, and carry out the division as A/B * 2N-M, with A and B being treated as (essentially) integers in this case. The only real difference is that with floating point you normally want to round rather than truncate the result. That basically just means carrying out the division (at least) one extra bit of precision, then rounding up if that extra bit is a one.
The most common method using lookup tables is SRT division. This is most often done in hardware, so I'd probably Google for something like "Verilog SRT" or "VHDL SRT". Rendering it in C++ shouldn't be terribly difficult though. Where the method I outlined in the linked answer produces on bit per iteration, this can be written to do 2, 4, etc. If memory serves, the size of table grows quadratically with the number of bits produced per iteration though, so you rarely see much more than 4 in practice.
Each Newton-Raphson step roughly doubles the number of digits of precision, so if you can work out the number of bits of precision you expect from a LUT of a particular size, you should be able to work out how many NR steps you need to attain your desired precision. The Cray-1 used NR as the final stage of its reciprocal calculation. Looking for this I found a fairly detailed article on this sort of thing: An Accurate, High Speed Implementation of Division by Reciprocal Approximation from the 9th IEEE Symposium on Computer Arithmetic (September 6-8, 1989).

TSQL: Which real number type results in faster comparisons

Among TSQL types (MSSQL): real, float and decimal; which type would result in faster comparisons?
Would decimal use hardaware FPU computation or is it purely in software?
Decimals cannot use the FPU. You get best performance with float or real which map to a standard IEEE floating point number which is supported by the FPU.
float = double
real = single
Of course, single is faster.
Are these relative comparisons >, < or equality comparisons =, != ?
Floats and reals are approximation data types where as a decimal is an actual representation. If you're doing equality, you'll want to stay away from floats and reals. So that leaves decimals.
More than likely, SQL Server won't go to the FPU for relative comparisons. FPU and other coprocessors are used for arithmetic.

Resources