IEEE 754 Why can't we exactly represent the number n=1+2^30? - precision

I've read that if we try to represent this number according to the Standard I will obtain the just the number 1. Then the absolute error would be 2^30. I know how to represent a number in IEEE 754 but in this case is a "sum" and I don't know how to proceed. Can anyone help me? Thanks

Related

Are float types "nested" for different precisions

I am currently coding in a scientific code in python. This code uses several different types of float precision, namely half, single and double precision (using numpy). However, my question is more general as it is not specific to python.
Question: are these precisions "nested" in the sense that any number exactly representable (i.e. no approximation) in a lower precision is also exactly representable in higher precision.
Other phrasing: Do I change the value of a float when casting to higher precision ?
I'm quite sure that the answer is yes - at least for IEEE754 standard floating point types. If you cast a variable precisely representing a number to a higher precision type, the least significand bits of the new mantissa will be zero, so the answer to your second question is: No, the numeric value won't be changed.

floating point random number generation in verilog.

Is there a way to generate random floating point numbers in either verilog or system verilog? More specifically through some hardware implementation?
A floating point number is just a number of bits. As such generating a random floating point number can be done by generating random bits and then interpreting them as a float (Or real as they are called in both VHDL and Verilog).
A standard way of generating a series of random bits in hardware is using a PRBS generator (Pseudo Random Bit Sequence generator): A linear feedback shift register with special feedback to get the maximum sequence. There are various polynomial depending on how long a PRBS you want to have.
For exact implementation I suggest you search for PRBS.

Denormalized Numbers - IEEE 754 Floating Point

So I'm trying to learn more about Denormalized numbers as defined in the IEEE 754 standard for Floating Point numbers. I've already read several articles thanks to Google search results, and I've gone through several StackOverFlow posts. However I still have some questions unanswered.
First off, just to review my understanding of what a Denormalized float is:
Numbers which have fewer bits of precision, and are smaller (in
magnitude) than normalized numbers
Essentially, a denormalized float has the ability to represent the SMALLEST (in magnitude) number that is possible to be represented with any floating point value.
Does that sound correct? Anything more to it than that?
I've read that:
using denormalized numbers comes with a performance cost on many
platforms
Any comments on this?
I've also read in one of the articles that
one should "avoid overlap between normalized and denormalized numbers"
Any comments on this?
In some presentations of the IEEE standard, when floating point ranges are presented the denormalized values are excluded and the tables are labeled as an "effective range", almost as if the presenter is thinking "We know that denormalized numbers CAN represent the smallest possible floating point values, but because of certain disadvantages of denormalized numbers, we choose to exclude them from ranges that will better fit common use scenarios" -- As if denormalized numbers are not commonly used.
I guess I just keep getting the impression that using denormalized numbers turns out to not be a good thing in most cases?
If I had to answer that question on my own I would want to think that:
Using denormalized numbers is good because you can represent the smallest (in magnitude) numbers possible -- As long as precision is not important, and you do not mix them up with normalized numbers, AND the resulting performance of the application fits within requirements.
Using denormalized numbers is a bad thing because most applications do not require representations so small -- The precision loss is detrimental, and you can shoot yourself in the foot too easily by mixing them up with normalized numbers, AND the peformance is not worth the cost in most cases.
Any comments on these two answers? What else might I be missing or not understand about denormalized numbers?
Essentially, a denormalized float has the ability to represent the
SMALLEST (in magnitude) number that is possible to be represented with
any floating point value.
That is correct.
using denormalized numbers comes with a performance cost on many platforms
The penalty is different on different processors, but it can be up to 2 orders of magnitude. The reason? The same as for this advice:
one should "avoid overlap between normalized and denormalized numbers"
Here's the key: denormals are a fixed-point "micro-format" within the IEEE-754 floating-point format. In normal numbers, the exponent indicates the position of the binary point. Denormal numbers contain the last 52 bits in the fixed-point notation with an exponent of 2-1074 for doubles.
So, denormals are slow because they require special handling. In practice, they occur very rarely, and chip makers don't like to spend too many valuable resources on rare cases.
Mixing denormals with normals is slow because then you're mixing formats and you have the additional step of converting between the two.
I guess I just keep getting the impression that using denormalized
numbers turns out to not be a good thing in most cases?
Denormals were created for one primary purpose: gradual underflow. It's a way to keep the relative difference between tiny numbers small. If you go straight from the smallest normal number to zero (abrupt underflow), the relative change is infinite. If you go to denormals on underflow, the relative change is still not fully accurate, but at least more reasonable. And that difference shows up in calculations.
To put it a different way. Floating-point numbers are not distributed uniformly. There are always the same amount of numbers between successive powers of two: 252 (for double precision). So without denormals, you always end up with a gap between 0 and the smallest floating-point number that is 252 times the size of the difference between the smallest two numbers. Denormals fill this gap uniformly.
As an example about the effects of abrupt vs. gradual underflow, look at the mathematically equivalent x == y and x - y == 0. If x and y are tiny but different and you use abrupt underflow, then if their difference is less than the minimum cutoff value, their difference will be zero, and so the equivalence is violated.
With gradual underflow, the difference between two tiny but different normal numbers gets to be a denormal, which is still not zero. The equivalence is preserved.
So, using denormals on purpose is not advised, because they were designed only as a backup mechanism in exceptional cases.

Alternating series: Problems with floating point precision

I have a list of numbers x obtained from measurements. I need to calculate the quantity
where a is a positive number. Unfortunately, the individual terms of the sum can be very large and the sign of each term is determined by k and the associated value of x. This leads to cancellation and loss of precision.
I have found a couple of approaches such as compensated summation but wanted to check whether I am on the right path and whether there are better alternatives.

Convert rounded decimal to (approximate) radical value?

I've made a lot of random math programs to help me with my homework (synthetic division being the most fun) and now I'm wanting to reverse a radical expression.
For instance, in my handy TI calculator I get
.2360679775
Well, I want to convert that number to it's equivalent irrational expression, which is
sqrt(5)-2
I realize I could brute force it... but that takes out the fun, and isn't nearly so easy when you consider the significant round-off error of floating point.
So how would you do it? Is there is a trivial algorithm?
Inverse Symbolic Calculator
(I originally linked to this which seems to be gone.)
Well, your example hasn't actually transformed the input to the equivalent irrational expression, but to an equivalent irrational expression. As the Inverse Symbolic Calculator indicates, there are many candidate irrational expressions within a tolerance of the decimal number in your example, and there will be just as many irrationals within any degree of tolerance of any decimal number you specify. It's all to do with the density of irrationals along the number line.
So to answer your questions:
I would limit myself to a number of terms such as sqrt(2), sqrt(3), sqrt(small prime numbers), e, pi, and integers, plus rationals with small prime denominators and approximate the decimals with a few terms based on those plus the four basic arithmetical operators;
Is this algorithm trivial ? You decide. In general, though, I think it will be impossible to find an algorithm for determining a canonical representation of any decimal fraction as a series of irrational terms and integers, for the simple reason that no such canonical representation exists.
But then, my real and irrational maths is very rusty, I look forward to proofs that I am wrong and counter-examples.

Resources