Precison of digital computers

Precison of digital computers - performance

I read that multiplying multiple values between 0 and 1 will significantly reduce the precision of digital computers; I want to know the basis on which such postulate is based? And does it still holds for modern-day computers?

The typical IEEE-conformant representation of fractional numbers only supports a limited number of (binary) digits. So, very often, the result of some computation isn't an exact representation of the expected mathematical value, but something close to it (rounded to the next number representable within the digits limit), meaning that there is some amount of error in most calculations.
If you do multi-step calculations, you might be lucky that the error introduced by one step is compensated by some complementary error at a later step. But that's pure luck, and statistics teaches us that the expected error will indeed increase with every step.
If you e.g. do 1000 multiplications using the float datatype (typically achieving 6-7 significant decimal digits accuracy), I'd expect the result to be correct only up to about 5 digits, and in worst case only 3-4 digits.
There are ways to do precise calculations (at least for addition, subtraction, multiplication and division), e.g. using the ratio type in the LISP programming language, but in practice they are rarely used.
So yes, doing multi-step calculations in datatypes supporting fractional numbers quickly degrades precision, and it happens with all number ranges, not only with numbers between 0 and 1.
If this is a problem for some application, it's a special skill to transform mathematical formulas into equivalent ones that can be computed with better precision (e.g. formulas with fewer intermediate steps).

Related

What is the fastest way to get the value of e?

What is the most optimised algorithm which finds the value of e with moderate accuracy?
I am looking for a comparison between optimised approaches giving more importance to speed than high precision.
Edit: By moderate accuracy I mean upto 6-7 decimal places.
But if there is a HUGE difference in speed, then I can settle with 4-5 places.

basic datatype
As mentioned in the comments 6-7 decimal places is too small accuracy to do this by an algorithm. Instead use a constant which is the fastest way anyway for this.
const double e=2.7182818284590452353602874713527;
If FPU is involved the constant is usually stored there too... Also having single constant occupies much less space than a function that computes it ...
finite accuracy
Only once bignums are involved then has any merit to use algorithm to compute e. The algo depends on target accuracy. Again for smaller accuracies are predefined constants used:
e=2.71828182845904523536028747135266249775724709369995957496696762772407663035354759457138217852516642742746639193200305992181741359662904357290033429526059563073813232862794349076323382988075319525101901157383418793070215408914993488416750924476146066808226480016847741185374234544243710753907774499206955170189
but usually in hex format for faster and more precise manipulation:
e=2.B7E151628AED2A6ABF7158809CF4F3C762E7160F38B4DA56A784D9045190CFEF324E7738926CFBE5F4BF8D8D8C31D763DA06C80ABB1185EB4F7C7B5757F5958490CFD47D7C19BB42158D9554F7B46BCED55C4D79FD5F24D6613C31C3839A2DDF8A9A276BCFBFA1C877C56284DAB79CD4C2B3293D20E9E5EAF02AC60ACC93ECEBh
For limited/finite accuracy and best speed is the PSLQ algorithm best. My understanding is that it is algorithm to find relation between real number and integer iterations.
here is my favourite PSLQ up to 800 digits of Pi PSLQ example
arbitrary accuracy
For arbitrary or "fixed" precision you need algorithm that is with variable precision. This is what I use in my arbnum class:
e=(1+1/x)^x where x -> +infinity
If you chose x as power of 2 realize that x is just single set bit of the number and 1/x has predictable bit-width. So the e will be obtained with single division and pow. Here an example:
arbnum arithmetics_e() // e computation min(_arbnum_max_a,arbnum_max_b)*5 decimals
{ // e=(1+1/x)^x ... x -> +inf
int i; arbnum c,x;
i=_arbnum_bits_a; if (i>_arbnum_bits_b) i=_arbnum_bits_b; i>>=1;
c.zero(); c.bitset(_arbnum_bits_b-i); x.one(); x/=c; c++;
for (;!x.bitget(_arbnum_bits_b);x>>=1) c*=c; //=pow(c,x);
return c;
}
Where _arbnum_bits_a,_arbnum_bits_b is the number of bits before and after decimal point in binary. So it break down to some bit operations, one bignum division and single power by squaring. Beware that multiplication and division is not that simple with bignums and usually involves Karatsuba or worse ...
There are also polynomial approaches out there that does not require bignum arithmetics similar to compute Pi. The idea is to compute chunk of binary bits per iteration without affecting the previously computed bits (too much). They should be faster but as usual for any optimizations that depends on the implementation and HW it runs on.

For reference see Brother's formula here : https://www.intmath.com/exponential-logarithmic-functions/calculating-e.php

Denormalized Numbers - IEEE 754 Floating Point

So I'm trying to learn more about Denormalized numbers as defined in the IEEE 754 standard for Floating Point numbers. I've already read several articles thanks to Google search results, and I've gone through several StackOverFlow posts. However I still have some questions unanswered.
First off, just to review my understanding of what a Denormalized float is:
Numbers which have fewer bits of precision, and are smaller (in
magnitude) than normalized numbers
Essentially, a denormalized float has the ability to represent the SMALLEST (in magnitude) number that is possible to be represented with any floating point value.
Does that sound correct? Anything more to it than that?
I've read that:
using denormalized numbers comes with a performance cost on many
platforms
Any comments on this?
I've also read in one of the articles that
one should "avoid overlap between normalized and denormalized numbers"
Any comments on this?
In some presentations of the IEEE standard, when floating point ranges are presented the denormalized values are excluded and the tables are labeled as an "effective range", almost as if the presenter is thinking "We know that denormalized numbers CAN represent the smallest possible floating point values, but because of certain disadvantages of denormalized numbers, we choose to exclude them from ranges that will better fit common use scenarios" -- As if denormalized numbers are not commonly used.
I guess I just keep getting the impression that using denormalized numbers turns out to not be a good thing in most cases?
If I had to answer that question on my own I would want to think that:
Using denormalized numbers is good because you can represent the smallest (in magnitude) numbers possible -- As long as precision is not important, and you do not mix them up with normalized numbers, AND the resulting performance of the application fits within requirements.
Using denormalized numbers is a bad thing because most applications do not require representations so small -- The precision loss is detrimental, and you can shoot yourself in the foot too easily by mixing them up with normalized numbers, AND the peformance is not worth the cost in most cases.
Any comments on these two answers? What else might I be missing or not understand about denormalized numbers?

Essentially, a denormalized float has the ability to represent the
SMALLEST (in magnitude) number that is possible to be represented with
any floating point value.
That is correct.
using denormalized numbers comes with a performance cost on many platforms
The penalty is different on different processors, but it can be up to 2 orders of magnitude. The reason? The same as for this advice:
one should "avoid overlap between normalized and denormalized numbers"
Here's the key: denormals are a fixed-point "micro-format" within the IEEE-754 floating-point format. In normal numbers, the exponent indicates the position of the binary point. Denormal numbers contain the last 52 bits in the fixed-point notation with an exponent of 2-1074 for doubles.
So, denormals are slow because they require special handling. In practice, they occur very rarely, and chip makers don't like to spend too many valuable resources on rare cases.
Mixing denormals with normals is slow because then you're mixing formats and you have the additional step of converting between the two.
I guess I just keep getting the impression that using denormalized
numbers turns out to not be a good thing in most cases?
Denormals were created for one primary purpose: gradual underflow. It's a way to keep the relative difference between tiny numbers small. If you go straight from the smallest normal number to zero (abrupt underflow), the relative change is infinite. If you go to denormals on underflow, the relative change is still not fully accurate, but at least more reasonable. And that difference shows up in calculations.
To put it a different way. Floating-point numbers are not distributed uniformly. There are always the same amount of numbers between successive powers of two: 252 (for double precision). So without denormals, you always end up with a gap between 0 and the smallest floating-point number that is 252 times the size of the difference between the smallest two numbers. Denormals fill this gap uniformly.
As an example about the effects of abrupt vs. gradual underflow, look at the mathematically equivalent x == y and x - y == 0. If x and y are tiny but different and you use abrupt underflow, then if their difference is less than the minimum cutoff value, their difference will be zero, and so the equivalence is violated.
With gradual underflow, the difference between two tiny but different normal numbers gets to be a denormal, which is still not zero. The equivalence is preserved.
So, using denormals on purpose is not advised, because they were designed only as a backup mechanism in exceptional cases.

LUT versus Newton-Raphson Division For IEEE-754 32-bit Floating Point

I was wondering what are the tradeoffs when implementing 32-bit IEEE-754 floating point division: using LUTs versus through the Newton-Raphson method?
When I say tradeoffs I mean in terms of memory size, instruction count, etc.
I have a small memory (130 words (each 16-bits)). I am storing upper 12-bits of mantissa (including hidden bit) in one memory location and lower 12-bits of mantissa in another location.
Currently I am using newton-raphson division, but am considering what are the tradeoffs if I changed my method. Here is a link to my algorithm: Newton's Method for finding the reciprocal of a floating point number for division
Thank you and please explain your reasoning.

The trade-off is fairly simple. A LUT uses extra memory in the hope of reducing the instruction count enough to save some time. Whether it's effective will depend a lot on the details of the processor -- caching in particular.
For Newton-Raphson, you change X/Y to X* (1/Y) and use your iteration to find 1/Y. At least in my experience, if you need full precision, it's rarely useful -- it's primary strength is in allowing you to find something to (say) 16-bit precision more quickly.
The usual method for division is a bit-by-bit method. Although that particular answer deals with integers, for floating point you do essentially the same except that along with it you subtract the exponents. A floating point number is basically A*2N, where A is the significand and N is the exponent part of the number. So, you take two numbers A*2N / B * 2M, and carry out the division as A/B * 2N-M, with A and B being treated as (essentially) integers in this case. The only real difference is that with floating point you normally want to round rather than truncate the result. That basically just means carrying out the division (at least) one extra bit of precision, then rounding up if that extra bit is a one.
The most common method using lookup tables is SRT division. This is most often done in hardware, so I'd probably Google for something like "Verilog SRT" or "VHDL SRT". Rendering it in C++ shouldn't be terribly difficult though. Where the method I outlined in the linked answer produces on bit per iteration, this can be written to do 2, 4, etc. If memory serves, the size of table grows quadratically with the number of bits produced per iteration though, so you rarely see much more than 4 in practice.

Each Newton-Raphson step roughly doubles the number of digits of precision, so if you can work out the number of bits of precision you expect from a LUT of a particular size, you should be able to work out how many NR steps you need to attain your desired precision. The Cray-1 used NR as the final stage of its reciprocal calculation. Looking for this I found a fairly detailed article on this sort of thing: An Accurate, High Speed Implementation of Division by Reciprocal Approximation from the 9th IEEE Symposium on Computer Arithmetic (September 6-8, 1989).

Kahan summation

Has anyone used Kahan summation in an application? When would the extra precision be useful?
I hear that on some platforms double operations are quicker than float operations. How can I test this on my machine?

Kahan summation works well when you are summing numbers and you need to minimize the worse-case floating point error. Without this technique, you may have significant loss of precision in add operations if you have two numbers that differ in magnitude by the significant digits available (e.g. 1 + 1e-12). Kahan summation compensates for this.
And an excellent resource for floating point issues is here, "What every computer scientist should know about floating-point arithmetic": http://www.validlab.com/goldberg/paper.pdf
On single vs double precision performance: yes, single precision can be significantly faster, but it depends on the particular machine. See: https://www.hpcwire.com/2006/06/16/less_is_more_exploiting_single_precision_math_in_hpc-1/
The best way to test is to write a short example that tests the operations you care about, using both single (float) and double precision, and measure the runtimes.

I've use Kahan summation for Monte-Carlo integration. You have a scalar valued function f which you believe is rather expensive to evaluate; a reasonable estimate is 65ns/dimension. Then you accumulate those values into an average-updating an average takes about 4ns. So if you update the average using Kahan summation (4x as many flops, ~16ns) then you're really not adding that much compute to the total. Now, often it is said that the error of Monte-Carlo integration is σ/√N, but this is incorrect. The real error bound (in finite precision arithmetic) is
σ/√N + cond(In)ε N
Where cond(In) is the condition number of summation and ε is twice the unit roundoff. So the algorithm diverges faster than it converges. For 32 bit arithmetic, getting ε N ~ 1 is simple: 10^7 evaluations can be done exceedingly quickly, and after this your Monte-Carlo integration goes on a random walk. The situation is even worse when the condition number is large.
If you use Kahan summation, the expression for the error changes to
σ/√N + cond(In)ε2 N,
Which, admittedly still diverges faster than it converges, but ε2 N cannot be made large on a reasonable timescale on modern hardware.

I've used Kahan summation to compensate for an accumulated error when computing running averages. It does make quite a difference and it's easy to test. I eliminated rather large errors after only a 100 summations.
I would definitely use the Kahan summation algorithm to compensate for the error in any running totals.
However, I've noticed quite large (1e-3) errors when doing inverse matrix multiplication. Basically, A*x = y, then inv(A)*y ~= x I'm not getting the original values back exactly. Which is fine but I thought maybe Kahan summation would help (there's a lot of addition) especially with larger matrices >3-by-3. I tried with a 4-by-4 matrix and it did not improve the situation at all.

When would the extra precision be useful?
Very roughly:
Case 1
When you are
Summing up a lot of data
in a non-sequential fashion, i.e. computing sums, then summing up the sums (as opposed to iterating all data with a running sum),
then Kahan summation makes a lot of sense in the second phase - when you sum-up-the-sums, because the errors you're avoiding are by now more significant, while the overhead is paid only for a small fraction of the overall sum operations.
Case 2
When you're working with a lower-precision floating-point type, without being sure you're meeting the accuracy requirement, and you're not allowed to switch to a larger, higher-precision type.

Recombine Number to Equal Math Formula

I've been thinking about a math/algorithm problem and would appreciate your input on how to solve it!
If I have a number (e.g. 479), I would like to recombine its digits or combination of them to a math formula that matches the original number. All digits should be used in their original order, but may be combined to numbers (hence, 479 allows for 4, 7, 9, 47, 79) but each digit may only be used once, so you can not have something like 4x47x9 as now the number 4 was used twice.
Now an example just to demonstrate on how I think of it. The example is mathematically incorrect because I couldn't come up with a good example that actually works, but it demonstrates input and expected output.
Example Input: 29485235
Example Output: 2x9+48/523^5
As I said, my example does not add up (2x9+48/523^5 doesn't result in 29485235) but I wondered if there is an algorithm that would actually allow me to find such a formula consisting of the source number's digits in their original order which would upon calculation yield the original number.
On the type of math used, I'd say parenthesis () and Add/Sub/Mul/Div/Pow/Sqrt.
Any ideas on how to do this? My thought was on simply brute forcing it by chopping the number apart by random and doing calculations hoping for a matching result. There's gotta be a better way though?
Edit: If it's any easier in non-original order, or you have an idea to solve this while ignoring some of the 'conditions' described above, it would still help tremendously to understand how to go about solving such a problem.

For numbers up to about 6 digits or so, I'd say brute-force it according to the following scheme:
1) Split your initial value into a list (array, whatever, according to language) of numbers. Initially, these are the digits.
2) For each pair of numbers, combine them together using one of the operators. If the result is the target number, then return success (and print out all the operations performed on your way out). Otherwise if it's an integer, recurse on the new, smaller list consisting of the number you just calculated, and the numbers you didn't use. Or you might want to allow non-integer intermediate results, which will make the search space somewhat bigger. The binary operations are:
Add
subtract
multiply
divide
power
concatenate (which may only be used on numbers which are either original digits, or have been produced by concatenation).
3) Allowing square root bloats the search space to infinity, since it's a unary operator. So you will need a way to limit the number of times it can be applied, and I'm not sure what that will be (loss of precision as the answer approaches 1, maybe?). This is another reason to allow only integer intermediate values.
4) Exponentiation will rapidly cause overflows. 2^(9^(4^8)) is far too large to store all the digits directly [although in base 2 it's pretty obvious what they are ;-)]. So you'll either have to accept that you might miss solutions with large intermediate values, or else you'll have to write a bunch of code to do your arithmetic in terms of factors. These obviously don't interact very well with addition, so you might have to do some estimation. For example, just by looking at the magnitude of the number of factors we see that 2^(9^(4^8)) is nowhere near (2^35), so there's no need to calculate (2^(9^(4^8)) + 5) / (2^35). It can't possibly be 29485235, even if it were an integer (which it certainly isn't - another way to rule out this particular example). I think handling these numbers is harder than the rest of the problem put together, so perhaps you should limit yourself to single-digit powers to begin with, and perhaps to results which fit in a 64bit integer, depending what language you are using.
5) I forgot to exclude the trivial solution for any input, of just concatenating all the digits. That's pretty easy to handle, though, just maintain a parameter through the recursion which tells you whether you have performed any non-concatenation operations on the route to your current sub-problem. If you haven't, then ignore the false match.
My estimate of 6 digits is based on the fact that it's fairly easy to write a Countdown solver that runs in a fraction of a second even when there's no solution. This problem is different in that the digits have to be used in order, but there are more operations (Countdown does not permit exponentiation, square root, or concatenation, or non-integer intermediate results). Overall I think this problem is comparable, provided you resolve the square root and overflow issues. If you can solve one case in a fraction of a second, then you can brute force your way through a million candidates in reasonable time (assuming you don't mind leaving your PC on).
By 10 digits, brute force appears impossible, because you have to consider 10 billion cases, each with a significant amount of recursion required. So I guess you'll hit the limit of brute force somewhere between the two.
Note also that my simple algorithm at the top still has a lot of redundancy - it doesn't stop you doing (4,7,9,1) -> (47,9,1) -> (47,91), and then later also doing (4,7,9,1) -> (4,7,91) -> (47,91). So unless you work out where those duplicates are going to occur and avoid them, you'll attempt (47,91) twice. Obviously that's not much work when there's only 2 numbers in the list, but when there are 7 numbers in the list, you probably do not want to e.g. add 4 of them together in 6 different ways and then solve the resulting 4-number problem 6 times. Cleverness here is not required for the Countdown game, but for all I know in this problem it might make the difference between brute-forcing 8 digits, and brute-forcing 9 digits, which is quite significant.

Numbers like that, as I recall, are exceedingly rare, if extant. Some numbers can be expressed by their component digits in a different order, such as, say, 25 (5²).
Also, trying to brute-force solutions is hopeless, at best, given that the number of permutations increase extremely rapidly as the numbers grow in digits.
EDIT: Partial solution.
A partial solution solving some cases would be to factorize the number into its prime factors. If its prime factors are all the same, and the exponent and factor are both present in the digits of the number (such as is the case with 25) you have a specific solution.
Most numbers that do fall into these kinds of patterns will do so either with multiplication or pow() as their major driving force; addition simply doesn't increase it enough.

Short of building a neural network that replicates Carol Voorderman I can't see anything short of brute force working - humans are quite smart at seeing patterns in problems such as this but encoding such insight is really tough.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio