Force Mathematica to do numerical computations with finite precision - precision

I would like to analyse the numerical stability of analytic expressions in Mathematica. To this end I want to force Mathematica evaluate the expression numerically at finite precision and compare to a result at much higher precision. The problem is that I do not really get it to forget about extra digits it keeps in the background even if I tell it to do so explicitly. Where is the bug in the following?
In[466]:= Sin[2.0]
Out[466]= 0.9092974268256817
In[467]:= Block[{$MaxExtraPrecision = 0}, N[Sin[2.0], 2]]
Out[467]= 0.9092974268256817
In[468]:= Block[{$MaxExtraPrecision = 0}, N[Sin[2.0`2], 2]]
Out[468]= 0.91
In[469]:= SetPrecision[%, 16]
Out[469]= 0.9092974268256817
Even in the third version it keeps many more digits in the background.

Maybe NumberForm is what you need.
NumberForm[expr, n] prints with approximate real numbers in expr
given to n-digit precision.
http://reference.wolfram.com/mathematica/ref/NumberForm.html

Related

How to find the best combination of parameters from a very large sets?

I have a processing logic which has 11 parameters(let's say from parameter A to parameter K) and different combinations of theses parameters can results in different outcomes.
Processing Logic Example:
if x > A:
x = B
else:
x = C
y = math.sin(2x*x+1.1416)-D
# other logic involving parameter E,F,G,H,I,J,K
return outcome
Here are some examples of the possible values of the parameters(others are similar, discrete):
A ∈ [0.01, 0.02, 0.03, ..., 0.2]
E ∈ [1, 2, 3, 4, ..., 200]
I would like to find the combination of these parameters that results in the best outcome.
However, the problem I am facing is that there are in total
10^19 possible combinations while each combination takes 700ms processing time per CPU core. Obviously, the time to process the whole combinations is unacceptable even I have a large computing cluster.
Could anyone give some advice on what is the correct methodology to handle this problem?
Here is some of my thoughts:
Step 1. Minimize the step interval of each parameter that reduces the total processing time to an acceptable scope, for example:
A ∈ [0.01, 0.05, 0.09, ..., 0.2]
E ∈ [1, 5, 10, 15, ..., 200]
Step 2. Starting from the best combination resulted from step 1, doing a more meticulous research around that combination to find the best combination
But I am afraid that the best combination might hide somewhere that step 1 is not able to perceive, so step 2 is in vain
This is an optimization problem. However, you have two distinct problems in what you posed:
There are no restrictions or properties on the evaluation function;
You accept only the best solution of 10^19 possibilities.
The field of optimization serves up many possibilities, most of which are one variation or another of hill-climbing search and irruptive movement (to help break out of a local maximum that is not the global solution). All of these depend on some manner of continuity or predictability in the evaluation function's dependence on its inputs.
Without that continuity, there is no shorter path to the sole optimal solution.
If you do have some predictability, then you have some reading to do on various solution methods. Start with Newton-Raphson, move on to Gradient Descent, and continue to other topics, depending on the fabric of your function.
Have you thought about purely mathematical approach i.e. trying to find local/global extrema, or based on whether function is monotonic per operation?
There are quite decent numerical methods for derivatives/integrals, even to be used in a relatively-generic manner.
So in other words limit the scope, instead of computing every single option - depends on the general character of operations, that you have in mind.

How do binary connections eliminate multiplication?

I was reading Neural Network with Few Multiplications and I'm having trouble understanding how Binary or Ternary Connect eliminate the need for multiplication.
They explain that by stochastically sampling the weights from [-1, 0, 1], we eliminate the need to multiply and Wx can be calculated using only sign changes. However, even with weights strictly -1, 0, and 1, how can I change the signs of x without multiplication?
eg. W = [0,1,-1] and x = [0.3, 0.2, 0.4]. Wouldn't I still need to multiply W and x to get [0, 0.2, -0.4]? Or is there some other way to change the sign more efficiently than multiplication?
Yes. All the general-purpose processors I know of since the "early days" (say, 1970) have a machine operation to take the magnitude of one number, the sign of another, and return the result. The data transfer happens in parallel: the arithmetic part of the operation is a single machine cycle.
Many high-level languages have this capability as a built-in function. It often comes under a name such as "copy_sign".

In what situations does the difference between random numbers generated on [0,1) and those generated on [0,1] make a difference?

I'm used to pseudo random number generators that return floating point values in the half open interval [0,1).
I've seen some reference to RNGs that can return values on the closed interval [0,1], e.g. this implementation of the Mersenne Twister.
I can see reasons why you'd want to exclude one, or both, of the endpoints for mathematical reasons, e.g.
exponentially_distributed=-logf( 1.0-rng() )
always yields a valid number if 0.0<=rng()<1.0.
But I can't think of a case where replacing an rng yielding [0,1] with one that yields [0,1) would produce any practical difference.
In what situations does having a floating point pseudo random number generator
that returns values on the closed interval [0,1] absolutely necessary?
Maybe if you're randomly generating the probability of an event occurring? If you allow 0, you have to allow 1.
Can't, figure out when the closed interval would be useful, but the open end interval seems the only reasonable to use way to go.
Lets take coin tossing:
If you say rnd() < 0.5 is head and the rest is tail you will get more tails than heads if you use the closed interval. How many more tails depends on how likely it is to actually get 1.
A compelling reason to use a half-open interval is the use case where you are picking a random array index for some array. When you scale from [0, 1) to integers in [0, arrayLength], it's helpful never to get the value arrayLength, since that is not an index in the array in many language implementations. E.g., Java and ArrayIndexOutOfBoundsException. The half-open interval is a great convenience here.
A reason for having a closed interval [0, 1] is Albin's probability argument. But it's worth noting that mathematically speaking, the probability of picking any particular random number, including 1, in [0, 1], is zero. For pseudo random number generators, though, it will pop up occasionally.

what is the best numerical precision in ruby

I was wondering, how I can get the best precision on ruby. Someone told me that the best precision is probably between 0 and 1, because as you go into larger numbers the step increases as well.
I suppose a way to find out would be to know what the minimum float number is and what the next float number, then the precision would be the difference, right? If I'm correct how could I do this on ruby?
I am not sure how to use this http://ruby.wikia.com/wiki/Float to find that information.
Any help appreciated.
In terms of significant digits, the precision is the same, regardless of scale. That is, if you scale your range from [0.0, 1000.0] down to [0.0, 1.0] just by dividing numbers in the natural range by 1000.0, this will have no discernible effect on the precision of your range. In fact, a larger range will have marginally greater precision since it fully contains the smaller range.
As for discovering the absolute precision, you have two problems:
The absolute precision depends on the magnitude, which varies "infinitely" within the range [0, 1] (limx→0 log(x) = –∞). So there is no one precision for numbers in that range. You can only derive absolute precision at a given point in the range.
The common technique for discovering the minimum step — known as the ulp — is to interpret the bit-representation of the float as an integer, increment it by one, and reinterpret the result as a float. Ruby doesn't, AFAIK, let you do this.
There is, however, an iterative solution. Simply add 1.0 to the number and subtract ((x + 1.0) - x). If the difference is zero, double the addend ((x + 2.0) - x) and repeat until the difference is non-zero. Otherwise, halve the addend (to 0.5) and repeat until the difference is zero. Whenever you stop, the lowest addend that produces a non-zero difference is the ulp. (I described this from vague memory, so it might be NQR.)
You can use class Rational - it stores non-integer numbers as fraction of two Integers, which (as far as I know) will be automatically converted to Bignum, when need.
The Flt ruby library provides arbitrary floating point precision.

Best algorithm for avoiding loss of precision?

A recent homework assignment I have received asks us to take expressions which could create a loss of precision when performed in the computer, and alter them so that this loss is avoided.
Unfortunately, the directions for doing this haven't been made very clear. From watching various examples being performed, I know that there are certain methods of doing this: using Taylor series, using conjugates if square roots are involved, or finding a common denominator when two fractions are being subtracted.
However, I'm having some trouble noticing exactly when loss of precision is going to occur. So far the only thing I know for certain is that when you subtract two numbers that are close to being the same, loss of precision occurs since high order digits are significant, and you lose those from round off.
My question is what are some other common situations I should be looking for, and what are considered 'good' methods of approaching them?
For example, here is one problem:
f(x) = tan(x) − sin(x) when x ~ 0
What is the best and worst algorithm for evaluating this out of these three choices:
(a) (1/ cos(x) − 1) sin(x),
(b) (x^3)/2
(c) tan(x)*(sin(x)^2)/(cos(x) + 1).
I understand that when x is close to zero, tan(x) and sin(x) are nearly the same. I don't understand how or why any of these algorithms are better or worse for solving the problem.
Another rule of thumb usually used is this: When adding a long series of numbers, start adding from numbers closest to zero and end with the biggest numbers.
Explaining why this is good is abit tricky. when you're adding small numbers to a large numbers, there is a chance they will be completely discarded because they are smaller than then lowest digit in the current mantissa of a large number. take for instance this situation:
a = 1,000,000;
do 100,000,000 time:
a += 0.01;
if 0.01 is smaller than the lowest mantissa digit, then the loop does nothing and the end result is a == 1,000,000
but if you do this like this:
a = 0;
do 100,000,000 time:
a += 0.01;
a += 1,000,000;
Than the low number slowly grow and you're more likely to end up with something close to a == 2,000,000 which is the right answer.
This is ofcourse an extreme example but I hope you get the idea.
I had to take a numerics class back when I was an undergrad, and it was thoroughly painful. Anyhow, IEEE 754 is the floating point standard typically implemented by modern CPUs. It's useful to understand the basics of it, as this gives you a lot of intuition about what not to do. The simplified explanation of it is that computers store floating point numbers in something like base-2 scientific notation with a fixed number of digits (bits) for the exponent and for the mantissa. This means that the larger the absolute value of a number, the less precisely it can be represented. For 32-bit floats in IEEE 754, half of the possible bit patterns represent between -1 and 1, even though numbers up to about 10^38 are representable with a 32-bit float. For values larger than 2^24 (approximately 16.7 million) a 32-bit float cannot represent all integers exactly.
What this means for you is that you generally want to avoid the following:
Having intermediate values be large when the final answer is expected to be small.
Adding/subtracting small numbers to/from large numbers. For example, if you wrote something like:
for(float index = 17000000; index < 17000001; index++) {}
This loop would never terminate becuase 17,000,000 + 1 is rounded down to 17,000,000.
If you had something like:
float foo = 10000000 - 10000000.0001
The value for foo would be 0, not -0.0001, due to rounding error.
My question is what are some other
common situations I should be looking
for, and what are considered 'good'
methods of approaching them?
There are several ways you can have severe or even catastrophic loss of precision.
The most important reason is that floating-point numbers have a limited number of digits, e.g..doubles have 53 bits. That means if you have "useless" digits which are not part of the solution but must be stored, you lose precision.
For example (We are using decimal types for demonstration):
2.598765000000000000000000000100 -
2.598765000000000000000000000099
The interesting part is the 100-99 = 1 answer. As 2.598765 is equal in both cases, it
does not change the result, but waste 8 digits. Much worse, because the computer doesn't
know that the digits is useless, it is forced to store it and crams 21 zeroes after it,
wasting at all 29 digits. Unfortunately there is no way to circumvent it for differences,
but there are other cases, e.g. exp(x)-1 which is a function occuring very often in physics.
The exp function near 0 is almost linear, but it enforces a 1 as leading digit. So with 12
significant digits
exp(0.001)-1 = 1.00100050017 - 1 = 1.00050017e-3
If we use instead a function expm1(), use the taylor series:
1 + x +x^2/2 +x^3/6 ... -1 =
x +x^2/2 +x^3/6 =: expm1(x)
expm1(0.001) = 1.00500166667e-3
Much better.
The second problem are functions with a very steep slope like tangent of x near pi/2.
tan(11) has a slope of 50000 which means that any small deviation caused by rounding errors
before will be amplified by the factor 50000 ! Or you have singularities if e.g. the result approaches 0/0, that means it can have any value.
In both cases you create a substitute function, simplying the original function. It is of no use to highlight the different solution approaches because without training you will simply not "see" the problem in the first place.
A very good book to learn and train: Forman S. Acton: Real Computing made real
Another thing to avoid is subtracting numbers that are nearly equal, as this can also lead to increased sensitivity to roundoff error. For values near 0, cos(x) will be close to 1, so 1/cos(x) - 1 is one of those subtractions that you'd like to avoid if possible, so I would say that (a) should be avoided.

Resources