Alternating series: Problems with floating point precision - precision

I have a list of numbers x obtained from measurements. I need to calculate the quantity
where a is a positive number. Unfortunately, the individual terms of the sum can be very large and the sign of each term is determined by k and the associated value of x. This leads to cancellation and loss of precision.
I have found a couple of approaches such as compensated summation but wanted to check whether I am on the right path and whether there are better alternatives.

Related

How to find accuracy of matrix multiplication with floating-point numbers?

I am trying to analyze how floating-point computation becomes more inaccurate when the data size decreases. In order to do that, I wanted to perform simple matrix operations on different variations of floating point representation, such as float64, float32, and float16. Since float64 computation will give the most precise and accurate result out of the three, I assume all float64 computation to give the expected result (i.e., error = 0).
The issue is that when I compare the calculated result with the expected result, I don't have an exact idea of how to quantify all the individual errors that I get into a single metric. I know about certain ways to go about it, such as finding the error mean, or the sum of square of errors (SSE), but I just wanted to know if there was a standard way of calculating the overall error of a given matrix computation.
Perhaps a variant of the condition number can be helpful? See here: https://en.wikipedia.org/wiki/Condition_number#Matrices
if there was a standard way of calculating the overall error of a given matrix computation.
Consider the case when a matrix is size 1. Then we are in a familiar 1 dimension domain.
How to compare y_computed_as_float vs y_expected? Even in this case, there is not a standard of how these should compare as floating point numbers. Subtract? Divide? It is often context sensitive. So "no" to OP's question.
Yet there are common practices. So a potential "yes" to OP question for select cases.
Floating point computations are often assessed by the difference between computed and math expected values scaled by the Unit in the last place*.
error = (y_computed_as_float - y_expected)/ulpf((float) y_expected);
For an N dimension matrix, the matrix error could use a root mean square of the N2 element errors.
* Scaling by ULP has some issues near each power of 2 and more near 0.0. There are ways to mitigate that, but we a getting into the weeds.

How to efficiently sample the continuous negative binomial distribution?

First, for context, I am working on a game where when you do something good you earn positive credits and when you do something bad you earn negative credits, and each credit corresponds to flipping a biased coin where if you get heads then something happens (good if its a positive credit, bad if its a negative credit) and otherwise nothing happens.
The deal is that I want to handle the case of multiple credits and fractional credits, and I would like to have flips use up credits so that if something good/bad happens then the leftover credits carry over. A straightforward way of doing this is to just perform a bunch of trials, and in particular for the case of fractional credits we can multiply the number of credits by X and the likelihood of something happening by 1/X (the distribution has the same expectation but slightly different weights); unfortunately, this places a practical limit on how many credits the user can get and also how many decimal places can be in the number of credits since this results in an unbounded amount of work.
What I would like to do is to take advantage of the fact that I am sampling the continuous negative binomial distribution, which is the distribution of how many trials it takes to get heads, i.e. so that if f(X) is the distribution then f(X) gives the probability that there will be X tails before we run into a heads, where X need not be an integer. If I can sample this distribution, then what I can do is that if X is the number of tails then I can see if X is greater or less than the number of credits; if it is greater than then we use up all of the credits but nothing happens, and if it is less than or equal to then something good happens and we subtract X from the number of credits. Furthermore, because the distribution is continuous I can easily handle fractional credits.
Does anyone know of a way for me to be able to efficiently sample the continuous negative binomial distribution (that is, a function that generates random numbers from this distribution)?
This question may be better answered on StatsExchange, but here I will take a stab at it.
You are correct that trying to compute this directly will be computationally expensive as you cannot avoid the beta and/or gamma function dependencies. The only statistically valid approximation I'm aware of is if the number of successes s required is large, and p is neither very small nor very large, then you can approximate it with a normal distribution with special values for the mean and variance. You can read more here but I'm guessing this approximation will not be generally applicable for you.
The negative binomial distribution can also be approximated as a mixture of Poisson distributions, but this doesn't save you from the gamma function dependency.
The only efficient class of negative binomial samplers that I'm aware of use optimized accept-reject techniques. Pages 10-11 of this PDF here describe the concept behind the method. Page 6 (page 295 internally) of this PDF here contains source code for sampling binomial deviates using related techniques. Note that even these methods still require random uniform deviates as well as sqrt(), log(), and gammln() calls. For small numbers of trials (less than 100 maybe?) I wouldn't be surprised at all if just simulating the trials with fast random number generator is faster than even the accept-reject techniques. Definitely start by getting a fast PRNG; they are not all created equal.
Edit:
The following pseudo-code would probably be fairly efficient to draw a random discrete negative binomial-distributed value as long as p is not very large (too close to 1.0). It will return the number of trials required before reaching your first "desired" outcome (which is actually the first "failure" in terms of the distribution):
// assume p and r are the parameters to the neg. binomial dist.
// r = number of failures (you'll set to one for your purpose)
// p = probability of a "success"
double rnd = _rnd.nextDouble(); // [0.0, 1.0)
int k = 0; // represents the # of successes that occur before 1st failure
double lastPmf = (1 - p)^r;
double cdf = lastPmf;
while (cdf < rnd)
{
lastPmf *= (p * (k+r) / (k+1));
cdf += lastPmf;
k++;
}
return k;
// or return (k+1) to also count the trial on which the failure occurred
Using the recurrence relationship saves over repeating the factorial independently at each step. I think using this, combined with limiting your fractional precision to 1 or 2 decimal places (so you only need to multiply by 10 or 100 respectively) might work for your purposes. You are drawing only one random number and the rest is just multiplications--it should be quite fast.

probabilities with small numbers

I am working with large amounts of probabilities that I multiply so i quickly obtain very small numbers. But it seems that python finally store the final result as zero.
to overpass this difficutly, I decided to sum the logs of these probabilities (instead of directly multiplying the probabilities). This strategy returns a negative number (call it c) as expected.
But then, if I want to apply the exponential on c (to come back on the real value of my product of probabilities), I obtain the value zero because c is too largely negative (something like -123445,4).
How could I overpass this problem?
If you are going to use numbers of that magnitude you should use a specialized library which can handle arbitrary floating point precision. Check out mpmath or bigfloat package for example.
Computers natively only support number down to approximately exp(-300). Alternatively, you could restrict your code to store only the exponent and never convert it in a decimal representation.

Denormalized Numbers - IEEE 754 Floating Point

So I'm trying to learn more about Denormalized numbers as defined in the IEEE 754 standard for Floating Point numbers. I've already read several articles thanks to Google search results, and I've gone through several StackOverFlow posts. However I still have some questions unanswered.
First off, just to review my understanding of what a Denormalized float is:
Numbers which have fewer bits of precision, and are smaller (in
magnitude) than normalized numbers
Essentially, a denormalized float has the ability to represent the SMALLEST (in magnitude) number that is possible to be represented with any floating point value.
Does that sound correct? Anything more to it than that?
I've read that:
using denormalized numbers comes with a performance cost on many
platforms
Any comments on this?
I've also read in one of the articles that
one should "avoid overlap between normalized and denormalized numbers"
Any comments on this?
In some presentations of the IEEE standard, when floating point ranges are presented the denormalized values are excluded and the tables are labeled as an "effective range", almost as if the presenter is thinking "We know that denormalized numbers CAN represent the smallest possible floating point values, but because of certain disadvantages of denormalized numbers, we choose to exclude them from ranges that will better fit common use scenarios" -- As if denormalized numbers are not commonly used.
I guess I just keep getting the impression that using denormalized numbers turns out to not be a good thing in most cases?
If I had to answer that question on my own I would want to think that:
Using denormalized numbers is good because you can represent the smallest (in magnitude) numbers possible -- As long as precision is not important, and you do not mix them up with normalized numbers, AND the resulting performance of the application fits within requirements.
Using denormalized numbers is a bad thing because most applications do not require representations so small -- The precision loss is detrimental, and you can shoot yourself in the foot too easily by mixing them up with normalized numbers, AND the peformance is not worth the cost in most cases.
Any comments on these two answers? What else might I be missing or not understand about denormalized numbers?
Essentially, a denormalized float has the ability to represent the
SMALLEST (in magnitude) number that is possible to be represented with
any floating point value.
That is correct.
using denormalized numbers comes with a performance cost on many platforms
The penalty is different on different processors, but it can be up to 2 orders of magnitude. The reason? The same as for this advice:
one should "avoid overlap between normalized and denormalized numbers"
Here's the key: denormals are a fixed-point "micro-format" within the IEEE-754 floating-point format. In normal numbers, the exponent indicates the position of the binary point. Denormal numbers contain the last 52 bits in the fixed-point notation with an exponent of 2-1074 for doubles.
So, denormals are slow because they require special handling. In practice, they occur very rarely, and chip makers don't like to spend too many valuable resources on rare cases.
Mixing denormals with normals is slow because then you're mixing formats and you have the additional step of converting between the two.
I guess I just keep getting the impression that using denormalized
numbers turns out to not be a good thing in most cases?
Denormals were created for one primary purpose: gradual underflow. It's a way to keep the relative difference between tiny numbers small. If you go straight from the smallest normal number to zero (abrupt underflow), the relative change is infinite. If you go to denormals on underflow, the relative change is still not fully accurate, but at least more reasonable. And that difference shows up in calculations.
To put it a different way. Floating-point numbers are not distributed uniformly. There are always the same amount of numbers between successive powers of two: 252 (for double precision). So without denormals, you always end up with a gap between 0 and the smallest floating-point number that is 252 times the size of the difference between the smallest two numbers. Denormals fill this gap uniformly.
As an example about the effects of abrupt vs. gradual underflow, look at the mathematically equivalent x == y and x - y == 0. If x and y are tiny but different and you use abrupt underflow, then if their difference is less than the minimum cutoff value, their difference will be zero, and so the equivalence is violated.
With gradual underflow, the difference between two tiny but different normal numbers gets to be a denormal, which is still not zero. The equivalence is preserved.
So, using denormals on purpose is not advised, because they were designed only as a backup mechanism in exceptional cases.

math: scale coordinate system so that certain points get integer coordinates

this is more a mathematical problem. nonethelesse i am looking for the algorithm in pseudocode to solve it.
given is a one dimensional coordinate system, with a number of points. the coordinates of the points may be in floating point.
now i am looking for a factor that scales this coordinate system, so that all points are on fixed number (i.e. integer coordinate)
if i am not mistaken, there should be a solution for this problem as long as the number of points is not infinite.
if i am wrong and there is no analytical solution for this problem, i am interested in an algorithm that approximates the solution as close as possible. (i.e. the coordinates will look like 15.0001)
if you are interested for the concrete problem:
i would like to overcome the well known pixelsnapping problem in adobe flash, which cuts of half-pixels at the border of bitmaps if the whole stage is scaled. i would like to find out an ideal scaling factor for the stage which makes my bitmaps being placed on whole (screen-)pixel coordinates.
since i am placing two bitmaps on the stage, the number of points will be 4 in each direction (x,y).
thanks!
As suggested, you have to convert your floating point numbers to rational ones. Fix a tolerance epsilon, and for each coordinate, find its best rational approximation within epsilon.
An algorithm and definitions is outlined there in this section.
Once you have converted all the coordinates into rational numbers, the scaling is given by the least common multiple of the denominators.
Note that this latter number can become quite huge, so you may want to experiment with epsilon so that to control the denominators.
My own inclination, if I were in your situation, would be to use rational numbers not with floating point.
And the algorithms you are looking for is finding the lowest common denominator.
A floating point number is an integer, multiplied by a power of two (the power might be negative).
So, find the largest necessary power of two among your inputs, and that gives you a scale factor that will work. The power of two isn't just -1 times the exponent of the float, it's a few more than that (according to where the least significant 1 bit is in the significand).
It's also optimal, because if x times a power of 2 is an odd integer then x in its float representation was already in simplest rational form, there's no smaller integer that you can multiply x by to get an integer.
Obviously if you have a mixture of large and small values among your input, then the resulting integers will tend to be bigger than 64 bit. So there is an analytical solution, but perhaps not a very good one given what you want to do with the results.
Note that this approach treats floats as being precise representations, which they are not. You may get more sensible results by representing each float as a rational number with smaller denominator (within some defined tolerance), then taking the lowest common multiple of all the denominators.
The problem there though is the approximation process - if the input float is 0.334[*] then I can't in general be sure whether the person who gave it to me really mean 0.334, or whether it's 1/3 with some inaccuracy. I therefore don't know whether to use a scale factor of 3 and say the scaled result is 1, or use a scale factor of 500 and say the scaled result is 167. And that's just with 1 input, never mind a bunch of them.
With 4 inputs and allowed final tolerance of 0.0001, you could perhaps find the 10 closest rationals to each input with a certain maximum denominator, then try 10^4 different possibilities and see whether the resulting scale factor gives you any values that are too far from an integer. Brute force seems nasty, but you might a least be able to bound the search a bit as you go. Also "maximum denominator" might be expressed in terms of the primes present in the factorization, rather than just the number, since if you can find a lot of common factors among them then they'll have a smaller lcm and hence smaller deviation from integers after scaling.
[*] Not that 0.334 is an exact float value, but that sort of thing. Decimal examples are easier.
If you are talking about single precision floating point numbers, then the number can be expressed like this according to wikipedia:
From this formula you can deduce that you always get an integer if you multiply by 2127+23. (Actually, when e is 0 you have to use another formula for the special range of "subnormal" numbers so 2126+23 is sufficient. See the linked wikipedia article for details.)
To do this in code you will probably need to do some bit twiddling to extract the factors in the above formula from the bits in the floating point value. And then you will need some kind of support for unlimited size numbers to express the integer result of the scaling (e.g. BigInteger in .NET). Normal primitive types in most languages/platforms are typically limited to much smaller sizes.
It's really a problem in statistical inference combined with noise reduction. This is the method I'm going to try out soon. I'm assuming you're trying to get a regularly spaced 2-D grid but a similar method could work on a regularly spaced grid of 3 or more dimensions.
First tabulate all the differences and note that (dx,dy) and (-dx,-dy) denote the same displacement, so there's an equivalence relation. Group those differenecs that are within a pre-assigned threshold (epsilon) of one another. Epsilon should be large enough to capture measurement errors due to random noise or lack of image resolution, but small enough not to accidentally combine clusters.
Sort the clusters by their average size (dr = root(dx^2 + dy^2)).
If the original grid was, indeed, regularly spaced and generated by two independent basis vectors, then the two smallest linearly independent clusters will indicate so. The smallest cluster is the one centered on (0, 0). The next smallest cluster (dx0, dy0) has the first basis vector up to +/- sign (-dx0, -dy0) denotes the same displacement, recall.
The next smallest clusters may be linearly dependent on this (up to the threshold epsilon) by virtue of being multiples of (dx0, dy0). Find the smallest cluster which is NOT a multiple of (dx0, dy0). Call this (dx1, dy1).
Now you have enough to tag the original vectors. Group the vector, by increasing lexicographic order (x,y) > (x',y') if x > x' or x = x' and y > y'. Take the smallest (x0,y0) and assign the integer (0, 0) to it. Take all the others (x,y) and find the decomposition (x,y) = (x0,y0) + M0(x,y) (dx0, dy0) + M1(x,y) (dx1,dy1) and assign it the integers (m0(x,y),m1(x,y)) = (round(M0), round(M1)).
Now do a least-squares fit of the integers to the vectors to the equations (x,y) = (ux,uy) m0(x,y) (u0x,u0y) + m1(x,y) (u1x,u1y)
to find (ux,uy), (u0x,u0y) and (u1x,u1y). This identifies the grid.
Test this match to determine whether or not all the points are within a given threshold of this fit (maybe using the same threshold epsilon for this purpose).
The 1-D version of this same routine should also work in 1 dimension on a spectrograph to identify the fundamental frequency in a voice print. Only in this case, the assumed value for ux (which replaces (ux,uy)) is just 0 and one is only looking for a fit to the homogeneous equation x = m0(x) u0x.

Resources