How to find accuracy of matrix multiplication with floating-point numbers? - matrix

I am trying to analyze how floating-point computation becomes more inaccurate when the data size decreases. In order to do that, I wanted to perform simple matrix operations on different variations of floating point representation, such as float64, float32, and float16. Since float64 computation will give the most precise and accurate result out of the three, I assume all float64 computation to give the expected result (i.e., error = 0).
The issue is that when I compare the calculated result with the expected result, I don't have an exact idea of how to quantify all the individual errors that I get into a single metric. I know about certain ways to go about it, such as finding the error mean, or the sum of square of errors (SSE), but I just wanted to know if there was a standard way of calculating the overall error of a given matrix computation.

Perhaps a variant of the condition number can be helpful? See here: https://en.wikipedia.org/wiki/Condition_number#Matrices

if there was a standard way of calculating the overall error of a given matrix computation.
Consider the case when a matrix is size 1. Then we are in a familiar 1 dimension domain.
How to compare y_computed_as_float vs y_expected? Even in this case, there is not a standard of how these should compare as floating point numbers. Subtract? Divide? It is often context sensitive. So "no" to OP's question.
Yet there are common practices. So a potential "yes" to OP question for select cases.
Floating point computations are often assessed by the difference between computed and math expected values scaled by the Unit in the last place*.
error = (y_computed_as_float - y_expected)/ulpf((float) y_expected);
For an N dimension matrix, the matrix error could use a root mean square of the N2 element errors.
* Scaling by ULP has some issues near each power of 2 and more near 0.0. There are ways to mitigate that, but we a getting into the weeds.

Related

Is it beneficial for precision to calculate the incremental mean/average?

In the question "What's the numerically best way to calculate the average" it was suggested, that calculating a rolling mean, i.e.
mean = a[n]/n + (n-1)/n * mean
might be numerically more stable than calculating the sum and then dividing by the total number of elements. This was questioned by a commenter. I can not tell which one is true - can someone else? The advantage of the rolling mean is, that you keep the mean small (i.e. at roughly the same size of all vector entries). Intuitively this should keep the error small. But the commenter claims:
Part of the issue is that 1/n introduces errors in the least significant bits, so n/n != 1, at least when it is performed as a three step operation (divide-store-multiply). This is minimized if the division is only performed once, but you'd be doing it over GB of data.
So I have multiple questions:
Is the rolling mean more precise than summing and then dividing?
Does that depend on the question whether 1/n is calculated first and then multiplied?
If so, do computers implement a one step division? (I thought so, but I am unsure now)
If yes, is it more precise than Kahan summation and then dividing?
If compareable - which one is faster? In both cases we have additional calculations.
If more precise, could you use this for precise summation?
In many circumstances, yes. Consider a sequence of all positive terms, all on the same order of magnitude. Adding them all generates a large intermediate sum, to which we add small terms, which might round precisely to the intermediate sum. Using the rolling sum, you get terms on the same order of magnitude, and in addition, the sum is much harder to overflow. However, this is not open and shut: Adding the terms and then dividing allows us to use AVX instructions, which are significantly faster than the subtract/divide/add instructions of the rolling loop. In addition, there are distributions which cause one or the other to be more accurate. This has been examined in:
Robert F Ling. Comparison of several algorithms for computing sample means and variances. Journal of the American Statistical Association, 69(348): 859–866, 1974
Kahan summation is an orthogonal issue. You can apply Kahan summation to the sequence x[n] = (x[n-1]-mu)/n; this is very accurate.

Alternating series: Problems with floating point precision

I have a list of numbers x obtained from measurements. I need to calculate the quantity
where a is a positive number. Unfortunately, the individual terms of the sum can be very large and the sign of each term is determined by k and the associated value of x. This leads to cancellation and loss of precision.
I have found a couple of approaches such as compensated summation but wanted to check whether I am on the right path and whether there are better alternatives.

Frequency determination from sparsely sampled data

I'm observing a sinusoidally-varying source, i.e. f(x) = a sin (bx + d) + c, and want to determine the amplitude a, offset c and period/frequency b - the shift d is unimportant. Measurements are sparse, with each source measured typically between 6 and 12 times, and observations are at (effectively) random times, with intervals between observations roughly between a quarter and ten times the period (just to stress, the spacing of observations is not constant for each source). In each source the offset c is typically quite large compared to the measurement error, while amplitudes vary - at one extreme they are only on the order of the measurement error, while at the other extreme they are about twenty times the error. Hopefully that fully outlines the problem, if not, please ask and i'll clarify.
Thinking naively about the problem, the average of the measurements will be a good estimate of the offset c, while half the range between the minimum and maximum value of the measured f(x) will be a reasonable estimate of the amplitude, especially as the number of measurements increase so that the prospects of having observed the maximum offset from the mean improve. However, if the amplitude is small then it seems to me that there is little chance of accurately determining b, while the prospects should be better for large-amplitude sources even if they are only observed the minimum number of times.
Anyway, I wrote some code to do a least-squares fit to the data for the range of periods, and it identifies best-fit values of a, b and d quite effectively for the larger-amplitude sources. However, I see it finding a number of possible periods, and while one is the 'best' (in as much as it gives the minimum error-weighted residual) in the majority of cases the difference in the residuals for different candidate periods is not large. So what I would like to do now is quantify the possibility that the derived period is a 'false positive' (or, to put it slightly differently, what confidence I can have that the derived period is correct).
Does anybody have any suggestions on how best to proceed? One thought I had was to use a Monte-Carlo algorithm to construct a large number of sources with known values for a, b and c, construct samples that correspond to my measurement times, fit the resultant sample with my fitting code, and see what percentage of the time I recover the correct period. But that seems quite heavyweight, and i'm not sure that it's particularly useful other than giving a general feel for the false-positive rate.
And any advice for frameworks that might help? I have a feeling this is something that can likely be done in a line or two in Mathematica, but (a) I don't know it, an (b) don't have access to it. I'm fluent in Java, competent in IDL and can probably figure out other things...
This looks tailor-made for working in the frequency domain. Apply a Fourier transform and identify the frequency based on where the power is located, which should be clear for a sinusoidal source.
ADDENDUM To get an idea of how accurate is your estimate, I'd try a resampling approach such as cross-validation. I think this is the direction that you're heading with the Monte Carlo idea; lots of work is out there, so hopefully that's a wheel you won't need to re-invent.
The trick here is to do what might seem at first to make the problem more difficult. Rewrite f in the similar form:
f(x) = a1*sin(b*x) + a2*cos(b*x) + c
This is based on the identity for the sin(u+v).
Recognize that if b is known, then the problem of estimating {a1, a2, c} is a simple LINEAR regression problem. So all you need to do is use a 1-variable minimization tool, working on the value of b, to minimize the sum of squares of the residuals from that linear regression model. There are many such univariate optimizers to be found.
Once you have those parameters, it is easy to find the parameter a in your original model, since that is all you care about.
a = sqrt(a1^2 + a2^2)
The scheme I have described is called a partitioned least squares.
If you have a reasonable estimate of the size and the nature of your noise (e.g. white Gaussian with SD sigma), you can
(a) invert the Hessian matrix to get an estimate of the error in your position and
(b) should be able to easily derive a significance statistic for your fit residues.
For (a), compare http://www.physics.utah.edu/~detar/phys6720/handouts/curve_fit/curve_fit/node6.html
For (b), assume that your measurement errors are independent and thus the variance of their sum is the sum of their variances.

Kahan summation

Has anyone used Kahan summation in an application? When would the extra precision be useful?
I hear that on some platforms double operations are quicker than float operations. How can I test this on my machine?
Kahan summation works well when you are summing numbers and you need to minimize the worse-case floating point error. Without this technique, you may have significant loss of precision in add operations if you have two numbers that differ in magnitude by the significant digits available (e.g. 1 + 1e-12). Kahan summation compensates for this.
And an excellent resource for floating point issues is here, "What every computer scientist should know about floating-point arithmetic": http://www.validlab.com/goldberg/paper.pdf
On single vs double precision performance: yes, single precision can be significantly faster, but it depends on the particular machine. See: https://www.hpcwire.com/2006/06/16/less_is_more_exploiting_single_precision_math_in_hpc-1/
The best way to test is to write a short example that tests the operations you care about, using both single (float) and double precision, and measure the runtimes.
I've use Kahan summation for Monte-Carlo integration. You have a scalar valued function f which you believe is rather expensive to evaluate; a reasonable estimate is 65ns/dimension. Then you accumulate those values into an average-updating an average takes about 4ns. So if you update the average using Kahan summation (4x as many flops, ~16ns) then you're really not adding that much compute to the total. Now, often it is said that the error of Monte-Carlo integration is σ/√N, but this is incorrect. The real error bound (in finite precision arithmetic) is
σ/√N + cond(In)ε N
Where cond(In) is the condition number of summation and ε is twice the unit roundoff. So the algorithm diverges faster than it converges. For 32 bit arithmetic, getting ε N ~ 1 is simple: 10^7 evaluations can be done exceedingly quickly, and after this your Monte-Carlo integration goes on a random walk. The situation is even worse when the condition number is large.
If you use Kahan summation, the expression for the error changes to
σ/√N + cond(In)ε2 N,
Which, admittedly still diverges faster than it converges, but ε2 N cannot be made large on a reasonable timescale on modern hardware.
I've used Kahan summation to compensate for an accumulated error when computing running averages. It does make quite a difference and it's easy to test. I eliminated rather large errors after only a 100 summations.
I would definitely use the Kahan summation algorithm to compensate for the error in any running totals.
However, I've noticed quite large (1e-3) errors when doing inverse matrix multiplication. Basically, A*x = y, then inv(A)*y ~= x I'm not getting the original values back exactly. Which is fine but I thought maybe Kahan summation would help (there's a lot of addition) especially with larger matrices >3-by-3. I tried with a 4-by-4 matrix and it did not improve the situation at all.
When would the extra precision be useful?
Very roughly:
Case 1
When you are
Summing up a lot of data
in a non-sequential fashion, i.e. computing sums, then summing up the sums (as opposed to iterating all data with a running sum),
then Kahan summation makes a lot of sense in the second phase - when you sum-up-the-sums, because the errors you're avoiding are by now more significant, while the overhead is paid only for a small fraction of the overall sum operations.
Case 2
When you're working with a lower-precision floating-point type, without being sure you're meeting the accuracy requirement, and you're not allowed to switch to a larger, higher-precision type.

math: scale coordinate system so that certain points get integer coordinates

this is more a mathematical problem. nonethelesse i am looking for the algorithm in pseudocode to solve it.
given is a one dimensional coordinate system, with a number of points. the coordinates of the points may be in floating point.
now i am looking for a factor that scales this coordinate system, so that all points are on fixed number (i.e. integer coordinate)
if i am not mistaken, there should be a solution for this problem as long as the number of points is not infinite.
if i am wrong and there is no analytical solution for this problem, i am interested in an algorithm that approximates the solution as close as possible. (i.e. the coordinates will look like 15.0001)
if you are interested for the concrete problem:
i would like to overcome the well known pixelsnapping problem in adobe flash, which cuts of half-pixels at the border of bitmaps if the whole stage is scaled. i would like to find out an ideal scaling factor for the stage which makes my bitmaps being placed on whole (screen-)pixel coordinates.
since i am placing two bitmaps on the stage, the number of points will be 4 in each direction (x,y).
thanks!
As suggested, you have to convert your floating point numbers to rational ones. Fix a tolerance epsilon, and for each coordinate, find its best rational approximation within epsilon.
An algorithm and definitions is outlined there in this section.
Once you have converted all the coordinates into rational numbers, the scaling is given by the least common multiple of the denominators.
Note that this latter number can become quite huge, so you may want to experiment with epsilon so that to control the denominators.
My own inclination, if I were in your situation, would be to use rational numbers not with floating point.
And the algorithms you are looking for is finding the lowest common denominator.
A floating point number is an integer, multiplied by a power of two (the power might be negative).
So, find the largest necessary power of two among your inputs, and that gives you a scale factor that will work. The power of two isn't just -1 times the exponent of the float, it's a few more than that (according to where the least significant 1 bit is in the significand).
It's also optimal, because if x times a power of 2 is an odd integer then x in its float representation was already in simplest rational form, there's no smaller integer that you can multiply x by to get an integer.
Obviously if you have a mixture of large and small values among your input, then the resulting integers will tend to be bigger than 64 bit. So there is an analytical solution, but perhaps not a very good one given what you want to do with the results.
Note that this approach treats floats as being precise representations, which they are not. You may get more sensible results by representing each float as a rational number with smaller denominator (within some defined tolerance), then taking the lowest common multiple of all the denominators.
The problem there though is the approximation process - if the input float is 0.334[*] then I can't in general be sure whether the person who gave it to me really mean 0.334, or whether it's 1/3 with some inaccuracy. I therefore don't know whether to use a scale factor of 3 and say the scaled result is 1, or use a scale factor of 500 and say the scaled result is 167. And that's just with 1 input, never mind a bunch of them.
With 4 inputs and allowed final tolerance of 0.0001, you could perhaps find the 10 closest rationals to each input with a certain maximum denominator, then try 10^4 different possibilities and see whether the resulting scale factor gives you any values that are too far from an integer. Brute force seems nasty, but you might a least be able to bound the search a bit as you go. Also "maximum denominator" might be expressed in terms of the primes present in the factorization, rather than just the number, since if you can find a lot of common factors among them then they'll have a smaller lcm and hence smaller deviation from integers after scaling.
[*] Not that 0.334 is an exact float value, but that sort of thing. Decimal examples are easier.
If you are talking about single precision floating point numbers, then the number can be expressed like this according to wikipedia:
From this formula you can deduce that you always get an integer if you multiply by 2127+23. (Actually, when e is 0 you have to use another formula for the special range of "subnormal" numbers so 2126+23 is sufficient. See the linked wikipedia article for details.)
To do this in code you will probably need to do some bit twiddling to extract the factors in the above formula from the bits in the floating point value. And then you will need some kind of support for unlimited size numbers to express the integer result of the scaling (e.g. BigInteger in .NET). Normal primitive types in most languages/platforms are typically limited to much smaller sizes.
It's really a problem in statistical inference combined with noise reduction. This is the method I'm going to try out soon. I'm assuming you're trying to get a regularly spaced 2-D grid but a similar method could work on a regularly spaced grid of 3 or more dimensions.
First tabulate all the differences and note that (dx,dy) and (-dx,-dy) denote the same displacement, so there's an equivalence relation. Group those differenecs that are within a pre-assigned threshold (epsilon) of one another. Epsilon should be large enough to capture measurement errors due to random noise or lack of image resolution, but small enough not to accidentally combine clusters.
Sort the clusters by their average size (dr = root(dx^2 + dy^2)).
If the original grid was, indeed, regularly spaced and generated by two independent basis vectors, then the two smallest linearly independent clusters will indicate so. The smallest cluster is the one centered on (0, 0). The next smallest cluster (dx0, dy0) has the first basis vector up to +/- sign (-dx0, -dy0) denotes the same displacement, recall.
The next smallest clusters may be linearly dependent on this (up to the threshold epsilon) by virtue of being multiples of (dx0, dy0). Find the smallest cluster which is NOT a multiple of (dx0, dy0). Call this (dx1, dy1).
Now you have enough to tag the original vectors. Group the vector, by increasing lexicographic order (x,y) > (x',y') if x > x' or x = x' and y > y'. Take the smallest (x0,y0) and assign the integer (0, 0) to it. Take all the others (x,y) and find the decomposition (x,y) = (x0,y0) + M0(x,y) (dx0, dy0) + M1(x,y) (dx1,dy1) and assign it the integers (m0(x,y),m1(x,y)) = (round(M0), round(M1)).
Now do a least-squares fit of the integers to the vectors to the equations (x,y) = (ux,uy) m0(x,y) (u0x,u0y) + m1(x,y) (u1x,u1y)
to find (ux,uy), (u0x,u0y) and (u1x,u1y). This identifies the grid.
Test this match to determine whether or not all the points are within a given threshold of this fit (maybe using the same threshold epsilon for this purpose).
The 1-D version of this same routine should also work in 1 dimension on a spectrograph to identify the fundamental frequency in a voice print. Only in this case, the assumed value for ux (which replaces (ux,uy)) is just 0 and one is only looking for a fit to the homogeneous equation x = m0(x) u0x.

Resources