Setting decomposition threshold (tolerance) Eigen::JacobiSVD - eigen

I am trying to experiment with JacobiSVD of Eigen. In particular I am trying to reconstruct the input matrix from its singular value decomposition. http://eigen.tuxfamily.org/dox/classEigen_1_1JacobiSVD.html.
Eigen::MatrixXf m = Eigen::MatrixXf::Random(3,3);
Eigen::JacobiSVD<Eigen::MatrixXf, Eigen::NoQRPreconditioner> svd(m, Eigen::ComputeFullU | Eigen:: ComputeFullV);
Eigen::VectorXf SVec = svd.singularValues();
Eigen::MatrixXf S = Eigen::MatrixXf::Identity(3,3);
S(0,0) = SVec(0);
S(1,1) = SVec(1);
S(2,2) = SVec(2);
Eigen::MatrixXf recon = svd.matrixU() * S * svd.matrixV().transpose();
cout<< "diff : \n"<< m - recon << endl;
I know that internally the SVD is computed by an iterative method and can never get a perfect reconstruction. The errors are in order of 10^-7. With the above code the output is --
diff :
9.53674e-07 1.2517e-06 -2.98023e-07
-4.47035e-08 1.3113e-06 8.9407e-07
5.96046e-07 -9.53674e-07 -7.7486e-07
For my application this error is too high, I am aiming for an error in the range 10^-10 - 10^-12. My question is how to set the threshold for the decomposition.
NOTE : In the docs I have noted that there is a method setThreshold() but it clearly states that this does not set a threshold for the decomposition but for singular values comparison with zero.
NOTE : As far as possible I do not wish to go for double. Is it even possible with float?

A single precision floating point (a 32 bit float) has between six to nine significant decimal figures, so your requirement of 10^{-10} is impossible (assuming the values are around 0.5f). A double precision floating point (a 64 bit double) has 15-17 significant decimal figures, so should work as long as the values aren't 10^6.

Related

How to compare matrices in GNU Octave [duplicate]

I am writing a program where I need to delete duplicate points stored in a matrix. The problem is that when it comes to check whether those points are in the matrix, MATLAB can't recognize them in the matrix although they exist.
In the following code, intersections function gets the intersection points:
[points(:,1), points(:,2)] = intersections(...
obj.modifiedVGVertices(1,:), obj.modifiedVGVertices(2,:), ...
[vertex1(1) vertex2(1)], [vertex1(2) vertex2(2)]);
The result:
>> points
points =
12.0000 15.0000
33.0000 24.0000
33.0000 24.0000
>> vertex1
vertex1 =
12
15
>> vertex2
vertex2 =
33
24
Two points (vertex1 and vertex2) should be eliminated from the result. It should be done by the below commands:
points = points((points(:,1) ~= vertex1(1)) | (points(:,2) ~= vertex1(2)), :);
points = points((points(:,1) ~= vertex2(1)) | (points(:,2) ~= vertex2(2)), :);
After doing that, we have this unexpected outcome:
>> points
points =
33.0000 24.0000
The outcome should be an empty matrix. As you can see, the first (or second?) pair of [33.0000 24.0000] has been eliminated, but not the second one.
Then I checked these two expressions:
>> points(1) ~= vertex2(1)
ans =
0
>> points(2) ~= vertex2(2)
ans =
1 % <-- It means 24.0000 is not equal to 24.0000?
What is the problem?
More surprisingly, I made a new script that has only these commands:
points = [12.0000 15.0000
33.0000 24.0000
33.0000 24.0000];
vertex1 = [12 ; 15];
vertex2 = [33 ; 24];
points = points((points(:,1) ~= vertex1(1)) | (points(:,2) ~= vertex1(2)), :);
points = points((points(:,1) ~= vertex2(1)) | (points(:,2) ~= vertex2(2)), :);
The result as expected:
>> points
points =
Empty matrix: 0-by-2
The problem you're having relates to how floating-point numbers are represented on a computer. A more detailed discussion of floating-point representations appears towards the end of my answer (The "Floating-point representation" section). The TL;DR version: because computers have finite amounts of memory, numbers can only be represented with finite precision. Thus, the accuracy of floating-point numbers is limited to a certain number of decimal places (about 16 significant digits for double-precision values, the default used in MATLAB).
Actual vs. displayed precision
Now to address the specific example in the question... while 24.0000 and 24.0000 are displayed in the same manner, it turns out that they actually differ by very small decimal amounts in this case. You don't see it because MATLAB only displays 4 significant digits by default, keeping the overall display neat and tidy. If you want to see the full precision, you should either issue the format long command or view a hexadecimal representation of the number:
>> pi
ans =
3.1416
>> format long
>> pi
ans =
3.141592653589793
>> num2hex(pi)
ans =
400921fb54442d18
Initialized values vs. computed values
Since there are only a finite number of values that can be represented for a floating-point number, it's possible for a computation to result in a value that falls between two of these representations. In such a case, the result has to be rounded off to one of them. This introduces a small machine-precision error. This also means that initializing a value directly or by some computation can give slightly different results. For example, the value 0.1 doesn't have an exact floating-point representation (i.e. it gets slightly rounded off), and so you end up with counter-intuitive results like this due to the way round-off errors accumulate:
>> a=sum([0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]); % Sum 10 0.1s
>> b=1; % Initialize to 1
>> a == b
ans =
logical
0 % They are unequal!
>> num2hex(a) % Let's check their hex representation to confirm
ans =
3fefffffffffffff
>> num2hex(b)
ans =
3ff0000000000000
How to correctly handle floating-point comparisons
Since floating-point values can differ by very small amounts, any comparisons should be done by checking that the values are within some range (i.e. tolerance) of one another, as opposed to exactly equal to each other. For example:
a = 24;
b = 24.000001;
tolerance = 0.001;
if abs(a-b) < tolerance, disp('Equal!'); end
will display "Equal!".
You could then change your code to something like:
points = points((abs(points(:,1)-vertex1(1)) > tolerance) | ...
(abs(points(:,2)-vertex1(2)) > tolerance),:)
Floating-point representation
A good overview of floating-point numbers (and specifically the IEEE 754 standard for floating-point arithmetic) is What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg.
A binary floating-point number is actually represented by three integers: a sign bit s, a significand (or coefficient/fraction) b, and an exponent e. For double-precision floating-point format, each number is represented by 64 bits laid out in memory as follows:
The real value can then be found with the following formula:
This format allows for number representations in the range 10^-308 to 10^308. For MATLAB you can get these limits from realmin and realmax:
>> realmin
ans =
2.225073858507201e-308
>> realmax
ans =
1.797693134862316e+308
Since there are a finite number of bits used to represent a floating-point number, there are only so many finite numbers that can be represented within the above given range. Computations will often result in a value that doesn't exactly match one of these finite representations, so the values must be rounded off. These machine-precision errors make themselves evident in different ways, as discussed in the above examples.
In order to better understand these round-off errors it's useful to look at the relative floating-point accuracy provided by the function eps, which quantifies the distance from a given number to the next largest floating-point representation:
>> eps(1)
ans =
2.220446049250313e-16
>> eps(1000)
ans =
1.136868377216160e-13
Notice that the precision is relative to the size of a given number being represented; larger numbers will have larger distances between floating-point representations, and will thus have fewer digits of precision following the decimal point. This can be an important consideration with some calculations. Consider the following example:
>> format long % Display full precision
>> x = rand(1, 10); % Get 10 random values between 0 and 1
>> a = mean(x) % Take the mean
a =
0.587307428244141
>> b = mean(x+10000)-10000 % Take the mean at a different scale, then shift back
b =
0.587307428244458
Note that when we shift the values of x from the range [0 1] to the range [10000 10001], compute a mean, then subtract the mean offset for comparison, we get a value that differs for the last 3 significant digits. This illustrates how an offset or scaling of data can change the accuracy of calculations performed on it, which is something that has to be accounted for with certain problems.
Look at this article: The Perils of Floating Point. Though its examples are in FORTRAN it has sense for virtually any modern programming language, including MATLAB. Your problem (and solution for it) is described in "Safe Comparisons" section.
type
format long g
This command will show the FULL value of the number. It's likely to be something like 24.00000021321 != 24.00000123124
Try writing
0.1 + 0.1 + 0.1 == 0.3.
Warning: You might be surprised about the result!
Maybe the two numbers are really 24.0 and 24.000000001 but you're not seeing all the decimal places.
Check out the Matlab EPS function.
Matlab uses floating point math up to 16 digits of precision (only 5 are displayed).

Sampling from all possible floats in D

In the D programming language, the standard random (std.random) module provides a simple mechanism for generating a random number in some specified range.
auto a = uniform(0, 1024, gen);
What is the best way in D to sample from all possible floating point values?
For clarification, sampling from all possible 32-bit integers can be done as follows:
auto l = uniform!int(); // randomly selected int from all possible integers
Depends on the kind of distribution you want.
A uniform distribution over all possible values could be done by generating a random ulong and then casting the bits into floating point. For T being float or double:
union both { ulong input; T output; }
both val;
val.input = uniform!"[]"(ulong.min, ulong.max);
return val.output;
Since roughly half of the positive floating point numbers are between 0 and 1, this method will often give you numbers near zero.`It will also give you infinity and NaN values.
Aside: This code should be fine with D, but would be undefined behavior in C/C++. Use memcpy there.
If you prefer a uniform distribution over all possible numbers in floating point (equal probability for 0..1 and 1..2 etc), you need something like the normal uniform!double, which unfortunately does not work very well for large numbers. It also will not generate infinity or NaN. You could generate double numbers and convert them to float, but I have no answer for generating random large double numbers.

Should a Float or Int be used in this RNG?

I am using a simple Linear Congruential Generator to generate random numbers. The problem is, the result is behaving inconsistently depending on if I use Floats (known as Numbers in some languages) or Ints
// Variable definitions
var _seed:int = 1;
const MULTIPLIER:int = 48271;
const MODULUS:int = 2147483647; // 0x7FFFFFFF (31 bit integer)
// Inside the function
return _seed = ((_seed * MULTIPLIER) % MODULUS) & MODULUS;
The part I'm having difficulties with is the (_seed * MULTIPLIER) part. If _seed and MULTIPLIER are Ints, the int*int multiplication ensues, and most languages give an int as a result. The problem is, if that int is too large, the resulting value is truncated down.
Is this integer overflow behavior "supposed to be done" in RNGs, or should I cast _seed and MULTIPLIER to Floats before the multiplication in order to allow for larger variables?
LCG's are implemented with integer arithmetic because floating point arithmetic is only approximate - a floating point implementation will diverge from the integer implementation and won't yield full cycle for the generator. Even a double only has 52 mantissa bits, which is fewer than required to store the product of two 32 bit ints with guaranteed precision. With modulo arithmetic it's the low bits that are significant, and they're the ones at risk of getting lopped off.
Solutions:
You should be doing the intermediate arithmetic using 64 bit integers, then
cast/convert the result back to 32 bit ints after the modulo operation.
Explicitly break up the multiplication into low bits/high bits
components, and then recombine them after the modulo operation.
This is what Schrage did to achieve this portable FORTRAN
implementation of a relatively popular (at the time) LCG.

Are floats a secure alternative to generating an un-biased random number

I have always generated un-biased random numbers by throwing away any numbers in the biased range. Similar to this
int biasCount = MAX_INT % max
int maxSafeNumber = MAX_INT - biasCount;
int generatedNumber = 0;
do
{
generatedNumber = GenerateNumber();
} while (generatedNumber > maxSafeNumber)
return generatedNumber % max;
Today a friend showed me how he generated random numbers by converting the generated number into a float, then multiplying that against the max.
float percent = generatedNumber / (float)MAX_INT;
return (int)(percent * max);
This seems to solve the bias issue by not having to use a modulus in the first place. It also looks simple and fast. Is there any reason why the float approach would not be as secure (unbiased) as the first one?
The float method with a floor (i.e. your cast) introduces a bias
against the largest value in your range.
In order to return max, generatedNumber == MAX_INT must be true.
So max has probability 1/MAX_INT, while every other number in the
range has probability max/MAX_INT
As Henry points out, there's also the issue of aliasing if MAX_INT
is not a multiple of max. This makes some values in the range more
likely than others. The larger the difference between max and MAX_INT the smaller this bias is.
(Assuming you get, and want, a uniform distribution.)
This presentation by Stephan T. Lavavej from GoingNative 2013 goes over a lot of common fallacies with random numbers, including these range schemes. It's C++ centric in the implementations, but all the concepts carry over to any language:
http://channel9.msdn.com/Events/GoingNative/2013/rand-Considered-Harmful
The float method may not generate uniformly distributed output numbers even when the input numbers are uniformly distributed. To see where it breaks down do some examples with small numbers e.g. max = 6, MAX_INT = 8
it gets better when MAX_INT is large, but it is almost never perfect.

linear interpolation on 8bit microcontroller

I need to do a linear interpolation over time between two values on an 8 bit PIC microcontroller (Specifically 16F627A but that shouldn't matter) using PIC assembly language. Although I'm looking for an algorithm here as much as actual code.
I need to take an 8 bit starting value, an 8 bit ending value and a position between the two (Currently represented as an 8 bit number 0-255 where 0 means the output should be the starting value and 255 means it should be the final value but that can change if there is a better way to represent this) and calculate the interpolated value.
Now PIC doesn't have a divide instruction so I could code up a general purpose divide routine and effectivly calculate (B-A)/(x/255)+A at each step but I feel there is probably a much better way to do this on a microcontroller than the way I'd do it on a PC in c++
Has anyone got any suggestions for implementing this efficiently on this hardware?
The value you are looking for is (A*(255-x)+B*x)/255. It requires only 8x8 multiplication, and a final division by 255, which can be approximated by simply taking the high byte of the sum.
Choosing x in range 0..128, no approximation is needed: take the high byte of (A*(128-x)+B*x)<<1.
Assuming you interpolate a sequence of values where the previous endpoint is the new start point:
(B-A)/(x/255)+A
sounds like a bad idea. If you use base 255 as a fixedpoint representation, you get the same interpolant twice. You get B when x=255 and B as the new A when x=0.
Use 256 as the fixedpoint system. Divides become shifts, but you need 16-bit arithmetic and 8x8 multiplication with a 16-bit result. The previous issue can be fixed by simply ignoring any bits in the higher-bytes as x mod 256 becomes 0. This suggestion uses 16-bit multiplication, but can't overflow. and you don't interpolate over the same x twice.
interp = (a*(256 - x) + b*x) >> 8
256 - x becomes just a subtract-with-borrow, as you get 0 - x.
The PIC lacks these operations in its instruction set:
Right and left shift. (both logical and arithmetic)
Any form of multiplication.
You can get right-shifting by using rotate-right instead, followed by masking out the extra bits on the left with bitwise-and. A straight-forward way to do 8x8 multiplication with 16-bit result:
void mul16(
unsigned char* hi, /* in: operand1, out: the most significant byte */
unsigned char* lo /* in: operand2, out: the least significant byte */
)
{
unsigned char a,b;
/* loop over the smallest value */
a = (*hi <= *lo) ? *hi : *lo;
b = (*hi <= *lo) ? *lo : *hi;
*hi = *lo = 0;
while(a){
*lo+=b;
if(*lo < b) /* unsigned overflow. Use the carry flag instead.*/
*hi++;
--a;
}
}
The techniques described by Eric Bainville and Mads Elvheim will work fine; each one uses two multiplies per interpolation.
Scott Dattalo and Tony Kubek have put together a super-optimized PIC-specific interpolation technique called "twist" that is slightly faster than two multiplies per interpolation.
Is using this difficult-to-understand technique worth running a little faster?
You could do it using 8.8 fixed-point arithmetic. Then a number from range 0..255 would be interpreted as 0.0 ... 0.996 and you would be able to multiply and normalize it.
Tell me if you need any more details or if it's enough for you to start.
You could characterize this instead as:
(B-A)*(256/(x+1))+A
using a value range of x=0..255, precompute the values of 256/(x+1) as a fixed-point number in a table, and then code a general purpose multiply, adjust for the position of the binary point. This might not be small spacewise; I'd expect you to need a 256 entry table of 16 bit values and the multiply code. (If you don't need speed, this would suggest your divison method is fine.). But it only takes one multiply and an add.
My guess is that you don't need every possible value of X. If there are only a few values of X, you can compute them offline, do a case-select on the specific value of X and then implement the multiply in terms of a fixed sequence of shifts and adds for the specific value of X. That's likely to be pretty efficient in code and very fast for a PIC.
Interpolation
Given two values X & Y , its basically:
(X+Y)/2
or
X/2 + Y/2 (to prevent the odd-case that A+B might overflow the size of the register)
Hence try the following:
(Pseudo-code)
Initially A=MAX, B=MIN
Loop {
Right-Shift A by 1-bit.
Right-Shift B by 1-bit.
C = ADD the two results.
Check MSB of 8-bit interpolation value
if MSB=0, then B=C
if MSB=1, then A=C
Left-Shift 8-bit interpolation value
}Repeat until 8-bit interpolation value becomes zero.
The actual code is just as easy. Only i do not remember the registers and instructions off-hand.

Resources