Algorithm for superimposition of 3d points - algorithm

I need to superimpose two groups of 3D points on top of each other; i.e. find rotation and translation matrices to minimize the RMSD (root mean square deviation) between their coordinates.
I currently use Kabsch algorithm, which is not very useful for many of the cases I need to deal with. Kabsch requires equal number of points in both data sets, plus, it needs to know which point is going to be aligned with which one beforehand. For my case, the number of points will be different, and I don't care which point corresponds to which in the final alignment, as long as the RMSD is minimized.
So, the algorithm will (presumably) find a 1-1 mapping between the subsets of two point sets such that AFTER rotation&translation, the RMSD is minimized.
I know some algorithms that deal with different number of points, however they all are protein-based, that is, they try to align the backbones together (some continuous segment is aligned with another continuous segment etc), which is not useful for points floating in space, without any connections. (OK, to be clear, some points are connected; but there are points without any connections which I don't want to ignore during superimposition.)
Only algorithm that I found is DIP-OVL, found in STRAP software module (open source). I tried the code, but the behaviour seems erratic; sometimes it finds good alignments, sometimes it can't align a set of few points with itself after a simple X translation.
Anyone know of an algorithm that deals with such limitations? I'll have at most ~10^2 to ~10^3 points if the performance is an issue.
To be honest, the objective function to use is not very clear. RMSD is defined as the RMS of the distance between the corresponding points. If I have two sets with 50 and 100 points, and the algorithm matches 1 or few points within the sets, the resulting RMSD between those few points will be zero, while the overall superposition may not be so great. RMSD between all pairs of points is not a better solution (I think).
Only thing I can think of is to find the closest point in set X for each point in set Y (so there will be exactly min(|X|,|Y|) matches, e.g. 50 in that case) and calculate RMSD from those matches. But the distance calculation and bipartite matching portion seems too computationally complex to call in a batch fashion. Any help in that area will help as well.
Thanks!

What you said looks like a "cloud to cloud registration" task. Take a look into http://en.wikipedia.org/wiki/Iterative_closest_point and http://www.willowgarage.com/blog/2011/04/10/modular-components-point-cloud-registration for example. You can play with your data in open source Point Cloud Library to see if it works for you.

If you know which pairs of points correspond to each other, you can recover the transformation matrix with Linear Least Squares (LLS).
When considering LLS, you normally would want to find an approximation of x in A*x = b. With a transpose, you can solve for A instead of x.
Extend each source and target vector with "1", so they look like <x, y z, 1>
Equation: A · xi = bi
Extend to multiple vectors: A · X = B
Transpose: (A · X)T = BT
Simplify: XT · AT = BT
Substitute P = XT, Q = AT and R = BT. The result is: P · Q = R
Apply the formula for LLS: Q ≈ (PT · P)-1 · PT · R.
Substitute back: AT ≈ (X · XT)-1 · X · BT
Solve for A, and simplify: A ≈ B · XT · (X · XT)-1
(B · XT) and (X · XT) can be computed iteratively by summing up the outer products of the individual vector pairs.
B · XT = ∑bi·xiT
X · XT = ∑xi·xiT
A ≈ (∑bi·xiT) · (∑xi·xiT)-1
No matrix will be bigger than 4×4, so the algorithm does not use any excessive memory.
The result is not necessarily affine, but probably close. With some further processing, you can make it affine.

The best algorithm for discovering alignments through superimposition is Procrustes Analysis or Horn's method. Please follow this Stackoverflow link.

Related

How many paths of length n with the same start and end point can be found on a hexagonal grid?

Given this question, what about the special case when the start point and end point are the same?
Another change in my case is that we must move at every step. How many such paths can be found and what would be the most efficient approach? I guess this would be a random walk of some sort?
My think so far is, since we must always return to our starting point, thinking about n/2 might be easier. At every step, except at step n/2, we have 6 choices. At n/2 we have a different amount of choices depending on if n is even or odd. We also have a different amount of choices depending on where we are (what previous choices we made). For example if n is even and we went straight out, we only have one choice at n/2, going back. But if n is even and we didn't go straight out, we have more choices.
It is all the cases at this turning point that I have trouble getting straight.
Am I on the right track?
To be clear, I just want to count the paths. So I guess we are looking for some conditioned permutation?
This version of the combinatorial problem looks like it actually has a short formula as an answer.
Nevertheless, the general version, both this and the original question's, can be solved by dynamic programming in O (n^3) time and O (n^2) memory.
Consider a hexagonal grid which spans at least n steps in all directions from the target cell.
Introduce a coordinate system, so that every cell has coordinates of the form (x, y).
Let f (k, x, y) be the number of ways to arrive at cell (x, y) from the starting cell after making exactly k steps.
These can be computed either recursively or iteratively:
f (k, x, y) is just the sum of f (k-1, x', y') for the six neighboring cells (x', y').
The base case is f (0, xs, ys) = 1 for the starting cell (xs, ys), and f (0, x, y) = 0 for every other cell (x, y).
The answer for your particular problem is the value f (n, xs, ys).
The general structure of an iterative solution is as follows:
let f be an array [0..n] [-n-1..n+1] [-n-1..n+1] (all inclusive) of integers
f[0][*][*] = 0
f[0][xs][ys] = 1
for k = 1, 2, ..., n:
for x = -n, ..., n:
for y = -n, ..., n:
f[k][x][y] =
f[k-1][x-1][y] +
f[k-1][x][y-1] +
f[k-1][x+1][y] +
f[k-1][x][y+1]
answer = f[n][xs][ys]
OK, I cheated here: the solution above is for a rectangular grid, where the cell (x, y) has four neighbors.
The six neighbors of a hexagon depend on how exactly we introduce a coordinate system.
I'd prefer other coordinate systems than the one in the original question.
This link gives an overview of the possibilities, and here is a short summary of that page on StackExchange, to protect against link rot.
My personal preference would be axial coordinates.
Note that, if we allow standing still instead of moving to one of the neighbors, that just adds one more term, f[k-1][x][y], to the formula.
The same goes for using triangular, rectangular, or hexagonal grid, for using 4 or 8 or some other subset of neighbors in a grid, and so on.
If you want to arrive to some other target cell (xt, yt), that is also covered: the answer is the value f[n][xt][yt].
Similarly, if you have multiple start or target cells, and you can start and finish at any of them, just alter the base case or sum the answers in the cells.
The general layout of the solution remains the same.
This obviously works in n * (2n+1) * (2n+1) * number-of-neighbors, which is O(n^3) for any constant number of neighbors (4 or 6 or 8...) a cell may have in our particular problem.
Finally, note that, at step k of the main loop, we need only two layers of the array f: f[k-1] is the source layer, and f[k] is the target layer.
So, instead of storing all layers for the whole time, we can store just two layers, as we don't need more: one for odd k and one for even k.
Using only two layers is as simple as changing all f[k] and f[k-1] to f[k%2] and f[(k-1)%2], respectively.
This lowers the memory requirement from O(n^3) down to O(n^2), as advertised in the beginning.
For a more mathematical solution, here are some steps that would perhaps lead to one.
First, consider the following problem: what is the number of ways to go from (xs, ys) to (xt, yt) in n steps, each step moving one square north, west, south, or east?
To arrive from x = xs to x = xt, we need H = |xt - xs| steps in the right direction (without loss of generality, let it be east).
Similarly, we need V = |yt - ys| steps in another right direction to get to the desired y coordinate (let it be south).
We are left with k = n - H - V "free" steps, which can be split arbitrarily into pairs of north-south steps and pairs of east-west steps.
Obviously, if k is odd or negative, the answer is zero.
So, for each possible split k = 2h + 2v of "free" steps into horizontal and vertical steps, what we have to do is construct a path of H+h steps east, h steps west, V+v steps south, and v steps north. These steps can be done in any order.
The number of such sequences is a multinomial coefficient, and is equal to n! / (H+h)! / h! / (V+v)! / v!.
To finally get the answer, just sum these over all possible h and v such that k = 2h + 2v.
This solution calculates the answer in O(n) if we precalculate the factorials, also in O(n), and consider all arithmetic operations to take O(1) time.
For a hexagonal grid, a complicating feature is that there is no such clear separation into horizontal and vertical steps.
Still, given the starting cell and the number of steps in each of the six directions, we can find the final cell, regardless of the order of these steps.
So, a solution can go as follows:
Enumerate all possible partitions of n into six summands a1, ..., a6.
For each such partition, find the final cell.
For each partition where the final cell is the cell we want, add multinomial coefficient n! / a1! / ... / a6! to the answer.
Just so, this takes O(n^6) time and O(1) memory.
By carefully studying the relations between different directions on a hexagonal grid, perhaps we can actually consider only the partitions which arrive at the target cell, and completely ignore all other partitions.
If so, this solution can be optimized into at least some O(n^3) or O(n^2) time, maybe further with decent algebraic skills.

Querying large amount of multidimensional points in R^N

I'm looking at listing/counting the number of integer points in R^N (in the sense of Euclidean space), within certain geometric shapes, such as circles and ellipses, subject to various conditions, for small N. By this I mean that N < 5, and the conditions are polynomial inequalities.
As a concrete example, take R^2. One of the queries I might like to run is "How many integer points are there in an ellipse (parameterised by x = 4 cos(theta), y = 3 sin(theta) ), such that y * x^2 - x * y = 4?"
I could implement this in Haskell like this:
ghci> let latticePoints = [(x,y) | x <- [-4..4], y <-[-3..3], 9*x^2 + 16*y^2 <= 144, y*x^2 - x*y == 4]
and then I would have:
ghci> latticePoints
[(-1,2),(2,2)]
Which indeed answers my question.
Of course, this is a very naive implementation, but it demonstrates what I'm trying to achieve. (I'm also only using Haskell here as I feel it most directly expresses the underlying mathematical ideas.)
Now, if I had something like "In R^5, how many integer points are there in a 4-sphere of radius 1,000,000, satisfying x^3 - y + z = 20?", I might try something like this:
ghci> :{
Prelude| let latticePoints2 = [(x,y,z,w,v) | x <-[-1000..1000], y <- [-1000..1000],
Prelude| z <- [-1000..1000], w <- [-1000..1000], v <-[1000..1000],
Prelude| x^2 + y^2 + z^2 + w^2 + v^2 <= 1000000, x^3 - y + z == 20]
Prelude| :}
so if I now type:
ghci> latticePoints2
Not much will happen...
I imagine the issue is because it's effectively looping through 2000^5 (32 quadrillion!) points, and it's clearly unreasonably of me to expect my computer to deal with that. I can't imagine doing a similar implementation in Python or C would help matters much either.
So if I want to tackle a large number of points in such a way, what would be my best bet in terms of general algorithms or data structures? I saw in another thread (Count number of points inside a circle fast), someone mention quadtrees as well as K-D trees, but I wouldn't know how to implement those, nor how to appropriately query one once it was implemented.
I'm aware some of these numbers are quite large, but the biggest circles, ellipses, etc I'd be dealing with are of radius 10^12 (one trillion), and I certainly wouldn't need to deal with R^N with N > 5. If the above is NOT possible, I'd be interested to know what sort of numbers WOULD be feasible?
There is no general way to solve this problem. The problem of finding integer solutions to algebraic equations (equations of this sort are called Diophantine equations) is known to be undecidable. Apparently, you can write equations of this sort such that solving the equations ends up being equivalent to deciding whether a given Turing machine will halt on a given input.
In the examples you've listed, you've always constrained the points to be on some well-behaved shape, like an ellipse or a sphere. While this particular class of problem is definitely decidable, I'm skeptical that you can efficiently solve these problems for more complex curves. I suspect that it would be possible to construct short formulas that describe curves that are mostly empty but have a huge bounding box.
If you happen to know more about the structure of the problems you're trying to solve - for example, if you're always dealing with spheres or ellipses - then you may be able to find fast algorithms for this problem. In general, though, I don't think you'll be able to do much better than brute force. I'm willing to admit that (and in fact, hopeful that) someone will prove me wrong about this, though.
The idea behind the kd-tree method is that you recursive subdivide the search box and try to rule out whole boxes at a time. Given the current box, use some method that either (a) declares that all points in the box match the predicate (b) declares that no points in the box match the predicate (c) makes no declaration (one possibility, which may be particularly convenient in Haskell: interval arithmetic). On (c), cut the box in half (say along the longest dimension) and recursively count in the halves. Obviously the method can choose (c) all the time, which devolves to brute force; the goal here is to do (a) or (b) as much as possible.
The performance of this method is very dependent on how it's instantiated. Try it -- it shouldn't be more than a couple dozen lines of code.
For nicely connected region, assuming your shape is significantly smaller than your containing search space, and given a seed point, you could do a growth/building algorithm:
Given a seed point:
Push seed point into test-queue
while test-queue has items:
Pop item from test-queue
If item tests to be within region (eg using a callback function):
Add item to inside-set
for each neighbour point (generated on the fly):
if neighbour not in outside-set and neighbour not in inside-set:
Add neighbour to test-queue
else:
Add item to outside-set
return inside-set
The trick is to find an initial seed point that is inside the function.
Make sure your set implementation gives O(1) duplicate checking. This method will eventually break down with large numbers of dimensions as the surface area exceeds the volume, but for 5 dimensions should be fine.

Find points given distances between them

Here is an example:
Suppose there are 4 points: A, B, C, and D
Given that Point A is at (0,0):
and the distances:
A to B: 7
A to C: 5
A to D: 9
B to C: 6
B to D: 5
C to D: 7
The goal would be to find a solution to points B(x,y), C(x,y) and D(x,y)
What is an algorithm to find the points ( up to 50 of them ) given the distances between them?
OK,you have 4 points A, B, C, and D which are separated from one another such that the lengths of the distances between each pair of points is AB=7, AC=5, BC=6, AD=9, BD=5, and CD=7. Axyz=(0,0,0), Bxyz=(7,0,0), Cxyz=(2.7,4.2,0), Dxyz=(7.5,1.9,4.6) (rounding to the first decimal).
We set point A at the origin Axyz= (0,0,0).
We set point B at x=7,y=0,z=0 Bxyz= (7,0,0).
We find the x coordinate for point C by using the law of cosines:
((AB^2+AC^2-BC^2)/2)/Bx = Cx
((7^2+5^2-6^2)/2)/7=
((49+25-36)/2)/7= 38/14 = 2.714286
We then use the pythagorean theorem to find Cy:
sqrt(AC^2-Cx^2)=Cy
sqrt(25-7.367347)=4.199
So Cxyz=(2.714,4.199,0)
We find Dx in much the same way we found Cx:
((AB^2+AD^2-BD^2)/2)/Bx =Dx
((49+81-25)/2)/7= 7.5 = Dx
We find Dy by a slightly different formula:
(((AC^2+AD^2-CD^2)/2)-(Cx*Dx))/Dy
(((25+81-49)/2)-(2.714*7.5))/4.199= 1.94 (approx)
Having found Dx and Dy, we can find Dz by using Pythagorean theorem:
sqrt(AD^2-Dx^2-Dy^2)=
sqrt(9^2-7.5^2-1.94^2) = 4.58
So Dxyz=(7.5, 1.94, 4.58)
If you have pairwise distances between each of a set of 50 points, then you might need as many as 49 dimensions in order to obtain coordinates for all the points. If A, B, C, D, and E are all separated by 10 lengths units from each of every other, then you would need 4 spatial dimensions - if you introduce another point (F) which is also equidistant from all the other points, then you will need 5 dimensions. The algorithm works the same no matter how many dimensions are necessary (and in fact it works best when the maximum number of dimensions IS required-). The algorithm also works when the distances violate the triangle rule - such as if AB=3, AC=4, and BC=13 - the coordinates are A=0,0; B=3,0; and C=-24,23.66i. If the triangle rule is violated, then some of the coordinates will simply be imaginary valued. No big deal.
In general for point G, the coordinates (x1st, x2nd, x3rd, x4th, x5th, and x6th) can be found thusly:
G1st=((AB^2+AG^2-BG^2)/2)/(B1st)
G2nd=(((AC^2+AG^2-CG^2)/2)-(C1st*G1st))/(C2nd)
G3rd=(((AD^2+AG^2-DG^2)/2)-(D1stG1st)-(D2ndG2nd))/(D3rd)
G4th=(((AE^2+AG^2-EG^2)/2)-(E1stG1st)-(E2ndG2nd)-(E3rd*G3rd))/(E4th)
G5th=(((AF^2+AG^2-FG^2)/2)-(F1stG1st)-(F2ndG2nd)-(F3rdG3rd)-(F4thG4th))/(F5th)
G6th=sqrt(AG^2-G1st^2-G2nd^2-G3rd^2-G4th^2-G5th^2)
For the 5th point you find the first three coordinates with lawofcosine calculations and you find the 4th coordinate with a pythagoreantheorem calculations. For the 6th point you find the first 4 coordinates with 4 lawofcosine calculations and then you obtain the final coordinate with the pythagoreantheorem calculation. For the 50th point, you find the first 48 coordinates with 48 lawofcosines calculations and the 49th coordinate is found with a pythagoreantheorem calculation. So for 50 points, there will be 48 pythagoreantheorem calculations altogether plus 1128 lawofcosine calculations.
The algorithm is fairly straightforward:
A is always set at the origin and B is set at x=AB (or rather B1st=AB)
C1st is found by using the law of cosines ((AB^2+AC^2-BC^2)/2)/(B1st)
C2nd is then found with pythagorean theorem (sqrt(AC^2-C1st^2))
BUT WHAT IF C2nd = 0? This is not necessarily a problem, but it can become a problem for finding D2nd, D3rd, E2nd, E3rd, E4th, etc.
If AB=4, AC=8, BC=4, then we will obtain A (0,0), B (4,0), and C (8,0). If AD=4, BD=8, and CD=12, then there will be no problem for finding coordinates for D which would be D (-4,0).
However, if CD is not equal to 12, then we WILL have a problem. For instance, if CD=5, then we might find that we should go back and calculate coordinates for the points in a different order such as ACDB, that way we can get A=(0,0,0);C=(8,0,0); D=(3.44,2.04,0); and B=(4,-14.55,14.55i). This is a fairly intuitive solution, but it interrupts the flow of the algorithm because we have to go backwards and start over in a different order.
Another solution to the problem which does not necessitate interrupting the flow of computations is to deliberately introduce an error whenever a pythagoreantheorem calculation gives us a zero. -- Instead of a zero, put a 0.1 or 0.01 as the C2nd coordinate. This will allow one to proceed with calculating coordinates for the remaining points without interruption and the accuracy of the final results will suffer only a little (truth be told the algorithm is subject to cumulative rounding errors anyhow, so its no big deal). Also the deliberate introduction of error is the only way to obtain a solution at all in some cases:
Consider once again 4 points A, B, C, and D with distances such the AB=4, AC=8, BC=4, AD=4, BD=8, and CD=4 (we previously have had CD at 12, and CD at 5). When CD=4, there IS NO exact solution no matter what order you calculate the points. Go ahead and try.
A=(0,0,0), B=(4,0,0), C=(8,0,0)... If you introduce an error at C2nd so that instead of zero you put 0.1 such that C=(8,0.1,0), then you can obtain a solution for point D's coordinates D=(-4,640,640i). If you introduce a smaller error for C2nd such that C=(8,0.01,0), then you get D=(-4,6400,6400i). As C2nd gets closer and closer to zero, D2nd, and D3rd just get farther and farther away along the same direction. A similar result occurs sometimes when the distance between two points is close to zero. The algorithm ofcourse will not work with a distance that is actually equal to zero such with AB=5,AC=8, and BC=0. But it will work with BC=0.000001.
Anyway, I think this has answered your question you asked a year ago.

Is there a fast way to invert a matrix in Matlab?

I have lots of large (around 5000 x 5000) matrices that I need to invert in Matlab. I actually need the inverse, so I can't use mldivide instead, which is a lot faster for solving Ax=b for just one b.
My matrices are coming from a problem that means they have some nice properties. First off, their determinant is 1 so they're definitely invertible. They aren't diagonalizable, though, or I would try to diagonlize them, invert them, and then put them back. Their entries are all real numbers (actually rational).
I'm using Matlab for getting these matrices and for this stuff I need to do with their inverses, so I would prefer a way to speed Matlab up. But if there is another language I can use that'll be faster, then please let me know. I don't know a lot of other languages (a little but of C and a little but of Java), so if it's really complicated in some other language, then I might not be able to use it. Please go ahead and suggest it, though, in case.
I actually need the inverse, so I can't use mldivide instead,...
That's not true, because you can still use mldivide to get the inverse. Note that A-1 = A-1 * I. In MATLAB, this is equivalent to
invA = A\speye(size(A));
On my machine, this takes about 10.5 seconds for a 5000x5000 matrix. Note that MATLAB does have an inv function to compute the inverse of a matrix. Although this will take about the same amount of time, it is less efficient in terms of numerical accuracy (more info in the link).
First off, their determinant is 1 so they're definitely invertible
Rather than det(A)=1, it is the condition number of your matrix that dictates how accurate or stable the inverse will be. Note that det(A)=∏i=1:n λi. So just setting λ1=M, λn=1/M and λi≠1,n=1 will give you det(A)=1. However, as M → ∞, cond(A) = M2 → ∞ and λn → 0, meaning your matrix is approaching singularity and there will be large numerical errors in computing the inverse.
My matrices are coming from a problem that means they have some nice properties.
Of course, there are other more efficient algorithms that can be employed if your matrix is sparse or has other favorable properties. But without any additional info on your specific problem, there is nothing more that can be said.
I would prefer a way to speed Matlab up
MATLAB uses Gauss elimination to compute the inverse of a general matrix (full rank, non-sparse, without any special properties) using mldivide and this is Θ(n3), where n is the size of the matrix. So, in your case, n=5000 and there are 1.25 x 1011 floating point operations. So on a reasonable machine with about 10 Gflops of computational power, you're going to require at least 12.5 seconds to compute the inverse and there is no way out of this, unless you exploit the "special properties" (if they're exploitable)
Inverting an arbitrary 5000 x 5000 matrix is not computationally easy no matter what language you are using. I would recommend looking into approximations. If your matrices are low rank, you might want to try a low-rank approximation M = USV'
Here are some more ideas from math-overflow:
https://mathoverflow.net/search?q=matrix+inversion+approximation
First suppose the eigen values are all 1. Let A be the Jordan canonical form of your matrix. Then you can compute A^{-1} using only matrix multiplication and addition by
A^{-1} = I + (I-A) + (I-A)^2 + ... + (I-A)^k
where k < dim(A). Why does this work? Because generating functions are awesome. Recall the expansion
(1-x)^{-1} = 1/(1-x) = 1 + x + x^2 + ...
This means that we can invert (1-x) using an infinite sum. You want to invert a matrix A, so you want to take
A = I - X
Solving for X gives X = I-A. Therefore by substitution, we have
A^{-1} = (I - (I-A))^{-1} = 1 + (I-A) + (I-A)^2 + ...
Here I've just used the identity matrix I in place of the number 1. Now we have the problem of convergence to deal with, but this isn't actually a problem. By the assumption that A is in Jordan form and has all eigen values equal to 1, we know that A is upper triangular with all 1s on the diagonal. Therefore I-A is upper triangular with all 0s on the diagonal. Therefore all eigen values of I-A are 0, so its characteristic polynomial is x^dim(A) and its minimal polynomial is x^{k+1} for some k < dim(A). Since a matrix satisfies its minimal (and characteristic) polynomial, this means that (I-A)^{k+1} = 0. Therefore the above series is finite, with the largest nonzero term being (I-A)^k. So it converges.
Now, for the general case, put your matrix into Jordan form, so that you have a block triangular matrix, e.g.:
A 0 0
0 B 0
0 0 C
Where each block has a single value along the diagonal. If that value is a for A, then use the above trick to invert 1/a * A, and then multiply the a back through. Since the full matrix is block triangular the inverse will be
A^{-1} 0 0
0 B^{-1} 0
0 0 C^{-1}
There is nothing special about having three blocks, so this works no matter how many you have.
Note that this trick works whenever you have a matrix in Jordan form. The computation of the inverse in this case will be very fast in Matlab because it only involves matrix multiplication, and you can even use tricks to speed that up since you only need powers of a single matrix. This may not help you, though, if it's really costly to get the matrix into Jordan form.

Why don't genetic algorithms work on problems like factoring RSA?

Some time ago i was pretty interested in GAs and i studied about them quite a bit. I used C++ GAlib to write some programs and i was quite amazed by their ability to solve otherwise difficult to compute problems, in a matter of seconds. They seemed like a great bruteforcing technique that works really really smart and adapts.
I was reading a book by Michalewitz, if i remember the name correctly and it all seemed to be based on the Schema Theorem, proved by MIT.
I've also heard that it cannot really be used to approach problems like factoring RSA private keys.
Could anybody explain why this is the case ?
Genetic Algorithm are not smart at all, they are very greedy optimizer algorithms. They all work around the same idea. You have a group of points ('a population of individuals'), and you transform that group into another one with stochastic operator, with a bias in the direction of best improvement ('mutation + crossover + selection'). Repeat until it converges or you are tired of it, nothing smart there.
For a Genetic Algorithm to work, a new population of points should perform close to the previous population of points. Little perturbation should creates little change. If, after a small perturbation of a point, you obtain a point that represents a solution with completely different performance, then, the algorithm is nothing better than random search, a usually not good optimization algorithm. In the RSA case, if your points are directly the numbers, it's either YES or NO, just by flipping a bit... Thus using a Genetic Algorithm is no better than random search, if you represents the RSA problem without much thinking "let's code search points as the bits of the numbers"
I would say because factorisation of keys is not an optimisation problem, but an exact problem. This distinction is not very accurate, so here are details.
Genetic algorithms are great to solve problems where the are minimums (local/global), but there aren't any in the factorising problem. Genetic algorithm as DCA or Simulated annealing needs a measure of "how close I am to the solution" but you can't say this for our problem.
For an example of problem genetics are good, there is the hill climbing problem.
GAs are based on fitness evaluation of candidate solutions.
You basically have a fitness function that takes in a candidate solution as input and gives you back a scalar telling you how good that candidate is. You then go on and allow the best individuals of a given generation to mate with higher probability than the rest, so that the offspring will be (hopefully) more 'fit' overall, and so on.
There is no way to evaluate fitness (how good is a candidate solution compared to the rest) in the RSA factorization scenario, so that's why you can't use them.
GAs are not brute-forcing, they’re just a search algorithm. Each GA essentially looks like this:
candidates = seed_value;
while (!good_enough(best_of(candidates))) {
candidates = compute_next_generation(candidates);
}
Where good_enough and best_of are defined in terms of a fitness function. A fitness function says how well a given candidate solves the problem. That seems to be the core issue here: how would you write a fitness function for factorization? For example 20 = 2*10 or 4*5. The tuples (2,10) and (4,5) are clearly winners, but what about the others? How “fit” is (1,9) or (3,4)?
Indirectly, you can use a genetic algorithm to factor an integer N. Dixon's integer factorization method uses equations involving powers of the first k primes, modulo N. These products of powers of small primes are called "smooth". If we are using the first k=4 primes - {2,3,5,7} - 42=2x3x7 is smooth and 11 is not (for lack of a better term, 11 is "rough"). Dixon's method requires an invertible k x k matrix consisting of the exponents that define these smooth numbers. For more on Dixon's method see https://en.wikipedia.org/wiki/Dixon%27s_factorization_method.
Now, back to the original question: There is a genetic algorithm for finding equations for Dixon's method.
Let r be the inverse of a smooth number mod N - so r is a rough number
Let s be smooth
Generate random solutions of rx = sy mod N. These solutions [x,y] are the population for the genetic algorithm. Each x, y has a smooth component and a rough component. For example suppose x = 369 = 9 x 41. Then (assuming 41 is not small enough to count as smooth), the rough part of x is 41 and the smooth part is 9.
Choose pairs of solutions - "parents" - to combine into linear combinations with ever smaller rough parts.
The algorithm terminates when a pair [x,y] is found with rough parts [1,1], [1,-1],[-1,1] or [-1,-1]. This yields an equation for Dixon's method, because rx=sy mod N and r is the only rough number left: x and y are smooth, and s started off smooth. But even 1/r mod N is smooth, so it's all smooth!
Every time you combine two pairs - say [v,w] and [x,y] - the smooth parts of the four numbers are obliterated, except for the factors the smooth parts of v and x share, and the factors the smooth parts of w and y share. So we choose parents that share smooth parts to the greatest possible extent. To make this precise, write
g = gcd(smooth part of v, smooth part of x)
h = gcd(smooth part of w, smooth part of y)
[v,w], [x,y] = [g v/g, h w/h], [g x/g, h y/h].
The hard-won smooth factors g and h will be preserved into the next generation, but the smooth parts of v/g, w/h, x/g and y/h will be sacrificed in order to combine [v,w] and [x,y]. So we choose parents for which v/g, w/h, x/g and y/h have the smallest smooth parts. In this way we really do drive down the rough parts of our solutions to rx = sy mod N from one generation to the next.
On further thought the best way to make your way towards smooth coefficients x, y in the lattice ax = by mod N is with regression, not a genetic algorithm.
Two regressions are performed, one with response vector R0 consisting of x-values from randomly chosen solutions of ax = by mod N; and the other with response vector R1 consisting of y-values from the same solutions. Both regressions use the same explanatory matrix X. In X are columns consisting of the remainders of the x-values modulo smooth divisors, and other columns consisting of the remainders of the y-values modulo other smooth divisors.
The best choice of smooth divisors is the one that minimizes the errors from each regression:
E0 = R0 - X (inverse of (X-transpose)(X)) (X-transpose) (R0)
E1 = R1 - X (inverse of (X-transpose)(X)) (X-transpose) (R1)
What follows is row operations to annihilate X. Then apply a result z of these row operations to the x- and y-values from the original solutions from which X was formed.
z R0 = z R0 - 0
= z R0 - zX (inverse of (X-transpose)(X)) (X-transpose) (R0)
= z E0
Similarly, z R1 = z E1
Three properties are now combined in z R0 and z R1:
They are multiples of large smooth numbers, because z annihilates remainders modulo smooth numbers.
They are relatively small, since E0 and E1 are small.
Like any linear combination of solutions to ax = by mod N, z R0 and z R1 are themselves solutions to that equation.
A relatively small multiple of a large smooth number might just be the smooth number itself. Having a smooth solution of ax = by mod N yields an input to Dixon's method.
Two optimizations make this particularly fast:
There is no need to guess all the smooth numbers and columns of X at once. You can run regressions continuosly, adding one column to X at a time, choosing columns that reduce E0 and E1 the most. At no time will any two smooth numbers with a common factor be selected.
You can also start with a lot of random solutions of zx = by mod N, and remove the ones with the largest errors between selections of new columns for X.

Resources