Traceback in DP Needleman-Wunsch/Smith-Waterman - bioinformatics

In Needleman-Wunsch and Smith-Waterman, what is the best way to implement traceback? Do we usually keep two matrices, one with each entry's predecessor? That is, each entry would be UP, DIAG, or LEFT. Or is there a simpler, more space-efficient way to do traceback? I understand the algorithms and how to get the maximum score, but not the traceback. Thanks!

Using 2 matrices will work but is the naive approach especially if size or memory are an issue. The problem is that 2 separate matrices are space inefficient.
Since there are only 3 possible directions for the trace back in N-W and 4 possible in S-W (you need to add the STOP), you can store each direction as 2 bits. If your scores will be small enough, you can pack both values of the corresponding matrix cells into a single cell of one matrix and then do bit masking to get the score and the traceback direction.
Or, if you still want 2 matrices, there is no reason to take up so much space for the traceback matrix. You can still pack your traceback matrix so that 4 traceback positions are in a single cell of a byte matrix. (you would have to similar bit masking).

My understanding is yes, you do need 2 matrices.
And then you trace back from the bottom right position. See http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm

Related

How to count black cells without iteration (bitmap?)

Been stuck on a problem for a while, hope some of you have ideas.
Given a matrix size N*M of binary values (0 / 1), come with an approach to return the number of 1's which is more efficient than simply iterating the matrix.
The key in my opinion is bitmap. Thought about allocating new N*M matrix and manipulate the two... haven't got a solution yet.
Any ideas?
From a theoretical point of view, unless the matrix has special properties, you must test all the N.M elements and this can be achieved by a loop. So this construction is optimal and unbeatable.
In practice, maybe you are looking for a way to get some speedup from a naïve implementation that handles a single element at a time. The answer will be highly dependent on the storage format of the elements and the processor architecture.
If the bits are packed 8 per bytes, you can setup a lookup table of bit counts for every possible byte value. This yields a potential speedup of x8.
If you know that the black zone are simply connected (no hole), then it is not necessary to visit their inside, a contouring algorithm will suffice. But you still have to scan the white areas. This allows to break the N.M limit and to reduce to Nw + Lb, where Nw is the number of white pixels, and Lb the length of the black outlines.
If in addition you know that there is a single, simply connected black zone, and you know a black outline pixel, the complexity drops to Lb, which can be significantly smaller than N.M.

Minutiae-based fingerprint matching algorithm

The problem
I need to match two fingerprints and give a score of resemblance.
I have posted a similar question before, but I think I've made enough progress to warrant a new question.
The input
For each image, I have a list of minutiae (important points). I want to match the fingerprints by matching these two lists.
When represented graphically, they look like this:
A minutia consists of a triplet (i, j, theta) where:
i is the row in a matrix
j is the column in a matrix
theta is a direction. I don't use that parameter yet in my matching algorithm.
What I have done so far
For each list, find the "dense regions" or "clusters". Some areas have more points than others, and I have written an algorithm to find them. I can explain further if you want.
Shifting the second list in order to account for the difference in finger position between both images. I neglect differences in finger rotation. The shift is done by aligning the barycenters of the centers of the clusters. (It is more reliable than the barycenter of all minutiae)
I tried building a matrix for each list (post-shift) so that for every minutia increments the corresponding element and it's close neighbours, like below.
1 1 1 1 1 1 1
1 2 2 2 2 2 1
1 2 3 3 3 2 1
1 2 3 4 3 2 1
1 2 3 3 3 2 1
1 2 2 2 2 2 1
1 1 1 1 1 1 1
By subtracting the two matrices and adding up the absolute values of all elements in the resulting matrix, I hoped to get low numbers for close fingerprints.
Results
I tested a few fingerprints and found that the number of clusters is very stable. Matching fingerprints very often have the same number of clusters, and different fingers give different numbers. So that will definitely be a factor in the overall resemblance score.
The sum of the differences didn't work at all however. There was no correlation between resemblance and the sum.
Thoughts
I may need to use the directions of the points but I don't know how yet
I could use the standard deviation of the points, or of the clusters.
I could repeat the process for different types of minutiae. Right now my algorithm detects ridge endings and ridge bifurcations but maybe I should process these separately.
Question: How can I improve my algorithm ?
Edit
I've come a long way since posting this question, so here's my update.
I dropped the bifurcations altogether, because my thinning algorithm messes those up too often. I did however end up using the angles quite a lot.
My initial cluster-counting idea does hold up pretty well on the small scale tests I ran (different combinations of my fingers and those of a handful of volunteers).
I give a score based on the following tests (10 tests, so 10% per success. It's a bit naïve but I'll find a better way to turn these 10 results into a score, as each test has its specificities):
Cluster-thingy (all the following don't use clusters, but minutiae. This is the only cluster-related approach I took)
Mean i position
Mean angle
i variance
j variance
Angle variance
i kurtosis
j kurtosis
Angle kurtosis
j skewness
A statistical approch indeed.
Same finger comparisons give pretty much always between 80 and 100%. Odd finger comparisons between 0 and 60% (not often 60%). I don't have exact numbers here so I won't pretend this a statistically significant success but it seems like a good first shot.
Your clustering approach is interesting, but one thing I'm curious about is how well you've tested it. For a new matching algorithm to be useful with respect to all the research and methods that already exists, you need to have a reasonably low EER. Have you tested your method with any of the standard databases? I have doubts as to the ability of cluster counts and locations alone to identify individuals at larger scales.
1) Fingerprint matching is a well studied problem and there are many good papers that can help you implement this. For a nice place to start, check out this paper, "Fingerprint Minutiae Matching Based on the Local and Global Structures" by Jiang & Yau. It's a classic paper, a short read (only 4 pages), and can be implemented fairly reasonably. They also define a scoring metric that can be used to quantify the degree to which two fingerprint images match. Again, this should only be a starting point because these days there are many algorithms that perform better.
2) If you want your algorithm to be robust, it should consider transformations of the fingerprint between images. Scanned fingerprints and certainly latent prints may not be consistent from image to image.
Also, calculating the direction of the minutiae points provides a method for handling fingerprint rotations. By measuring the angles between minutiae point directions, which will remain the same or close to the same across multiple images regardless of global rotation (though small inconsistencies may occur because skin is not rigid and may stretch slightly), you can find the best set of corresponding minutia pairs or triplets and use them as the basis for rotational alignment.
3) I recommend that you distinguish between ridge line endings and bifurcations. The more features you can isolate, the more accurately you can determine whether or not the fingerprints match. You might also consider the number of ridge lines that occur between each minutiae point.
This image below illustrates the features used by Jiang and Yau.
d: Euclidean distance between minutiae
θ: Angle measure between minutiae directions
φ: Global minutiae angle
n: Number of ridge lines between minutiae i and j
If you haven't read the Handbook of Fingerprint Recognition, I recommend it.

incremental least squares differing with only one row

I have to solve multiple least squares problem sequentially - that is one by one. Every least square problem from the previous one changes by only one row. The right hand side is same for all. For eg., Problem 1 : ||Ax-b|| and Problem 2 : ||Cy-b|| where C and A changes by only one row. That is, it is equivalent to deleting a row from A and including a new row in A. When solving problem 2, I also have x. Is there a fast way for solving y of Problem 2?
You can use the Sherman-Morrison formula.
The key piece of the linear regression solution is computing the inverse of A'A.
If b is the old row from A and a is the new row in C, then
C'C=A'A-bb'+aa'=A'A+(a-b)(a+b)'
This expression can be plugged into the Sherman-Morrison formula to compute (C'C)^{-1} given (A'A)^{-1}.
Unfortunately the answer may be NO...
Changing one row of a matrix will lead to completely different spectrum of the matrix. All the eigenvalues and eigenvectors are changed with both magnitude and orientation. As a result, the gradient of problem 1 won't remain in problem 2. You can try to use your x from problem 1 as a initial guess for y in problem 2, but it is not guaranteed to reduce your searching time in optimization.
Yet a linear matrix equation solving is not that hard with the powerful packages. You can use LU decomposition or QR decomposition to improve the computing efficiency very much.

Is there an efficient algorithm to generate random points in general position in the plane?

I need to generate n random points in general position in the plane, i.e. no three points can lie on a same line. Points should have coordinates that are integers and lie inside a fixed square m x m. What would be the best algorithm to solve such a problem?
Update: square is aligned with the axes.
Since they're integers within a square, treat them as points in a bitmap. When you add a point after the first, use Bresenham's algorithm to paint all pixels on each of the lines going through the new point and one of the old ones. When you need to add a new point, get a random location and check if it's clear; otherwise, try again. Since each pair of pixels gives a new line, and thus excludes up to m-2 other pixels, as the number of points grows you will have several random choices rejected before you find a good one. The advantage of the approach I'm suggesting is that you only pay the cost of going through all lines when you have a good choice, while rejecting a bad one is a very quick test.
(if you want to use a different definition of line, just replace Bresenham's with the appropriate algorithm)
Can't see any way around checking each point as you add it, either by (a) running through all of the possible lines it could be on, or (b) eliminating conflicting points as you go along to reduce the possible locations for the next point. Of the two, (b) seems like it could give you better performance.
Similar to #LaC's answer. If memory is not a problem, you could do it like this:
Add all points on the plane to a list (L).
Shuffle the list.
For each point (P) in the list,
For each point (Q) previously picked,
Remove every point from L which are linear to P-Q.
Add P to the picked list.
You could continue the outer loop until you have enough points, or run out of them.
This might just work (though might be a little constrained on being random). Find the largest circle you can draw within the square (this seems very doable). Pick any n points on the circle, no three will ever be collinear :-).
This should be an easy enough task in code. Say the circle is centered at origin (so something of the form x^2 + y^2 = r^2). Assuming r is fixed and x randomly generated, you can solve to find y coordinates. This gives you two points on the circle for every x which are diametrically opposite. Hope this helps.
Edit: Oh, integer points, just noticed that. Thats a pity. I'm going to keep this solution up though - since I like the idea
Both #LaC's and #MizardX's solution are very interesting, but you can combine them to get even better solution.
The problem with #LaC's solution is that you get random choices rejected. The more points you have already generated the harder it gets to generate new ones. If there is only one available position left you have slight chance of randomly choosing it (1/(n*m)).
In the #MizardX's solution you never get rejected choices, however if you directly implement the "Remove every point from L which are linear to P-Q." step you'll get worse complexity (O(n^5)).
Instead it would be better to use a bitmap to find which points from L are to be removed. The bitmap would contain a value indicating whether a point is free to use and what is its location on the L list or a value indicating that this point is already crossed out. This way you get worst-case complexity of O(n^4) which is probably optimal.
EDIT:
I've just found that question: Generate Non-Degenerate Point Set in 2D - C++
It's very similar to this one. It would be good to use solution from this answer Generate Non-Degenerate Point Set in 2D - C++. Modifying it a bit to use radix or bucket sort and adding all the n^2 possible points to the P set initially and shufflying it, one can also get worst-case complexity of O(n^4) with a much simpler code. Moreover, if space is a problem and #LaC's solution is not feasible due to space requirements, then this algorithm will just fit in without modifications and offer a decent complexity.
Here is a paper that can maybe solve your problem:
"POINT-SETS IN GENERAL POSITION WITH MANY
SIMILAR COPIES OF A PATTERN"
by BERNARDO M. ABREGO AND SILVIA FERNANDEZ-MERCHANT
um, you don't specify which plane.. but just generate 3 random numbers and assign to x,y, and z
if 'the plane' is arbitrary, then set z=o every time or something...
do a check on x and y to see if they are in your m boundary,
compare the third x,y pair to see if it is on the same line as the first two... if it is, then regenerate the random values.

CUDA efficient polygons fill algorithm

I need efficient fill algorithm to fill closed polygons (like ex. Scanline fill), which I can run on CUDA. Have you got any suggestions?
Thanks in advance for any replays!
Thrust has a really good scanning algorithm, but only along single line. You may need to extend it a bit to work with images. Assuming edges are 1 and 0 everywhere else, all you need to do is perform a prefix sum on the image. Once the prefix sum is complete, all you need to do is fill the areas where the sum is Odd.

Resources