FASTA Algorithm Explanation - bioinformatics

I'm trying to understand the basic steps of FASTA algorithm in searching similar sequences of a query sequence in a database. These are the steps of the algorithm:
Identify common k-words between I and J
Score diagonals with k-word matches, identify 10 best
diagonals
Rescore initial regions with a substitution score matrix
Join initial regions using gaps, penalise for gaps
Perform dynamic programming to find final alignments
I'm confused with the 3rd and 4th steps in using PAM250 score matrix, and how to "join using gaps".
Can somebody explain these two steps for me "as specifically as possible".
Thanks

This is how FASTA works:
Find all k-length identities, then find locally similar regions by selecting those dense with k-word identities (i.e. many k-words, without too many gaps between). The best ten initial regions are used.
The initial regions are re-scored along their lengths by applying a substitution matrix in the usual way. Optimally scoring subregions are identified.
Create an alignment of the trimmed initial regions using dynamic programming, with a gap penalty of 20. Regions with too low of a score are not included.
Optimize the alignment from 3) using "banded" dynamic programming (Smith-Waterman). This is dynamic programming restricted to the 32 residue-wide band around the original alignment, which saves space and time over full dynamic programming.
If there are insufficient initial regions to form an alignment in 3), the best score from 2) can be used to rank sequences by similarity. Scores from 3) and 4) can also be used for that purpose.
Unfortunately my institution doesn't have access to the original FASTA paper so I can't supply the original values of the various parameters mentioned above.

The explanation is essentially correct, but the final band optimization is centered on the one best ungapped alignment found in step 2. Step 3 is used simply to improve the sensitivity in the choice of sequences that get step 4.
The original paper can be seen here: http://faculty.virginia.edu/wrpearson/papers/pearson_lipman_pnas88.pdf

Related

Select relevant features with PCA and K-MEANS

I am trying to understand PCA and K-Means algorithms in order to extract some relevant features from a set of features.
I don't know what branch of computer science study these topics, seems on internet there aren't good resources, just some paper that I don't understand well. An example of paper http://www.ifp.illinois.edu/~qitian/e_paper/icip02/icip02.pdf
I have csv files of pepole walks composed as follow:
TIME, X, Y, Z, these values are registred by the accelerometer
What I did
I transformed the dataset as a table in Python
I used tsfresh, a Python library, to extract from each walk a vector of features, these features are a lot, 2k+ features from each walk.
I have to use PFA, Principal Feature Analysis, to select the relevant features from the set of
vectors features
In order to do the last point, I have to reduce the dimension of the set of features walks with PCA (PCA will make the data different from the original one cause it modifies the data with the eigenvectors and eigenvalues of the covariance matrix of the original data). Here I have the first question:
How the input of PCA should look? The rows are the number of walks and the columns are the features or viceversa, so the rows are the number of the features and the columns are the number of walks of pepole?
After I reduced this data, I should use the K-Means algorithm on the reduced 'features' data. How the input should look in the K-Means? And what's the propouse on using this algorithm? All I know this algorithm it's used to 'cluster' some data, so in each cluster there are some 'points' based on some rule. What I did and think is:
If I use in PCA an input that looks like: the rows are the number of walks and the columns are the number of features, then for K-Means I should change the columns with rows cause in this way each point it's a feature (but this is not the original data with the features, it's just the reduced one, so I don't know). So then for each cluster I see with euclidean distance who has the lower distance from the centroid and select that feature. So how many clusters I should declare? If I declare that the clusters are the same as the number of features, I will extract always the same number of features. How can I say that a point in the reduced data correspond to this feature in the original set of features?
I know it's not correct what I am saying maybe, but I am trying to understand it, can some of you help me? If am I in the right way? Thanks!
For the PCA, make sure you separate the understanding of the method the algorithm uses (eigenvectors and such) and the result. The result, is a linear mapping, mapping the original space A, to A', where possibly, the dimension (number of features in your case) is less than the original space A.
So the first feature/element in space A', is a linear combination of features of A.
The row/column depends on implementation, but if you use scikit PCA the columns are the features.
You can feed the PCA output, the A' space, to K-means, and it will cluster them, based on a space of usually reduced dimension.
Each point will be part of a cluster, and the idea is that if you would calculate K-Means on A, you would probably end up with the same/similar clusters like with A'. Computationally A' is a lot cheaper. You now have a clustering, on A' and A. As we agree that points similar in A' are also similar in A.
The number of clusters is difficult to answer, if you don't know anything search the elbow method. But say you want to get a sense of different type of things you have, I argue go for 3~8 and not too much, compare 2-3 points closest to
each center, and you have something consumable. The number of features can be larger than the number of clusters. e.g. If we want to know the most dense area in some area (2D) you can easily have 50 clusters, to get a sense where 50 cities could be. Here we have number of cluster way higher than space dimension, and it makes sense.

Create matching of row and column entries such that values are maximized

Currently I am facing the following optimization problem and I cant seem to find the right applicable algorithm for this. This has to do with some of the combinatorial optimzation problems such as the knapsack problem but my mathematical knowledge is limited to that extent.
assume we have a list of the following words: ["apple", "banana", "cookie", "donut", "ear", "force"] Further, assume we have a dataset of texts which, among others, include these words. At some point I compute a cofrequency matrix, that is, a matrix of each of the word combinations the frequency in which the words combine together in all of the files. e.g. cofreq("apple", "banana") = (amount of files which have apple and banana)/(total files). Therefore, cofreq(apple, banana) = cofreq(banana, apple). We ignore cofreq(apple, apple)
Assume we have the following computed matrix (as an image, adding tables seems to be impossible): Table
The goal now is to create unique word pairs such that the word frequencies are maximized and each of the word pairs have a "partner" (We assume we have an even number of words). In this example it would be:
(apple, force) 0.4
(cookie, donut) 0.5
(banana, ear) 0.05
------------------+--
.95
In this case I did it by hand but I know that there is a good algorithm for it, but I cant seem to find it. I was hoping someone could point me in the right direction in the form of a research paper or such.
You need to use a maximum weight matching algorithm to compute this maximal sum pairing.
The table you have in input can be seen as the adjacency matrix of a graph, where the values in the table correspond to the graph's edges weight. You can do it since the cofreq value is commutative (meaning cofreq(apple, banana) == cofreq(banana, apple)).
The matching algorithm you can use here is called the blossom algorithm. It is not trivial, but very elegant. If you have some experience in implementing complex algorithms, you can implement it. Otherwise, there exists implementations of it in graph libraries for most of the common laguages.

Pointwise vs. pairwise Learning-to-rank on DATA WITH BINARY RELEVANCE VALUES

I have two question about the differences between pointwise and pairwise learning-to-rank algorithms on DATA WITH BINARY RELEVANCE VALUES (0s and 1s). Suppose the loss function for a pairwise algorithm calculates the number of times an entry with label 0 gets ranked before an entry with label 1, and that for a pointwise algorithm calculates the overall differences between the estimated relevance values and the actual relevance values.
So my questions are: 1) theoretically, will the two groups of algorithms perform significantly differently? 2) will a pairwise algorithm degrade to pointwise algorithm in such settings?
thanks!
In point wise estimation the errors across rows in your data (rows with items and users, you want to rank items within each user/query) are assumed to be independent sort of like normally distributed errors. Whereas in pair wise evaluation the algorithm loss function often used is cross entropy - a relative measure of accurately classifying 1's as 1's and 0's as 0s in each pair (with information - i.e. one of the item is better than other within the pair).
So changes are that the pair wise is likely to learn better than point-wise.
Only exception I could see is a business scenario when users click items without evaluating/comparing items from one another per-say. This is highly unlikely though.

Fast method of tile fitting/arranging

For example, let's say we have a bounded 2D grid which we want to cover with square tiles of equal size. We have unlimited number of tiles that fall into a defined number of types. Each type of tile specifies the letters printed on that tile. Letters are printed next to each edge and only the tiles with matching letters on their adjacent edges can be placed next to one another on the grid. Tiles may be rotated.
Given the size of the grid and tile type definitions, what is the fastest method of arranging the tiles such that the above constraint is met and the entire/majority of the grid is covered? Note that my use case is for large grids (~20 in each dimension) and medium-large number of solutions (unlike Eternity II).
So far, I've tried DFS starting in the center and picking the locations around filled area that allow the least number of possibilities and backtrack in case no progress can be made. This only works for simple scenarios with one or two types. Any more and too much backtracking ensues.
Here's a trivial example, showing input and the final output:
This is a hard puzzle.
The Eternity 2 was a puzzle of this form with a 16 by 16 square grid.
Despite a 2 million dollar prize, no one found the solution in several years.
The paper "Jigsaw Puzzles, Edge Matching, and Polyomino Packing: Connections and Complexity" by Erik D. Demaine, Martin L. Demaine shows that this type of puzzle is NP-complete.
Given a problem of this sort with a square grid I would try a brute force list of all possible columns, and then a dynamic programming solution across all of the rows. This will find the number of solutions, and with a little extra work can be used to generate all of the solutions.
However if your rows are n long and there are m letters with k tiles, then the brute force list of all possible columns has up to mn possible right edges with up to m4k combinations/rotations of tiles needed to generate it. Then the transition from the right edge of one column to the right edge of the next next potentially has up to m2n possibilities in it. Those numbers are usually not worst case scenarios, but the size of those data structures will be the upper bound on the feasibility of that technique.
Of course if those data structures get too large to be feasible, then the recursive algorithm that you are describing will be too slow to enumerate all solutions. But if there are enough, it still might run acceptably fast even if this approach is infeasible.
Of course if you have more rows than columns, you'll want to reverse the role of rows and columns in this technique.

Graph Simplification Algorithm Advice Needed

I have a need to take a 2D graph of n points and reduce it the r points (where r is a specific number less than n). For example, I may have two datasets with slightly different number of total points, say 1021 and 1001 and I'd like to force both datasets to have 1000 points. I am aware of a couple of simplification algorithms: Lang Simplification and Douglas-Peucker. I have used Lang in a previous project with slightly different requirements.
The specific properties of the algorithm I am looking for is:
1) must preserve the shape of the line
2) must allow me reduce dataset to a specific number of points
3) is relatively fast
This post is a discussion of the merits of the different algorithms. I will post a second message for advice on implementations in Java or Groovy (why reinvent the wheel).
I am concerned about requirement 2 above. I am not an expert enough in these algorithms to know whether I can dictate the exact number of output points. The implementation of Lang that I've used took lookAhead, tolerance and the array of Points as input, so I don't see how to dictate the number of points in the output. This is a critical requirement of my current needs. Perhaps this is due to the specific implementation of Lang we had used, but I have not seen a lot of information on Lang on the web. Alternatively we could use Douglas-Peucker but again I am not sure if the number of points in the output can be specified.
I should add I am not an expert on these types of algorithms or any kind of math wiz, so I am looking for mere mortal type advice :) How do I satisfy requirements 1 and 2 above? I would sacrifice performance for the right solution.
I think you can adapt Douglas-PĆ¼cker quite straightforwardly. Adapt the recursive algorithm so that rather than producing a list it produces a tree mirroring the structure of the recursive calls. The root of the tree will be the single-line approximation P0-Pn; the next level will represent the two-line approximation P0-Pm-Pn where Pm is the point between P0 and Pn which is furthest from P0-Pn; the next level (if full) will represent a four-line approximation, etc. You can then trim the tree either on the basis of depth or on the basis of distance of the inserted point from the parent line.
Edit: in fact, if you take the latter approach you don't need to build a tree. Instead you populate a priority queue where the priority is given by the distance of the inserted point from the parent line. Then when you've finished the queue tells you which points to remove (or keep, according to the order of the priorities).
You can find my C++ implementation and article on Douglas-Peucker simplification here and here. I also provide a modified version of the Douglas-Peucker simplification that allows you to specify the number of points of the resulting simplified line. It uses a priority queue as mentioned by 'Peter Taylor'. Its a lot slower though, so I don't know if it would satisfy the 'is relatively fast' requirement.
I'm planning on providing an implementation for Lang simplification (and several others). Currently I don't see any easy way how to adjust Lang to reduce to a fixed point count. If you
could live with a less strict requirement: 'must allow me reduce dataset to an approximate number of points', then you could use an iterative approach. Guess an initial value for lookahead: point count / desired point count. Then slowly increase the lookahead until you approximately hit the desired point count.
I hope this helps.
p.s.: I just remembered something, you could also try the Visvalingam-Whyatt algorithm. In short:
-compute the triangle area for each point with its direct neighbors
-sort these areas
-remove the point with the smallest area
-update the area of its neighbors
-resort
-continue until n points remain

Resources