Prove that L >= G for Local and Global alignments of a specific function - bioinformatics

I'm taking a bioinformatics class this semester and I'm having trouble with a specific question from the book.
*Given two DNA sequences, S and T, of the same length n and let the scoring function be defined as follows: match = 1, mismatch = -1, indel(gap) = -2. Suppose that G and L are the scores of an optimal global alignment and an optimal local alignment between S and T, respectively.
Prove that L >= G.
I understand how to find the respective alignments of two random sequences, but I'm having trouble proving this. As far as I can tell this is true. G will never be able to be greater than L because of the indel penalty being so high and the match not being able to make up for it. I also had to generate an example to prove that they can be equal, so I know that is true.
So yeah, any hints on how to go about this would be great.

Well this site isn't supposed to be us doing your homework but it's a simple question so let's have a crack at it:
We'll assume the points you make originally are valid (about the scoring).
Suppose the contrary, that there exists some local alignment which is less than G. If this were true, then it means your best local alignment (meaning you started somewhere away from G's starting or end points) is actually less efficient than your global alignment. But we know this can't be the case because the local alignment is a subset of your global alignment (worst case scenario, your local alignment IS your global alignment).
Therefore we prove that there are no counterexamples so this statement must hold.
Hope that makes sense!

Related

Bloom filters for determining which sets in a family are subsets of a given set

I am trying to use a Bloom filter to determine which sets from a family of sets A1, A2,...,Am are subsets of another fixed set Q. I am hoping that someone can verify the correctness of the stated approach or offer any improvements.
Let Q be a given set of integers, containing anywhere from 1-10000 elements from the universe set U = {1,2,...,10000}.
Also, let there be a family of sets A1, A2,...,Am each containing anywhere from 1-3 elements from the same universe set U. The size m is on the order of 5000.
outline of algorithm:
Let there be a collection of k hash functions. For each element of Q apply the hash functions and add it to a bitset of size n, denoted Q_b.
Also, for each of the Ai, i = 1,...,m sets, apply the hash functions to each element of Ai, generating the bitset (also of size n), denoted Ai_b.
To check if Ai is a subset of Q, perform a logical AND on the two bitsets, Q_b & Ai_b, and check if it is equal to the bitset Ai_b. That is, if Q_b & Ai_b == Ai_b is false, then we know that Ai is not a subset of Q; if it is true, then we do not know for sure (possibility of a false positive) and we need to check the given Ai using a deterministic approach.
The hope is that the filter tells us the majority of the Ai's that are not in Q and we can check the ones that return true more carefully.
Is this a good approach for my problem?
(Side questions: How big should n be? What are some good hash functions to use?)
If the range of values is rather small (as in your example), you can use a simple deterministic solution with linear time complexity.
Let's create an array was (with indices from 1 to 10000, that is, one cell for each element of the universal set), initially filled with false values.
For each element q of Q, we set was[q] = true.
Now we iterate over all sets of the family. For each set A_i, we iterate over all elements x of the set and check if was[x] is true. If it's not for at least one x, then A_i is not a subset of Q. Otherwise, it is.
This solution is clearly correct as it checks if one set is a subset of the other by definition. It's also rather simple and deterministic. The only potential downside it has is that it requires an auxiliary array of 10000 elements, but it looks admissible for most practical purposes (a bloom filter would require some extra space too, anyway).
Please try to ask only one question in your question.
I will address the first one: "Is this a good approach for my problem?", but not the last two, "How big should n be? What are some good hash functions to use?"
This is probably not a good approach.
First, Q is tiny; 10,000 elements from {1,...,10k} means Q can be stored with a bitset in 10k bits or about 1.2 kibibytes. That is very, very small. For instance, it is smaller than your question, which uses almost 1.5 kibibytes.
Second, Ai contains one to three elements, so Ai_b will likely be larger than Ai unless you chose them to be so small that the false positive rate is very high.
Finally, hash function computation is not free.
You can do this much more simply if you check each element of each Ai against a bitset representing Q.

I have a function f(w,x,y,z) and a target value A, how can I discover values for w,x,y,z that produce A?

So I have a function that takes four numerical arguments and produces a numerical argument.
f(w,x,y,z) --> A
If I have the function f and a target result A, is there an iterative method for discovering parameters w,x,y,z that produce a given number A?
If it helps, my function f is a quintic bezier where most of the parameters are determined. I have isolated just these four that are required to fit the value A.
Q(t)=R(1−t)^5+5S(1−t)^4*t+10T(1−t)^3*t^2+10U(1−t)^2*t^3+5V(1−t)t^4+Wt^5
R,S,T,U,V,W are vectors where R and W are known, I have isolated only a single element in each of S,T,U,V that vary as parameters.
The set of solutions of the equation f(w,x,y,z)=A (where all of w, x, y, z and A are scalars) is, in general, a 3 dimensional manifold (surface) in the 4-dimensional space R^4 of (w,x,y,z). I.e., the solution is massively non-unique.
Now, if f is simple enough for you to compute its derivative, you can use the Newton's method to find a root: the gradient is the direction of the fastest change of the function, so you go there.
Specifically, let X_0=(w_0,x_0,y_0,z_0) be your initial approximation of a solution and let G=f'(X_0) be the gradient at X_0.
Then f(X_0+h)=f(X_0)+(G,h)+O(|h|^2) (where (a,b) is the dot product).
Let h=a*G, and solve A=f(X_0)+a*|G|^2 to get a=(A-f(X_0))/|G|^2 (if G=0, change X_0) and X_1=X_0+a*G. If f(X_1) is close enough to A, you are done, otherwise proceed to compute f'(X_1) &c.
If you cannot compute f', you can play with many other methods.
If you can impose 3 (or more) additional equations that you know (or suspect) must be true for your 4-variable solution that gives target value A, then you can try applying Newton's method for solving a system of k equations with k unknowns. Otherwise, without a deeper understanding of the structure of the function you are trying to make equal to A, the only general type of technique I'm aware of that's easy to implement is to simply define the error function as g(w,x,y,z) = |f(w,x,y,z) - A| and search for a minimum of g. Typically the "minimum" found will be a local minimum, so it may require many restarts of the minimization problem with different starting values for your parameters to actually find a solution that gives a local minimum you want of g = 0. This is very easy to implement and try in a few lines e.g. in MATLAB using fminsearch

Minimize a function

Suppose you are given a function of a single variable and arguments a and b and are asked to find the minimum value that the function takes on the interval [a, b]. (You can assume that the argument is a double, though in my application I may need to use an arbitrary-precision library.)
In general this is a hard problem because functions can be weird. A simple version of this problem would be to minimize the function assuming that it is continuous (no gaps or jumps) and single-peaked (there is a unique minimum; to the left of the minimum the function is decreasing and to the right it is increasing). Is there a good way to solve this easier (but perhaps not easy!) problem?
Assume that the function may be difficult to calculate but not particularly expensive to store an answer that you've computed. (Obviously, it's better if you don't have to make giant arrays of key/value pairs.)
Bonus points for good ideas on improving the algorithm in the fortunate case in which it's nice (e.g.: derivative exists, function is smooth/analytic, derivative can be computed in closed form, derivative can be computed at no cost when the function is evaluated).
The version you describe, with a single minimum, is easy to solve.
The idea is this. Suppose that I have 3 points with a < b < c and f(b) < f(a) and f(b) < f(c). Then the true minimum is between a and c. Furthermore if I pick another point d somewhere in the interval, then I can throw away one of a or d and still have an interval with the true minimum in the middle. My approximations will improve exponentially quickly as I do more iterations.
We don't quite start with this. We start with 2 points, a and b, and know that the answer is somewhere in the middle. Take the mid-point. If f there is below the end points, we're into the case I discussed above. Otherwise it must be below one of the end points, and above the other. We can throw away the higher end point and repeat.
If the function is nice, i.e., single-peaked and strictly monotonic (i.e., strictly decreasing to the left of the minimum and strictly increasing to the right), then you can find the minimum with binary search:
Set x = (b-a)/2
test whether x is to the right of the minimum or to the left
if x is left of the minimum:b = x
if x is right of the minimum:a = x
repeat from start until you get bored
the minimum is at x
To test whether x is left/right of the minimum, invent a small value epsilon and check whether f(x - epsilon) < f(x + epsilon). If it is, the minimum is to the left, otherwise it's to the right. By "until you get bored", I mean: invent another small value delta and stop if fabs(f(x - epsilon) - f(x + epsilon)) < delta.
Note that in the general case where you don't know anything about the behavior of a function f, it's not possible to decide a non-trivial property of f. Well, unless you're willing to try all possible inputs. See Rice's Theorem for details.
The Boost project has an implementation of Brent's algorithm that may be useful.
It seems to assume that the function is continuous, and has no maxima (only a minimum) in the input interval.
Not a direct answer but a pointer to more reading:
scipy.optimize: http://docs.scipy.org/doc/scipy/reference/optimize.html
section e04 of naglib: http://www.nag.co.uk/numeric/cl/nagdoc_cl09/html/genint/libconts.html
For the special case where the function is differentiable twice (and the two derivatives can be calculated easily), one can use Newton's method for optimization, i.e. essentially finding the roots of the first derivative (which is a necessary condition for the minimum).
Concerning the general case, note that the extreme case of 'weird' is a function which is continuous nowhere and for which it is very hard if not impossible to find the minimum (in finite time). So I guess you should try to make at least some assumptions about the function you are trying to minimize.
What you want is to optimize an Unimodal function. The correct algorithm is similar to btilly's but you need extra points.
Take 4 points a < b < c < d.
We want to minimize f in [a,d].
If f(b) < f(c) we know the minimum is in [a, c]
If f(b) > f(c) " " " " is in [b, d]
This can give an algorithm by itself, but there is a nice trick involving the golden ratio that allows you to reuse the intermediate values (in a way you only need to compute f once per iteration instead of twice)
If you have an expression for the function, there are global optimization algorithms based on interval analysis.

How to compute palindrome from a stream of characters in sub-linear space/time?

I don't even know if a solution exists or not. Here is the problem in detail. You are a program that is accepting an infinitely long stream of characters (for simplicity you can assume characters are either 1 or 0). At any point, I can stop the stream (let's say after N characters were passed through) and ask you if the string received so far is a palindrome or not. How can you do this using less sub-linear space and/or time.
Yes. The answer is about two-thirds of the way down http://rjlipton.wordpress.com/2011/01/12/stringology-the-real-string-theory/
EDIT: Some people have asked me to summarize the result, in case the link dies. The link gives some details about a proof of the following theorem: There is a multi-tape Turing machine that can recognize initial non-trivial palindromes in real-time. (A summary, also provided by the article linked: Suppose the machine has read x1, x2, ..., xk of the input. Then it has only constant time to decide if x1, x2, ..., xk is a palindrome.)
A multitape Turing machine is just one with several side-by-side tapes that it can read and write to; in a very specific sense it is exactly equivalent to a standard Turing machine.
A real-time computation is one in which a Turing machine must read a character from input at least once every M steps (for some bounded constant M). It is readily seen that any real-time algorithm should be linear-time, then.
There is a paper on the proof which is around 10 pages which is available behind an institutional paywall here which I will not repost elsewhere. You can contact the author for a more detailed explanation if you'd like; I just had read this recently and realized it was more or less what you were looking for.
You could use a rolling hash, or more rolling hashes for accuracy. Incrementally compute the hash of the characters read so far, in the order they were read, and in reverse order of reading.
If your hash function is x*3^(k-1)+x*3^(k-2)+...+x*3^0 for example, where x is a character you read, this is how you'd do it:
hLeftRight = 0
hRightLeft = 0
k = 0
repeat until there are numbers in the stream
x = stream.Get()
hLeftRight = 3*hLeftRight + x.Value
hRightLeft = hRightLeft + 3^k*x.Value
if (x.QueryPalindrome = true)
yield hLeftRight == hRightLeft
k = k + 1
Obviously you'd have to calculate the hashes modulo something, probably a prime or a power of two. And of course, this could lead to false positives.
Round 2
As I see it, with each new character, there are three cases:
Character breaks potential symmetry, for example, aab -> aabc
Character extends the middle, for example aab -> aabb
Character continues symmetry, for example aab->aaba
Assume you have a pointer that tracks down the string and points to the last character that continued a potential palindrome.
(I am going to use parenthesis to indicate a pointed at character)
Lets say you are starting with aa(b) and get an:
'a' (case 3), you move the pointer to
the left and check if it's an 'a' (it
is). You now have a(a)b.
'c' (case 1), you are not expecting a 'c', in this case you start back at the beginning and you now have aab(c).
The really tricky case is 2, because somehow you have to know that the character you just got isn't affecting symmetry, it is just extending the middle. For this, you have to hold an additional pointer that tracks where the plateau's (middle's) edge lies. For example, you have (b)baabb and you just got another 'b', in this case you have to know to reset the pointer to the base of the middle plateau here: bbaa(b)bb. Since we are going for constant time, you have to hold a pointer here to begin with (you can't afford the time to search for the plateau's edge). Now if you get another 'b', you know that you are still on the edge of that plateau and you keep the pointer where it is, so bbaa(b)bb -> bbaa(b)bbb. Now, if you get an 'a', you know that the 'b's are not part of the extended middle and you reset both pointers (The tracking pointer and the edge pointer) so you now have bbaabbbb((a)).
With these three cases, I think all bases are covered. If you ever want to check if the current string is a palindrome, check if the first pointer (not the plateau's edge pointer) is at index 0.
This might help you:
http://arxiv.org/pdf/1308.3466v1.pdf
If you store the last $k$ many input symbols you can easily find palindromes up to a length of $k$.
If you use the algorithms of the paper you can find the midpoints of palindromes and an length estimate of its length.

Dynamic Programming: Sum-of-products

Let's say you have two lists, L1 and L2, of the same length, N. We define prodSum as:
def prodSum(L1, L2) :
ans = 0
for elem1, elem2 in zip(L1, L2) :
ans += elem1 * elem2
return ans
Is there an efficient algorithm to find, assuming L1 is sorted, the number of permutations of L2 such that prodSum(L1, L2) < some pre-specified value?
If it would simplify the problem, you may assume that L1 and L2 are both lists of integers from [1, 2, ..., N].
Edit: Managu's answer has convinced me that this is impossible without assuming that L1 and L2 are lists of integers from [1, 2, ..., N]. I'd still be interested in solutions that assume this constraint.
I want to first dispell a certain amount of confusion about the math, then discuss two solutions and give code for one of them.
There is a counting class called #P which is a lot like the yes-no class NP. In a qualitative sense, it is even harder than NP. There is no particular reason to believe that this counting problem is any better than #P-hard, although it could be hard or easy to prove that.
However, many #P-hard problems and NP-hard problems vary tremendously in how long they take to solve in practice, and even one particular hard problem can be harder or easier depending on the properties of the input. What NP-hard or #P-hard mean is that there are hard cases. Some NP-hard and #P-hard problems also have less hard cases or even outright easy cases. (Others have very few cases that seem much easier than the hardest cases.)
So the practical question could depend a lot on the input of interest. Suppose that the threshold is on the high side or on the low side, or you have enough memory for a decent number of cached results. Then there is a useful recursive algorithm that makes use of two ideas, one of them already mentioned: (1) After partially assigning some of the values, the remaining threshold for list fragments may rule out all of the permutations, or it may allow all of them. (2) Memory permitting, you should cache the subtotals for some remaining threshold and some list fragments. To improve the caching, you might as well pick the elements from one of the lists in order.
Here is a Python code that implements this algorithm:
list1 = [1,2,3,4,5,6,7,8,9,10,11]
list2 = [1,2,3,4,5,6,7,8,9,10,11]
size = len(list1)
threshold = 396 # This is smack in the middle, a hard value
cachecutoff = 6 # Cache results when up to this many are assigned
def dotproduct(v,w):
return sum([a*b for a,b in zip(v,w)])
factorial = [1]
for n in xrange(1,len(list1)+1):
factorial.append(factorial[-1]*n)
cache = {}
# Assumes two sorted lists of the same length
def countprods(list1,list2,threshold):
if dotproduct(list1,list2) <= threshold: # They all work
return factorial[len(list1)]
if dotproduct(list1,reversed(list2)) > threshold: # None work
return 0
if (tuple(list2),threshold) in cache: # Already been here
return cache[(tuple(list2),threshold)]
total = 0
# Match the first element of list1 to each item in list2
for n in xrange(len(list2)):
total += countprods(list1[1:],list2[:n] + list2[n+1:],
threshold-list1[0]*list2[n])
if len(list1) >= size-cachecutoff:
cache[(tuple(list2),threshold)] = total
return total
print 'Total permutations below threshold:',
print countprods(list1,list2,threshold)
print 'Cache size:',len(cache)
As the comment line says, I tested this code with a hard value of the threshold. It is quite a bit faster than a naive search over all permutations.
There is another algorithm that is better than this one if three conditions are met: (1) You don't have enough memory for a good cache, (2) the list entries are small non-negative integers, and (3) you're interested in the hardest thresholds. A second situation to use this second algorithm is if you want counts for all thresholds flat-out, whether or not the other conditions are met. To use this algorithm for two lists of length n, first pick a base x which is a power of 10 or 2 that is bigger than n factorial. Now make the matrix
M[i][j] = x**(list1[i]*list2[j])
If you compute the permanent of this matrix M using the Ryser formula, then the kth digit of the permanent in base x tells you the number of permutations for which the dot product is exactly k. Moreover, the Ryser formula is quite a bit faster than the summing over all permutations directly. (But it is still exponential, so it does not contradict the fact that computing the permanent is #P-hard.)
Also, yes it is true that the set of permutations is the symmetric group. It would be great if you could use group theory in some way to accelerate this counting problem. But as far as I know, nothing all that deep comes from that description of the question.
Finally, if instead of exactly counting the number of permutations below a threshold, you only wanted to approximate that number, then probably the game changes completely. (You can approximate the permanent in polynomial time, but that doesn't help here.) I'd have to think about what to do; in any case it isn't the question posed.
I realized that there is another kind of caching/dynamic programming that is missing from the above discussion and the above code. The caching implemented in the code is early-stage caching: If just the first few values of list1 are assigned to list2, and if a remaining threshold occurs more than once, then the cache allows the code to reuse the result. This works great if the entries of list1 and list2 are integers that are not too large. But it will be a failed cache if the entries are typical floating point numbers.
However, you can also precompute at the other end, when most of the values of list1 have been assigned. In this case, you can make a sorted list of the subtotals for all of the remaining values. And remember, you can use up list1 in order, and do all of the permutations on the list2 side. For example, suppose that the last three entries of list1 are [4,5,6], and suppose that three of the values in list2 (somewhere in the middle) are [2.1,3.5,3.7]. Then you would cache a sorted list of the six dot products:
endcache[ [2.1, 3.5, 3.7] ] = [44.9, 45.1, 46.3, 46.7, 47.9, 48.1]
What does this do for you? If you look in the code that I did post, the function countprods(list1,list2,threshold) recursively does its work with a sub-threshold. The first argument, list1, might have been better as a global variable than as an argument. If list2 is short enough, countprods can do its work much faster by doing a binary search in the list endcache[list2]. (I just learned from stackoverflow that this is implemented in the bisect module in Python, although a performance code wouldn't be written in Python anyway.) Unlike the head cache, the end cache can speed up the code a lot even if there are no numerical coincidences among the entries of list1 and list2. Ryser's algorithm also stinks for this problem without numerical coincidences, so for this type of input I only see two accelerations: Sawing off a branch of the search tree using the "all" test and the "none" test, and the end cache.
Probably not (without the simplifying assumption): your problem is NP-Hard. Here's a trivial reduction to SUBSET-SUM. Let count_perms(L1, L2, x) represent the function "count the number of permutations of L2 such that prodSum(L1, L2) < x"
SUBSET_SUM(L2,n): # (determine if any subset of L2 adds up to n)
For i in [1,...,len(L2)]
Set L1=[0]*(len(L2)-i)+[1]*i
calculate count_perms(L1,L2,n+1)-count_perms(L1,L2,n)
if result positive, return true
Return false
Thus, if there were a way to calculate your function count_perms(L1, L2, x) efficiently, then we would have an efficient algorithm to calculate SUBSET_SUM(L2,n).
This also turns out to be an abstract algebra problem. It's been awhile for me, but here's a few things to get started. There's nothing terribly significant about the following (it's all very basic; an expansion on the fact that every group is isomorphic to a permutation group), but it provides a different way of looking at the problem.
I'll try to stick to fairly standard notation: "x" is a vector, and "xi" is the ith component of x. If "L" is a list, L is the equivalent vector. "1n" is a vector with all components = 1. The set of natural numbers ℕ is taken to be the positive integers. "[a,b]" is the set of integers from a through b, inclusive. "θ(x, y)" is the angle formed by x and y
Note prodSum is the dot product. The question is equivalent to finding all vectors L generated by an operation (permuting elements) on L2 such that θ(L1, L) less than a given angle α. The operation is equivalent to reflecting a point in ℕn through a subspace with presentation:
< ℕn | (xixj-1)(i,j) ∈ A >
where i and j are in [1,n], A has at least one element and no (i,i) is in A (i.e. A is a non-reflexive subset of [1,n]2 where |A| > 0). Stated more plainly (and more ambiguously), the subspaces are the points where one or more components are equal to one or more other components. The reflections correspond to matrices whose columns are all the standard basis vectors.
Let's name the reflection group "RPn" (it should have another name, but memory fails). RPn is isomorphic to the symmetric group Sn. Thus
|RPn| = |Sn| = n!
In 3 dimensions, this gives a group of order 6. The reflection group is D3, the triangle symmetry group, as a subgroup of the cube symmetry group. It turns out you can also generate the points by rotating L2 in increments of π/3 around the line along 1n. This is the the modular group ℤ6 and this points to a possible solution: find a group of order n! with a minimal number of generators and use that to generate the permutations of L2 as sequences with increasing, then decreasing, angle with L2. From there, we can try to generate the elements L with θ(L1, L) < α directly (for example we can binsearch on the 1st half of each sequence to find the transition point; with that, we can specify the rest of the sequence that fulfills the condition and count it in O(1) time). Let's call this group RP'n.
RP'4 is constructed of 4 subspaces isomorphic to ℤ6. More generally, RP'n is constructed of n subspaces isomorphic to RP'n-1.
This is where my abstract algebra muscles really begins to fail. I'll try to keep working on the construction, but Managu's answer doesn't leave much hope. I fear that reducing RP3 to ℤ6 is the only useful reduction we can make.
It looks like if l1 and l2 are both ordered high->low (or low->high, whatever, if they have the same order), the result is maximized, and if they are ordered oposite, the result is minimized, and other alterations of order appear to follow some rules; swapping two numbers in a continuous list of integers always reduces the sum by a fixed amount which seems to be related to their distance apart (ie swapping 1 and 3 or 2 and 4 have the same effect). This was just from a little messing around, but the idea is that there is a maximum, a minimum, and if some-pre-specified-value is between them, there are ways to count the permutations that make that possible (although; if the list isn't evenly spaced, then there aren't. Well, not that I know of. If l2 is (1 2 4 5) swapping 1 2 and 2 4 would have different effects)

Resources