Pairing the weight of a protein sequence with the correct sequence - for-loop

This piece of code is part of a larger function. I already created a list of molecular weights and I also defined a list of all the fragments in my data.
I'm trying to figure out how I can go through the list of fragments, calculate their molecular weight and check if it matches the number in the other list. If it matches, the sequence is appended into an empty list.
combs = [397.47, 2267.58, 475.63, 647.68]
fragments = ['SKEPFKTRIDKKPCDHNTEPYMSGGNY', 'KMITKARPGCMHQMGEY', 'AINV', 'QIQD', 'YAINVMQCL', 'IEEATHMTPCYELHGLRWV', 'MQCL', 'HMTPCYELHGLRWV', 'DHTAQPCRSWPMDYPLT', 'IEEATHM', 'MVGKMDMLEQYA', 'GWPDII', 'QIQDY', 'TPCYELHGLRWVQIQDYA', 'HGLRWVQIQDYAINV', 'KKKNARKW', 'TPCYELHGLRWV']
frags = []
for c in combs:
for f in fragments:
if c == SeqUtils.molecular_weight(f, 'protein', circular = True):
frags.append(f)
print(frags)
I'm guessing I don't fully know how the SeqUtils.molecular_weight command works in Python, but if there is another way that would also be great.

You are comparing floating point values for equality. That is bound to fail. You always have to account for some degree of error when dealing with floating point values. In this particular case you also have to take into account the error margin of the input values.
So do not compare floats like this
x == y
but instead like this
abs(x - y) < epsilon
where epsilon is some carefully selected arbitrary number.
I did two slight modifications to your code: I swapped the order of the f and the c loop to be able to store the calculated value of w. And I append the value of w to the list frags as well in order to better understand what is happening.
Your modified code now looks like this:
from Bio import SeqUtils
combs = [397.47, 2267.58, 475.63, 647.68]
fragments = ['SKEPFKTRIDKKPCDHNTEPYMSGGNY', 'KMITKARPGCMHQMGEY', 'AINV', 'QIQD', 'YAINVMQCL', 'IEEATHMTPCYELHGLRWV',
'MQCL', 'HMTPCYELHGLRWV', 'DHTAQPCRSWPMDYPLT', 'IEEATHM', 'MVGKMDMLEQYA', 'GWPDII', 'QIQDY',
'TPCYELHGLRWVQIQDYA', 'HGLRWVQIQDYAINV', 'KKKNARKW', 'TPCYELHGLRWV']
frags = []
threshold = 0.5
for f in fragments:
w = SeqUtils.molecular_weight(f, 'protein', circular=True)
for c in combs:
if abs(c - w) < threshold:
frags.append((f, w))
print(frags)
This prints the result
[('AINV', 397.46909999999997), ('IEEATHMTPCYELHGLRWV', 2267.5843), ('MQCL', 475.6257), ('QIQDY', 647.6766)]
As you can see, the first value for the weight differs from the reference value by about 0.0009. That's why you did not catch it with your approach.

Related

Matlab: replace values in one matrix with another matrix according to their referenced locations

I have two geotiff images (saying "A" and "B") imported in Matlab as matrices with Geotiffread. One has different values, while the second has only 0 and 255s.
What I'd like to do is replacing all the 255s with the values inside the other image (or matrix), according to their positions.
A and B differs in size, but they have the same projections.
I tried this:
A (A== 255)= B;
the output is the error:
??? In an assignment A(:) = B, the number of elements in A and B must be the same.
Else, I also tried with the logical approach:
if A== 255
A= B;
end
and nothing happens.
Is there a way to replace the values of A with values of B according to a specific value and the position in the referenced space?
As darthbith put in his comment, you need to make sure that the number of entries you want to replace is the same as the number values you are putting in.
By doing A(A==255)=B you are trying to put the entire matrix B into the subset of A that equals 255.
However, if, as you said, the projections are the same, you can simply do A(A==255) = B(A==255), under the assumption that B is larger or the same size as A.
Some sample code to provide a proof of concept.
A = randi([0,10],10,10);
B = randi([0,4],15,15);
C = A % copy original A matrix for comparison later
A(A==5) = B(A==5); % replace values
C==A % compare original and new
This example code creates two matrices, A is a 10x10 and B is a 15x15 and replaces all values that equal 5 in A with the corresponding values in B. This is shown to be true by doing C==A which shows where the new matrix and the old matrix vary, proving replacement did happen.
It seems to me that you are trying to mask an image with a binary mask. You can do this:
BW = im2bw(B,0.5);
A=A.*BW;
hope it helps
Try A(A==255) = B(A==255). The error is telling you that when you try to assign values to the elements of an array, you cannot give it any more or fewer values than you are trying to assign.
Also, regarding the if statement: if A==255 means the same as if all(A==255), as in, if any elements of A are not 255, false is returned. You can check this at the command line.
If you're really desperate, you can use a pair of nested for loops to achieve this (assuming A and B are the same size and shape):
[a,b] = size(A);
for ii = 1:a
for jj = 1:b
if A(ii,jj) == 255
A(ii,jj) = B(ii,jj);
end
end
end

Random Forest Query

I am working on a project based on random forest. I saw one ppt (Rec08_Oct21.ppt)(www.cs.cmu.edu/~ggordon/10601/.../rec08/Rec08_Oct21.ppt)
regarding random forest creation. I wanted to ask a question.
After scanning through the randomly selected features and their Information gain value, we select the feature with the max value of IG for feature j. Then, how do we split using this information? How do we proceed after this?
LearnTree(X, Y)
Let X be an R x M matrix, R-datapoints and M-attributes and Y with R elements which contains the output class of each data point.
j* = *argmaxj* **IG** j // (This is the splitting attribute we'll use)
The maximum value of IG can come from either a categorical (text-based) or real (number-based) attribute.
---> If it's coming from a categorical attribute(j): for each value v in jth attribute, we'll define a new matrix, now taking X v and Y v as the input derive a childtree.
Xv = subset of all the rows of X in which Xij = v;
Yv = corresponding subset of Y values;
Child v = LearnTree(Xv, Yv);
PS: The number of child trees will be same as the number of unique value v's in the jth attribute
---> If it's coming from the real valued attribute (j): we need to find out the best split threshold
PS: The threshold value t is the same value that provides the maximum IG value for that attribute
define IG(Y|X:t) as H(Y) - H(Y|X:t)
define H(Y|X:t) = H(Y|X<t) P(X<t) + H(Y|X>=t) P(X>=t)
define IG*(Y|X) = maxt IG(Y|X:t)
We'll be splitting over this t value, we then define two ChildTrees by defining two new pairs of X t and Y t.
X_lo = subset of all the rows whose Xij < t
Y_lo = corresponding subset Y values
Child_lo = LearnTree(X_lo, Y_lo)
X_hi = subset of all the rows whose Xij >t
Y_hi = corresponding subset Y values
Child_hi = LearnTree(X_hi, Y_hi)
After splitting is done, the data is then classified.
For more information, go here!
I hope I answered your question.

Calculate image of a set for a function represented as an array of ROBDD's

I have a set of integers, represented as a Reduced Ordered Binary Decision Diagram (ROBDD) (interpreted as a function which evaluates to true iff the input is in the set) which I shall call Domain, and an integer function (which I shall call F) represented as an array of ROBDD's (one entry per bit of the result).
Now I want to calculate the image of the domain for F. It's definitely possible, because it could trivially be done by enumerating all items from the domain, apply F, and insert the result in the image. But that's a horrible algorithm with exponential complexity (linear in the size of the domain), and my gut tells me it can be faster. I've been looking into the direction of:
apply Restrict(Domain) to all bits of F
do magic
But the second step proved difficult. The result of the first step contains the information I need (at least, I'm 90% sure of it), but not in the right form. Is there an efficient algorithm to turn it into a "set encoded as ROBDD"? Do I need an other approach?
Define two set-valued functions:
N(d1...dn): The subset of the image where members start with a particular sequence of digits d0...dn.
D(d1...dn): The subset of the inputs that produce N(d1...dn).
Then when the sequences are empty, we have our full problem:
D(): The entire domain.
N(): The entire image.
From the full domain we can define two subsets:
D(0) = The subset of D() such that F(x)[1]==0 for any x in D().
D(1) = The subset of D() such that F(x)[1]==1 for any x in D().
This process can be applied recursively to generate D for every sequence.
D(d1...d[m+1]) = D(d1...dm) & {x | F(x)[m+1]==d[m+1]}
We can then determine N(x) for the full sequences:
N(d1...dn) = 0 if D(d1...dn) = {}
N(d1...dn) = 1 if D(d1...dn) != {}
The parent nodes can be produced from the two children, until we've produced N().
If at any point we determine that D(d1...dm) is empty, then we know
that N(d1...dm) is also empty, and we can avoid processing that branch.
This is the main optimization.
The following code (in Python) outlines the process:
def createImage(input_set_diagram,function_diagrams,index=0):
if input_set_diagram=='0':
# If the input set is empty, the output set is also empty
return '0'
if index==len(function_diagrams):
# The set of inputs that produce this result is non-empty
return '1'
function_diagram=function_diagrams[index]
# Determine the branch for zero
set0=intersect(input_set_diagram,complement(function_diagram))
result0=createImage(set0,function_diagrams,index+1)
# Determine the branch for one
set1=intersect(input_set_diagram,function_diagram)
result1=createImage(set1,function_diagrams,index+1)
# Merge if the same
if result0==result1:
return result0
# Otherwise create a new node
return {'index':index,'0':result0,'1':result1}
Let S(x1, x2, x3...xn) be the indicator function for the set S, so that S(x1, x2...xn) = true if (x1, x2,...xn) is an element of S. Let F1(x1, x2, x3... xn), F2(),... Fn() be the individual functions that define F. Then I could ask if a particular bit pattern, with wild cards, is in the image of F by forming the equation e.g. S() & F1() & ~F2() for bit-pattern 10 and then solving this equation, which I presume that I can do since it is an ROBDD.
Of course you want a general indicator function, which tells me if abc is in the image. Extending the above, I think you get S() & (a&F1() | ~a&~F1()) & (b&F2() | ~b&~F2()) &... If you then re-order the variables so that the original x1, x2, ... xn occur last in the ROBDD order, then you should be able to prune the tree to return true for the case where any setting of the x1, x2, ... xn leads to the value true, and to return false otherwise.
(of course you could run of space, or patience, waiting for the re-ordering to work).

Using min/maximum values of a type when finding the min/maximum values in a set

Originally, I had basically written an essay with a question at the end, so I'm going to cram it down to this: which is better (being really nit-picky here)?
A)
int min = someArray[0][0];
for (int i = 0; i < someArray.length; i++)
for (int j = 0; j < someArray[i].length; j++)
min = Math.min(min, someArray[i][j]);
-or-
B)
int min = int.MAX_VALUE;
for (int i = 0; i < someArray.length; i++)
for (int j = 0; j < someArray[i].length; j++)
min = Math.min(min, someArray[i][j]);
I reckon b is faster, saving an instruction or two by initializing min to a constant value instead of using the indexer. It also feels less redundant - no comparing someArray[0][0] to itself...
As an algorithm, which is better/valid-er.
EDIT: Assume that the array is not null and not empty.
EDIT2: Fixed a couple of careless errors.
Both of these algorithms are correct (assuming, of course, the array is nonempty). I think that version A works more generally, since for some types (strings, in particular) there may not be a well-defined maximum value.
The reason that these algorithms are equivalent has to do with a cool mathematical object called a semilattice. To motivate semilattices, there are few cool properties of max that happen to hold true:
max is idempotent, so applying it to the same value twice gives back that original value: max(x, x) = x
max is commutative, so it doesn't matter what order you apply it to its arguments: max(x, y) = max(y, x)
max is associative, so when taking the maximum of three or more values it doesn't matter how you group the elements: max(max(x, y), z) = max(x, max(y, z))
These laws also hold for the minimum as well, as well as many other structures. For example, if you have a tree structure, the "least upper bound" operator also satisfies these constraints. Similarly, if you have a collection of sets and set union or intersection, you'd find that these constraints hold as well.
If you have a set of elements (for example, integers, strings, etc.) and some binary operator defined over them with the above three properties (idempotency, commutativity, and associativity), then you have found a structure called a semilattice. The binary operator is then called a meet operator (or sometimes a join operator depending on the context).
The reason that semilattices are useful is that if you have a (finite) collection of elements drawn from a semilattice and want to compute their meet, you can do so by using a loop like this:
Element e = data[0];
for (i in data[1 .. n])
e = meet(e, data[i])
The reason that this works is that because the meet operator is commutative and associative, we can apply the meet across the elements in any order that we want. Applying it one element at a time as we walk across the elements of the array in order thus produces the same value than if we had shuffled the array elements first, or iterated in reverse order, etc. In your case, the meet operator was "max" or "min," and since they satisfy the laws for meet operators described above the above code will correctly compute the max or min.
To address your initial question, we need a bit more terminology. You were curious about whether or not it was better or safer to initialize your initial guess of the minimum value to be the maximum possible integer. The reason this works is that we have the cool property that
min(int.MAX_VALUE, x) = min(x, int.MAX_VALUE) = x
In other words, if you compute the meet of int.MAX_VALUE and any other value, you get the second value back. In mathematical terms, this is because int.MAX_VALUE is the top element of the meet semilattice. More formally, a top element for a meet semilattice is an element (denoted &top;) satisfying
meet(&top;, x) = meet(x, &top;) = x
If you use max instead of min, then the top element would be int.MIN_VALUE, since
max(int.MIN_VALUE, x) = max(x, int.MIN_VALUE) = x
Because applying the meet operator to &top; and any other element produces that other element, if you have a meet semilattice with a well-defined top element, you can rewrite the above code to compute the meet of all the elements as
Element e = Element.TOP;
for (i in data[0 .. n])
e = meet(e, data[i])
This works because after the first iteration, e is set to meet(e, data[0]) = meet(Element.TOP, data[0]) = data[0] and the iteration proceeds as usual. Consequently, in your original question, it doesn't matter which of the two loops you use; as long as there is at least one element defined, they produce the same value.
That said, not all semilattices have a top element. Consider, for example, the set of all strings where the meet operator is defined as
meet(x, y) = x if x lexicographically precedes y
= y otherwise
For example, meet("a", "ab") = "a", meet("dog, "cat") = "cat", etc. In this case, there is no string s that satisfies the property meet(s, x) = meet(x, s) = x, and so the semilattice has no top element. In that case, you cannot possibly use the second version of the code, because there is no top element that you can initialize the initial value to.
However, there is a very cute technique you can use to fake this, which actually does end up getting used a bit in practice. Given a semilattice with no top element, you can create a new semilattice that does have a top element by introducing a new element &top; and arbitrarily defining that meet(&top;, x) = meet(x, &top;) = x. In other words, this element is specially crafted to be a top element and has no significance otherwise.
In code, you can introduce an element like this implicitly by writing
bool found = false;
Element e;
for (i in data[0 .. n]) {
if (!found) {
found = true;
e = i;
} else {
e = meet(e, i);
}
}
This code works by having an external boolean found keep track of whether or not we have seen the first element yet. If we haven't, then we pretend that the element e is this new top element. Computing the meet of this top element and the array element produces the array element, and so we can just set the element e to be equal to that array element.
Hope this helps! Sorry if this is too theoretical... I just happen to like math. :-)
B is better; if someArray happened to be empty, you'd get a runtime error; But A and B both could have an issue, because if someArray is null (and this wasn't checked in previous lines of code), both A and B will throw exceptions.
From a practical standpoint, I like option A marginally better because if the data type being dealt with changes in the future, changing the initial value is one less thing that needs to be updated (and therefore, one less thing that can go wrong).
From an algorithmic purity standpoint, I have no idea if one is better than the other.
By the way, option A should have its initialization like so:
int min = someArray[0][0];

Algorithm to find matching pairs in a list

I will phrase the problem in the precise form that I want below:
Given:
Two floating point lists N and D of the same length k (k is multiple of 2).
It is known that for all i=0,...,k-1, there exists j != i such that D[j]*D[i] == N[i]*N[j]. (I'm using zero-based indexing)
Return:
A (length k/2) list of pairs (i,j) such that D[j]*D[i] == N[i]*N[j].
The pairs returned may not be unique (any valid list of pairs is okay)
The application for this algorithm is to find reciprocal pairs of eigenvalues of a generalized palindromic eigenvalue problem.
The equality condition is equivalent to N[i]/D[i] == D[j]/N[j], but also works when denominators are zero (which is a definite possibility). Degeneracies in the eigenvalue problem cause the pairs to be non-unique.
More generally, the algorithm is equivalent to:
Given:
A list X of length k (k is multiple of 2).
It is known that for all i=0,...,k-1, there exists j != i such that IsMatch(X[i],X[j]) returns true, where IsMatch is a boolean matching function which is guaranteed to return true for at least one j != i for all i.
Return:
A (length k/2) list of pairs (i,j) such that IsMatch(i,j) == true for all pairs in the list.
The pairs returned may not be unique (any valid list of pairs is okay)
Obviously, my first problem can be formulated in terms of the second with IsMatch(u,v) := { (u - 1/v) == 0 }. Now, due to limitations of floating point precision, there will never be exact equality, so I want the solution which minimizes the match error. In other words, assume that IsMatch(u,v) returns the value u - 1/v and I want the algorithm to return a list for which IsMatch returns the minimal set of errors. This is a combinatorial optimization problem. I was thinking I can first naively compute the match error between all possible pairs of indexes i and j, but then I would need to select the set of minimum errors, and I don't know how I would do that.
Clarification
The IsMatch function is reflexive (IsMatch(a,b) implies IsMatch(b,a)), but not transitive. It is, however, 3-transitive: IsMatch(a,b) && IsMatch(b,c) && IsMatch(c,d) implies IsMatch(a,d).
Addendum
This problem is apparently identically the minimum weight perfect matching problem in graph theory. However, in my case I know that there should be a "good" perfect matching, so the distribution of edge weights is not totally random. I feel that this information should be used somehow. The question now is if there is a good implementation to the min-weight-perfect-matching problem that uses my prior knowledge to arrive at a solution early in the search. I'm also open to pointers towards a simple implementation of any such algorithm.
I hope I got your problem.
Well, if IsMatch(i, j) and IsMatch(j, l) then IsMatch(i, l). More generally, the IsMatch relation is transitive, commutative and reflexive, ie. its an equivalence relation. The algorithm translates to which element appears the most times in the list (use IsMatch instead of =).
(If I understand the problem...)
Here is one way to match each pair of products in the two lists.
Multiply each pair N and save it to a structure with the product, and the subscripts of the elements making up the product.
Multiply each pair D and save it to a second instance of the structure with the product, and the subscripts of the elements making up the product.
Sort both structions on the product.
Make a merge-type pass through both sorted structure arrays. Each time you find a product from one array that is close enough to the other, you can record the two subscripts from each sorted list for a match.
You can also use one sorted list for an ismatch function, doing a binary search on the product.
well。。Multiply each pair D and save it to a second instance of the structure with the product, and the subscripts of the elements making up the product.
I just asked my CS friend, and he came up with the algorithm below. He doesn't have an account here (and apparently unwilling to create one), but I think his answer is worth sharing.
// We will find the best match in the minimax sense; we will minimize
// the maximum matching error among all pairs. Alpha maintains a
// lower bound on the maximum matching error. We will raise Alpha until
// we find a solution. We assume MatchError returns an L_1 error.
// This first part finds the set of all possible alphas (which are
// the pairwise errors between all elements larger than maxi-min
// error.
Alpha = 0
For all i:
min = Infinity
For all j > i:
AlphaSet.Insert(MatchError(i,j))
if MatchError(i,j) < min
min = MatchError(i,j)
If min > Alpha
Alpha = min
Remove all elements of AlphaSet smaller than Alpha
// This next part increases Alpha until we find a solution
While !AlphaSet.Empty()
Alpha = AlphaSet.RemoveSmallest()
sol = GetBoundedErrorSolution(Alpha)
If sol != nil
Return sol
// This is the definition of the helper function. It returns
// a solution with maximum matching error <= Alpha or nil if
// no such solution exists.
GetBoundedErrorSolution(Alpha) :=
MaxAssignments = 0
For all i:
ValidAssignments[i] = empty set;
For all j > i:
if MatchError <= Alpha
ValidAssignments[i].Insert(j)
ValidAssignments[j].Insert(i)
// ValidAssignments[i].Size() > 0 due to our choice of Alpha
// in the outer loop
If ValidAssignments[i].Size() > MaxAssignments
MaxAssignments = ValidAssignments[i].Size()
If MaxAssignments = 1
return ValidAssignments
Else
G = graph(ValidAssignments)
// G is an undirected graph whose vertices are all values of i
// and edges between vertices if they have match error less
// than or equal to Alpha
If G has a perfect matching
// Note that this part is NP-complete.
Return the matching
Else
Return nil
It relies on being able to compute a perfect matching of a graph, which is NP-complete, but at least it is reduced to a known problem. It is expected that the solution be NP-complete, but this is OK since in practice the size of the given lists are quite small. I'll wait around for a better answer for a few days, or for someone to expand on how to find the perfect matching in a reasonable way.
You want to find j such that D(i)*D(j) = N(i)*N(j) {I assumed * is ordinary real multiplication}
assuming all N(i) are nonzero, let
Z(i) = D(i)/N(i).
Problem: find j, such that Z(i) = 1/Z(j).
Split set into positives and negatives and process separately.
take logs for clarity. z(i) = log Z(i).
Sort indirectly. Then in the sorted view you should have something like -5 -3 -1 +1 +3 +5, for example. Read off +/- pairs and that should give you the original indices.
Am I missing something, or is the problem easy?
Okay, I ended up using this ported Fortran code, where I simply specify the dense upper triangular distance matrix using:
complex_t num = N[i]*N[j] - D[i]*D[j];
complex_t den1 = N[j]*D[i];
complex_t den2 = N[i]*D[j];
if(std::abs(den1) < std::abs(den2)){
costs[j*(j-1)/2+i] = std::abs(-num/den2);
}else if(std::abs(den1) == 0){
costs[j*(j-1)/2+i] = std::sqrt(std::numeric_limits<double>::max());
}else{
costs[j*(j-1)/2+i] = std::abs(num/den1);
}
This works great and is fast enough for my purposes.
You should be able to sort the (D[i],N[i]) pairs. You don't need to divide by zero -- you can just multiply out, as follows:
bool order(i,j) {
float ni= N[i]; float di= D[i];
if(di<0) { di*=-1; ni*=-1; }
float nj= N[j]; float dj= D[j];
if(dj<0) { dj*=-1; nj*=-1; }
return ni*dj < nj*di;
}
Then, scan the sorted list to find two separation points: (N == D) and (N == -D); you can start matching reciprocal pairs from there, using:
abs(D[i]*D[j]-N[i]*N[j])<epsilon
as a validity check. Leave the (N == 0) and (D == 0) points for last; it doesn't matter whether you consider them negative or positive, as they will all match with each other.
edit: alternately, you could just handle (N==0) and (D==0) cases separately, removing them from the list. Then, you can use (N[i]/D[i]) to sort the rest of the indices. You still might want to start at 1.0 and -1.0, to make sure you can match near-zero cases with exactly-zero cases.

Resources