This piece of code is part of a larger function. I already created a list of molecular weights and I also defined a list of all the fragments in my data.
I'm trying to figure out how I can go through the list of fragments, calculate their molecular weight and check if it matches the number in the other list. If it matches, the sequence is appended into an empty list.
combs = [397.47, 2267.58, 475.63, 647.68]
fragments = ['SKEPFKTRIDKKPCDHNTEPYMSGGNY', 'KMITKARPGCMHQMGEY', 'AINV', 'QIQD', 'YAINVMQCL', 'IEEATHMTPCYELHGLRWV', 'MQCL', 'HMTPCYELHGLRWV', 'DHTAQPCRSWPMDYPLT', 'IEEATHM', 'MVGKMDMLEQYA', 'GWPDII', 'QIQDY', 'TPCYELHGLRWVQIQDYA', 'HGLRWVQIQDYAINV', 'KKKNARKW', 'TPCYELHGLRWV']
frags = []
for c in combs:
for f in fragments:
if c == SeqUtils.molecular_weight(f, 'protein', circular = True):
frags.append(f)
print(frags)
I'm guessing I don't fully know how the SeqUtils.molecular_weight command works in Python, but if there is another way that would also be great.
You are comparing floating point values for equality. That is bound to fail. You always have to account for some degree of error when dealing with floating point values. In this particular case you also have to take into account the error margin of the input values.
So do not compare floats like this
x == y
but instead like this
abs(x - y) < epsilon
where epsilon is some carefully selected arbitrary number.
I did two slight modifications to your code: I swapped the order of the f and the c loop to be able to store the calculated value of w. And I append the value of w to the list frags as well in order to better understand what is happening.
Your modified code now looks like this:
from Bio import SeqUtils
combs = [397.47, 2267.58, 475.63, 647.68]
fragments = ['SKEPFKTRIDKKPCDHNTEPYMSGGNY', 'KMITKARPGCMHQMGEY', 'AINV', 'QIQD', 'YAINVMQCL', 'IEEATHMTPCYELHGLRWV',
'MQCL', 'HMTPCYELHGLRWV', 'DHTAQPCRSWPMDYPLT', 'IEEATHM', 'MVGKMDMLEQYA', 'GWPDII', 'QIQDY',
'TPCYELHGLRWVQIQDYA', 'HGLRWVQIQDYAINV', 'KKKNARKW', 'TPCYELHGLRWV']
frags = []
threshold = 0.5
for f in fragments:
w = SeqUtils.molecular_weight(f, 'protein', circular=True)
for c in combs:
if abs(c - w) < threshold:
frags.append((f, w))
print(frags)
This prints the result
[('AINV', 397.46909999999997), ('IEEATHMTPCYELHGLRWV', 2267.5843), ('MQCL', 475.6257), ('QIQDY', 647.6766)]
As you can see, the first value for the weight differs from the reference value by about 0.0009. That's why you did not catch it with your approach.
I am trying to find different sequences of fixed length which can be generated using the numbers from a given set (distinct elements) such that each element from set should appear in the sequence. Below is my logic:
eg. Let the set consists of S elements, and we have to generate sequences of length K (K >= S)
1) First we have to choose S places out of K and place each element from the set in random order. So, C(K,S)*S!
2) After that, remaining places can be filled from any values from the set. So, the factor
(K-S)^S should be multiplied.
So, overall result is
C(K,S)S!((K-S)^S)
But, I am getting wrong answer. Please help.
PS: C(K,S) : No. of ways selecting S elements out of K elements (K>=S) irrespective of order. Also, ^ : power symbol i.e 2^3 = 8.
Here is my code in python:
# m is the no. of element to select from a set of n elements
# fact is a list containing factorial values i.e. fact[0] = 1, fact[3] = 6& so on.
def ways(m,n):
res = fact[n]/fact[n-m+1]*((n-m)**m)
return res
What you are looking for is the number of surjective functions whose domain is a set of K elements (the K positions that we are filling out in the output sequence) and the image is a set of S elements (your input set). I think this should work:
static int Count(int K, int S)
{
int sum = 0;
for (int i = 1; i <= S; i++)
{
sum += Pow(-1, (S-i)) * Fact(S) / (Fact(i) * Fact(S - i)) * Pow(i, K);
}
return sum;
}
...where Pow and Fact are what you would expect.
Check out this this math.se question.
Here's why your approach won't work. I didn't check the code, just your explanation of the logic behind it, but I'm pretty sure I understand what you're trying to do. Let's take for example K = 4, S = {7,8,9}. Let's examine the sequence 7,8,9,7. It is a unique sequence, but you can get to it by:
Randomly choosing positions 1,2,3, filling them randomly with 7,8,9 (your step 1), then randomly choosing 7 for the remaining position 4 (your step 2).
Randomly choosing positions 2,3,4, filling them randomly with 8,9,7 (your step 1), then randomly choosing 7 for the remaining position 1 (your step 2).
By your logic, you will count it both ways, even though it should be counted only once as the end result is the same. And so on...
Let's say I have a dataset with following schema:
ItemName (String) , Length (long)
I need to find items that are duplicates based on their length. That's pretty easy to do in PIG:
raw_data = LOAD...dataset
grouped = GROUP raw_data by length
items = FOREACH grouped GENERATE COUNT(raw_data) as count, raw_data.name;
dups = FILTER items BY count > 1;
STORE dups....
The above finds exact duplicates. Given the set bellow:
a, 100
b, 105
c, 100
It will output 2, (a,c)
Now I need to find duplicates using a threshold. For example a threshold of 5 would mean match items if their length +/- 5. So the output should look like:
3, (a,b,c)
Any ideas how I can go about doing this?
It is almost like I want PIG to use a UDF as its comparator when it is comparing records during its join...
I think the only way to do what you want is to load the data into two tables and do a cartesian join of the data set onto itself, so that each value can be compared to each other value.
Pseudo-code:
r1 = load dataset
r2 = load dataset
rcross = cross r1, r2
rcross is a cartesian product that will allow you to check the difference in length between each pair.
I was solving a similar problem once and got one crazy and dirty solution.
It is based on next lemma:
If |a - b| < r then there exists such an integer number x: 0 <= x < r that
floor((a+x)/r) = floor((b+x)/r)
(further I will mean only integer division and will omit floor() function, i.e. 5/2=2)
This lemma is obvious, I'm not gonna prove it here
Based on this lemma you may do a next join:
RESULT = JOIN A by A.len / r, B By B.len / r
And get several values from all values corresponding to |A.len - B.len| < r
But doing this r times:
RESULT0 = JOIN A by A.len / r, B By (B.len / r)
RESULT1 = JOIN A by (A.len+1) / r, B By (B.len+1) / r
...
RESULT{R-1} = JOIN A by (A.len+r-1) / r, B By (B.len+r-1) / r
you will get all needed values. Of course you will get more rows than you need, but as I said already it's a dirty solution (i.e. it's not optimal, but works)
The other big disadvantage of this solution is that JOINs should be written dynamically and their number will be big for big r.
Still it works if you know r and it is rather small (like r=6 in your case)
Hope it helps
I have an N by 2 matrix A of indices of elements I want to get from a 2D matrix B, each row of A being the row and column index of an element of B that I want to get. I would like to get all of those elements stacked up as an N by 1 vector.
B is a square matrix, so I am currently using
N = size(B,1);
indices = arrayfun(#(i) A(i,1) + N*(A(i,2)-1), 1:size(A,1));
result = B(indices);
but, while it works, this is probing to be a huge bottleneck and I need to speed up the code in order for it to be useful.
What is the fastest way I can achieve the same result?
How about
indices = [1 N] * (A'-1) + 1;
I can never remember if B(A(:,1), A(:,2)) works the way you want it to, but I'd try that to avoid the intermediate variable. If that does not work, try subs2ind.
Also, you can look at how you generated A in the first place. if A came about from the output of find, for example, it is faster to use logical indexing. i.e if
B( B == 2 )
Is faster than finding the row,col indexes that satisfy that condition, then indexing into B.
I have got a square matrix consisting of elements either 1
or 0. An ith row toggle toggles all the ith row elements (1
becomes 0 and vice versa) and jth column toggle toggles all
the jth column elements. I have got another square matrix of
similar size. I want to change the initial matrix to the
final matrix using the minimum number of toggles. For example
|0 0 1|
|1 1 1|
|1 0 1|
to
|1 1 1|
|1 1 0|
|1 0 0|
would require a toggle of the first row and of the last
column.
What will be the correct algorithm for this?
In general, the problem will not have a solution. To see this, note that transforming matrix A to matrix B is equivalent to transforming the matrix A - B (computed using binary arithmetic, so that 0 - 1 = 1) to the zero matrix. Look at the matrix A - B, and apply column toggles (if necessary) so that the first row becomes all 0's or all 1's. At this point, you're done with column toggles -- if you toggle one column, you have to toggle them all to get the first row correct. If even one row is a mixture of 0's and 1's at this point, the problem cannot be solved. If each row is now all 0's or all 1's, the problem is solvable by toggling the appropriate rows to reach the zero matrix.
To get the minimum, compare the number of toggles needed when the first row is turned to 0's vs. 1's. In the OP's example, the candidates would be toggling column 3 and row 1 or toggling columns 1 and 2 and rows 2 and 3. In fact, you can simplify this by looking at the first solution and seeing if the number of toggles is smaller or larger than N -- if larger than N, than toggle the opposite rows and columns.
It's not always possible. If you start with a 2x2 matrix with an even number of 1s you can never arrive at a final matrix with an odd number of 1s.
Algorithm
Simplify the problem from "Try to transform A into B" into "Try to transform M into 0", where M = A xor B. Now all the positions which must be toggled have a 1 in them.
Consider an arbitrary position in M. It is affected by exactly one column toggle and exactly one row toggle. If its initial value is V, the presence of the column toggle is C, and the presence of the row toggle is R, then the final value F is V xor C xor R. That's a very simple relationship, and it makes the problem trivial to solve.
Notice that, for each position, R = F xor V xor C = 0 xor V xor C = V xor C. If we set C then we force the value of R, and vice versa. That's awesome, because it means if I set the value of any row toggle then I will force all of the column toggles. Any one of those column toggles will force all of the row toggles. If the result is the 0 matrix, then we have a solution. We only need to try two cases!
Pseudo-code
function solve(Matrix M) as bool possible, bool[] rowToggles, bool[] colToggles:
For var b in {true, false}
colToggles = array from c in M.colRange select b xor Matrix(0, c)
rowToggles = array from r in M.rowRange select colToggles[0] xor M(r, 0)
if none from c in M.colRange, r in M.rowRange
where colToggle[c] xor rowToggle[r] xor M(r, c) != 0 then
return true, rowToggles, colToggles
end if
next var
return false, null, null
end function
Analysis
The analysis is trivial. We try two cases, within which we run along a row, then a column, then all cells. Therefore if there are r rows and c columns, meaning the matrix has size n = c * r, then the time complexity is O(2 * (c + r + c * r)) = O(c * r) = O(n). The only space we use is what is required for storing the outputs = O(c + r).
Therefore the algorithm takes time linear in the size of the matrix, and uses space linear in the size of the output. It is asymptotically optimal for obvious reasons.
I came up with a brute force algorithm.
The algorithm is based on 2 conjectures:
(so it may not work for all matrices - I'll verify them later)
The minimum (number of toggles) solution will contain a specific row or column only once.
In whatever order we apply the steps to convert the matrix, we get the same result.
The algorithm:
Lets say we have the matrix m = [ [1,0], [0,1] ].
m: 1 0
0 1
We generate a list of all row and column numbers,
like this: ['r0', 'r1', 'c0', 'c1']
Now we brute force, aka examine, every possible step combinations.
For example,we start with 1-step solution,
ksubsets = [['r0'], ['r1'], ['c0'], ['c1']]
if no element is a solution then proceed with 2-step solution,
ksubsets = [['r0', 'r1'], ['r0', 'c0'], ['r0', 'c1'], ['r1', 'c0'], ['r1', 'c1'], ['c0', 'c1']]
etc...
A ksubsets element (combo) is a list of toggle steps to apply in a matrix.
Python implementation (tested on version 2.5)
# Recursive definition (+ is the join of sets)
# S = {a1, a2, a3, ..., aN}
#
# ksubsets(S, k) = {
# {{a1}+ksubsets({a2,...,aN}, k-1)} +
# {{a2}+ksubsets({a3,...,aN}, k-1)} +
# {{a3}+ksubsets({a4,...,aN}, k-1)} +
# ... }
# example: ksubsets([1,2,3], 2) = [[1, 2], [1, 3], [2, 3]]
def ksubsets(s, k):
if k == 1: return [[e] for e in s]
ksubs = []
ss = s[:]
for e in s:
if len(ss) < k: break
ss.remove(e)
for x in ksubsets(ss,k-1):
l = [e]
l.extend(x)
ksubs.append(l)
return ksubs
def toggle_row(m, r):
for i in range(len(m[r])):
m[r][i] = m[r][i] ^ 1
def toggle_col(m, i):
for row in m:
row[i] = row[i] ^ 1
def toggle_matrix(m, combos):
# example of combos, ['r0', 'r1', 'c3', 'c4']
# 'r0' toggle row 0, 'c3' toggle column 3, etc.
import copy
k = copy.deepcopy(m)
for combo in combos:
if combo[0] == 'r':
toggle_row(k, int(combo[1:]))
else:
toggle_col(k, int(combo[1:]))
return k
def conversion_steps(sM, tM):
# Brute force algorithm.
# Returns the minimum list of steps to convert sM into tM.
rows = len(sM)
cols = len(sM[0])
combos = ['r'+str(i) for i in range(rows)] + \
['c'+str(i) for i in range(cols)]
for n in range(0, rows + cols -1):
for combo in ksubsets(combos, n +1):
if toggle_matrix(sM, combo) == tM:
return combo
return []
Example:
m: 0 0 0
0 0 0
0 0 0
k: 1 1 0
1 1 0
0 0 1
>>> m = [[0,0,0],[0,0,0],[0,0,0]]
>>> k = [[1,1,0],[1,1,0],[0,0,1]]
>>> conversion_steps(m, k)
['r0', 'r1', 'c2']
>>>
If you can only toggle the rows, and not the columns, then there will only be a subset of matrices that you can convert into the final result. If this is the case, then it would be very simple:
for every row, i:
if matrix1[i] == matrix2[i]
continue;
else
toggle matrix1[i];
if matrix1[i] == matrix2[i]
continue
else
die("cannot make similar");
This is a state space search problem. You are searching for the optimum path from a starting state to a destination state. In this particular case, "optimum" is defined as "minimum number of operations".
The state space is the set of binary matrices generatable from the starting position by row and column toggle operations.
ASSUMING that the destination is in the state space (NOT a valid assumption in some cases: see Henrik's answer), I'd try throwing a classic heuristic search (probably A*, since it is about the best of the breed) algorithm at the problem and see what happened.
The first, most obvious heuristic is "number of correct elements".
Any decent Artificial Intelligence textbook will discuss search and the A* algorithm.
You can represent your matrix as a nonnegative integer, with each cell in the matrix corresponding to exactly one bit in the integer On a system that supports 64-bit long long unsigned ints, this lets you play with anything up to 8x8. You can then use exclusive-OR operations on the number to implement the row and column toggle operations.
CAUTION: the raw total state space size is 2^(N^2), where N is the number of rows (or columns). For a 4x4 matrix, that's 2^16 = 65536 possible states.
Rather than look at this as a matrix problem, take the 9 bits from each array, load each of them into 2-byte size types (16 bits, which is probably the source of the arrays in the first place), then do a single XOR between the two.
(the bit order would be different depending on your type of CPU)
The first array would become: 0000000001111101
The second array would become: 0000000111110101
A single XOR would produce the output. No loops required. All you'd have to do is 'unpack' the result back into an array, if you still wanted to. You can read the bits without resorting to that, though.i
I think brute force is not necessary.
The problem can be rephrased in terms of a group. The matrices over the field with 2 elements constitute an commutative group with respect to addition.
As pointed out before, the question whether A can be toggled into B is equivalent to see if A-B can be toggled into 0. Note that toggling of row i is done by adding a matrix with only ones in the row i and zeros otherwise, while the toggling of column j is done by adding a matrix with only ones in column j and zeros otherwise.
This means that A-B can be toggled to the zero matrix if and only if A-B is contained in the subgroup generated by the toggling matrices.
Since addition is commutative, the toggling of columns takes place first, and we can apply the approach of Marius first to the columns and then to the rows.
In particular the toggling of the columns must make any row either all ones or all zeros. there are two possibilites:
Toggle columns such that every 1 in the first row becomes zero. If after this there is a row in which both ones and zeros occur, there is no solution. Otherwise apply the same approach for the rows (see below).
Toggle columns such that every 0 in the first row becomes 1. If after this there is a row in which both ones and zeros occur, there is no solution. Otherwise apply the same approach for the rows (see below).
Since the columns have been toggled successfully in the sense that in each row contains only ones or zeros, there are two possibilities:
Toggle rows such that every 1 in the first column becomes zero.
Toggle rows such that every 0 in the first row becomes zero.
Of course in the step for the rows, we take the possibility which results in less toggles, i.e. we count the ones in the first column and then decide how to toggle.
In total, only 2 cases have to be considered, namely how the columns are toggled; for the row step, the toggling can be decided by counting to minimuze the number of toggles in the second step.