Random Forest Query - random

I am working on a project based on random forest. I saw one ppt (Rec08_Oct21.ppt)(www.cs.cmu.edu/~ggordon/10601/.../rec08/Rec08_Oct21.ppt)
regarding random forest creation. I wanted to ask a question.
After scanning through the randomly selected features and their Information gain value, we select the feature with the max value of IG for feature j. Then, how do we split using this information? How do we proceed after this?

LearnTree(X, Y)
Let X be an R x M matrix, R-datapoints and M-attributes and Y with R elements which contains the output class of each data point.
j* = *argmaxj* **IG** j // (This is the splitting attribute we'll use)
The maximum value of IG can come from either a categorical (text-based) or real (number-based) attribute.
---> If it's coming from a categorical attribute(j): for each value v in jth attribute, we'll define a new matrix, now taking X v and Y v as the input derive a childtree.
Xv = subset of all the rows of X in which Xij = v;
Yv = corresponding subset of Y values;
Child v = LearnTree(Xv, Yv);
PS: The number of child trees will be same as the number of unique value v's in the jth attribute
---> If it's coming from the real valued attribute (j): we need to find out the best split threshold
PS: The threshold value t is the same value that provides the maximum IG value for that attribute
define IG(Y|X:t) as H(Y) - H(Y|X:t)
define H(Y|X:t) = H(Y|X<t) P(X<t) + H(Y|X>=t) P(X>=t)
define IG*(Y|X) = maxt IG(Y|X:t)
We'll be splitting over this t value, we then define two ChildTrees by defining two new pairs of X t and Y t.
X_lo = subset of all the rows whose Xij < t
Y_lo = corresponding subset Y values
Child_lo = LearnTree(X_lo, Y_lo)
X_hi = subset of all the rows whose Xij >t
Y_hi = corresponding subset Y values
Child_hi = LearnTree(X_hi, Y_hi)
After splitting is done, the data is then classified.
For more information, go here!
I hope I answered your question.

Related

Pairing the weight of a protein sequence with the correct sequence

This piece of code is part of a larger function. I already created a list of molecular weights and I also defined a list of all the fragments in my data.
I'm trying to figure out how I can go through the list of fragments, calculate their molecular weight and check if it matches the number in the other list. If it matches, the sequence is appended into an empty list.
combs = [397.47, 2267.58, 475.63, 647.68]
fragments = ['SKEPFKTRIDKKPCDHNTEPYMSGGNY', 'KMITKARPGCMHQMGEY', 'AINV', 'QIQD', 'YAINVMQCL', 'IEEATHMTPCYELHGLRWV', 'MQCL', 'HMTPCYELHGLRWV', 'DHTAQPCRSWPMDYPLT', 'IEEATHM', 'MVGKMDMLEQYA', 'GWPDII', 'QIQDY', 'TPCYELHGLRWVQIQDYA', 'HGLRWVQIQDYAINV', 'KKKNARKW', 'TPCYELHGLRWV']
frags = []
for c in combs:
for f in fragments:
if c == SeqUtils.molecular_weight(f, 'protein', circular = True):
frags.append(f)
print(frags)
I'm guessing I don't fully know how the SeqUtils.molecular_weight command works in Python, but if there is another way that would also be great.
You are comparing floating point values for equality. That is bound to fail. You always have to account for some degree of error when dealing with floating point values. In this particular case you also have to take into account the error margin of the input values.
So do not compare floats like this
x == y
but instead like this
abs(x - y) < epsilon
where epsilon is some carefully selected arbitrary number.
I did two slight modifications to your code: I swapped the order of the f and the c loop to be able to store the calculated value of w. And I append the value of w to the list frags as well in order to better understand what is happening.
Your modified code now looks like this:
from Bio import SeqUtils
combs = [397.47, 2267.58, 475.63, 647.68]
fragments = ['SKEPFKTRIDKKPCDHNTEPYMSGGNY', 'KMITKARPGCMHQMGEY', 'AINV', 'QIQD', 'YAINVMQCL', 'IEEATHMTPCYELHGLRWV',
'MQCL', 'HMTPCYELHGLRWV', 'DHTAQPCRSWPMDYPLT', 'IEEATHM', 'MVGKMDMLEQYA', 'GWPDII', 'QIQDY',
'TPCYELHGLRWVQIQDYA', 'HGLRWVQIQDYAINV', 'KKKNARKW', 'TPCYELHGLRWV']
frags = []
threshold = 0.5
for f in fragments:
w = SeqUtils.molecular_weight(f, 'protein', circular=True)
for c in combs:
if abs(c - w) < threshold:
frags.append((f, w))
print(frags)
This prints the result
[('AINV', 397.46909999999997), ('IEEATHMTPCYELHGLRWV', 2267.5843), ('MQCL', 475.6257), ('QIQDY', 647.6766)]
As you can see, the first value for the weight differs from the reference value by about 0.0009. That's why you did not catch it with your approach.

algorithm to find unique, non equivalent configurations given the height, the width, and the number of states each element can be

SO recently, I have been attempting to solve a code challenge and can not find the answer. The issue is not the implementation, but rather what to implement. The prompt can be found here http://pastebin.com/DxQssyKd
the main useful information from the prompt is as follows
"Write a function answer(w, h, s) that takes 3 integers and returns the number of unique, non-equivalent configurations that can be found on a star grid w blocks wide and h blocks tall where each celestial body has s possible states. Equivalency is defined as above: any two star grids with each celestial body in the same state where the actual order of the rows and columns do not matter (and can thus be freely swapped around). Star grid standardization means that the width and height of the grid will always be between 1 and 12, inclusive. And while there are a variety of celestial bodies in each grid, the number of states of those bodies is between 2 and 20, inclusive. The answer can be over 20 digits long, so return it as a decimal string."
The equivalency is in a way that
00
01
is equivalent to
01
00
and so on.
The problem is, what algorithm(s) should I use? i know this is somewhat related to permutations, combinations, and group theory, but I can not find anything specific.
The key weapon is Burnside's lemma, which equates the number of orbits of the symmetry group G = Sw × Sh acting on the set of configurations X = ([w] × [h] → [s]) (i.e., the answer) to the sum 1/|G| ∑g&in;G |Xg|, where Xg = {x | g.x = x} is the set of elements fixed by g.
Given g, it's straightforward to compute |Xg|: use g to construct a graph on vertices [w] × [h] where there is an edge between (i, j) and g(i, j) for all (i, j). Count c, the number of connected components, and return sc. The reasoning is that every vertex in a connected component must have the same state, but vertices in different components are unrelated.
Now, for 12 × 12 grids, there are far too many values of g to do this calculation on. Fortunately, when g and g' are conjugate (i.e., there exists some h such that h.g.h-1 = g') we find that |Xg'| = |{x | g'.x = x}| = |{x | h.g.h-1.x = x}| = |{x | g.h-1.x = h-1.x}| = |{h.y | g.y = y}| = |{y | g.y = y}| = |Xg|. We can thus sum over conjugacy classes and multiply each term by the number of group elements in the class.
The last piece is the conjugacy class structure of G = Sw × Sh. The conjugacy class structure of this direct product is really just the direct product of the conjugacy classes of Sw and Sh. The conjugacy classes of Sn are in one-to-one correspondence with integer partitions of n, enumerable by standard recursive methods. To compute the size of the class, you'll divide n! by the product of the partition terms (because circular permutations of the cycles are equivalent) and also by the product of the number of symmetries between cycles of the same size (product of the factorials of the multiplicities). See https://groupprops.subwiki.org/wiki/Conjugacy_class_size_formula_in_symmetric_group.

Best way to make a random matrix subject to full rank

I want to make a matrix k x n (k rows and n columns) that its rank is k. My idea is that I will check the current rank of matrix at each generation of column. If the current rank is small than number of current column j, I will make the column again until the rank equals current column. This is my code. However, it work very slowly (due to check rank at every step). Please help me to modify it.
function G=fullRank(k,n)
%% make matrix kxn
j=0;
while(j<n)
d=randi(k,1)
column = [ones(1,d) zeros(1,k-d)];
column = column(randperm(k));
G(:,j)=column';
%% check full rank- Modify here
if((j>=2)&(rank(full(G))<j)&&(j<=k))
%% Set current column of G to zeros
column =zeros(1,k);
G(:,j) = column';
else
j=j+1;
end
end
The probability of your matrix not being full-rank depends on how you choose the random values for its entries, but I guess it is low. In that case, you can save time checking only at the end, and generating the full matrix again if needed:
maxrank = min(k,n); %// precompute to save a little time
G = []; %// this is just to enter the while loop at least once
while rank(G)<maxrank
G = randi(k,k,n); %// replace by your procedure to generate G
end

Which Hashfunction to choose for parallel insertion of a set of values?

I have a hashtable T of size w*h with a bucket per entry for storing values mapped to the same hash.
Now I want to insert a set of values G.
each value in G contains of a position tuple (x,y) and a certain payload p.
The hash function uses the position tuple of the value as parameter: H(x,y).
G is essentially a grid with each position storing a payload p.
To insert all values from G into T in parallel without synchronization, H should guarantee different hashs for all values in G.
The width of G is smaller than w and its height is smaller than h, so
H(x,y) = mod y * w + mod x
would be suitable.
However, this simple modulo hash function is only sufficient for uniformly distributed data.
In literature, a better suited (at least for my application) hash function is proposed:
H(x,y) = (x*p xor y*q) mod(w*h)
where p and q are large prime numbers.
But I'm unsure how to check if all values of the grid would be mapped to different hashs.
Does anyone know how to prove that (if this is the case) or does anyone know a suited hash function?
Thanks so much!

mathematica help ~ assign each element in a matrix to a value from a column vector

I have this 4x4 square matrix A, which has a random value in each element. I now have a column matrix (16x1) B which also has random values. The number of values in B is 16 which corresponds to the total number of elements in A.
I am trying to assign the values in B to elements in matrix A in the following way:
A[[1,1]] = B[[1]],
A[[1,2]] = B[[2]],
A[[1,3]] = B[[3]],
A[[1,4]] = B[[4]],
A[[2,1]] = B[[5]],
A[[2,2]] = B[[6]],
A[[2,3]] = B[[7]],
A[[2,4]] = B[[8]],
etc...
Does anyone know a convenient way of doing this so that I can achieve this for any NxN square matrix, and any length M column matrix(Mx1 matrix)? Assuming of course that the total # of elements are the same in both matrices.
If you have Mathematica 9, the function ArrayReshape can turn your list B into an arbitrary m x n Matrix.

Resources