Hidden Markov Model from Score Matrix

Hidden Markov Model from Score Matrix - matrix

Im doing a course in bioinformatic.
Im trying to figure out how to model a Hidden Makrov Model (HMM) from a Position Specific Probability Matrix (PSPM).
Is there clear pattern how one should model it?
Can someone show me how to model it, with 3 or more states based on a PSPM.
I will supply a example of a PSPM, but feel free to use your own.
Example taken from MIT course.

The question is not very clear, but maybe this will help.
For this models you can have as many positions as you want in principle.
Here therw is an example of what is possilbe to do in bioinformatics:
Let's imagine you have an alignment of DNA (we can call it "database"). You can build the model like this with hmmer:
hmmbuild <hmm_profile> <alignment>
Then you can check the hmmer_profile which will contain all the transition probabilities for each position of the alignment and also the transition probabilities.
After that you can search new sequences against it for example. Now imagine you have a set of sequences ("queries") in a file and you want to map them against your "database", so you know which positions each query aligns better:
nhmmer -o <output_file> --dna <hmmer_profile> <sequences_file>
The output_file it returns the aligned sequences against the database.

Related

Foundry-workshop : sorting capabilities

I build a chart with 2 layers in Workshop, the only difference is the date bucket. As far as segmentation is the same, I would like each layer being displayed with the same color.
In the chart above, you'll see 2 series: one to display the monthly evolution of a property, the second one to display the yearly average. Each series is segmented by the same object (listed in the table above, coming from an hidden filter). This is achieves by defining 2 layers, with exactly the same definition, except the X-axis: the date is bucked either monthly or yearly. The property is calculated on a monthly basis, so the monthly average bucket displays the input value, and the second chart dynamically calculates the yearly average.
The main issue is that the 2 blue lines are not related to the same object.
I also would like, as far as possible, to have only one legend, instead of one per layer. Currently, my workaround is to display in one case the code, in the second one the description.
Maybe I missed something, but I did not find any way to define precisely the chart colors: am I wrong?
Thus, I was wondering if there was any way to sort input data. The filter is based on the object's primary key, is is possible to sort the queryset accordingly ? Maybe the segmentation would be displayed in the same order and the colors match this order.
Or is there any other way to proceed?

The answer to the title's question is quite easy: actually, the segmentation is not exactly the same. In the first layer, the segmentation is based on a code, the second one on the object's name. Thus, the order is slightly different.
Then solution is: use exactly the same segmentation.
I still wonder how to manage the display: view the legend for only one layer, choose lines' colors... But it looks like it's another subject and I'll probably open a new topic.

Relation between two texts with different tags

I'm currently having a problem with the conception of an algorithm.
I want to create a WYSIWYG editor that goes along the current [bbcode] editor I have.
To do that, I use a div with contenteditable set to true for the WYSIWYG editor and a textarea containing the associated bbcode. Until there, no problem. But my concern is that if a user wants to add a tag (for example, the [b] tag), I need to know where they want to include it.
For that, I need to know exactly where in the bbcode I should insert the tags. I thought of comparing the two texts (one with html tags like <span>, the other with bbcode tags like [b]), and that's where I'm struggling.
I did some research but couldn't find anything that would help me, or I did not understand it correctly (maybe did I do a wrong research). What I could find is the Jaccard index, but I don't really know how to make it work correctly.
I also thought of another alternative. I could just take the code in the WYSIWYG editor before the cursor location, and split it every time I encounter a html tag. That way, I can, in the bbcode editor, search for the first occurrence, then search for the second occurrence starting at the last index found, and so on until I reach the place where the cursor is pointing at.
I'm not sure if it would work, and I find that solution a bit dirty. Am I totally wrong or should I do it this way?
Thanks for the help.

A popular way of determining what is the level of the similarity between the two texts is computing the mentioned Jaccard similarity. Citing Wikipedia:
The Jaccard index, also known as Intersection over Union and the Jaccard similarity coefficient, is a statistic used for comparing the similarity and diversity of sample sets. The Jaccard coefficient measures the similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:
If you have a large number of texts though, computing the full Jaccard index of every possible combination of two texts is super computationally expensive. There is another way to approximate this index that is called minhashing. What it does is use several (e.g. 100) independent hash functions to create a signature and it repeats this procedure many times. This whole process has a nice property that the probability (over all permutations) that T1 = T2 is the same as J(A,B).
Another way to cluster similar texts (or any other data) together is to use Locality Sensitive Hashing which by itself is an approximation of what KNN does, and is usually worse than that, but is definitely faster to compute. The basic idea is to project the data into low-dimensional binary space (that is, each data point is mapped to a N-bit vector, the hash key). Each hash function h must satisfy the sensitive hashing property prob[h(x)=h(y)]=sim(x,y) where sim(x,y) in [0,1] is the similarity function of interest. For dots products it can be visualized as follows:
we can now ask what would be the has of the indicated point (in this case it's 101) and everything that is close to this point has the same hash.
EDIT to answer the comment
No, you asked about the text similarity and so I answered that. You basically ask how can you predict the position of the character in text 2. It depends on whether you analyze the writer's style or just pure syntax. In any of those two cases, IMHO you need some sort of statistics that will tell where it is likely for this character to occur given all the other data/text. You can go with n-grams, RNNs, LSTMs, Markov Chains or any other form of sequential data analysis.

How to rotate a word2vec onto another word2vec?

I am training multiple word2vec models with Gensim. Each of the word2vec will have the same parameter and dimension, but trained with slightly different data. Then I want to compare how the change in data affected the vector representation of some words.
But every time I train a model, the vector representation of the same word is wildly different. Their similarity among other words remain similar, but the whole vector space seems to be rotated.
Is there any way I can rotate both of the word2vec representation in such way that same words occupy same position in vector space, or at least they are as close as possible.
Thanks in advance.

That the locations of words vary between runs is to be expected. There's no one 'right' place for words, just mutual arrangements that are good at the training task (predicting words from other nearby words) – and the algorithm involves random initialization, random choices during training, and (usually) multithreaded operation which can change the effective ordering of training examples, and thus final results, even if you were to try to eliminate the randomness by reliance on a deterministically-seeded pseudorandom number generator.
There's a class called TranslationMatrix in gensim that implements the learn-a-projection-between-two-spaces method, as used for machine-translation between natural languages in one of the early word2vec papers. It requires you to have some words that you specify should have equivalent vectors – an anchor/reference set – then lets other words find their positions in relation to those. There's a demo of its use in gensim's documentation notebooks:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb
But, there are some other techniques you could also consider:
transform & concatenate the training corpuses instead, to both retain some words that are the same across all corpuses (such as very frequent words), but make other words of interest different per segment. For example, you might leave words like "hot" and "cold" unchanged, but replace words like "tamale" or "skiing" with subcorpus-specific versions, like "tamale(A)", "tamale(B)", "skiing(A)", "skiing(B)". Shuffle all data together for training in a single session, then check the distances/directions between "tamale(A)" and "tamale(B)" - since they were each only trained by their respective subsets of the data. (It's still important to have many 'anchor' words, shared between different sets, to force a correlation on those words, and thus a shared influence/meaning for the varying-words.)
create a model for all the data, with a single vector per word. Save that model aside. Then, re-load it, and try re-training it with just subsets of the whole data. Check how much words move, when trained on just the segments. (It might again help comparability to hold certain prominent anchor words constant. There's an experimental property in the model.trainables, with a name ending _lockf, that lets you scale the updates to each word. If you set its values to 0.0, instead of the default 1.0, for certain word slots, those words can't be further updated. So after re-loading the model, you could 'freeze' your reference words, by setting their _lockf values to 0.0, so that only other words get updated by the secondary training, and they're still bound to have coordinates that make sense with regard to the unmoving anchor words. Read the source code to better understand how _lockf works.)

Wapiti CRF : Understanding the model file and forcing hard-constraints

I'm currently using Wapiti to detect specific product names in web pages.
I've trained a model, and I'd like to list the top 10 more important rules of this model (those rules that have a biggest weight (positive or negative)).
Here is an example of a trained model taken from the Wapiti documentation:
[...]
12:*:Pre-3 X='s,
13:*:Pre-3 X=Wel,
13:*:Suf-3 X=rid,
[...]
10=-0x1.32892bf985df3p-1
11=0x1.73883325ee8edp-4
15=0x1.034d12a224d71p-2
16=-0x1.1fa154002a2f9p+0
So, from the above 3 rules, how do I know which one has the biggest weight? The rule *:Pre-3 X='s, is associated with the number "12". is this number
the weight? or is it a reference to the lines below? however, the number "12" does not appear in those lines.
Another question: Is is possible to force a "hard-constraint"? that is, to write a rule that whenever an observation is seen, it produces always a given tag.

For your first question, look at the dump mode of wapiti, it turn the model file in a more readable format where it will be easy to extract the feature with highest or lowest weights.
wapiti dump model > model.txt
This will give you a text file with one feature per line described with 4 columns. First the pattern with the substitutions expanded, next the label at previous position (or # for unigrams pattern), next the label at current position, and finally the feature weight.
For your second question, Wapiti have a forced decoding mode made for this. If your data have N column of observations, just give wapiti a file with N+1 column and put the constrains in the last column. With the --force switch of the label mode, if a valid label is present in this last column, wapiti will force the decoder to predict this label at this position and take account of this in the neighbors positions.

Fill all connected grid squares of the same type

Foreword: I am aware there is another question like this, however mine has very specific restrictions. I have done my best to make this question applicable to many, as it is a generic grid issue, but if it still does not belong here, then I am sorry, and please be nice about it. I have found in the past stackoverflow to be a very picky and hostile environment to question askers, but I'm hoping that was just a bad couple people.
Goal(abstract): Check all connected grid squares in a 3D grid that are of the same type and touching on one face.
Goal(specific/implementation): Create a "fill bucket" tool in Minecraft with command blocks.
Knowledge of Minecraft not really necessary to answer, this is more of an algorithm question, and I will be staying away from Minecraft specifics.
Restrictions: I can do this in code with recursive functions, but in Minecraft there are some limitations I am wondering if are possible to get around. 1: no arrays(data structure) permitted. In Minecraft I can store an integer variable and do basic calculations with it (+,-,*,/,%(mod),=,==), but that's it. I cannot dynamically create variables or have the program create anything with a name that I did not set out ahead of time. I can do "IF" and "OR" statements, and everything that derives from them. I CANNOT have multiple program pointers - that is, I can't have things like recursive functions, which require a program to stop executing, execute itself from beginning to end, and then resume executing where it was - I have minimal control over the program flow. I can use loops and conditional exits (so FOR loops). I can have a marker on the grid in 3D space that can move regardless of the presence of blocks (I'm using an armour stand, for those who know), and I can test grid squares relative to that marker.
So say my grid is full of empty spaces only. There are separate clusters of filled squares in opposite corners, not touching each other. If I "use" my fillbucket tool on one block / filled grid square, I want it to use a single marker to check and identify all the connected grid squares - basically, I need to be sure that it traverses the entire shape, all the nooks and crannies, but not the squares that are not connected to that shape. So in the end, one of the two clusters, from me only selecting a single square of it, will be erased/replaced by another kind of block, without affecting the other blocks around it.
Again, apologies if this doesn't belong here. And only answer this if you WANT to tackle the challenge - it's not important or anything, I just want to do this. You don't have to answer it if you don't want to. Or if you can solve this problem for a 2D grid, that would be helpful as well, as I could possibly extend that to work for 3D.
Thank you, and if I get nobody degrading me for how I wrote this post or the fact that I did, then I will consider this a success :)

With help from this and other sources, I figured it out! It turns out that, since all recursive functions (or at least most of them) can be written as FOR loops, that I can make a recursive function in Minecraft. So I did, and the general idea of it is as follows:
For explaining the program, you may assuming the situation is a largely empty grid with a grouping of filled squares in one part of it, and the goal is to replace the kind of block that that grouping is made of with a different block. We'll say the grouping currently consists of red blocks, and we want to change them to blue blocks.
Initialization:
IDs - A objective (data structure) for holding each marker's ID (score)
numIDs - An integer variable for holding number of IDs/markers active
Create one marker at selected grid position with ID [1] (aka give it a score of 1 in the "IDs" objective). This grid position will be a filled square from which to start replacing blocks.
Increment numIDs
Main program:
FOR loop that goes from 1 to numIDs
{
at marker with ID [1], fill grid square with blue block
step 1. test block one to the +x for a red block
step 2. if found, create marker there with ID [numIDs]
step 3. increment numIDs
[//repeat steps 1 2 and 3 for the other five adjacent grid squares: +z, -x, -z, +y, and -y]
delete stand[1]
numIDs -= 1
subtract 1 from every marker's ID's, so that the next marker to evaluate, which was [2], now has ID [1].
} (end loop)
So that's what I came up with, and it works like a charm. Sorry if my explanation is hard to understand, I'm trying to explain in a way that might make sense to both coders and Minecraft players, and maybe achieving neither :P

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio