Wapiti CRF : Understanding the model file and forcing hard-constraints - crf

I'm currently using Wapiti to detect specific product names in web pages.
I've trained a model, and I'd like to list the top 10 more important rules of this model (those rules that have a biggest weight (positive or negative)).
Here is an example of a trained model taken from the Wapiti documentation:
[...]
12:*:Pre-3 X='s,
13:*:Pre-3 X=Wel,
13:*:Suf-3 X=rid,
[...]
10=-0x1.32892bf985df3p-1
11=0x1.73883325ee8edp-4
15=0x1.034d12a224d71p-2
16=-0x1.1fa154002a2f9p+0
So, from the above 3 rules, how do I know which one has the biggest weight? The rule *:Pre-3 X='s, is associated with the number "12". is this number
the weight? or is it a reference to the lines below? however, the number "12" does not appear in those lines.
Another question: Is is possible to force a "hard-constraint"? that is, to write a rule that whenever an observation is seen, it produces always a given tag.

For your first question, look at the dump mode of wapiti, it turn the model file in a more readable format where it will be easy to extract the feature with highest or lowest weights.
wapiti dump model > model.txt
This will give you a text file with one feature per line described with 4 columns. First the pattern with the substitutions expanded, next the label at previous position (or # for unigrams pattern), next the label at current position, and finally the feature weight.
For your second question, Wapiti have a forced decoding mode made for this. If your data have N column of observations, just give wapiti a file with N+1 column and put the constrains in the last column. With the --force switch of the label mode, if a valid label is present in this last column, wapiti will force the decoder to predict this label at this position and take account of this in the neighbors positions.

Related

Foundry-workshop : sorting capabilities

I build a chart with 2 layers in Workshop, the only difference is the date bucket. As far as segmentation is the same, I would like each layer being displayed with the same color.
In the chart above, you'll see 2 series: one to display the monthly evolution of a property, the second one to display the yearly average. Each series is segmented by the same object (listed in the table above, coming from an hidden filter). This is achieves by defining 2 layers, with exactly the same definition, except the X-axis: the date is bucked either monthly or yearly. The property is calculated on a monthly basis, so the monthly average bucket displays the input value, and the second chart dynamically calculates the yearly average.
The main issue is that the 2 blue lines are not related to the same object.
I also would like, as far as possible, to have only one legend, instead of one per layer. Currently, my workaround is to display in one case the code, in the second one the description.
Maybe I missed something, but I did not find any way to define precisely the chart colors: am I wrong?
Thus, I was wondering if there was any way to sort input data. The filter is based on the object's primary key, is is possible to sort the queryset accordingly ? Maybe the segmentation would be displayed in the same order and the colors match this order.
Or is there any other way to proceed?
The answer to the title's question is quite easy: actually, the segmentation is not exactly the same. In the first layer, the segmentation is based on a code, the second one on the object's name. Thus, the order is slightly different.
Then solution is: use exactly the same segmentation.
I still wonder how to manage the display: view the legend for only one layer, choose lines' colors... But it looks like it's another subject and I'll probably open a new topic.

How to get immediate next word probability using GPT2 model?

I was trying the hugging face gpt2 model. I have seen the run_generation.py script, which generates a sequence of tokens given a prompt. I am aware that we can use GPT2 for NLG.
In my use case, I wish to determine the probability distribution for (only) the immediate next word following the given prompt. Ideally this distribution would be over the entire vocab.
For example, given the prompt: "How are ", it should give a probability distribution where "you" or "they" have the some high floating point values and other vocab words have very low floating values.
How to do this using hugging face transformers? If it is not possible in hugging face, is there any other transformer model that does this?
You can have a look at how the generation script works with the probabilities.
GPT2LMHeadModel (as well as other "MLHead"-models) returns a tensor that contains for each input the unnormalized probability of what the next token might be. I.e., the last output of the model is the normalized probability of the next token (assuming input_ids is a tensor with token indices from the tokenizer):
outputs = model(input_ids)
next_token_logits = outputs[0][:, -1, :]
You get the distribution by normalizing the logits using softmax. The indices in the first dimension of the next_token_logits correspond to indices in the vocabulary that you get from the tokenizer object.
Selecting the last logits becomes tricky when you use a batch size bigger than 1 and sequences of different lengths. In that case, you would need to specify attention_mask in the model call to mask out padding tokens and then select the last logits using torch.index_select. It is much easier either to use batch size 1 or batch of equally long sequences.
You can use any autoregressive model in Transformers: there is distilGPT-2 (a distilled version of GPT-2), CTRL (which is basically GPT-2 trained with some additional "commands"), the original GPT (under the name openai-gpt), XLNet (designed for contextual embeddings, but can be used for generation in arbitrary order). There are probably more, you can Hugging Face Model Hub.

Google Sheets calculate characters only once

Is there a formula in google sheets to calculate a character only once. For example, if a row has 5 columns (Monday-Friday) and there are 2 or 3 columns marked with X. How can I calculate how many rows have an X. I don't need to know how many Xs there are just how many have an X?
Reina, I have one answer, though there may be better ones.
This formula, pasted into B34, should do what you want. It merges all the cells in column B to F, in each row, into one value, substitutes out possible spaces, then checks if it has at least one "y" (as used in your example.
=COUNTIF(ARRAYFORMULA(
SUBSTITUTE(B4:B29&C4:C29&D4:D29&E4:E29&F4:F29," ","")),
"*y*")
It is coded to search all student rows, ie. between 4 and 29 - change these row numbers if necessary.
If the attendance might be marked with something other than a "y", you could change the "y" part of the formula to "?*". I just didn't know if other values might be used, eg. an "S' for sick day or something, and you wanted to ignore those.
Then, you can drag the new formula from B34, sideways on row 34, to G34 and beyond, and it should calculate the results for the subsequent weeks. It will shift the columns being checked by the formula automatically.
Let me know if this works for you, or if you need something else.
To possibly ease data entry, here is a sample sheet with the formula, but with check boxes replacing the cells where attendance is marked.
https://docs.google.com/spreadsheets/d/1ON5Rc55aLVq_LHtFOfpgmf876bYg2ITfwpbifklr3lU/edit?usp=sharing
Here the formula is slightly modified to look for "TRUE" values, instead of "y"s.
UPDATE: To look for ANY non-blank cell in that range, and count "1" for every student that week that attended at least one day, the formula is:
=COUNTIF(
ARRAYFORMULA( B4:B29&C4:C29&D4:D29&E4:E29&F4:F29), ">""")
or
=COUNTIF(
ARRAYFORMULA( B4:B29&C4:C29&D4:D29&E4:E29&F4:F29), "?*")
See sample here:
https://docs.google.com/spreadsheets/d/1ON5Rc55aLVq_LHtFOfpgmf876bYg2ITfwpbifklr3lU/edit#gid=461771088&range=B34:F34
Let me know if this answers your question, or do you need to do something specific with the "y,x, and o"s?

Kalman filter, multiple lines tracking

I have a problem with multiple lines tracking by using Kalman filter.
Input data - number of items and set of structures with x1,y1, x2,y2 (coordinates). For each iteration the number of items can be different so some lines can appear or disappear.
For single line it looks simple - we have input data, equasions etc. and we can create output. We always know the line can exist. If not, and it will appear later - it will be still the same line.
But for multiple lines I don't know how to start - in one iteration I can get few objects - ok, I will use this set of equasions for each of them. But in next iteration I can get less of more lines. I'm not sure what's the correct approach - I have data from previous iteration but I will need to use it for the same object. So:
1. I need to find it - checking distance between middlepoints for pair estimated previously <-> line N and choosing the smallest value? Is it correct approach or we have different method?
2. Storing old data - the line was visible long time but after next iteration will never appear. I got new line and again - the same situation. It will be good to store old results but in this case, after looong time, I will have a lot of zombie-data. Do we have some special criteria to clean it or I need to use some own ideas like max iterations if no detection etc.?

Seeking algo for text diff that detects and can group similar lines

I am in the process of writing a diff text tool to compare two similar source code files.
There are many such "diff" tools around, but mine shall be a little improved:
If it finds a set of lines are mismatched on both sides (ie. in both files), it shall not only highlight those lines but also highlight the individual changes in these lines (I call this inter-line comparison here).
An example of my somewhat working solution:
alt text http://files.tempel.org/tmp/diff_example.png
What it currently does is to take a set of mismatched lines and running their single chars thru the diff algo once more, producing the pink highlighting.
However, the second set of mismatches, containing "original 2", requires more work: Here, the first two right lines ("added line a/b") were added, while the third line is an altered version of the left side. I wish my software to detect this difference between a likely alteration and a probable new line.
When looking at this simple example, I can rather easily detect this case:
With an algo such as Levenshtein, I could find that of all right lines in the set of 3 to 5, line 5 matches left line 3 best, thus I could deduct that lines 3 and 4 on the right were added, and perform the inter-line comparison on left line 3 and right line 5.
So far, so good. But I am still stuck with how to turn this into a more general algorithm for this purpose.
In a more complex situation, a set of different lines could have added lines on both sides, with a few closely matching lines in between. This gets quite complicated:
I'd have to match not only the first line on the left to the best on the right, but vice versa as well, and so on with all other lines. Basically, I have to match every line on the left against every one on the right. At worst, this might create even crossings, so that it's not easily clear any more which lines were newly inserted and which were just altered (Note: I do not want to deal with possible moved lines in such a block, unless that would actually simplify the algorithm).
Sure, this is never going to be perfect, but I'm trying to get it better than it's now. Any suggestions that aren't too theoerical but rather practical (I'm not good understanding abstract algos) are appreciated.
Update
I must admit that I do not even understand how the LCS algo works. I simply feed it two arrays of strings and out comes a list of which sequences do not match. I am basically using the code from here: http://www.incava.org/projects/java/java-diff
Looking at the code I find one function equal() that is responsible for telling the algorithm whether two lines match or not. Based on what Pavel suggested, I wonder if that's the place where I'd make the changes. But how? This function only returns a boolean - not a relative value that could identify the quality of the match. And I can not simply used a fixed Levenshtein ration that would decide whether a similar line is still considered equal or not - I'll need something that's self-adopting to the entire set of lines in question.
So, what I'm basically saying is that I still do not understand where I'd apply the fuzzy value that relates to the relative similarity of lines that do not (exactly) match.
Levenshtein distance is based on the notion of an "edit script" that transforms one string into another. It's very closely related to the Needleman-Wunsch algorithm used for aligning DNA sequences by inserting gap characters, in which we search for the alignment that maximises a score in O(nm) time using dynamic programming. Exact matches between characters increase the score, while mismatches or inserted gap characters reduce the score. An example alignment of AACTTGCCA and AATGCGAT:
AACTTGCCA-
AA-T-GCGAT
(6 matches, 1 mismatch, 3 gap characters, 3 gap regions)
We can think of the top string being the "starting" sequence that we are transforming into the "final" sequence on the bottom. Each column containing a - gap character on the bottom is a deletion, each column with a - on the top is an insertion, and each column with different (non-gap) characters is a substitution. There are 2 deletions, 1 insertion and 1 substitution in the above alignment, so the Levenshtein distance is 4.
Here is another alignment of the same strings, with the same Levenshtein distance:
AACTTGCCA-
AA--TGCGAT
(6 matches, 1 mismatch, 3 gap characters, 2 gap regions)
But notice that although there are the same number of gaps, there is one less gap region. Because biological processes are more likely to create wide gaps than multiple separate gaps, biologists prefer this alignment -- and so will the users of your program. This is accomplished by also penalising the number of gap regions in the scores that we compute. An O(nm) algorithm to accomplish this for strings of lengths n and m was given by Gotoh in 1982 in a paper called "An improved algorithm for matching biological sequences". Unfortunately, I can't find any links to free full text of the paper -- but there are many useful tutorials that you can find by googling "sequence alignment" and "affine gap penalty".
In general, different choices of match, mismatch, gap and gap region weights will give different alignments, but any negative score for gap regions will prefer the bottom alignment above to the top one.
What does all this have to do with your problem? If you use Gotoh's algorithm on individual characters with a suitable gap penalty (arrived at with a few empirical tests), you should find a significant decrease in the the number of terrible-looking alignments like the example you gave.
Efficiency Considerations
Ideally, you could just do this on characters and ignore lines altogether, since the affine penalty will work to cluster changes into blocks spanning many lines wherever it can. But because of the higher running time, it may be more realistic to do a first pass on lines and then rerun the algorithm on characters, using as input all lines that are not identical. Under this scheme, any shared block of identical lines can be handled by compressing it into a single "character" with inflated matching weight, which helps to ensure no "crossings" appear.
With an algo such as Levenshtein, I could find that of all right lines in the set of 3 to 5, line 5 matches left line 3 best, thus I could deduct that lines 3 and 4 on the right were added, and perform the inter-line comparison on left line 3 and right line 5.
After you have determined it, use the same algorithm to determine what lines in these two chinks match each other. But you need to make slight modificaiton. When you used the algorithm to match equal lines, the lines could either match or not match, so that added either 0 or 1 to the cell of the table you used.
When comparing strings in one chunk some of them are "more equal" than others (ack. to Orwell). So they can add a real number from 0 to 1 to the cell when considering what sequence matches best so far.
To compute this metrics (from 0 to 1), you can apply to each pair of strings you encounter... right, the same algorithm again (actually, you already did this when you were doing the first pass of Levenstein algorithm). This will compute the length of LCS, whose ratio to the average length of two strings would be the the metric value.
Or, you can borrow the algorithm from one of diff tools. For instance, vimdiff can highlight the matches you require.
Here's one possible solution someone else just made me realize:
My original approach was like this:
Split the text up into separate lines and use LCS algo to determine where there are blocks of nonmatching lines.
Use some smart algo (which this question is about) to figure out which of these lines closely match, i.e. to tell that these lines were modified between revisions.
Compare those closely matching lines line-by-line using LCS again, while marking the non-matching lines as entirely new.
While this would allow for a better visual display of changes when comparing source code revisions, I now found that a much simpler approach is usually sufficient. It works like this:
Same as above.
Take the right and left block of nonmatching lines, concatenate those lines, and tokenize them (either into language-specific tokens/words, or just into single characters)
Apply the LCS algo on the two arrays of tokens.
Maybe those who replied to my original question assumed that I knew to do this all the time, but I had my focus so strongly on a per-line comparison that it did not occur to me to apply LCS on the set of lines by concatenating them, instead of processing them line-by-line.
So, while this approach will not provide as detailed change information as my original intent was, it still does improve the results over what I started yesterday with when I wrote this question.
I'll leave this question open for a while longer - maybe someone else, reading all this, can still provide a complete answer (Pavel and random_hacker offered some suggestions, but it's not a complete solution yet - anyway, thank you for the helpful comments).

Resources