How to automatically compute accuracy (precision, recall, F1) for NER? - metrics

I'm using a NER system that gives as output a text file containing a list of named entities which are instances of the concept Speaker. I'm looking for a tool that can compute the system's precision, recall and F1 by taking as input this list and the gold standard where the instances are correctly annotated with tags <Speaker>.
I have two txt files: Instances.txt and GoldStandard.txt. I need to compare the extracted instances with the gold standard in order to calculate these metrics. For example, according to the second file, the first three sentences in the first file are True Positive and the last sentence is False Positive.
instances.txt contains:
is sponsoring a lecture by <speaker> Antal Bejczy from
announces a talk by <speaker> Julia Hirschberg
His name is <speaker> Toshiaki Tsuboi He will
to produce a schedule by <speaker> 50% for problems
GoldStandard.txt contains:
METC is sponsoring a lecture by <speaker> Antal Bejczy from Stanford university
METC announces a talk by <speaker> Julia Hirschberg
The speaker is from USA His name is <speaker> Toshiaki Tsuboi He will
propose a solution to these problems
It led to produce a schedule by 50% for problems

For NER results, people usually measure precision, recall and F1-score instead of accuracy, and conlleval is probably the most common way to calculate these metrics: https://github.com/spyysalo/conlleval.py. It also reports the accuracy though.
the conlleval script takes conll format files as input. Take your first sentence as example:
METC O O
is O O
sponsoring O O
a O O
lecture O O
by O O
Antal B-speaker B-speaker
Bejczy I-speaker I-speaker
from O O
Stanford O O
university O O
where the first column is word, the second column is system output, and the third column is gold label. O indicates that a token belongs to no chunk. Suffixes B- and I- mean the beginning of, inside of/the ending of a chunk. Sentences are separated using an empty line.

It depends entirely on your use-case, and how much work you do on cleaning up/disambiguating the output from the NER. There is also weighted-F1 score; you presumably care more about missing references (i.e. want higher recall) than false-positives (higher precision). Except maybe for other types of use-cases you don't (issuing subpoenas or warrants, banning users for chat abuse).
sklearn.metrics.f1_score() implements weighted-F1.
Tell us more about your application: how bad is it if you mistake, misidentify or confuse speaker name (false-positive), vs miss a valid one (false-negative)?

Related

What algorithms can group characters into words?

I have some text generated by some lousy OCR software.
The output contains mixture of words and space-separated characters, which should have been grouped into words. For example,
Expr e s s i o n Syntax
S u m m a r y o f T e r minology
should have been
Expression Syntax
Summary of Terminology
What algorithms can group characters into words?
If I program in Python, C#, Java, C or C++, what libraries provide the implementation of the algorithms?
Thanks.
Minimal approach:
In your input, remove the space before any single letter words. Mark the final words created as part of this somehow (prefix them with a symbol not in the input, for example).
Get a dictionary of English words, sorted longest to shortest.
For each marked word in your input, find the longest match and break that off as a word. Repeat on the characters left over in the original "word" until there's nothing left over. (In the case where there's no match just leave it alone.)
More sophisticated, overkill approach:
The problem of splitting words without spaces is a real-world problem in languages commonly written without spaces, such as Chinese and Japanese. I'm familiar with Japanese so I'll mainly speak with reference to that.
Typical approaches use a dictionary and a sequence model. The model is trained to learn transition properties between labels - part of speech tagging, combined with the dictionary, is used to figure out the relative likelihood of different potential places to split words. Then the most likely sequence of splits for a whole sentence is solved for using (for example) the Viterbi algorithm.
Creating a system like this is almost certainly overkill if you're just cleaning OCR data, but if you're interested it may be worth looking into.
A sample case where the more sophisticated approach will work and the simple one won't:
input: Playforthefunofit
simple output: Play forth efunofit (forth is longer than for)
sophistiated output: Play for the fun of it (forth efunofit is a low-frequency - that is, unnatural - transition, while for the is not)
You can work around the issue with the simple approach to some extent by adding common short-word sequences to your dictionary as units. For example, add forthe as a dictionary word, and split it in a post processing step.
Hope that helps - good luck!

Game Theory: how to apply it in transcriptomics?

Hi
I have seen this paper (and also Game Theory applied to gene expression analysis, Game Theory and Microarray Data Analysis) authors have used game theory for their microarray DEG analysis (microarray
game).
Is there any simple guide from you (or other online resources) that can describe how to use related formula for checking game theory concept in the DEG analysis of RNA-seq experiences ? (Basically is it even practical?)
Maybe there is some software for doing such investigation, painlessly.
NOTE1: For example please have a look at "Game Theory Method" in the first paper above :
Let N 5 {1,. . .,n} be a set of genes. A microarray game is a
coalitional game (N, w) where the function w assigns to each coalition
S ( N a frequency of associations, between a condition and a
expression property, of genes realized in the coalition S."
Imagin we have 150 gene up-regulated in females and 80 up-regulated in males (using de novo assembly and DESeq2 package), now how I can use game theory for mining something new or some extra connections between this collection of genes?
NOTE2: I have asked this question in BIOSTARS but no answer after 8 weeks.
Thanks

Why does accessing coefficients following estimation with nl require slightly different syntax than for other estimation commands?

Following most estimation commands in Stata (e.g. reg, logit, probit, etc.) one may access the estimates using the _b[ParameterName] syntax (or the synonymous _coef[ParameterName]). For example:
regress y x
followed by
di _b[x]
will display the estimate of the coefficient of x. di _b[_cons] will display the coefficient of the estimated intercept (assuming the regress command was successful), etc.
But if I use the nonlinear least squares command nl I (seemingly) have to do something slightly different. Now (leaving aside that for this example model there is absolutely no need to use a NLLS regression):
nl (y = {_cons} + {x}*x)
followed by (notice the forward slash)
di _b[/x]
will display the estimate of the coefficient of x.
Why does accessing parameter estimates following nl require a different syntax? Are there subtleties to be aware of?
"leaving aside that for this example model there is absolutely no need to use a NLLS regression": I think that's what you can't do here....
The question is about why the syntax is as it is. That's a matter of logic and a matter of history. Why a particular syntax was chosen is ultimately a question for the programmers at StataCorp who chose it. Here is one limited take on your question.
The main syntax for regression-type models grows out of a syntax designed for linear regression models in which by default the parameters include an intercept, as you know.
The original syntax for nonlinear regression models (in the sense of being estimated by nonlinear least-squares) matches a need to estimate a bundle of parameters specified by the user, which need not include an intercept at all.
Otherwise put, there is no question of an intercept being a natural default; no parameterisation is a natural default and each model estimated by nl is sui generis.
A helpful feature is that users can choose the names they find natural for the parameters, within the constraints of what counts as a legal name in Stata, say alpha, beta, gamma, a, b, c, etc. If you choose _cons for the intercept in nl that is a legal name but otherwise not special and just your choice; nl won't take it as a signal that it should flip into using regress conventions.
The syntax you cite is part of what was made possible by a major redesign of nl but it is consistent with the original philosophy.
That the syntax is different because it needs to be may not be the answer you seek, but I guess you'll get a fuller answer only from StataCorp; developers do hang out on Statalist, but they don't make themselves visible here.

Naive approach to estimate or calculate the visual difference between characters

My starting point was to generate random passwords that are visually easy to recognize. I first decided to omit characters that will be probably visually indistinguishable from the set I choose from randomly. Maybe this is a nonsense idea, because random passwords will be copy-pasted most time. I'm also aware that the difference between two given chars depends on the font, but since every-day fonts are designed to be read by humans, there will be some font-independent characteristics.
Some examples of character pairs that have a low distance from each other in most fonts:
O 0
1 l
5 S
Is there an easy way to "calculate" this sort of distances?
What is the name of this computational discipline, that I can be googled?
Edit: I now found that the term is Homoglyph

Deducing string transformation rules

I have a set of pairs of character strings, e.g.:
abba - aba,
haha - aha,
baa - ba,
exb - esp,
xa - za
The second (right) string in the pair is somewhat similar to the first (left) string.
That is, a character from the first string can be represented by nothing, itself or a character from a small set of characters.
There's no simple rule for this character-to-character mapping, although there are some patterns.
Given several thousands of such string pairs, how do I deduce the transformation rules such that if I apply them to the left strings, I get the right strings?
The solution can be approximate, working correctly for, say, 80-95% of the strings.
Would you recommend to use some kind of a genetic algorithm? If so, how?
If you could align the characters, or rather groups of characters, you could work out tables saying that aa => a, bb => z, and so on. If you had such tables, you could align the characters using http://en.wikipedia.org/wiki/Dynamic_time_warping. One approach is therefore to guess an alignment (e.g. one for one, just as a starting point, or just align the first and last characters of each sequence), work out a translation table from that, use DTW to get a new alignment, work out a revised translation table, and iterate in that way. Perhaps you could wrap this up with enough maths to show that there is some measure of optimality or probability that such passes increase, climbing to a local maximum.
There is probably some way of doing this by modelling a Hidden Markov Model that generates both sequences simultaneously and then deriving rules from that model, but I would not chose this approach unless I was already familiar with HMMs and had software to use as a starting point that I was happy to modify.
You can use text to speech to create sound waves. then compare sound waves with other's and match them with percentages.
This is my theory how Google has such a advanced spell checker.

Resources