Defining TokensRegex for Stanford NLP RegexNER - stanford-nlp

I'm trying to create a regex token to tag Universities as schools in input text. For e.g. University of Wisconsin or Universidad Anahuac should get tagged as SCHOOL.
I have this as my pattern
( /University|Universidad/ /of?/ [ {ner:LOCATION}|{ner:ORGANIZATION} ]+ ) SCHOOL
I can't seem to get the syntax correct. Any help would be appreciated.

Related

How to replace the matched text with a word specifying by rules files?

I am currently working on a Stanford CoreNLP program that replaces a matched text with a specified word using a list of given rules. I checked TokensRegex Expression, I know there is a regex function can be used in Action field:
Replace(List<CoreMap>, tokensregex, replacement)<br>Match(String,regex,replacement)
to do that. However, it is not clear to me how to implement this function in my rules files. And I couldn't find any example on GitHub or other web pages.
Here is an example of a replacement:
Input text: John Smith is a member of the NLP lab.
Matched pattern: "John Smith" is replaced with "Student A" in the text.
Resulting text: Student A is a member of the NLP lab.
Anyone could help me? I am new to Stanford CoreNLP and have a lot of things to learn.

Specific Part of Speech labels for Java Stanford NLP

What are the set of PoS labels produced by Standford NLP (including PoS for punctuation tokens), and its description?
I know this question has been asked several times, such as in:
Java Stanford NLP: Part of Speech labels?
http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf
but those answers list some typical PoS labels which are not specific to Standfor NLP. For instance, none of those answers list the -LRB- PoS label used by Stanford NKLP for the ( punctuation.
Where can I find this list of PoS labels in the source code of the Stanford NLP?
Also, what are some token examples annotated with the SYM PoS label?
Also, how to know if a token is a punctuation?
Here they define isPunctation == true if its PoS is :|,|.|“|”|-LRB-|-RRB-|HYPH|NFP|SYM|PUNC. However Stanford NLP does not have all these PoS.
It is the Penn Treebank POS set, but many descriptions of this tag set seem to omit punctuation marks. Here is a complete list of tags:
https://www.eecis.udel.edu/~vijay/cis889/ie/pos-set.pdf
(But parentheses are tagged as -LRB- and -RRB-, not sure why they don't mention this in the documentation.)

stanfordnlp - Training space separated words as a single token to Stanford NER model generation

I have read the detailed description given here- http://nlp.stanford.edu/software/crf-faq.shtml#a on training the model based on the labelled input file according to the .prop file. But the article says-
You should make sure each line consists of solely content fields and tab characters. Spaces don't work. Extra tabs will cause problems.
My text corpus has some space separated words which are all combinedly form a token instead of single word. For instance, "Wright State University" is a single token though Wright, State and University are entities individually. I would like to generate the model with the above token as a single one. The article says that the input file to generate the model should be given as a tab separated words with first column being the token and the second column the label. How can I achieve this?
Typically NER training data is in the form of natural language sentences where each token has an NER tag. You might have 10,000 sentences or more.
For instance: "He attended Wright State University."
should be represented as:
He O
attended O
Wright SCHOOL
State SCHOOL
University SCHOOL
. O
If don't have sentences, and you simply have a list of strings that should be tagged a certain way, it makes more sense to use RegexNER.
You can find a thorough description of how to use RegexNER here:
http://nlp.stanford.edu/software/regexner.html

Stanford NER tool -- spaces in training file

I've been looking through the Stanford NER classifier. I have been able to train a model using a simple file that has spaces only to delimit the items the system expects. For instance,
/a/b/c sanferro 2
/d/e/f ginger 2
However, I run into errors while trying forms such as:
/a/b/c san ferro 2
Here "san ferro" is a single "word" and "2" is the "answer" or desired labeling output.
How can I encode spaces? I've tried enclosing a double quotes but that doesn't work.
Typically you use CoNLL style data to train a CRF. Here is an example:
-DOCSTART- O
John PERSON
Smith PERSON
went O
to O
France LOCATION
. O
Jane PERSON
Smith PERSON
went O
to O
Hawaii LOCATION
. O
A "\t" character separates the tokens and the tags. You put a blank space in between the sentences. You use the special symbol "-DOCSTART-" to indicate where a new document starts. Typically you provide a large set of sentences. This is the case when you are training a CRF.
If you just want to tag certain patterns the same way all the time, you may want to use RegexNER, which is described here: http://nlp.stanford.edu/software/regexner/
Here is more documentation on using the NER system: http://nlp.stanford.edu/software/crf-faq.shtml

ruby regex split string by . but not if its part of an word contained in an exclusion list

I want to split sentences by a specific char but just if this char isnt used as a part of a word that is contained in an exclusion list. For example I want to split the sentence by a fullstop "." but I just if its not used after "Dr" or "Prof". For example:
"Im a Dr. of Physics and my Name is Sheldon Cooper. Im working at the University of Pasadena."
So the regex should just split by the fullstop after "Cooper" but not after the "Dr".
You can use negative lookbehind:
a = "Im a Dr. of Physics and my Name is Sheldon Cooper. Im working at the University of Pasadena."
a.split(/(?<!Dr|Prof)\./)
#=> ["Im a Dr. of Physics and my Name is Sheldon Cooper", " Im working at the University of Pasadena"]
You can define titles separately. There's no other way to do that. You should set like this: Dr|Prof|Assoc

Resources