Stanford NER tool -- spaces in training file - stanford-nlp

I've been looking through the Stanford NER classifier. I have been able to train a model using a simple file that has spaces only to delimit the items the system expects. For instance,
/a/b/c sanferro 2
/d/e/f ginger 2
However, I run into errors while trying forms such as:
/a/b/c san ferro 2
Here "san ferro" is a single "word" and "2" is the "answer" or desired labeling output.
How can I encode spaces? I've tried enclosing a double quotes but that doesn't work.

Typically you use CoNLL style data to train a CRF. Here is an example:
-DOCSTART- O
John PERSON
Smith PERSON
went O
to O
France LOCATION
. O
Jane PERSON
Smith PERSON
went O
to O
Hawaii LOCATION
. O
A "\t" character separates the tokens and the tags. You put a blank space in between the sentences. You use the special symbol "-DOCSTART-" to indicate where a new document starts. Typically you provide a large set of sentences. This is the case when you are training a CRF.
If you just want to tag certain patterns the same way all the time, you may want to use RegexNER, which is described here: http://nlp.stanford.edu/software/regexner/
Here is more documentation on using the NER system: http://nlp.stanford.edu/software/crf-faq.shtml

Related

How to replace the matched text with a word specifying by rules files?

I am currently working on a Stanford CoreNLP program that replaces a matched text with a specified word using a list of given rules. I checked TokensRegex Expression, I know there is a regex function can be used in Action field:
Replace(List<CoreMap>, tokensregex, replacement)<br>Match(String,regex,replacement)
to do that. However, it is not clear to me how to implement this function in my rules files. And I couldn't find any example on GitHub or other web pages.
Here is an example of a replacement:
Input text: John Smith is a member of the NLP lab.
Matched pattern: "John Smith" is replaced with "Student A" in the text.
Resulting text: Student A is a member of the NLP lab.
Anyone could help me? I am new to Stanford CoreNLP and have a lot of things to learn.

Ruby Regex solution for 3 types of strings

I need to grab the first part of a string.
Three types of cases I need to match are:
The Jones Group
Amanda Jones,
William Smith, Director
I only want to grab the name(The Jones Group, Amanda Jones, and William Smith) not the comma or anything after the comma. Sometimes a name will be present with a comma and nothing after it. Other time just a group name is used (i.e. The Stanley Team).
I've used
/^(\w.+),.+/
but this fails for cases 1) and 2)
I've also tried
/(\w.+)?,/
but this fails for 1)
You can use this to match all 3:
^([\w ]+)?,?.*$
Notice that there is a space in the [ ] to match the space between the names instead of all characters - otherwise, it would take in "Director" into the capture group as well.

How can I train my own Chinese NER model

I'm trying to train my own Chinese NER model by https://nlp.stanford.edu/software/crf-faq.html mentioned. I converted the data to one Chinese character per line, and labeled entities after character, it likes:
红 ORG
帽 ORG
首 O
席 O
执 O
行 O
官 O
Jim PERSON
Whitehurst PERSON
曾 O
表 O
示 O
, O
亚 ORG
马 ORG
逊 ORG
公 O
共 O
云 O
有 O
许 O
多 O
...
After using command java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop tech.prop, it finally generated the classfier(chinese.misc.distsim.crf.ser.gz). Then I checked the classfier how it works on annotated test data, I used command java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier chinese.misc.distsim.crf.ser.gz -testFile test.tsv, it seems to work.
But when I checked the classifier by a text paragraph instead of annotated test data using commandjava -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier chinese.misc.distsim.crf.ser.gz -textfile test.txt, it seems that the classifier is useless, it didn't recognize the word-segmented Chinese.
Is there any problem when I trained new Chinese NER model?
One probably problem I think is that I convert training data to one Chinese character per line. In fact in Chinese one Chinese character not a Chinese word, should I use word-segmented Chinese training data and convert the data to one Chinese word per line, then label the Chinese word instead of Chinese character.
Flags that may be useful for handling different types of text input are:
-plainTextDocumentReaderAndWriter CLASSNAME Specify a class to read text documents (which extends DocumentReaderAndWriter)
-tokenizerFactory CLASSNAME Specify a class to do tokenization (which extends TokenizerFactory)
-tokenizerOptions "tokenizeNLs=true,asciiQuotes=true" Give options to the tokenizer, such as the two example options here.
This might be useful too:
https://stanfordnlp.github.io/CoreNLP/human-languages.html
Apart from these, you should take a look at Chinese word-segmenter features in SeqClassifierFlags.

stanfordnlp - Training space separated words as a single token to Stanford NER model generation

I have read the detailed description given here- http://nlp.stanford.edu/software/crf-faq.shtml#a on training the model based on the labelled input file according to the .prop file. But the article says-
You should make sure each line consists of solely content fields and tab characters. Spaces don't work. Extra tabs will cause problems.
My text corpus has some space separated words which are all combinedly form a token instead of single word. For instance, "Wright State University" is a single token though Wright, State and University are entities individually. I would like to generate the model with the above token as a single one. The article says that the input file to generate the model should be given as a tab separated words with first column being the token and the second column the label. How can I achieve this?
Typically NER training data is in the form of natural language sentences where each token has an NER tag. You might have 10,000 sentences or more.
For instance: "He attended Wright State University."
should be represented as:
He O
attended O
Wright SCHOOL
State SCHOOL
University SCHOOL
. O
If don't have sentences, and you simply have a list of strings that should be tagged a certain way, it makes more sense to use RegexNER.
You can find a thorough description of how to use RegexNER here:
http://nlp.stanford.edu/software/regexner.html

Stanford NER Classifier linefeed issue

I'm using the Stanford NER with a 3 class model to identify PERSON, LOCATION, and ORGANIZATION in a file. It works fine except when there are names separated by a newline:
JANE DOE
JOHN DOE
JANE SMITH
The NER tools thinks these three names as one big name and not three names. If I put a comma after each name, it picks up the three names. How can I tell the tool to use the newline to separate the three names?
If the names end up as successive tokens in the same "sentence", that is what will happen. The main thing you can do is to have the system tokenize/sentence split on newlines, then you will get a separate sentence for each name and things will work fine. In general, this will work fine if your text is formatted as one paragraph per-line (with soft line-wrapping, as is usual in modern text), but badly if you have text with hard line breaks (not at sentence/paragraph boundaries), because then the system will wrongly treat each line as a sentence. Commands that do this for both calling Stanford NER directly and through CoreNLP are:
java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators "tokenize,ssplit,pos,lemma,ner" -file taylorswift.txt -outputFormat conll -ssplit.newlineIsSentenceBreak always
java edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz -textFile taylorswift.txt -tokenizerOptions tokenizeNLs=true

Resources