Stanford NER Classifier linefeed issue - stanford-nlp

I'm using the Stanford NER with a 3 class model to identify PERSON, LOCATION, and ORGANIZATION in a file. It works fine except when there are names separated by a newline:
JANE DOE
JOHN DOE
JANE SMITH
The NER tools thinks these three names as one big name and not three names. If I put a comma after each name, it picks up the three names. How can I tell the tool to use the newline to separate the three names?

If the names end up as successive tokens in the same "sentence", that is what will happen. The main thing you can do is to have the system tokenize/sentence split on newlines, then you will get a separate sentence for each name and things will work fine. In general, this will work fine if your text is formatted as one paragraph per-line (with soft line-wrapping, as is usual in modern text), but badly if you have text with hard line breaks (not at sentence/paragraph boundaries), because then the system will wrongly treat each line as a sentence. Commands that do this for both calling Stanford NER directly and through CoreNLP are:
java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators "tokenize,ssplit,pos,lemma,ner" -file taylorswift.txt -outputFormat conll -ssplit.newlineIsSentenceBreak always
java edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz -textFile taylorswift.txt -tokenizerOptions tokenizeNLs=true

Related

ZEBRA ZPL label formats prints quotations along with text

In our space we use an ACUCOBOL-GT written program which pushes variables such as PHY NAME to a written format. The program then replaces said PHY NAME with the actual name of the pharmacy and then cups pushes the data onto the label, producing an readable output. We use zebra printers with EPL written label formats to print out these pharmacy medication labels.
Recently I wrote a format using ZPL to produce the same label as Im able to customize the data better than with the old EPL.
The only issue Im running into is that now with the ZPL the data prints onto the label between quotation marks.
So instead of it being eg Montagu Pharmacy it prints out as "Montagu Pharmacy ".
an example of the old EPL code is as follow : PHY NAME A010,025,0,4,1,1,N,
and the output of said code is as follow : A010,025,0,4,1,1,N,"MONTAGU PHARMACY "
on the output before printing it shows the quotes as well but as soon as you send it to the printer it removes the quotes and prints without them.
Here is an example of the ZPL code : PHY NAME ^FO100,025,0^A0N,18,30^FD
and here is the output of said code : PHY NAME ^FO100,025,0^A0N,18,30^FD"Montagu Pharmacy "
on the output before printing it shows the quotes and when it prints out onto the label it prints the text along with the quotes.
I understand that the ACUCOBOL-GT program creates the data with the quotations but the EPL never printed them out but the ZPL does.
Is there anything Im doing wrong or that I can do to remove the quotes so that it can print out normally as before.
I would really appreciate any assistance.
Kind regards
Hans Steyn
This is because in EPL the "DATA" to be printed is surrounded by quotes("), while in ZPL it is not, so quotes are printed. You need to remove the quotes as follows:
PHY NAME ^FO100,025,0^A0N,18,30^FDMontagu Pharmacy
References:
EPL (A command)
https://www.servopack.de/support/zebra/EPL2_Manual.pdf
page 41
ZPL (^FD command)
https://www.zebra.com/content/dam/zebra/manuals/printers/common/programming/zpl-zbi2-pm-en.pdf
page 172
EDIT 1
As per your comment, AFAIK there is no way to remove the quotes after the command reaches the printer. So the only way is to have your supplier's software send the command to some custom tool/software you have that will strip the quotes then relay the message to the printer.
[supplier software] -> [custom software] -> [Printer]
I don't know their software, but they could generate a text file that your custom tool would load and remove the quotes for example with a regex like this, that even allows for escaped quotes in the middle of the name if there are any (arrows):
This being said, it is quite surprising that removing quotes requires massive development, especially because they already seem able to generate both EPL and ZPL.

How to replace the matched text with a word specifying by rules files?

I am currently working on a Stanford CoreNLP program that replaces a matched text with a specified word using a list of given rules. I checked TokensRegex Expression, I know there is a regex function can be used in Action field:
Replace(List<CoreMap>, tokensregex, replacement)<br>Match(String,regex,replacement)
to do that. However, it is not clear to me how to implement this function in my rules files. And I couldn't find any example on GitHub or other web pages.
Here is an example of a replacement:
Input text: John Smith is a member of the NLP lab.
Matched pattern: "John Smith" is replaced with "Student A" in the text.
Resulting text: Student A is a member of the NLP lab.
Anyone could help me? I am new to Stanford CoreNLP and have a lot of things to learn.

How can I expand stanford coreNLP spanish model/dictionary

I just run a "hello world" using Standford Core NLP to get named entities from text. But some places are not recognized properly such as "Ixhuatlancillo" or "Veracruz", both cities which has to be labeled as LUG (place) are labeled as ORG.
I will like to expand the spanish model or dictionary to add places(cities) from México, and to add person names. How can I do this?
Thanks in advance.
The fastest and easiest way would be to use the regexner annotator. You can use this to manually build a dictionary.
Here is an example rule format (separated by tab, the first column can be any number of words)
system administrator TITLE MISC 2
token sequence tag tags-that-can-be-overwritten priority
That above rule would mark "system administrator" in text as TITLE.
For your case:
Veracruz LUG MISC,ORG,PERS 2
This will allow the dictionary to overwrite MISC,ORGS, and PERS. Without adding extra tags in the third column it won't overwrite previously tagged ner tags.
You might use a command like this to run it:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,regexner -props StanfordCoreNLP-spanish.properties -regexner.mapping /path/to/new_spanish.rules - regexner.ignorecase -regexner.validpospattern "^(NN|JJ|NNP).*" -outputFormat text -file sample-text.txt
Note that regexner.ignorecase means to make caseless matches, and -regexner.validpospattern is saying you should only match sequences with the specified pos tag pattern.
All of this being said, I just ran on the sentence:
Ella fue a Veracruz.
and it tagged it properly. Could you let me know what sentence you ran on that caused an incorrect tag for Veracruz?

stanfordnlp - Training space separated words as a single token to Stanford NER model generation

I have read the detailed description given here- http://nlp.stanford.edu/software/crf-faq.shtml#a on training the model based on the labelled input file according to the .prop file. But the article says-
You should make sure each line consists of solely content fields and tab characters. Spaces don't work. Extra tabs will cause problems.
My text corpus has some space separated words which are all combinedly form a token instead of single word. For instance, "Wright State University" is a single token though Wright, State and University are entities individually. I would like to generate the model with the above token as a single one. The article says that the input file to generate the model should be given as a tab separated words with first column being the token and the second column the label. How can I achieve this?
Typically NER training data is in the form of natural language sentences where each token has an NER tag. You might have 10,000 sentences or more.
For instance: "He attended Wright State University."
should be represented as:
He O
attended O
Wright SCHOOL
State SCHOOL
University SCHOOL
. O
If don't have sentences, and you simply have a list of strings that should be tagged a certain way, it makes more sense to use RegexNER.
You can find a thorough description of how to use RegexNER here:
http://nlp.stanford.edu/software/regexner.html

Stanford NER tool -- spaces in training file

I've been looking through the Stanford NER classifier. I have been able to train a model using a simple file that has spaces only to delimit the items the system expects. For instance,
/a/b/c sanferro 2
/d/e/f ginger 2
However, I run into errors while trying forms such as:
/a/b/c san ferro 2
Here "san ferro" is a single "word" and "2" is the "answer" or desired labeling output.
How can I encode spaces? I've tried enclosing a double quotes but that doesn't work.
Typically you use CoNLL style data to train a CRF. Here is an example:
-DOCSTART- O
John PERSON
Smith PERSON
went O
to O
France LOCATION
. O
Jane PERSON
Smith PERSON
went O
to O
Hawaii LOCATION
. O
A "\t" character separates the tokens and the tags. You put a blank space in between the sentences. You use the special symbol "-DOCSTART-" to indicate where a new document starts. Typically you provide a large set of sentences. This is the case when you are training a CRF.
If you just want to tag certain patterns the same way all the time, you may want to use RegexNER, which is described here: http://nlp.stanford.edu/software/regexner/
Here is more documentation on using the NER system: http://nlp.stanford.edu/software/crf-faq.shtml

Resources