How to force NER class in Stanford NLP - stanford-nlp

Historically I have used OpenNLP for natural language processing. I decided to give Stanford NLP a try on my latest project and am running into issues with NER. Specifically, when a specific token is processed (TOKENP in my example), I would like it to classify this as a type of TOKENP.
I have read through the documentation multiple times, read through the response to this related SO post, and I cannot get it to reliably be assigned TOKENP.
Here is the rules file (labels.txt):
TOKENP TOKENP PERSON 5
Here is the input file (tmp.txt):
Michael Scott Dunder Mifflin TOKENP
Here is the command I am using:
java -mx4g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators 'tokenize,ssplit,pos,lemma,ner' -ner.fine.regexner.mapping labels.txt -outputFormat text -file tmp.txt
And here is the output:
Tokens:
[Text=Michael CharacterOffsetBegin=0 CharacterOffsetEnd=7 PartOfSpeech=NNP Lemma=Michael NamedEntityTag=PERSON]
[Text=Scott CharacterOffsetBegin=8 CharacterOffsetEnd=13 PartOfSpeech=NNP Lemma=Scott NamedEntityTag=PERSON]
[Text=Dunder CharacterOffsetBegin=14 CharacterOffsetEnd=20 PartOfSpeech=NNP Lemma=Dunder NamedEntityTag=PERSON]
[Text=Mifflin CharacterOffsetBegin=21 CharacterOffsetEnd=28 PartOfSpeech=NNP Lemma=Mifflin NamedEntityTag=PERSON]
[Text=TOKENP CharacterOffsetBegin=29 CharacterOffsetEnd=35 PartOfSpeech=NNP Lemma=TOKENP NamedEntityTag=PERSON]
Extracted the following NER entity mentions:
Michael Scott Dunder Mifflin TOKENP PERSON
I expect the TOKENP as the last token in the input to receive a class of TOKENP based on the rules.

The issue is that the rules system won't break up the entity of "Michael Scott Dunder Mifflin TOKENP".
I could try adding an option that would write a rule based tag over anything under all circumstances (or see if there is already such an option!)

Related

Can I get an entityMention from the result of a TokensRegex match in Stanford CoreNLP?

I want to add addresses (and possibly other rules based entities) to an NER pipeline and the Tokens Regex seems like a terribly useful DSL for doing so. Following https://stackoverflow.com/a/42604225, I'm created this rules file:
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
{ pattern: ([{ner:"NUMBER"}] [{pos:"NN"}|{pos:"NNP"}] /ave(nue)?|st(reet)?|boulevard|blvd|r(oa)?d/), action: Annotate($0, ner, "address") }
Here's a scala repl session showing how I'm trying to set up an annotation pipeline.
# import edu.stanford.nlp.pipeline.{StanfordCoreNLP, CoreDocument}
# import edu.stanford.nlp.util.PropertiesUtils.asProperties
# val pipe = new StanfordCoreNLP(asProperties(
"customAnnotatorClass.tokensregex", "edu.stanford.nlp.pipeline.TokensRegexAnnotator",
"annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex",
"ner.combinationMode", "HIGH_RECALL",
"tokensregex.rules", "addresses.tregx"))
pipe: StanfordCoreNLP = edu.stanford.nlp.pipeline.StanfordCoreNLP#2ce6a051
# val doc = new CoreDocument("Adam Smith lived at 123 noun street in Glasgow, Scotland")
doc: CoreDocument = Adam Smith lived at 123 noun street in Glasgow, Scotland
# pipe.annotate(doc)
# doc.sentences.get(0).nerTags
res5: java.util.List[String] = [PERSON, PERSON, O, O, address, address, address, O, CITY, O, COUNTRY]
# doc.entityMentions
res6: java.util.List[edu.stanford.nlp.pipeline.CoreEntityMention] = [Adam Smith, 123, Glasgow, Scotland]
As you can see, the address gets correctly tagged in the nerTags for the sentence, but it doesn't show up in the documents entityMentions. Is there a way to do this?
Also, is there a way from the document to discern two adjacent matches of the tokenregex from a single match (assuming I have more complicated set of regexes; in the current example I only match exactly 3 tokens, so I could just count tokens)?
I tried approaching it using the regexner with a tokens regex described here https://stanfordnlp.github.io/CoreNLP/regexner.html, but I couldn't seem to get that working.
Since I'm working in scala I'll be happy to dive into the Java API to get this to work, rather than fiddle with properties and resource files, if that's necessary.
Yes, I've recently added some changes (in the GitHub version) to make this easier! Make sure to download the latest version from GitHub. Though we are aiming to release Stanford CoreNLP 3.9.2 fairly soon and it will have these changes.
If you read this page you can get an understanding of the full NER pipeline run by the NERCombinerAnnotator.
https://stanfordnlp.github.io/CoreNLP/ner.html
Furthermore there is a lot of write up on the TokensRegex here:
https://stanfordnlp.github.io/CoreNLP/tokensregex.html
Basically what you want to do is run the ner annotator, and use it's TokensRegex sub-annotator. Imagine you have some named entity rules in a file called my_ner.rules.
You could run a command like this:
java -Xmx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.tokensregex.rules my_ner.rules -outputFormat text -file example.txt
This will run a TokensRegex sub-annotator during the full named entity recognition process. Then when the final step of entity mentions are run, it will operate on the rules extracted named entities and create entity mentions from them.

How to get CoreAnnotations.CoNLLDepAnnotation and CoreAnnotations.GovernorAnnotation as annotators through StanfordCorenlp pipeline?

There are standard names for annotations (like tokenize, ssplit, pos) but i am not sure what name should be specified for the CoNLLDepAnnotations and GovernorAnnotations and also what other annotations does these depend on.
Dependency Parse require the annotator (parse). All dependency that you mentioned are performed under this annotator. The code below will print the semantic graph of the sentence in a List format.
for (CoreMap sentenceAnnotation:art.sentences){
SemanticGraph deps = sentenceAnnotation.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class);
System.out.println(deps.toList());
}
For example the output of the sentence: Apple even went as far to make an electric guitar version . will be:
root(ROOT-0, went-3)
nsubj(went-3, Apple-1)
advmod(went-3, even-2)
advmod(far-5, as-4)
advmod(went-3, far-5)
mark(make-7, to-6)
xcomp(went-3, make-7)
det(version-11, an-8)
amod(version-11, electric-9)
compound(version-11, guitar-10)
dobj(make-7, version-11)
punct(went-3, .-12)
where for the first token Apple the relation is nsubj and the governor is the third token went.

Stanford NER: how to add our own tags in existing NER models?

I am trying to make my own NER classifer with my own tags in it. I tried training my model using instuctions in http://nlp.stanford.edu/software/crf-faq.shtml#j. But the problem is I do not have much training data. So I was thinking if there is a way we can add our own tags in existing classifiers like english.all.3class.distsim.crf.ser, english.all.7class.distsim.crf.ser etc. I can train the classifier for my own tags.
Please help me in this regard. Thank you in advance.
You can have any tags(ex: PERSON) by replacing the default ones(ex: PERS) in the .tsv file. The classifier learns the tags you have supplied via the tsv file and then it tags with the ones you supplied when you supply the custom tag based model.
Taking a part of jane-austen-emma-ch1.tsv(from http://nlp.stanford.edu/software/ner-example/jane-austen-emma-ch1.tsv) file and putting our own custom tags for training as follows. I have got two tags- PERSON and ADJECTIVE
CHAPTER O
I O
Emma PERSON
Woodhouse PERSON
, O
handsome ADJECTIVE
, O
clever ADJECTIVE
, O
and O
rich ADJECTIVE
, O
with O
a O
comfortable ADJECTIVE
Now you can feed this tsv file to the classifier(put this tsv file name in .prop file) and generate the model as shown below-
java -cp "stanford-ner.jar:slf4j-api.jar" edu.stanford.nlp.ie.crf.CRFClassifier -prop ner.prop
Now, let's test the model for any text file and see how it annotates. Let's take the following text file(toBeAnnotated.txt)
CHAPTER O
I Emma Woodhouse, handsome, clever and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings
Running the following command annotates the above text file-
java -mx600m -cp "stanford-ner.jar:slf4j-api.jar" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -textFile toBeAnnotated.txt -outputFormat inlineXML 2> /dev/null
The output I have got is(I have added newlines for clarity)-
I <PERSON>Emma Woodhouse</PERSON>,
<ADJECTIVE>handsome</ADJECTIVE>, <ADJECTIVE>clever</ADJECTIVE>
and <ADJECTIVE>rich</ADJECTIVE>, with a <ADJECTIVE>comfortable</ADJECTIVE>
home and happy <ADJECTIVE>disposition</ADJECTIVE>,
seemed to unite some of the best blessings

Stanford Named Entity Recognizer (NER) functionality with NLTK

Is this possible: to get (similar to) Stanford Named Entity Recognizer functionality using just NLTK?
Is there any example?
In particular, I am interested in extraction LOCATION part of text. For example, from text
The meeting will be held at 22 West Westin st., South Carolina, 12345
on Nov.-18
ideally I would like to get something like
(S
22/LOCATION
(LOCATION West/LOCATION Westin/LOCATION)
st./LOCATION
,/,
(South/LOCATION Carolina/LOCATION)
,/,
12345/LOCATION
.....
or simply
22 West Westin st., South Carolina, 12345
Instead, I am only able to get
(S
The/DT
meeting/NN
will/MD
be/VB
held/VBN
at/IN
22/CD
(LOCATION West/NNP Westin/NNP)
st./NNP
,/,
(GPE South/NNP Carolina/NNP)
,/,
12345/CD
on/IN
Nov.-18/-NONE-)
Note that if I enter my text into http://nlp.stanford.edu:8080/ner/process I get results far from perfect (street number and zip code are still missing) but at least "st." is a part of LOCATION and South Carolina is a LOCATION and not some "GPE / NNP" : ?
What I am doing wrong please? how can I fix it to use NLTK for extracting location piece from some text please?
Many thanks in advance!
nltk DOES have an interface for Stanford NER, check nltk.tag.stanford.NERTagger.
from nltk.tag.stanford import NERTagger
st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
'/usr/share/stanford-ner/stanford-ner.jar')
st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
output:
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
However every time you call tag, nltk simply writes the target sentence into a file and runs Stanford NER command line tool to parse that file and finally parses the output back to python. Therefore the overhead of loading classifiers (around 1 min for me every time) is unavoidable.
If that's a problem, use Pyner.
First run Stanford NER as a server
java -mx1000m -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer \
-loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -port 9191
then go to pyner folder
import ner
tagger = ner.SocketNER(host='localhost', port=9191)
tagger.get_entities("University of California is located in California, United States")
# {'LOCATION': ['California', 'United States'],
# 'ORGANIZATION': ['University of California']}
tagger.json_entities("Alice went to the Museum of Natural History.")
#'{"ORGANIZATION": ["Museum of Natural History"], "PERSON": ["Alice"]}'
Hope this helps.

Stanford NER: How do I create a new training set that I can use and test out?

From my understanding, to create a training file, you put your words in a text file. Then after each word, add a space or tab along with the tag (such as PERS, LOC, etc...)
I also copied text from a sample properties file into a word pad. How do I get these into a gz file that I can input into the classifier and use?
Please guide me though. I'm a newbie and am fairly inept with technology.
Your training file (say training-data.tsv) should look like this:
I O
drove O
to O
Vancouver LOCATION
BC LOCATION
yesterday O
where O means "Outside", as in not a named entity.
where the space between the columns is a tab.
You don't put them in a ser.gz file. The ser.gz file is the classifier model that is created by the training process.
To train the classifier run:
java -cp ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop my-classifier.properties
where my-classifier.properties would look like this:
trainFile = training-data.tsv
serializeTo = my-classification-model.ser.gz
map = word=0,answer=1
...
I'd advise you take a look at the NLTK documentation to learn more about training a parser http://nltk.googlecode.com/svn/trunk/doc/howto/tag.html
. Now, it seems that you want to train the CRFClassifier (not the parser!); for that you may want to check this FAQ http://nlp.stanford.edu/software/crf-faq.shtml#a

Resources