I'm working on an annotation task of named entities in a text corpus. I found guidelines in the document 1999 Named Entity Recognition Task Definition. In that document, there are guidelines that pertain to titles of persons, in particular the following one: Titles such as “Mr.” and role names such as “President” are not considered part of a person name. For example, in “Mr. Harry Schearer” or “President Harry Schearer”, only Harry Schearer should be tagged as person.
In the Stanford NER though, there are many examples of including titles in the person tag (Captain Weston, Mr. Perry, etc). See here an example of gazette that they give. In their view of person tags, it seems that even “Mrs. and Miss Bates” should be tagged as a person.
Question: what is the most generally accepted guideline?
If you download Stanford CoreNLP 3.5.2 from here: http://nlp.stanford.edu/software/corenlp.shtml
and run this command:
java -Xmx6g -cp "*:." edu.stanford.nlp.pipeline.StanfordCoreNLP -ssplit.eolonly -annotators tokenize,ssplit,pos,lemma,ner -file ner_examples.txt -outputFormat text
(assuming you put some sample sentences, one sentence per line in ner_examples.txt)
the tagged tokens will be shown in: ner_examples.txt.out
You can try out some sentences and see how our current NER system handles different situations. This system is trained on data that does not have titles tagged as PERSON, so our current system in general does not tag the titles as PERSON.
Related
I have trained Stanford NER to extract the organization names from text. I used IO tagging format. It works fine. However, I wonder if changing the tag format to IOB (or other formats) might improve the scores. ?
Suppose you have a sentence that lacks normal punctuation, like this:
John Sam Ted are all here.
If you don't have a B tag you won't be able to tell if this should be three entities or one entity with three words.
On the other hand, for many common types of entities, they can't just run together in normal English text since you'll at least have a comma between them.
If you can set it up, using IOB is better in case you have entities run together, but depending on your data set it may not be an issue. You'll have to look at the data to tell.
Given the name "David" presented in three different ways ("DAVID david David"), CoreNLP is only able to mark #1 and #2 as MALE despite the fact that #3 is the only one marked as a PERSON. I'm using the standard model provided originally and I attempted to implement the suggestions listed here but 'gender' is not allowed before NER anymore. My test is below with the same results in both Java and Jython (Word, Gender, NER Tag):
DAVID, MALE, O
david, MALE, O
David, None, PERSON
This is a bug in Stanford CoreNLP 3.8.0.
I have made some modifications to the GenderAnnotator and submitted them. They are available now on GitHub. I am still working on this, so probably over the next day or so there will be further changes, but I think this bug is fixed now. You will also need the latest version of the models jar which was just updated that contains the name lists. I believe shortly I will build another models jar with larger name lists.
The new version of GenderAnnotator requires the entitymentions annotator to be used. Also, the new version logs the gender of both the CoreMap for the entity mention and for each token of the entity mention.
You can learn how to work with the latest version of Stanford CoreNLP off of GitHub here: https://stanfordnlp.github.io/CoreNLP/download.html
I have the phrase:
5th-6th Grade Teacher, Mount Pilot Elementary School
RegExner mapping file contents:
Pilot TITLE
Annotators:
tokenize,ssplit,pos,lemma,depparse,ner,regexner
Everything works fine with such configuration, I get the phrase "Mount Pilot Elementary School" tagged as ORGANIZATION, and in the corenlp log I have a message:
Not annotating 'Pilot': ORGANIZATION with [TITLE], sentence is '5th-6th Grade Teacher , Mount Pilot Elementary School'
So this is OK and expected behaviour.
However once I add the follwing line to the mapping file:
Labor ORGANIZATION
CoreNLP returns such tags for the same santence:
Mount/ORGANIZATION
Pilot/TITLE
Elementary School/ORGANIZATION
"Pilot" ORGANIZATION get overwrited by "Pilot" TITLE from the mapping file.
Is there any way to avoid this behaviour? I just wanted to tag "Labor" as an ORGANIZATION, I didn't want to force CoreNLP overwrite NER tags by RegexNER. In my opinion it is a bit unexpected, but maybe this is kind of a feature than a bug
Your rule needs to be in this format:
Pilot TITLE MISC 1
Then it will not overwrite other label types.
I was looking at the online demo: http://nlp.stanford.edu:8080/ner/process
Try a simple testcase like: John Chambers studied in London (UK) and Mumbai (India).
The 3-class Classifier identifies the Person, the 7-class Classifier does not identify the Person. Seems like I need to run the parser on both the Models: once to identify Person, Location & Organization. And once just for Currency?
When I run this command it finds all of the appropriate entities on your example:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -file sample-sentence.txt -outputFormat text
When you run the NERCombinerAnnotator which corresponds to the annotator ner it will run a combination of several models automatically for you.
I have been using the Stanford NER tagger to find the named entities in a document. The problem that I am facing is described below:-
Let the sentence be The film is directed by Ryan Fleck-Anna Boden pair.
Now the NER tagger marks Ryan as one entity, Fleck-Anna as another and Boden as a third entity. The correct marking should be Ryan Fleck as one and Anna Boden as another.
Is this a problem of the NER tagger and if it is then can it be handled?
How about
take your data and run it through Stanford NER or some other NER.
look at the results and find all the mistakes
correctly tag the incorrect results and feed them back into your NER.
lather, rinse, repeat...
This is a sort of manual boosting technique. But your NER probably won't learn too much this way.
In this case it looks like there is a new feature, hyphenated names, the the NER needs to learn about. Why not make up a bunch of hyphenated names, put them in some text, and tag them and train your NER on that?
You should get there by adding more features, more data and training.
Instead of using stanford-coreNLP you could try Apache opeNLP. There is option available to train your model based on your training data. As this model is dependent on the names supplied by you, it able to detect names of your interest.