Stanford Corenlp: Regexner unexpectedly overwrites NER entities

Stanford Corenlp: Regexner unexpectedly overwrites NER entities - stanford-nlp

I have the phrase:
5th-6th Grade Teacher, Mount Pilot Elementary School
RegExner mapping file contents:
Pilot TITLE
Annotators:
tokenize,ssplit,pos,lemma,depparse,ner,regexner
Everything works fine with such configuration, I get the phrase "Mount Pilot Elementary School" tagged as ORGANIZATION, and in the corenlp log I have a message:
Not annotating 'Pilot': ORGANIZATION with [TITLE], sentence is '5th-6th Grade Teacher , Mount Pilot Elementary School'
So this is OK and expected behaviour.
However once I add the follwing line to the mapping file:
Labor ORGANIZATION
CoreNLP returns such tags for the same santence:
Mount/ORGANIZATION
Pilot/TITLE
Elementary School/ORGANIZATION
"Pilot" ORGANIZATION get overwrited by "Pilot" TITLE from the mapping file.
Is there any way to avoid this behaviour? I just wanted to tag "Labor" as an ORGANIZATION, I didn't want to force CoreNLP overwrite NER tags by RegexNER. In my opinion it is a bit unexpected, but maybe this is kind of a feature than a bug

Your rule needs to be in this format:
Pilot TITLE MISC 1
Then it will not overwrite other label types.

Related

Which Tagging format is the best for training Stanford NER (IO/ IOB)?

I have trained Stanford NER to extract the organization names from text. I used IO tagging format. It works fine. However, I wonder if changing the tag format to IOB (or other formats) might improve the scores. ?

Suppose you have a sentence that lacks normal punctuation, like this:
John Sam Ted are all here.
If you don't have a B tag you won't be able to tell if this should be three entities or one entity with three words.
On the other hand, for many common types of entities, they can't just run together in normal English text since you'll at least have a comma between them.
If you can set it up, using IOB is better in case you have entities run together, but depending on your data set it may not be an issue. You'll have to look at the data to tell.

What should a gazetter list include?

I am trying to extract locations from hotel reviews , by locations I mean hotel names , cities , neighbourhoods , POIs and countries . I am using a gazetter list with 165,000 entities[ this list doesn't have hotel names ] marked as location .
I have sloppygazette turned on but this gazette isn't helping much . I am confused about what should include I in the gazetter list.
PS : I am a novice as far as NLP is concerned , so little help about which features to be used is much appreciated.

Hi there is new more detailed documentation about the NER functionality here:
https://stanfordnlp.github.io/CoreNLP/ner.html
The rules format is one rule per line:
Los Angeles CITY LOCATION,MISC 1.0
Great Wall Of China LANDMARK LOCATION,MISC 1.0
Some of the functionality is only available if you use the latest code from GitHub, but a lot is available in Stanford CoreNLP 3.9.1
In short the NER annotator runs these steps:
statistical NER models
rules for numeric sequences and SUTime (for times and dates)
rules for fine grained NER (CITY, STATE_OR_PROVINCE, COUNTRY, etc...)
additional rules specified by user (this is new and not currently available in 3.9.1)
build entity mentions (identify that tokens "Los" and "Angeles" should be the entity "Los Angeles)
You can either download the code from GitHub and build the latest version, or you can just add your custom rules to the ner.fine.regexner annotator as described in the link above.

Sentence segmentation with annotated corpus

I have a custom annotated corpus, in OpenNLP format. Ex:
<START:Person> John <END> went to <START:Location> London <END>. He visited <START:Organisation> ACME Co <END> in the afternoon.
What I need is to segment sentences from this corpus. But it won't always work as expected due to the annotations.
How can I do it without losing the entity annotations?
I am using OpenNLP.

In case you want to create multiple NLP models for OpenNLP you need multiple formats to train them:
The tokenizer requires a training format
The sentence detector requires a training format
The name finder requires a training format
Therefore, you need to manage these different annotation layers in some way.
I created an annotation tool and a Maven plugin which help you doing this, have a look here. All information can be stored in a single file and the Maven plugin will generate the NLP models for you.
Let me know if you have an further questions.

How to improve the accuracy of ner of StanfordCoreNLP?

I used NER of StanfordCoreNLP to recognize the entity including organization, location and person. But there exists something weird. For example, I input a sentence like "Cleveland Cavaliers" and it will recognize the 'Cleveland' as 'location' but not 'Cleveland Cavaliers' as organization.
I am not very familiar with the ner and I don't know how the NER works. My task is to get all the company name in the text and the result I have got is not very satisfactory. So there are two ways occuring to me to solve the problem. The first is to modify the dict and insert the correct data. The second is to train the model. But there are still some questions.
Will the first way work effectively?
If the answer of question 1 is yes, how to modify the dict?
Further more, the FAQ list at https://nlp.stanford.edu/software/crf-faq.shtml#a proposed the way to train the ner model but what confused me most is what I will get if I trained my model.
If I create a dataset containing like
"organization 'Cleveland
Cavaliers'"
to train the model, what will happen in the model? The dict inside the CRFClassifier will change?
Will the CRFClassifier modify the bug when I input 'Cleveland Cavaliers' and recognize the 'Cleveland Cavaliers' as an organization entity?
These are all my puzzles and I am preparing the dataset to try the second way. Can anybody answer the 4 questions above?
Thanks

I think the first solution is not very technical and every time you want to tag a new company, you need to update your dictionary.
I prefer your second solution and I do this before and trained a new model to tag my sentences.
If you have a good corpus that is big enough which tagged properly, It may take some time to train, but it worth the effort.

Segmentation of entities in Named Entity Recognition

I have been using the Stanford NER tagger to find the named entities in a document. The problem that I am facing is described below:-
Let the sentence be The film is directed by Ryan Fleck-Anna Boden pair.
Now the NER tagger marks Ryan as one entity, Fleck-Anna as another and Boden as a third entity. The correct marking should be Ryan Fleck as one and Anna Boden as another.
Is this a problem of the NER tagger and if it is then can it be handled?

How about
take your data and run it through Stanford NER or some other NER.
look at the results and find all the mistakes
correctly tag the incorrect results and feed them back into your NER.
lather, rinse, repeat...
This is a sort of manual boosting technique. But your NER probably won't learn too much this way.
In this case it looks like there is a new feature, hyphenated names, the the NER needs to learn about. Why not make up a bunch of hyphenated names, put them in some text, and tag them and train your NER on that?
You should get there by adding more features, more data and training.

Instead of using stanford-coreNLP you could try Apache opeNLP. There is option available to train your model based on your training data. As this model is dependent on the names supplied by you, it able to detect names of your interest.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Stanford Corenlp: Regexner unexpectedly overwrites NER entities - stanford-nlp

Your rule needs to be in this format: Pilot TITLE MISC 1 Then it will not overwrite other label types.

Related

Which Tagging format is the best for training Stanford NER (IO/ IOB)?

What should a gazetter list include?

Sentence segmentation with annotated corpus

How to improve the accuracy of ner of StanfordCoreNLP?

Segmentation of entities in Named Entity Recognition

Categories

Resources