Stanford NLP - NER & Models - stanford-nlp

I was looking at the online demo: http://nlp.stanford.edu:8080/ner/process
Try a simple testcase like: John Chambers studied in London (UK) and Mumbai (India).
The 3-class Classifier identifies the Person, the 7-class Classifier does not identify the Person. Seems like I need to run the parser on both the Models: once to identify Person, Location & Organization. And once just for Currency?

When I run this command it finds all of the appropriate entities on your example:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -file sample-sentence.txt -outputFormat text
When you run the NERCombinerAnnotator which corresponds to the annotator ner it will run a combination of several models automatically for you.

Related

Override named entity with RegexNER instead of CRF model

I am trying to detect named entities using Stanford CoreNLP in a task.
I have already given a rule as follows in my RegexNER mapping file as follows:
Train VEHICLE_TYPE 2.0
But its identifying Train as CRIMINAL_CHARGE type of entity.
I have added this option ner.applyFineGrained and set it to true maybe that's why its overriding with CoreNLP's CRF model.
My question is how to add exceptions like this in RegexNER mapping file or is there some better approach.
You should use these settings:
# run fine-grained NER with a custom rules file
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.fine.regexner.mapping custom.rules -file example.txt -outputFormat text
You need to make sure to set ner.fine.regexner.mapping to your custom rules file to use that instead of the default fine-grained rules which would label things such as CRIMINAL_CHARGE

How to build entitymentions from tokens tagged by the `regexner` annotator?

This question is similar to Can I get an entityMention from the result of a TokensRegex match in Stanford CoreNLP?
I have a set of TokensRegex rules that tag tokens with a different tag than the standard "LOCATION", "PERSON" etc.
The entitymentions annotator is very useful for multi-token named entities. How can I also build entitymentions for token sequences that are tagged by the regexner annotator? They don't appear to be built with standard settings.
I'm using CoreNLP 3.9.2 with the http API
Thanks for the help
Here is an example command
java -Xmx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.tokensregex.rules example.rules -file example.txt -outputFormat text
Some more info...The ner annotator will run a series of steps
statistical ner
numeric sequences and SUTime
fine-grained NER (example: LOCATION --> STATE_OR_PROVINCE)
additional TokensRegexNER rules
additional TokensRegex rules
entity building
So after steps 1-5 are run, the entities will be built, and will see tags from your TokensRegex rules.
This is in the current GitHub code and version 3.9.2 (won't work with older versions).
More info here: https://stanfordnlp.github.io/CoreNLP/ner.html

Noun-mediated relationships not being found in OpenIE

I'm having difficulty extracting noun-mediated relationships as outlined in Angeli et al.
When I run OpenIE locally with the input "US president Barack Obama traveled to India on Monday" only two relationships are extracted:
(US president Barack Obama, traveled on, Monday)
(US president Barack Obama, traveled to, India)
Not found but expected: (Barack Obama, is president of, US)
However, when I run the same input at http://corenlp.run/, that third relationship looks to be extracted. Even more interestingly though, if I remove "Named Entities" as a possible annotator from corenlp.run, that third relationship is no longer found.
So I guess my question is what is the proper configuration (versions, models, annotators...) needed to properly extract noun-mediated relationships? On my local machine I downloaded v3.6.0, compiled the latest source code from the Master branch on GitHub, and then replaced stanford-corenlp-3.6.0.jar with the previously complied jar file. I then ran the following command from within the v3.6.0 folder:
java -mx1g -cp "*" edu.stanford.nlp.naturalli.OpenIE -format ollie
Any help or insight would be a big help. Thanks so much!
So, the current heuristics in the OpenIE system for extracting these relationships is to only extract them when named entity information is present (which we disable by default to improve speed), or else we drastically over-produce them. You can force-enable them with the flag -triple.all_nominals, but you've been warned :). The other easy option is to set the -resolve_coref flag, which will (1) run and resolve coreference when producing triples, but also (2) implicitly run the NER annotator. The last option is to specify the annotators directly to include NER:
java -mx1g -cp "*" edu.stanford.nlp.naturalli.OpenIE -annotators "tokenize,ssplit,pos,lemma,depparse,ner,natlog,openie" -format ollie
Lastly, if you're using the 3.6.0 release, that's now fairly out of date. You're likely to get better results from the HEAD of the GitHub repo -- this is roughly what corenlp.run tracks.

Named entities: guidelines that pertain to titles of persons

I'm working on an annotation task of named entities in a text corpus. I found guidelines in the document 1999 Named Entity Recognition Task Definition. In that document, there are guidelines that pertain to titles of persons, in particular the following one: Titles such as “Mr.” and role names such as “President” are not considered part of a person name. For example, in “Mr. Harry Schearer” or “President Harry Schearer”, only Harry Schearer should be tagged as person.
In the Stanford NER though, there are many examples of including titles in the person tag (Captain Weston, Mr. Perry, etc). See here an example of gazette that they give. In their view of person tags, it seems that even “Mrs. and Miss Bates” should be tagged as a person.
Question: what is the most generally accepted guideline?
If you download Stanford CoreNLP 3.5.2 from here: http://nlp.stanford.edu/software/corenlp.shtml
and run this command:
java -Xmx6g -cp "*:." edu.stanford.nlp.pipeline.StanfordCoreNLP -ssplit.eolonly -annotators tokenize,ssplit,pos,lemma,ner -file ner_examples.txt -outputFormat text
(assuming you put some sample sentences, one sentence per line in ner_examples.txt)
the tagged tokens will be shown in: ner_examples.txt.out
You can try out some sentences and see how our current NER system handles different situations. This system is trained on data that does not have titles tagged as PERSON, so our current system in general does not tag the titles as PERSON.

Segmentation of entities in Named Entity Recognition

I have been using the Stanford NER tagger to find the named entities in a document. The problem that I am facing is described below:-
Let the sentence be The film is directed by Ryan Fleck-Anna Boden pair.
Now the NER tagger marks Ryan as one entity, Fleck-Anna as another and Boden as a third entity. The correct marking should be Ryan Fleck as one and Anna Boden as another.
Is this a problem of the NER tagger and if it is then can it be handled?
How about
take your data and run it through Stanford NER or some other NER.
look at the results and find all the mistakes
correctly tag the incorrect results and feed them back into your NER.
lather, rinse, repeat...
This is a sort of manual boosting technique. But your NER probably won't learn too much this way.
In this case it looks like there is a new feature, hyphenated names, the the NER needs to learn about. Why not make up a bunch of hyphenated names, put them in some text, and tag them and train your NER on that?
You should get there by adding more features, more data and training.
Instead of using stanford-coreNLP you could try Apache opeNLP. There is option available to train your model based on your training data. As this model is dependent on the names supplied by you, it able to detect names of your interest.

Resources