Noun-mediated relationships not being found in OpenIE - stanford-nlp

I'm having difficulty extracting noun-mediated relationships as outlined in Angeli et al.
When I run OpenIE locally with the input "US president Barack Obama traveled to India on Monday" only two relationships are extracted:
(US president Barack Obama, traveled on, Monday)
(US president Barack Obama, traveled to, India)
Not found but expected: (Barack Obama, is president of, US)
However, when I run the same input at http://corenlp.run/, that third relationship looks to be extracted. Even more interestingly though, if I remove "Named Entities" as a possible annotator from corenlp.run, that third relationship is no longer found.
So I guess my question is what is the proper configuration (versions, models, annotators...) needed to properly extract noun-mediated relationships? On my local machine I downloaded v3.6.0, compiled the latest source code from the Master branch on GitHub, and then replaced stanford-corenlp-3.6.0.jar with the previously complied jar file. I then ran the following command from within the v3.6.0 folder:
java -mx1g -cp "*" edu.stanford.nlp.naturalli.OpenIE -format ollie
Any help or insight would be a big help. Thanks so much!

So, the current heuristics in the OpenIE system for extracting these relationships is to only extract them when named entity information is present (which we disable by default to improve speed), or else we drastically over-produce them. You can force-enable them with the flag -triple.all_nominals, but you've been warned :). The other easy option is to set the -resolve_coref flag, which will (1) run and resolve coreference when producing triples, but also (2) implicitly run the NER annotator. The last option is to specify the annotators directly to include NER:
java -mx1g -cp "*" edu.stanford.nlp.naturalli.OpenIE -annotators "tokenize,ssplit,pos,lemma,depparse,ner,natlog,openie" -format ollie
Lastly, if you're using the 3.6.0 release, that's now fairly out of date. You're likely to get better results from the HEAD of the GitHub repo -- this is roughly what corenlp.run tracks.

Related

next release of Stanza

I'm interested in the Stanza constituency parser for Italian.
In https://stanfordnlp.github.io/stanza/constituency.html it is said that a new release with updated models (including an Italian model trained on the Turin treebank) should have been available in mid-November.
Any idea about when the next release of Stanza will appear?
Thanks
alberto
Technically you can already get it! If you install the dev branch of stanza, you should be able to download an IT parser.
pip install git+git://github.com/stanfordnlp/stanza.git#704d90df2418ee199d83c92c16de180aacccf5c0
stanza.download("it")
It's trained on the Turin treebank, which has about 4000 trees. If you download the Bert version of the model, it gets over 91 F1 on the Evalita test set (but has a length limit of about 200 words per sentence).
We might splurge on getting the VIT treebank or something. I've been agitating that we use that budget on Danish or PT or some other language where we have very few users, but it's a hard sell...
Edit: there's also some scripts included for converting the publicly available Turin trees into brackets. Their MWT annotation style was to repeat the MWT twice in a row, which doesn't doesn't work too well for a task like parsing raw text.
It is still very much a live task ... either December or January, I would say.
p.s. This isn't really a great SO question....

LUIS built-in geography type sometimes recognizes a city, but other times doesn't

I'm a bit confused. I'm using LUIS's built-in geographyV2 type.
My utterances are things like "are there any part time cashier positions near houston?" (recognized) or "do you have any part time cashier jobs within 10 miles of houston?" (not recognized).
If I hover over the unrecognized instance of "houston," I don't have the option to tag it as a geographyV2 instance (if I try "browse pre-built entities, it doesn't shown geographyV2, I guess since that is already one of my types).
Is there any way I can train it better to recognize houston in the 2nd case?
Seems like some cities don't get picked up at all:
While others are detected without a problem:
If you have any tips, please let me know. This is the first time I've used LUIS. Overall, I'm very impressed!
Thanks
Updates based on Steven's suggestions:
Now I'm able to get Anchorage and Houston recognized. But this introduces a problem with Los Angeles. It is getting extracted as two entities:
Similar issue for St. Louis (it wants to tokenize "St" and "Louis" separately).
Sorry for being such a n00b :-)
I have somewhat similar issue. I have the utterance:
"what is the price of diesel in Latin America"
Latin America is not tagged as geographyV2!
I tried Asia and same thing!
I tried North America, South America, South Africa, Middle East and those worked and were tagged!
I wondered - why the inconsistency?
I looked over the docs and here is a suggestion:
The behavior of prebuilt entities can't be modified but you can improve resolution by adding the prebuilt entity as a feature to a machine-learning entity or sub-entity.
Here is the link : LUIS DOC
Here is what I have come up with to resolve it:

CoreNLP's GenderAnnotation is unable to label names written in proper format

Given the name "David" presented in three different ways ("DAVID david David"), CoreNLP is only able to mark #1 and #2 as MALE despite the fact that #3 is the only one marked as a PERSON. I'm using the standard model provided originally and I attempted to implement the suggestions listed here but 'gender' is not allowed before NER anymore. My test is below with the same results in both Java and Jython (Word, Gender, NER Tag):
DAVID, MALE, O
david, MALE, O
David, None, PERSON
This is a bug in Stanford CoreNLP 3.8.0.
I have made some modifications to the GenderAnnotator and submitted them. They are available now on GitHub. I am still working on this, so probably over the next day or so there will be further changes, but I think this bug is fixed now. You will also need the latest version of the models jar which was just updated that contains the name lists. I believe shortly I will build another models jar with larger name lists.
The new version of GenderAnnotator requires the entitymentions annotator to be used. Also, the new version logs the gender of both the CoreMap for the entity mention and for each token of the entity mention.
You can learn how to work with the latest version of Stanford CoreNLP off of GitHub here: https://stanfordnlp.github.io/CoreNLP/download.html

Named entities: guidelines that pertain to titles of persons

I'm working on an annotation task of named entities in a text corpus. I found guidelines in the document 1999 Named Entity Recognition Task Definition. In that document, there are guidelines that pertain to titles of persons, in particular the following one: Titles such as “Mr.” and role names such as “President” are not considered part of a person name. For example, in “Mr. Harry Schearer” or “President Harry Schearer”, only Harry Schearer should be tagged as person.
In the Stanford NER though, there are many examples of including titles in the person tag (Captain Weston, Mr. Perry, etc). See here an example of gazette that they give. In their view of person tags, it seems that even “Mrs. and Miss Bates” should be tagged as a person.
Question: what is the most generally accepted guideline?
If you download Stanford CoreNLP 3.5.2 from here: http://nlp.stanford.edu/software/corenlp.shtml
and run this command:
java -Xmx6g -cp "*:." edu.stanford.nlp.pipeline.StanfordCoreNLP -ssplit.eolonly -annotators tokenize,ssplit,pos,lemma,ner -file ner_examples.txt -outputFormat text
(assuming you put some sample sentences, one sentence per line in ner_examples.txt)
the tagged tokens will be shown in: ner_examples.txt.out
You can try out some sentences and see how our current NER system handles different situations. This system is trained on data that does not have titles tagged as PERSON, so our current system in general does not tag the titles as PERSON.

Extending Stanford NER terms with new terms

We need to add terms to named entity extraction tables/model in Stanford and can't figure out how. Use case - we need to build up a set of IED terms over time and want to the Stanford pipeline to extract the terms when found in text files.
Looking to see if this is something someone has done before
Please take a look at http://nlp.stanford.edu/software/regexner/ to see how to use it. It allows you to specify a file of mappings of phrases to entity types. When you want to update the mappings, you update the file and rerun the Stanford pipeline.
If you are interested in how to actually learn patterns for the terms over time, you can take a look at our pattern learning system: http://nlp.stanford.edu/software/patternslearning.shtml
Could you specify the tags you want to apply?
To use the RegexNER all you have to do is build a file with 1 entry per line of the form:
TEXT_PATTERN\tTAG
You would put all of the things you want in your custom dictionary into a file, say custom_dictionary.txt
I am assuming by IED you mean
https://en.wikipedia.org/wiki/Improvised_explosive_device ??
So your file might look like:
VBIED\tIED_TERM
sticky bombs\tIED_TERM
RCIED\tIED_TERM
New Country\tLOCATION
New Person\tPERSON
(Note Stack Overflow has some strange formatting, there should not be blank lines between each entry, it should be 1 entry per line!!)
If you then run this command:
java -mx1g -cp '*' edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators 'tokenize,ssplit,pos,lemma,regexner,ner' -file sample_input.txt -regexner.mapping custom_dictionary.txt
you will tag sample_input.txt
Updating is merely a matter of updating custom_dictionary.txt
One thing to be on the lookout for, it matters if you put "ner" first or "regexner" first in your list of annotators.
If your highest priority is tagging with your specialized terms (for instance IED_TERM), I would run the regexner first in the pipeline, since there are some tricky issues with how the taggers overwrite each other.

Resources