What should a gazetter list include? - stanford-nlp

I am trying to extract locations from hotel reviews , by locations I mean hotel names , cities , neighbourhoods , POIs and countries . I am using a gazetter list with 165,000 entities[ this list doesn't have hotel names ] marked as location .
I have sloppygazette turned on but this gazette isn't helping much . I am confused about what should include I in the gazetter list.
PS : I am a novice as far as NLP is concerned , so little help about which features to be used is much appreciated.

Hi there is new more detailed documentation about the NER functionality here:
https://stanfordnlp.github.io/CoreNLP/ner.html
The rules format is one rule per line:
Los Angeles CITY LOCATION,MISC 1.0
Great Wall Of China LANDMARK LOCATION,MISC 1.0
Some of the functionality is only available if you use the latest code from GitHub, but a lot is available in Stanford CoreNLP 3.9.1
In short the NER annotator runs these steps:
statistical NER models
rules for numeric sequences and SUTime (for times and dates)
rules for fine grained NER (CITY, STATE_OR_PROVINCE, COUNTRY, etc...)
additional rules specified by user (this is new and not currently available in 3.9.1)
build entity mentions (identify that tokens "Los" and "Angeles" should be the entity "Los Angeles)
You can either download the code from GitHub and build the latest version, or you can just add your custom rules to the ner.fine.regexner annotator as described in the link above.

Related

CoreNLP training dataset AnCora for Spanish language

I'm looking for Spanish training dataset AnCora for CoreNLP, specifically this one IARG-AnCora Spanish (AnCora 3.0.1). The website requires a registration. I created an account, tried to register on the website, but account has never been activated. Any help would be appreciated. Thanks, Victor
There is info about training a dependency parser model, including where to find UD data here:
https://stanfordnlp.github.io/CoreNLP/depparse.html

Training a custom NER Model to identify entities

We are using the NER models to identify entities like org, percent, money, number etc - we would like to add an entity (I don't think we can extend the models) or build another model to tag these entities ( we are looking to classify financial securities).
I have just started looking at this and have used the models available so far.
I am looking at https://nlp.stanford.edu/software/crf-faq.shtml#a
to get started for the custom models are there sample data files I need to look at?
Does this still mean that the only entities that can be tagged are the already available ones like organization, date, money, location ...
Are there any changes one needs to made to the java files i.e which ones would I start with to understand how the classifier works.
Basically for some text like :
2.200% Notes due October 30, 2020 the principal amount $ 1,500,000,000.00 $ 186,750.00
I'd like to tag:
<security>2.200% Notes due October 30, 2020</security> the principal amount $ 1,500,000,000.00 $ 186,750.00
You can train a new sequence tagger with the following format:
Joe PERSON
Smith PERSON
was O
born O
in O
California LOCATION
. O
He O
works O
for O
Apple ORGANIZATION
. O
Note it should be a \t separating the token from the tag. You can use any tag you want. The statistical tagger will then be able to apply tags it saw in the training data.
You can see the full details of the properties file you should use if you look at this file in the models jar:
edu/stanford/nlp/models/ner/english.all.3class.distsim.prop
I should note, if what you're trying to extract follows a few basic patterns, you're going to probably get better results with a rule-based approach.
Here is some documentation on rule based approaches in StanfordCoreNLP:
https://nlp.stanford.edu/software/tokensregex.html

CoreNLP's GenderAnnotation is unable to label names written in proper format

Given the name "David" presented in three different ways ("DAVID david David"), CoreNLP is only able to mark #1 and #2 as MALE despite the fact that #3 is the only one marked as a PERSON. I'm using the standard model provided originally and I attempted to implement the suggestions listed here but 'gender' is not allowed before NER anymore. My test is below with the same results in both Java and Jython (Word, Gender, NER Tag):
DAVID, MALE, O
david, MALE, O
David, None, PERSON
This is a bug in Stanford CoreNLP 3.8.0.
I have made some modifications to the GenderAnnotator and submitted them. They are available now on GitHub. I am still working on this, so probably over the next day or so there will be further changes, but I think this bug is fixed now. You will also need the latest version of the models jar which was just updated that contains the name lists. I believe shortly I will build another models jar with larger name lists.
The new version of GenderAnnotator requires the entitymentions annotator to be used. Also, the new version logs the gender of both the CoreMap for the entity mention and for each token of the entity mention.
You can learn how to work with the latest version of Stanford CoreNLP off of GitHub here: https://stanfordnlp.github.io/CoreNLP/download.html

Stanford core NLP models for English language

I am using stanford corenlp for a task. There are two models "stanford-corenlp-3.6.0-models" and "stanford-english-corenlp-2016-01-10-models" on stanford's website. I want to know what is the difference between these two models.
According to the "Human languages supported" section of CoreNLP Overview , the basic distribution provides model files for the analysis of well-edited English,which is the stanford-corenlp-3.6.0-models you mentioned.
But,CoreNLP member also provides a jar that contains all of their English models, which includes various variant models, and in particular has one optimized for working with uncased English (e.g., mostly or all either uppercase or lowercase).The newest one is stanford-english-corenlp-2016-10-31-models and the previous one is stanford-english-corenlp-2016-01-10-models you mentioned.
Reference:
http://stanfordnlp.github.io/CoreNLP/index.html#programming-languages-and-operating-systems
(the Stanford CoreNLP Overview page)

Segmentation of entities in Named Entity Recognition

I have been using the Stanford NER tagger to find the named entities in a document. The problem that I am facing is described below:-
Let the sentence be The film is directed by Ryan Fleck-Anna Boden pair.
Now the NER tagger marks Ryan as one entity, Fleck-Anna as another and Boden as a third entity. The correct marking should be Ryan Fleck as one and Anna Boden as another.
Is this a problem of the NER tagger and if it is then can it be handled?
How about
take your data and run it through Stanford NER or some other NER.
look at the results and find all the mistakes
correctly tag the incorrect results and feed them back into your NER.
lather, rinse, repeat...
This is a sort of manual boosting technique. But your NER probably won't learn too much this way.
In this case it looks like there is a new feature, hyphenated names, the the NER needs to learn about. Why not make up a bunch of hyphenated names, put them in some text, and tag them and train your NER on that?
You should get there by adding more features, more data and training.
Instead of using stanford-coreNLP you could try Apache opeNLP. There is option available to train your model based on your training data. As this model is dependent on the names supplied by you, it able to detect names of your interest.

Resources