CoreNLP training dataset AnCora for Spanish language - stanford-nlp

I'm looking for Spanish training dataset AnCora for CoreNLP, specifically this one IARG-AnCora Spanish (AnCora 3.0.1). The website requires a registration. I created an account, tried to register on the website, but account has never been activated. Any help would be appreciated. Thanks, Victor

There is info about training a dependency parser model, including where to find UD data here:
https://stanfordnlp.github.io/CoreNLP/depparse.html

Related

What should a gazetter list include?

I am trying to extract locations from hotel reviews , by locations I mean hotel names , cities , neighbourhoods , POIs and countries . I am using a gazetter list with 165,000 entities[ this list doesn't have hotel names ] marked as location .
I have sloppygazette turned on but this gazette isn't helping much . I am confused about what should include I in the gazetter list.
PS : I am a novice as far as NLP is concerned , so little help about which features to be used is much appreciated.
Hi there is new more detailed documentation about the NER functionality here:
https://stanfordnlp.github.io/CoreNLP/ner.html
The rules format is one rule per line:
Los Angeles CITY LOCATION,MISC 1.0
Great Wall Of China LANDMARK LOCATION,MISC 1.0
Some of the functionality is only available if you use the latest code from GitHub, but a lot is available in Stanford CoreNLP 3.9.1
In short the NER annotator runs these steps:
statistical NER models
rules for numeric sequences and SUTime (for times and dates)
rules for fine grained NER (CITY, STATE_OR_PROVINCE, COUNTRY, etc...)
additional rules specified by user (this is new and not currently available in 3.9.1)
build entity mentions (identify that tokens "Los" and "Angeles" should be the entity "Los Angeles)
You can either download the code from GitHub and build the latest version, or you can just add your custom rules to the ner.fine.regexner annotator as described in the link above.

With which treebank are the available StanfordCoreNLP French models trained?

As per the title of this post, I would like to have a maximum of information regarding the dataset that is being used to train the StanfordCoreNLP French models that are made available on this page (https://stanfordnlp.github.io/CoreNLP/history.html). My ultimate aim is to know the set of tag that I can expect to be output by the stanford core nlp tool when using it to characterize a text written in French. I was told that a model is trained using a treebank. For for the French language, there is 6 of them (http://universaldependencies.org/, section for the French language) :
- FTB
- Original
- Sequoia
- ParTUT
- PUD
- Spoken
So I would like to know which of them was used to train which French model.
I have first asked this question on the mailing list dedicated to the java nlp users (java-nlp-user#lists.stanford.edu), but to no avail up until now.
So, again, assuming it is one the treebanks described above that was indeed used to train the stanford core nlp French models available at the link posted above, which one is it? Alternatively, who (name and surname) would know the answer to this question, if no one here knows?
For all who are curious about this, here is some info about the datasets used for French in Stanford CoreNLP:
French POS tagger: CC (Crabbe and Candito) modified French Treebank
French POS tagged (UD version): UD 1.3
French Constituency Parser: CC modified French Treebank
French NN Dependency Parser: UD 1.3
Also note that the constituency parser parse cannot translate constituency parses into dependency parses the way the English constituency parser can.

Stanford core NLP models for English language

I am using stanford corenlp for a task. There are two models "stanford-corenlp-3.6.0-models" and "stanford-english-corenlp-2016-01-10-models" on stanford's website. I want to know what is the difference between these two models.
According to the "Human languages supported" section of CoreNLP Overview , the basic distribution provides model files for the analysis of well-edited English,which is the stanford-corenlp-3.6.0-models you mentioned.
But,CoreNLP member also provides a jar that contains all of their English models, which includes various variant models, and in particular has one optimized for working with uncased English (e.g., mostly or all either uppercase or lowercase).The newest one is stanford-english-corenlp-2016-10-31-models and the previous one is stanford-english-corenlp-2016-01-10-models you mentioned.
Reference:
http://stanfordnlp.github.io/CoreNLP/index.html#programming-languages-and-operating-systems
(the Stanford CoreNLP Overview page)

Gazettes with Stanford NER

I am making my own model of Stanford NER which is CRF based, by following conventions given at this link.I want to add Gazettes and following this from same link. I am mentioning all of my Gazettes using this property, gazette=file1.txt;file2.txt and also mentioning useGazettes=true in austen.prop. After making model when I am testing data from my Gazettes then it is not TAGGING correctly. The tag which I mentioned in files in not coming correctly. These are little bit surprising results for me as Stanford NER is not giving them same tag as mentioned in those files.
Is there some limitations of Stanford NER with Gazettes or I am still missing something? If somebody can help me I will be thankful to you.

Training Stanford CoreNLP co-reference

I would like to use the Stanford CoreNLP library to do co-referencing in Dutch.
My question is, how do I train the CoreNLP to handle Dutch co-referencing resolution?
We've already created a Dutch NER model based on the 'conll2002' set (https://github.com/WillemJan/Stanford_ner_bugreport/raw/master/dutch.gz), but we would also like to use the co-referencing module in the same way.
Look at the class edu.stanford.nlp.scoref.StatisticalCorefTrainer.
The appropriate properties file for English is in:
edu/stanford/nlp/scoref/properties/scoref-train-conll.properties
You may have to get the latest code base from GitHub:
https://github.com/stanfordnlp/CoreNLP
While we are not currently supporting training of the statistical coreference models in the toolkit, I do believe the code for training them is included and it is certainly possible it works right now. I have yet to verify if it is functioning properly.
Please let me know if you need any more assistance. If you encounter bugs I can try to fix them...we would definitely like to get the statistical coreference training operational for future releases!

Resources