I would like to use the Stanford CoreNLP library to do co-referencing in Dutch.
My question is, how do I train the CoreNLP to handle Dutch co-referencing resolution?
We've already created a Dutch NER model based on the 'conll2002' set (https://github.com/WillemJan/Stanford_ner_bugreport/raw/master/dutch.gz), but we would also like to use the co-referencing module in the same way.
Look at the class edu.stanford.nlp.scoref.StatisticalCorefTrainer.
The appropriate properties file for English is in:
edu/stanford/nlp/scoref/properties/scoref-train-conll.properties
You may have to get the latest code base from GitHub:
https://github.com/stanfordnlp/CoreNLP
While we are not currently supporting training of the statistical coreference models in the toolkit, I do believe the code for training them is included and it is certainly possible it works right now. I have yet to verify if it is functioning properly.
Please let me know if you need any more assistance. If you encounter bugs I can try to fix them...we would definitely like to get the statistical coreference training operational for future releases!
Related
When trying to upload the fasttext model (cc.nl.300.bin) in gensim I get the following error:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.nl.300.bin.gz
!gunzip cc.nl.300.bin.gz
model = FastText_gensim.load_fasttext_format('cc.nl.300.bin')
model.build_vocab(cleaned_text, update=True)
AttributeError: 'FastTextTrainables' object has no attribute 'syn1neg'
The code goes wrong when building the vocab with my own dataset. The format of that dataset is all right, as I already used it to build and train other (not pre-trained) Word2Vec and FastText models.
I saw other had the same error on this blog, however their solution did not work for me: https://github.com/RaRe-Technologies/gensim/issues/2588
Also, I read somewhere that I should use 'load_facebook_model'? However I was not able to import load_facebook_model at all? Is this even a good way to solve this problem?
Any other suggestions?
Are you sure you're using the latest version of Gensim, 4.0.1, with many improvements to the FastText implementation?
And, there you will definitely want to use .load_facebook_model() to load a full .bin Facebook-format model:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_model
But also note: the post-training expansion of the vocabulary is best considered an advanced & experimental function. It may not offer any improvement on typical tasks - indeed, without careful consideration of tradeoffs & balancing influence of later traiing against earlier, it can make things worse.
A FastText model trained on a large, diverse corpus may already be able to synthesize better-than-nothing guess vectors for out-of-vocabulary words, via its subword vectors.
If there's some data with very-different words & word-senses you need to integrate, it will often be better to re-train from scratch, using an equal combination of all desired text influences. Then you'll be doing things in a standard and balanced way, without harder-to-tune and harder-to-evaluate improvised changes to usual practice.
Was using StanfordOepnIE for my professor on a research project.
I can successfully extract the triples by using OpenIE annotator from the Standford NLP server.
However, the confidence score was not returned with the requested json as it was shown on the website
https://nlp.stanford.edu/software/openie.html.
Apparently it seemed like that was not being implemented yet by the Stanford people.
Anyone has solution to the problem or have alternative python library that I can to extract both the expected output with its confidence level from the Stanford OpenIE?
The text output has the confidences. We can add the confidences into the json for future versions.
I am using stanford corenlp for a task. There are two models "stanford-corenlp-3.6.0-models" and "stanford-english-corenlp-2016-01-10-models" on stanford's website. I want to know what is the difference between these two models.
According to the "Human languages supported" section of CoreNLP Overview , the basic distribution provides model files for the analysis of well-edited English,which is the stanford-corenlp-3.6.0-models you mentioned.
But,CoreNLP member also provides a jar that contains all of their English models, which includes various variant models, and in particular has one optimized for working with uncased English (e.g., mostly or all either uppercase or lowercase).The newest one is stanford-english-corenlp-2016-10-31-models and the previous one is stanford-english-corenlp-2016-01-10-models you mentioned.
Reference:
http://stanfordnlp.github.io/CoreNLP/index.html#programming-languages-and-operating-systems
(the Stanford CoreNLP Overview page)
I'm working on entity extraction for one of my projects and came across CoreNLP. The demo works pretty good, but I can't seem to find any documentation on the entitylink/Wikipedia annotator. Anyone have any sources on what techniques and data were used for these models?
This is based off of Angel Chang's Wikidict resource: http://nlp.stanford.edu/pubs/crosswikis.pdf , albeit munged a fair bit to allow it to be loaded into memory.
I am making my own model of Stanford NER which is CRF based, by following conventions given at this link.I want to add Gazettes and following this from same link. I am mentioning all of my Gazettes using this property, gazette=file1.txt;file2.txt and also mentioning useGazettes=true in austen.prop. After making model when I am testing data from my Gazettes then it is not TAGGING correctly. The tag which I mentioned in files in not coming correctly. These are little bit surprising results for me as Stanford NER is not giving them same tag as mentioned in those files.
Is there some limitations of Stanford NER with Gazettes or I am still missing something? If somebody can help me I will be thankful to you.