I am making my own model of Stanford NER which is CRF based, by following conventions given at this link.I want to add Gazettes and following this from same link. I am mentioning all of my Gazettes using this property, gazette=file1.txt;file2.txt and also mentioning useGazettes=true in austen.prop. After making model when I am testing data from my Gazettes then it is not TAGGING correctly. The tag which I mentioned in files in not coming correctly. These are little bit surprising results for me as Stanford NER is not giving them same tag as mentioned in those files.
Is there some limitations of Stanford NER with Gazettes or I am still missing something? If somebody can help me I will be thankful to you.
Related
Was using StanfordOepnIE for my professor on a research project.
I can successfully extract the triples by using OpenIE annotator from the Standford NLP server.
However, the confidence score was not returned with the requested json as it was shown on the website
https://nlp.stanford.edu/software/openie.html.
Apparently it seemed like that was not being implemented yet by the Stanford people.
Anyone has solution to the problem or have alternative python library that I can to extract both the expected output with its confidence level from the Stanford OpenIE?
The text output has the confidences. We can add the confidences into the json for future versions.
I am using stanford corenlp for a task. There are two models "stanford-corenlp-3.6.0-models" and "stanford-english-corenlp-2016-01-10-models" on stanford's website. I want to know what is the difference between these two models.
According to the "Human languages supported" section of CoreNLP Overview , the basic distribution provides model files for the analysis of well-edited English,which is the stanford-corenlp-3.6.0-models you mentioned.
But,CoreNLP member also provides a jar that contains all of their English models, which includes various variant models, and in particular has one optimized for working with uncased English (e.g., mostly or all either uppercase or lowercase).The newest one is stanford-english-corenlp-2016-10-31-models and the previous one is stanford-english-corenlp-2016-01-10-models you mentioned.
Reference:
http://stanfordnlp.github.io/CoreNLP/index.html#programming-languages-and-operating-systems
(the Stanford CoreNLP Overview page)
I'm working on entity extraction for one of my projects and came across CoreNLP. The demo works pretty good, but I can't seem to find any documentation on the entitylink/Wikipedia annotator. Anyone have any sources on what techniques and data were used for these models?
This is based off of Angel Chang's Wikidict resource: http://nlp.stanford.edu/pubs/crosswikis.pdf , albeit munged a fair bit to allow it to be loaded into memory.
I would like to use the Stanford CoreNLP library to do co-referencing in Dutch.
My question is, how do I train the CoreNLP to handle Dutch co-referencing resolution?
We've already created a Dutch NER model based on the 'conll2002' set (https://github.com/WillemJan/Stanford_ner_bugreport/raw/master/dutch.gz), but we would also like to use the co-referencing module in the same way.
Look at the class edu.stanford.nlp.scoref.StatisticalCorefTrainer.
The appropriate properties file for English is in:
edu/stanford/nlp/scoref/properties/scoref-train-conll.properties
You may have to get the latest code base from GitHub:
https://github.com/stanfordnlp/CoreNLP
While we are not currently supporting training of the statistical coreference models in the toolkit, I do believe the code for training them is included and it is certainly possible it works right now. I have yet to verify if it is functioning properly.
Please let me know if you need any more assistance. If you encounter bugs I can try to fix them...we would definitely like to get the statistical coreference training operational for future releases!
I have been using the Stanford NER tagger to find the named entities in a document. The problem that I am facing is described below:-
Let the sentence be The film is directed by Ryan Fleck-Anna Boden pair.
Now the NER tagger marks Ryan as one entity, Fleck-Anna as another and Boden as a third entity. The correct marking should be Ryan Fleck as one and Anna Boden as another.
Is this a problem of the NER tagger and if it is then can it be handled?
How about
take your data and run it through Stanford NER or some other NER.
look at the results and find all the mistakes
correctly tag the incorrect results and feed them back into your NER.
lather, rinse, repeat...
This is a sort of manual boosting technique. But your NER probably won't learn too much this way.
In this case it looks like there is a new feature, hyphenated names, the the NER needs to learn about. Why not make up a bunch of hyphenated names, put them in some text, and tag them and train your NER on that?
You should get there by adding more features, more data and training.
Instead of using stanford-coreNLP you could try Apache opeNLP. There is option available to train your model based on your training data. As this model is dependent on the names supplied by you, it able to detect names of your interest.