NER annotator requires Lemma annotator to run (Line 576 in NERCombinerAnnotator).
I cannot find any usage of the LemmaAnnotation in NERCombinerAnnotator.
When I comment out this requirement, BasicPipelineExample works just fine.
Does NER actually use LemmaAnnotation?
Good question. I had thought it was used somewhere in SUTime, but there actually don't seem to be any uses. I think you're right that it could be deleted as a requirement.
Related
Was using StanfordOepnIE for my professor on a research project.
I can successfully extract the triples by using OpenIE annotator from the Standford NLP server.
However, the confidence score was not returned with the requested json as it was shown on the website
https://nlp.stanford.edu/software/openie.html.
Apparently it seemed like that was not being implemented yet by the Stanford people.
Anyone has solution to the problem or have alternative python library that I can to extract both the expected output with its confidence level from the Stanford OpenIE?
The text output has the confidences. We can add the confidences into the json for future versions.
I am making my own model of Stanford NER which is CRF based, by following conventions given at this link.I want to add Gazettes and following this from same link. I am mentioning all of my Gazettes using this property, gazette=file1.txt;file2.txt and also mentioning useGazettes=true in austen.prop. After making model when I am testing data from my Gazettes then it is not TAGGING correctly. The tag which I mentioned in files in not coming correctly. These are little bit surprising results for me as Stanford NER is not giving them same tag as mentioned in those files.
Is there some limitations of Stanford NER with Gazettes or I am still missing something? If somebody can help me I will be thankful to you.
I am trying to normalize tokens (potentially merging them if needed) before running the RegexNER annotator over them.
Is there something already implemented for this in Stanford CoreNLP or in Stanford NLP in general?
If not, what's the best way to implement it? Writing a custom annotator in CoreNLP?
There are definitely some options for token normalization. You apply the -options flag with a comma separated list containing the options you want.
This is described in more detail on this link:
http://nlp.stanford.edu/software/tokenizer.shtml
Near the bottom there is a section about Options which shows a list of possibilities.
Are there other normalizations you are interested in that are not on that list?
The Part Of Speech (POS) models that Stanford parser and Stanford CoreNlp
uses are different, that's why there is difference in the output of the POS tagging performed through Stanford Parser and CoreNlp.
Online Core NLP Output
The/DT man/NN is/VBZ smoking/NN ./.
A/DT woman/NN rides/NNS a/DT horse/NN ./.
Online Stanford Parser Output
The/DT man/NN is/VBZ smoking/VBG ./.
A/DT woman/NN rides/VBZ a/DT horse/NN ./.
similarly more sentences
Is there documentation comparing two models and other detail explanation for the differences ?
It seems output of corenlp is wrong for these cases. Apart from few sentences which I checked during error analysis I guess there would be quite a lot of similar cases where these kind of errors might be.
This isn't really about CoreNLP, it's about whether you are using the Stanford POS tagger or the Stanford Parser (the PCFG parser) to do the POS tagging. (The PCFG parser usually does POS tagging as part of its parsing algorithm, although it can also use POS tags given from elsewhere.) Both sometimes make mistakes. On average, the POS tagger is a slightly better POS tagger than the parser. But, sometimes the parser wins, and in particular, it sometimes seems like it is better at tagging decisions that involve integrating clause-level information. At any rate, it gets these two examples right - though I bet you could also find some examples that go the other way.
If you want to use the PCFG parser for POS tagging in CoreNLP, simply omit the POS tagger, and move the parser earlier so that POS tags are available for the lemmatizer and regex-based NER:
java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,parse,lemma,ner,dcoref -file test.txt
However, some of our other parsers (NN dependency parser, SR constituency parser) require POS tagging to have been done first.
I have been using the Stanford NER tagger to find the named entities in a document. The problem that I am facing is described below:-
Let the sentence be The film is directed by Ryan Fleck-Anna Boden pair.
Now the NER tagger marks Ryan as one entity, Fleck-Anna as another and Boden as a third entity. The correct marking should be Ryan Fleck as one and Anna Boden as another.
Is this a problem of the NER tagger and if it is then can it be handled?
How about
take your data and run it through Stanford NER or some other NER.
look at the results and find all the mistakes
correctly tag the incorrect results and feed them back into your NER.
lather, rinse, repeat...
This is a sort of manual boosting technique. But your NER probably won't learn too much this way.
In this case it looks like there is a new feature, hyphenated names, the the NER needs to learn about. Why not make up a bunch of hyphenated names, put them in some text, and tag them and train your NER on that?
You should get there by adding more features, more data and training.
Instead of using stanford-coreNLP you could try Apache opeNLP. There is option available to train your model based on your training data. As this model is dependent on the names supplied by you, it able to detect names of your interest.