Parsing of Noun Compounds - stanford-nlp

We have been using the corenlp package (june 2014 version; with default annotators) primarily for dependency parsing.
Recently, I have noticed a problem with the noun compound bracketing in cases like "The Bank of England announced further interest rate increases today" - the noun compound, "interest rate increases" is incorrectly bracketed (interest is parsed as modifying increases rather than rate). This is also the case when you put this sentence in the online demo of the stanford parser, and with other similar sentences where a noun compound essentially modifies another noun.
One of my colleagues who does more parsing than I do says that this is likely due to the model having been trained on the unpatched version of the Penn Treebank. And our own in-house parser which has been trained on the patched version does get (most of) these noun compounds correct. I was wondering if there is an alternative pre-trained model for the Stanford corenlp parser which I am not aware of - and if there is, how we go about running the pipeline using this different model?

Related

Stanford NLP training documentpreprocessor

Does Stanford NLP provide a train method for the DocumentPreprocessor to train with own corpora and creating own models for sentence splitting?
I am working with German sentences and I need to create my own German model for sentence splitting tasks. Therefore, I need to train the sentence splitter, DocumentPreprocessor.
Is there a way I can do it?
No. At present, tokenization of all European languages is done by a (hand-written) finite automaton. Machine learning-based tokenization is used for Chinese and Arabic. At present, sentence splitting for all languages is done by rule, exploiting the decisions of the tokenizer. (Of course, that's just how things are now, not how they have to be.)
At present we have no separate German tokenizer/sentence splitter. The current properties file just re-uses the English ones. This is clearly sub-optimal. If someone wanted to produce something for German, that would be great to have. (We may do it at some point, but German development is not currently at the top of the list of priorities.)

Sentence-level to document-level sentiment analysis. Analysing news

I need to perform sentiment analysis on news articles about a specific topic using the Stanford NLP tool.
Such tool only allows sentence based sentiment analysis while I would like to extract a sentiment evaluation of the whole articles with respect to my topic.
For instance, if my topic is Apple, I would like to know the sentiment of a news article with respect to Apple.
Just computing the average of the sentences in my articles won't do. For instance, I might have an article saying something along the lines of "Apple is very good at this, and this and that. While Google products are very bad for these reasons". Such an article would result in a Neutral classification using the average score of sentences, while it is actually a Very positive article about Apple.
On the other hand filtering my sentences to include only the ones containing the word Apple would miss articles along the lines of "Apple's product A is pretty good. However, it lacks the following crucial features: ...". In this case the effect of the second sentence would be lost if I were to use only the sentences containing the word Apple.
Is there a standard way of addressing this kind of problems? Is Stanford NLP the wrong tool to accomplish my goal?
Update: You might want to look into
http://blog.getprismatic.com/deeper-content-analysis-with-aspects/
This is a very active area of research so it would be hard to find an off-the-shelf tool to do this (at least nothing is built in the Stanford CoreNLP). Some pointers: look into aspect-based sentiment analysis. In this case, Apple would be an "aspect" (not really but can be modeled that way). Andrew McCallum's group at UMass, Bing Liu's group at UIC, Cornell's NLP group, among others, have worked on this problem.
If you want a quick fix, I would suggest to extract sentiment from sentences that have reference to Apple and its products; use coref (check out dcoref annotator in Stanford CoreNLP), which will increase the recall of sentences and solve the problem of sentences like "However, it lacks..".

Natural Language Parsing using Stanford NLP

How Stanford natural Language Parser uses Penn Tree Bank for Tagging process ? I want to know how it finds the POS for the given input?
The Stanford part-of-speech tagger uses a probabilistic sequence model to determine the most likely sequence of part-of-speech tags underlying a sentence. Some of the features provided to this model are
Surrounding words and n-grams
Part-of-speech tags of surrounding words
"Word shapes" (e.g., "Foo5" is translated to "Xxx#")
Word suffix, prefix
See the ExtractorFrames class for details. The model is trained on a tagged corpus (like the Penn Treebank) which has each token annotated with its correct part of speech.
At run time, features like those mentioned above are calculated for input text and are used to build per-tag probabilities, which are then fed into an implementation of the Viterbi algorithm (ExactBestSequenceFinder), which finds the most likely arrangement of tags for the entire sequence.
For more information to get started with POS tagging:
Watch the Week 5 lectures of the Coursera NLP class (co-taught by the CoreNLP lead)
Check out the code in the edu.stanford.nlp.tagger.maxent package
Part-of-speech tagging in NLTK

Sentiment analysis

while performing sentiment analysis, how can I make the machine understand that I'm referring apple (the iphone), instead of apple (the fruit)?
Thanks for the advise !
Well, there are several methods,
I would start with checking Capital letter, usually, when referring to a name, first letter is capitalized.
Before doing sentiment analysis, I would use some Part-of-speech and Named Entity Recognition to tag the relevant words.
Stanford CoreNLP is a good text analysis project to start with, it will teach
you the basic concepts.
Example from CoreNLP:
You can see how the tags can help you.
And check out more info
As described by Ofiris, NER is only one way to do solve your problem. I feel it's more effective to use word embedding to represent your words. In that way machine automatically recognize the context of the word. As an example "Apple" is mostly coming together with "eat" and But if the given input "Apple" is present with "mobile" or any other word in that domain, Machine will understand it's "iPhone apple" instead of "apple fruit". There are 2 popular ways to generate word embeddings such as word2vec and fasttext.
Gensim provides more reliable implementations for both word2vec and fasttext.
https://radimrehurek.com/gensim/models/word2vec.html
https://radimrehurek.com/gensim/models/fasttext.html
In presence of dates, famous brands, vip or historical figures you can use a NER (named entity recognition) algorithm; in such case, as suggested by Ofiris, the Stanford CoreNLP offers a good Named entity recognizer.
For a more general disambiguation of polysemous words (i.e., words having more than one sense, such as "good") you could use a POS tagger coupled with a Word Sense Disambiguation (WSD) algorithm. An example of the latter can be found HERE, but I do not know any freely downloadable library for this purpose.
This problem has already been solved by many open source pre-trained NER models. Anyways you can try retraining an existing NER models to finetune them to solve this issue.
You can find an demo of NER results as done by Spacy NER here.

Corenlp basic errors

Take the phrase "A Pedestrian wishes to cross the road".
I learnt english in England and, according to the old rules, the word 'Pedestrian' is a noun. Stanford CoreNLP finds it to be an adjective, regardless of capitalization.
I don't want to contradict the big-brains of Stanford, USA, but that is just wrong. I am new to this semantic stuff but, by finding the word to be an adjective, the sentence lacks a valid noun phrase.
Have I missed the point of CoreNLP, lost the point of the english language, or should I be seeking more effective analysis tools?
I ask as the example sentence is the very first sentence, of my very first processing experiment, and it is most discouraging.
CoreNLP is a statistical analysis tool. It is trained on many texts that have been annotated by pools of human experts. These experts agree on about 90% of the cases. Thus the CoreNLP system cannot beat that percentage and your sentence is part of the 10% wrong parses.

Resources