Sentence-level to document-level sentiment analysis. Analysing news - stanford-nlp

I need to perform sentiment analysis on news articles about a specific topic using the Stanford NLP tool.
Such tool only allows sentence based sentiment analysis while I would like to extract a sentiment evaluation of the whole articles with respect to my topic.
For instance, if my topic is Apple, I would like to know the sentiment of a news article with respect to Apple.
Just computing the average of the sentences in my articles won't do. For instance, I might have an article saying something along the lines of "Apple is very good at this, and this and that. While Google products are very bad for these reasons". Such an article would result in a Neutral classification using the average score of sentences, while it is actually a Very positive article about Apple.
On the other hand filtering my sentences to include only the ones containing the word Apple would miss articles along the lines of "Apple's product A is pretty good. However, it lacks the following crucial features: ...". In this case the effect of the second sentence would be lost if I were to use only the sentences containing the word Apple.
Is there a standard way of addressing this kind of problems? Is Stanford NLP the wrong tool to accomplish my goal?

Update: You might want to look into
http://blog.getprismatic.com/deeper-content-analysis-with-aspects/
This is a very active area of research so it would be hard to find an off-the-shelf tool to do this (at least nothing is built in the Stanford CoreNLP). Some pointers: look into aspect-based sentiment analysis. In this case, Apple would be an "aspect" (not really but can be modeled that way). Andrew McCallum's group at UMass, Bing Liu's group at UIC, Cornell's NLP group, among others, have worked on this problem.
If you want a quick fix, I would suggest to extract sentiment from sentences that have reference to Apple and its products; use coref (check out dcoref annotator in Stanford CoreNLP), which will increase the recall of sentences and solve the problem of sentences like "However, it lacks..".

Related

Is there any sentiment forum dataset for unsupervised training available?

I recently finished a machine learning course and would like to make a forum sentiment analysis tool, to apply it in stock-related forums.
The idea is to:
Capture (text mining) users with their comments, and evaluate their comment's sentiment (positive, negative, neutral).
Capture what happens (stock market) after those comments, and assign a weight to the user accordingly (bigger weight if the user's sentiments is spot-on and the market follows the same direction)
Use the comments as a tool to predict market direction.
Actually, I do this myself (pay attention on forums) plus my own technical analysis and the obligatory due diligence, and it has been working very well for me. I just wanted to try to automate it a little bit and maybe even allow a program to play with some of my accounts (paper trading first, and if it performs decently assign some money in a real account)
This would be my first machine learning project (just as a proof-of-concept) so any comments would be very kindly appreciated.
The biggest problem that I find is that I would like to make an unsupervised training, and I need a sample dataset to do the training.
Question: Is there any known forum-sentiment dataset available to be used for unsupervised training?
I've found several sentiment datasets (twitter, imbd, amazon reviews) but they are very specific to their niche (short messages, movies, products...) but I'm looking for something more general.
Since you are looking for an unsupervised approach you can use any set of data that matches your "real case scenario". Text mining and sentiment analysis are are often tailored to the problem at hand so it is easy to start directly with the real data. The best approach is to built a scraper that grabs directly the forum posts that you want to analyze. You can build the scraper easily enough with Python (beautifulsoup/selenium). Online is full of nice tutorial eg: https://www.dataquest.io/blog/web-scraping-tutorial-python/

Sentiment towards a keyword

I have been looking around for Sentiment and text analysis services but most of them seem to analyse the whole text and provide one result for it.
Is there a way of analysing the same piece of text against two different keywords? For example, the same article could be talking about two entities, positively towards one and negatively towards the other.
How could one get these two sentiments within the same text? Is there a service or API already for that?
I have found IBM's AlchemyAPI but doesn't seem to return accurate results...
What you want is aspect-based sentiment analysis. There are lots of algorithms with different precisions and recalls for this aspect-based sentiment analysis.
You can use Aylien's text analysis api.

Sentiment analysis

while performing sentiment analysis, how can I make the machine understand that I'm referring apple (the iphone), instead of apple (the fruit)?
Thanks for the advise !
Well, there are several methods,
I would start with checking Capital letter, usually, when referring to a name, first letter is capitalized.
Before doing sentiment analysis, I would use some Part-of-speech and Named Entity Recognition to tag the relevant words.
Stanford CoreNLP is a good text analysis project to start with, it will teach
you the basic concepts.
Example from CoreNLP:
You can see how the tags can help you.
And check out more info
As described by Ofiris, NER is only one way to do solve your problem. I feel it's more effective to use word embedding to represent your words. In that way machine automatically recognize the context of the word. As an example "Apple" is mostly coming together with "eat" and But if the given input "Apple" is present with "mobile" or any other word in that domain, Machine will understand it's "iPhone apple" instead of "apple fruit". There are 2 popular ways to generate word embeddings such as word2vec and fasttext.
Gensim provides more reliable implementations for both word2vec and fasttext.
https://radimrehurek.com/gensim/models/word2vec.html
https://radimrehurek.com/gensim/models/fasttext.html
In presence of dates, famous brands, vip or historical figures you can use a NER (named entity recognition) algorithm; in such case, as suggested by Ofiris, the Stanford CoreNLP offers a good Named entity recognizer.
For a more general disambiguation of polysemous words (i.e., words having more than one sense, such as "good") you could use a POS tagger coupled with a Word Sense Disambiguation (WSD) algorithm. An example of the latter can be found HERE, but I do not know any freely downloadable library for this purpose.
This problem has already been solved by many open source pre-trained NER models. Anyways you can try retraining an existing NER models to finetune them to solve this issue.
You can find an demo of NER results as done by Spacy NER here.

Corenlp basic errors

Take the phrase "A Pedestrian wishes to cross the road".
I learnt english in England and, according to the old rules, the word 'Pedestrian' is a noun. Stanford CoreNLP finds it to be an adjective, regardless of capitalization.
I don't want to contradict the big-brains of Stanford, USA, but that is just wrong. I am new to this semantic stuff but, by finding the word to be an adjective, the sentence lacks a valid noun phrase.
Have I missed the point of CoreNLP, lost the point of the english language, or should I be seeking more effective analysis tools?
I ask as the example sentence is the very first sentence, of my very first processing experiment, and it is most discouraging.
CoreNLP is a statistical analysis tool. It is trained on many texts that have been annotated by pools of human experts. These experts agree on about 90% of the cases. Thus the CoreNLP system cannot beat that percentage and your sentence is part of the 10% wrong parses.

How to tackle twitter sentiment analysis?

I'd like you to give me some advice in order to tackle this problem. At college I've been solving opinion mining tasks but with Twitter the approach is quite different. For example, I used an ensemble learning approach to classify users opinions about a certain Hotel in Spain. Of course, I was given a training set with positive and negative opinions and then I tested with the test set. But now, with twitter, I've found this kind of categorization very difficult.
Do I need to have a training set? and if the answer to this question is positive, don't you think twitter is so temporal so if I have that set, my performance on future topics will be very poor?
I was thinking in getting a dictionary (mainly adjectives) and cross my tweets with it and obtain a term-document matrix but I have no class assigned to any twitter. Also, positive adjectives and negative adjectives could vary depending on the topic and time. So, how to deal with this?
How to deal with the problem of languages? For instance, I'd like to study tweets written in English and those in Spanish, but separately.
Which programming languages do you suggest to do something like this? I've been trying with R packages like tm, twitteR.
Sure, I think the way sentiment is used will stay constant for a few months. worst case you relabel and retrain. Unsupervised learning has a shitty track record for industrial applications in my experience.
You'll need some emotion/adj dictionary for sentiment stuff- there are some datasets out there but I forget where they are. I may have answered previous questions with better info.
Just do English tweets, it's fairly easy to build a language classifier, but you want to start small, so take it easy on yourself
Python (NLTK) if you want to do it easily in a small amount of code. Java has good NLP stuff, but Python and it's libraries are way more user friendly
This site: https://sites.google.com/site/miningtwitter/questions/sentiment provides 3 ways to do sentiment analysis using R.
The twitter package is now updated to work with the new twitter API. I'd you download the source version of the package to avoid getting duplicated tweets.
I'm working on a spanish dictionary for opinion mining, and would publish somewhere accesible.
cheers!
Sentiment Analysis will give only 3 results as said above - positive, negative and neutral. I found a tutorial on Twitter Sentiment analysis and it's quiet easy.
I found it here - https://www.ai-ml.tech/twitter-sentiment-analysis/
Only 3 dependencies, i downloaded and lesser code, done. Just go through it, you will get the solution.

Resources