What are the existent Sentiment Analysis Algorithm? - sentiment-analysis

I and a group of people are developing a Sentiment Analysis Algorithm. I would like to know what are the existent ones, because I want to compare them. Is there any article that have the main algorithms in this area?
Thanks in advance
Thiago

Some of the papers on sentiment analysis may help you -
One of the earlier works by Bo Pang, Lillian Lee http://acl.ldc.upenn.edu/acl2002/EMNLP/pdfs/EMNLP219.pdf
A comprehensive survey of sentiment analysis techniques http://www.cse.iitb.ac.in/~pb/cs626-449-2009/prev-years-other-things-nlp/sentiment-analysis-opinion-mining-pang-lee-omsa-published.pdf
Study by Hang Cui, V Mittal, M Datar using 6-grams http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.5942&rep=rep1&type=pdf
For quick implementation naive bayes is recommended. You can find an example here http://nlp.stanford.edu/IR-book/
We did a statistical comparision of various classifiers and found SVM to be most accurate, though for a dataset consisting of large contents
( http://ai.stanford.edu/~amaas/data/sentiment/ ) none of the methods worked well.Our study may not be accurate though. Also instead of treating sentiment analysis as a text classification problem, you can look at extraction of meaning from text, though I do not know how successful it might be.

apparently the NLTK, a python natural language processing library, has one:
http://text-processing.com/demo/sentiment/
Probably worth having a look at it.

Related

Sentiment Analysis using tensorflow

I am exploring tensorflow and would like to do sentiment analysis using the options available. I had a look at the following tutorial http://www.tensorflow.org/tutorials/recurrent/index.html#language_modeling
I have worked woth Naive Bayes Classifier, Maximum Entropy Algorithm and Scikit Learn Classifier and would like to know if there are any better algorithms offered by tensorflow. Is this the right place to start or are there any other options?
Any help pointing in the right direction would be greatly appreciated.
Thanks in advance.
A commonly used approach would be using a Convolutional Neural Network (CNN) to do sentiment analysis. You can find a great explanation/tutorial in this WildML blogpost. The accompanying TensorFlow code can be found here.
Another approach would be using an LSTM (or related network), you can find example implementations online, a good starting point is this blogpost.
I would suggest you try a character-level LSTM, it's been shown to be able to achieve state-of-the-art results in many text classification tasks one of them being sentiment analysis.
I wrote a pretty lengthy article that you can find here where I go through it's implementation in TensorFlow line by line. The result is a model that is less than 100mb in size and that achieves an accuracy of over 80% on a test set of 80,000 tweets.
Another approach that has proven to be very effective is to use a recursive neural network, you can read the paper from Stanford NLP Group here
For me, the easiest tutorial to follow was: https://pythonprogramming.net/data-size-example-tensorflow-deep-learning-tutorial/?completed=/train-test-tensorflow-deep-learning-tutorial/
It walks you throughTensorFlow.train.AdamOptimizer().minimize(cost) and uses Sentiment140 dataset (from Stanford, ~1 mil examples of positive and negative sentiment)

Good algorithm for sentiment analysis

I tried naive bayes classifier and it's working very bad. SVM works a little better but still horrible. Most of the papers which i read about SVM and naive bayes with some variations(n-gram, POS etc) but all of them gives results close to 50% (authors of articles talk about 80% and high but i cannt to get same accurate on real data).
Is there any more powerfull methods except lexixal analys? SVM and Bayes suppose that words independet. These approach called "bag of words". What if we suppose that words are associated?
For example: Use apriory algorithm to detect that if sentences contains "bad and horrible" then 70% probality that sentence is negative. Also we can use distance between words and so on.
Is it good idea or i'm inventing bicycle?
You're confusing a couple of concepts here. Neither Naive Bayes nor SVMs are tied to the bag of words approach. Neither SVMs nor the BOW approach have an independence assumption between terms.
Here's some things you can try:
include punctuation marks in your bags of words; esp. ! and ? can be helpful for sentiment analysis, while many feature extractors geared toward document classification throw them away
same for stop words: words like "I" and "my" may be indicative of subjective text
build a two-stage classifier; first determine whether any opinion is expressed, then whether it's positive or negative
try a quadratic kernel SVM instead of a linear one to capture interactions between features.
Algorithms like SVM, Naive Bayes and maximum entropy ones are supervised machine learning algorithms and the output of your program depends on the training set you have provided.
For large scale sentiment analysis I prefer using unsupervised learning method in which one can determine the sentiments of the adjectives by clustering documents into same-oriented parts, and label the clusters positive or negative. More information can be found out from this paper.
http://icwsm.org/papers/3--Godbole-Srinivasaiah-Skiena.pdf
Hope this helps you in your work :)
You can find some useful material on Sentimnetal analysis using python.
This presentation summarizes Sentiment Analysis as 3 simple steps
Labeling data
Preprocessing &
Model Learning
Sentiment Analysis is an area of ongoing research. And there is a lot of research going on right now. For an overview of the most recent, most successful approaches, I would generally advice you to have a look at the shared tasks of SemEval. Usually, every year they run a competition on Sentiment Analysis in Twitter. You can find the paper describing the task, and the results for 2016 here (might be a bit technical though): http://alt.qcri.org/semeval2016/task4/data/uploads/semeval2016_task4_report.pdf
Starting from there, you can have a look in the papers describing the individual systems (as referenced there).

How to tackle twitter sentiment analysis?

I'd like you to give me some advice in order to tackle this problem. At college I've been solving opinion mining tasks but with Twitter the approach is quite different. For example, I used an ensemble learning approach to classify users opinions about a certain Hotel in Spain. Of course, I was given a training set with positive and negative opinions and then I tested with the test set. But now, with twitter, I've found this kind of categorization very difficult.
Do I need to have a training set? and if the answer to this question is positive, don't you think twitter is so temporal so if I have that set, my performance on future topics will be very poor?
I was thinking in getting a dictionary (mainly adjectives) and cross my tweets with it and obtain a term-document matrix but I have no class assigned to any twitter. Also, positive adjectives and negative adjectives could vary depending on the topic and time. So, how to deal with this?
How to deal with the problem of languages? For instance, I'd like to study tweets written in English and those in Spanish, but separately.
Which programming languages do you suggest to do something like this? I've been trying with R packages like tm, twitteR.
Sure, I think the way sentiment is used will stay constant for a few months. worst case you relabel and retrain. Unsupervised learning has a shitty track record for industrial applications in my experience.
You'll need some emotion/adj dictionary for sentiment stuff- there are some datasets out there but I forget where they are. I may have answered previous questions with better info.
Just do English tweets, it's fairly easy to build a language classifier, but you want to start small, so take it easy on yourself
Python (NLTK) if you want to do it easily in a small amount of code. Java has good NLP stuff, but Python and it's libraries are way more user friendly
This site: https://sites.google.com/site/miningtwitter/questions/sentiment provides 3 ways to do sentiment analysis using R.
The twitter package is now updated to work with the new twitter API. I'd you download the source version of the package to avoid getting duplicated tweets.
I'm working on a spanish dictionary for opinion mining, and would publish somewhere accesible.
cheers!
Sentiment Analysis will give only 3 results as said above - positive, negative and neutral. I found a tutorial on Twitter Sentiment analysis and it's quiet easy.
I found it here - https://www.ai-ml.tech/twitter-sentiment-analysis/
Only 3 dependencies, i downloaded and lesser code, done. Just go through it, you will get the solution.

Automatic Tagging Algorithm

Does anyone know how to build automatic tagging (blog post/document) algorithm? Any example will be appreciated.
I agree with what Wooble is saying. However the naïve solution is to simply write an algorithm that calculates the lexical similarities and differences of the given blog post compared to a corpus of text. This lexical difference will give you words that are found in the blog post with more frequency than those found in the corpus. And from those words, you can infer a tag.
But I strongly recommend against it. Automatic tagging doesn't seem to work in practice. Just outsource the tagging work to your users or to services like Mechanical Turk
Late response but also had this task for a course - so in case someone else is looking to explore this, here is a starting point:
If you are looking for simple solutions or perhaps as a machine learning exercise, you might view automatic tagging as a text categorization/classification task. Naive Bayes classifiers are simple tools to figure out and there is plenty of pseudocode and material to understand these. TFIDF (term frequency-inverse document frequency) metric is something else you can look into - although commonly associated with information retrieval it can be tasked for this problem when combined with other machine learning techniques.
However, instead of assigning the new sample a single label based on a the definition of NB classifier, you will have to determine multiple labels. You can probably use the tag co-occurrence information from training set to help you with this.
This is a simplistic and naive solution and there are a lot of details on feature selection left out (stemming to reduce independent parameters, information gain, etc). Plenty of easily accessible papers on this research topic to try it out!

Latent Dirichlet Allocation, pitfalls, tips and programs

I'm experimenting with Latent Dirichlet Allocation for topic disambiguation and assignment, and I'm looking for advice.
Which program is the "best", where best is some combination of easiest to use, best prior estimation, fast
How do I incorporate my intuitions about topicality. Let's say I think I know that some items in the corpus are really in the same category, like all articles by the same author. Can I add that into the analysis?
Any unexpected pitfalls or tips I should know before embarking?
I'd prefer is there are R or Python front ends for whatever program, but I expect (and accept) that I'll be dealing with C.
http://mallet.cs.umass.edu/ is IMHO the most awesome plug-n-play LDA package out there.. It uses Gibbs sampling to estimate topics and has a really straightforward command-line interface with a lot of extra bells-n-whistles (a few more complicated models, hyper-parameter optimization, etc)
Its best to let the algorithm do its job. There may be variants of LDA (and pLSI,etc) which let you do some sort of semi-supervised thing.. I don't know of any at the moment.
I found removing stop-words and other really high-frequency words seemed to improve the quality of my topics a lot (evaluated by looking at top words of each topic, not any rigorous metric).. I am guessing stemming/lemmatization would help as well.
You mentioned a preference for R, you can use two packages topicmodels (slow) or lda (fast). Python has deltaLDA, pyLDA, Gensim, etc.
Topic modeling with specified topics or words is tricky out-of-the-box, David Andrzejewski has some Python code that seems to do it. There is a C++ implementation of supervised LDA here. And plenty of papers on related approaches (DiscLDA, Labeled LDA but not in an easy-to-use form, for me anyway...
As #adi92 says, removing stopwords, white spaces, numbers, punctuation and stemming all improve things a lot. One possible pitfall is having the wrong (or an inappropriate) number of topics. Currently there are no straightforward diagnostics for how many topics are optimum for a coprus of a give size, etc. There are some measures of topic quality available in MALLET (fastest), which are very handy.
In addition to the usual sources, it seems like the most active area talking about this is on the topics-models listserv. From my initial survey, the easiest package to understand is the LDA Matlab package.
This is not lightweight stuff at all, so I'm not surprised it's hard to find good resources on it.
For this kind of analysis I have used LingPipe: http://alias-i.com/lingpipe/index.html. It is an open source Java library, parts of which I use directly or port. To incorporate your own data, you may use a classifier, such as naive bayes, in conjunction. my experiences with statistical nlp is limited, but it usually follows a cycle of setting up classifiers, training, and looking over results, tweaking.
i second that. Mallet's lda uses a sparselda data structure and distributed learning, so its v fast. switching on hyperparameter optimization will give a better result, imo.
def plot_top_words(model, feature_names, n_top_words, title):
fig, axes = plt.subplots(2, 5, figsize=(30, 15), sharex=True)
axes = axes.flatten()
for topic_idx, topic in enumerate(model.components_):
top_features_ind = topic.argsort()[:-n_top_words - 1:-1]
top_features = [feature_names[i] for i in top_features_ind]
weights = topic[top_features_ind]
ax = axes[topic_idx]
ax.barh(top_features, weights, height=0.7)
ax.set_title(f'Topic {topic_idx +1}',
fontdict={'fontsize': 30})
ax.invert_yaxis()
ax.tick_params(axis='both', which='major', labelsize=20)
for i in 'top right left'.split():
ax.spines[i].set_visible(False)
fig.suptitle(title, fontsize=40)
plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)
plt.show()

Resources