I am exploring tensorflow and would like to do sentiment analysis using the options available. I had a look at the following tutorial http://www.tensorflow.org/tutorials/recurrent/index.html#language_modeling
I have worked woth Naive Bayes Classifier, Maximum Entropy Algorithm and Scikit Learn Classifier and would like to know if there are any better algorithms offered by tensorflow. Is this the right place to start or are there any other options?
Any help pointing in the right direction would be greatly appreciated.
Thanks in advance.
A commonly used approach would be using a Convolutional Neural Network (CNN) to do sentiment analysis. You can find a great explanation/tutorial in this WildML blogpost. The accompanying TensorFlow code can be found here.
Another approach would be using an LSTM (or related network), you can find example implementations online, a good starting point is this blogpost.
I would suggest you try a character-level LSTM, it's been shown to be able to achieve state-of-the-art results in many text classification tasks one of them being sentiment analysis.
I wrote a pretty lengthy article that you can find here where I go through it's implementation in TensorFlow line by line. The result is a model that is less than 100mb in size and that achieves an accuracy of over 80% on a test set of 80,000 tweets.
Another approach that has proven to be very effective is to use a recursive neural network, you can read the paper from Stanford NLP Group here
For me, the easiest tutorial to follow was: https://pythonprogramming.net/data-size-example-tensorflow-deep-learning-tutorial/?completed=/train-test-tensorflow-deep-learning-tutorial/
It walks you throughTensorFlow.train.AdamOptimizer().minimize(cost) and uses Sentiment140 dataset (from Stanford, ~1 mil examples of positive and negative sentiment)
Related
I am currently exploring PU learning. This is learning from positive and unlabeled data only. One of the publications [Zhang, 2009] asserts that it is possible to learn by modifying the loss function of an algorithm of a binary classifier with probabilistic output (for example Logistic Regression). Paper states that one should optimize Balanced Accuracy.
Vowpal Wabbit currently supports five loss functions [listed here]. I would like to add a custom loss function where I optimize for AUC (ROC), or equivalently, following the paper: 1 - Balanced_Accuracy.
I am unsure where to start. Looking at the code reveals that I need to provide 1st, 2nd derivatives and some other info. I could also run the standard algorithm with Logistic loss but trying to adjust l1 and l2 according to my objective (not sure if this is good). I would be glad to get any pointers or advices on how to proceed.
UPDATE
More search revealed that it is impossible/difficult to optimize for AUC in online learning: answer
I found two software suites that are immediately ready to do PU learning:
(1) SVM perf from Joachims
Use the ``-l 10'' option here!
(2) Sofia-ml
Use ``--loop_type roc'' option here!
In general you set +1'' labels to your positive examples and-1'' to all unlabeled ones. Then you launch the training procedure followed by prediction.
Both softwares give you some performance metrics. I would suggest to use standardized and well established binary from KDD`04 cup: ``perf''. Get it here.
Hope it helps for those wondering how this works in practice. Perhaps I prevented the case XKCD
I want to use Maximum Entropy Classifier for doing Sentiment Analysis on Tweets. My knowledge of statistics is very basic. Can you suggest some good tutorial or books on Maximum Entropy Classifier that explains the steps required for implementing one in detail , including selection of features and mathematical calculations involved. I have gone through various materials on net, but haven't found anything that is much helpful in this regard. Thanks in advance
I tried naive bayes classifier and it's working very bad. SVM works a little better but still horrible. Most of the papers which i read about SVM and naive bayes with some variations(n-gram, POS etc) but all of them gives results close to 50% (authors of articles talk about 80% and high but i cannt to get same accurate on real data).
Is there any more powerfull methods except lexixal analys? SVM and Bayes suppose that words independet. These approach called "bag of words". What if we suppose that words are associated?
For example: Use apriory algorithm to detect that if sentences contains "bad and horrible" then 70% probality that sentence is negative. Also we can use distance between words and so on.
Is it good idea or i'm inventing bicycle?
You're confusing a couple of concepts here. Neither Naive Bayes nor SVMs are tied to the bag of words approach. Neither SVMs nor the BOW approach have an independence assumption between terms.
Here's some things you can try:
include punctuation marks in your bags of words; esp. ! and ? can be helpful for sentiment analysis, while many feature extractors geared toward document classification throw them away
same for stop words: words like "I" and "my" may be indicative of subjective text
build a two-stage classifier; first determine whether any opinion is expressed, then whether it's positive or negative
try a quadratic kernel SVM instead of a linear one to capture interactions between features.
Algorithms like SVM, Naive Bayes and maximum entropy ones are supervised machine learning algorithms and the output of your program depends on the training set you have provided.
For large scale sentiment analysis I prefer using unsupervised learning method in which one can determine the sentiments of the adjectives by clustering documents into same-oriented parts, and label the clusters positive or negative. More information can be found out from this paper.
http://icwsm.org/papers/3--Godbole-Srinivasaiah-Skiena.pdf
Hope this helps you in your work :)
You can find some useful material on Sentimnetal analysis using python.
This presentation summarizes Sentiment Analysis as 3 simple steps
Labeling data
Preprocessing &
Model Learning
Sentiment Analysis is an area of ongoing research. And there is a lot of research going on right now. For an overview of the most recent, most successful approaches, I would generally advice you to have a look at the shared tasks of SemEval. Usually, every year they run a competition on Sentiment Analysis in Twitter. You can find the paper describing the task, and the results for 2016 here (might be a bit technical though): http://alt.qcri.org/semeval2016/task4/data/uploads/semeval2016_task4_report.pdf
Starting from there, you can have a look in the papers describing the individual systems (as referenced there).
I and a group of people are developing a Sentiment Analysis Algorithm. I would like to know what are the existent ones, because I want to compare them. Is there any article that have the main algorithms in this area?
Thanks in advance
Thiago
Some of the papers on sentiment analysis may help you -
One of the earlier works by Bo Pang, Lillian Lee http://acl.ldc.upenn.edu/acl2002/EMNLP/pdfs/EMNLP219.pdf
A comprehensive survey of sentiment analysis techniques http://www.cse.iitb.ac.in/~pb/cs626-449-2009/prev-years-other-things-nlp/sentiment-analysis-opinion-mining-pang-lee-omsa-published.pdf
Study by Hang Cui, V Mittal, M Datar using 6-grams http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.5942&rep=rep1&type=pdf
For quick implementation naive bayes is recommended. You can find an example here http://nlp.stanford.edu/IR-book/
We did a statistical comparision of various classifiers and found SVM to be most accurate, though for a dataset consisting of large contents
( http://ai.stanford.edu/~amaas/data/sentiment/ ) none of the methods worked well.Our study may not be accurate though. Also instead of treating sentiment analysis as a text classification problem, you can look at extraction of meaning from text, though I do not know how successful it might be.
apparently the NLTK, a python natural language processing library, has one:
http://text-processing.com/demo/sentiment/
Probably worth having a look at it.
Does anyone know how to build automatic tagging (blog post/document) algorithm? Any example will be appreciated.
I agree with what Wooble is saying. However the naïve solution is to simply write an algorithm that calculates the lexical similarities and differences of the given blog post compared to a corpus of text. This lexical difference will give you words that are found in the blog post with more frequency than those found in the corpus. And from those words, you can infer a tag.
But I strongly recommend against it. Automatic tagging doesn't seem to work in practice. Just outsource the tagging work to your users or to services like Mechanical Turk
Late response but also had this task for a course - so in case someone else is looking to explore this, here is a starting point:
If you are looking for simple solutions or perhaps as a machine learning exercise, you might view automatic tagging as a text categorization/classification task. Naive Bayes classifiers are simple tools to figure out and there is plenty of pseudocode and material to understand these. TFIDF (term frequency-inverse document frequency) metric is something else you can look into - although commonly associated with information retrieval it can be tasked for this problem when combined with other machine learning techniques.
However, instead of assigning the new sample a single label based on a the definition of NB classifier, you will have to determine multiple labels. You can probably use the tag co-occurrence information from training set to help you with this.
This is a simplistic and naive solution and there are a lot of details on feature selection left out (stemming to reduce independent parameters, information gain, etc). Plenty of easily accessible papers on this research topic to try it out!