What is the proper way to deal with (score) dispersion in sentiment analysis on different topics in relation - sentiment-analysis

I'm analyzing sentiment on a social network. Based on different topics in relation as an input. How can we deal with dispersion of individual topics scores?
For example: we are trying to score sentiment on a theme which is an event that includes different keywords, let's say the theme is Innovation week with the following topics (keywords or synonyms):
Innovation week = {"innovation week", "data solution", "emerging technologies", "august 30"...}.
What if standard deviation of scores is so big.
Do we question:
The sentiment analysis algorithm itself?
Our input keywords?
Or we just take results as are? as they represent different views of people on different levels of granularity constituting a theme? The purpose finally is to have a general insight on a theme.
I think the question is simple although this is a concern of any sentiment analysis study in social networks.

The short answer is both the algorithm and the input keywords as they are dependent on each other.
Given the right input the dispersion would increse in any algorithm and given the wrong algorithm the same will happen for any input.
Usually in this cases you should revise the algorithm as this is the case in most situations.
You can also read this in order to understand it better:
http://www.cs.cornell.edu/home/llee/omsa/omsa-published.pdf

If you are not sure in your algorithm, maybe use the NLTK Vader Sentimenter to check the results. But it could be that the answers are so different that the standard deviation scores are so big.
Do you have test data to test your algorithm? If not you should have them anyhow to measure the standard measurements of algorithm.
Standard Measurements

Related

Good algorithm for sentiment analysis

I tried naive bayes classifier and it's working very bad. SVM works a little better but still horrible. Most of the papers which i read about SVM and naive bayes with some variations(n-gram, POS etc) but all of them gives results close to 50% (authors of articles talk about 80% and high but i cannt to get same accurate on real data).
Is there any more powerfull methods except lexixal analys? SVM and Bayes suppose that words independet. These approach called "bag of words". What if we suppose that words are associated?
For example: Use apriory algorithm to detect that if sentences contains "bad and horrible" then 70% probality that sentence is negative. Also we can use distance between words and so on.
Is it good idea or i'm inventing bicycle?
You're confusing a couple of concepts here. Neither Naive Bayes nor SVMs are tied to the bag of words approach. Neither SVMs nor the BOW approach have an independence assumption between terms.
Here's some things you can try:
include punctuation marks in your bags of words; esp. ! and ? can be helpful for sentiment analysis, while many feature extractors geared toward document classification throw them away
same for stop words: words like "I" and "my" may be indicative of subjective text
build a two-stage classifier; first determine whether any opinion is expressed, then whether it's positive or negative
try a quadratic kernel SVM instead of a linear one to capture interactions between features.
Algorithms like SVM, Naive Bayes and maximum entropy ones are supervised machine learning algorithms and the output of your program depends on the training set you have provided.
For large scale sentiment analysis I prefer using unsupervised learning method in which one can determine the sentiments of the adjectives by clustering documents into same-oriented parts, and label the clusters positive or negative. More information can be found out from this paper.
http://icwsm.org/papers/3--Godbole-Srinivasaiah-Skiena.pdf
Hope this helps you in your work :)
You can find some useful material on Sentimnetal analysis using python.
This presentation summarizes Sentiment Analysis as 3 simple steps
Labeling data
Preprocessing &
Model Learning
Sentiment Analysis is an area of ongoing research. And there is a lot of research going on right now. For an overview of the most recent, most successful approaches, I would generally advice you to have a look at the shared tasks of SemEval. Usually, every year they run a competition on Sentiment Analysis in Twitter. You can find the paper describing the task, and the results for 2016 here (might be a bit technical though): http://alt.qcri.org/semeval2016/task4/data/uploads/semeval2016_task4_report.pdf
Starting from there, you can have a look in the papers describing the individual systems (as referenced there).

Algorithm to compare similarity of ideas (as strings)

Consider an arbitrary text box that records the answer to the question, what do you want to do before you die?
Using a collection of response strings (max length 240), I'd like to somehow sort and group them and count them by idea (which may be just string similarity as described in this question).
Is there another or better way to do something like this?
Is this any different than string similarity?
Is this the right question to be asking?
The idea here is to have people write in a text box over and over again, and me to provide a number that describes, generally speaking, that 802 people wrote approximately the same thing
It is much more difficult than string similarity. This is what you need to do at a minimum:
Perform some text formatting/cleaning tasks like removing punctuations characters and common "stop words"
Construct a corpus (collection of words with their usage statistics) from the terms that occur answers.
Calculate a weight for every term.
Construct a document vector from every answer (each term corresponds to a dimension in a very high dimensional Euclidian space)
Run a clustering algorithm on document vectors.
Read a good statistical natural language processing book, or search google for good introductions / tutorials (likely terms: statistical nlp, text categorization, clustering) You can probably find some libraries (weka or nltk comes to mind) depending on the language of your choice but you need to understand the concepts to use the library anyway.
The Latent Semantic Analysis (LSA) might interest you. Here is a nice introduction.
Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.
[...]
What you want is very much an open problem in NLP. #Ali's answer describes the idea at a high level, but the part "Construct a document vector for every answer" is the really hard one. There are a few obvious ways of building a document vector from a the vectors of the words it contains. Addition, multiplication and averaging are fast, but they affectively ignore the syntax. Man bites dog and Dog bites man will have the same representation, but clearly not the same meaning. Google compositional distributional semantics- as far as I know, there are people at Universities of Texas, Trento, Oxford, Sussex and at Google working in the area.

how to categorize but don't use Classification or Clustering algorithms?

I have a crawler program that stores sport data from 7 difference news agencies every day. it stores about 1200 sport news every day.
I want to categorize news of last two days into sub-categories. So every two days I have about 2400 news that are exactly for these days and many of their topics are talking exactly about the same event.
for example:
70 news are talking about 500 miles racing of Brad Keselowski.
120 news are talking about US swimmer Nyad that begins swimming.
28 new are talking about the match between Man United and Man City.
. . .
In other words, I want to make something like Google News.
The problem is that this situation is not a classification problem, because I don't have special classes. for example, my classes are not swimming, golf, football, etc. my classes are a special events in every field that happened in these two years. So I cannot use classification algorithms such as Naive Bayes.
On the other hand, my problem is not solving with clustering algorithms too. Because I don't want to force them to put into n clusters. Maybe one of the news doesn't have any similar news or maybe in one pack of two days, there are 12 different stories, but in other two days, there are 30 different issues. So I cannot use clustering algorithms such as "Single Link( Maximum Similarity)", "Complete Link( Minimum Similarity)", "Maximum Weighted Matching" or "Group Average( Average Intra Similarity)".
I have some ideas myself to do this, for example, each two news that have 10 common words, should be in the same class. But if we don't consider some parameters such as length of documents, influence of common and rare words and some other things, this will not work well.
I have read this paper, but it was not my answer.
Is there any known algorithm to solve this problem?
The problem strikes me as a clustering problem with an unknown quality measure for the clusters. That points to an unsupervised method, which is ultimately based on detecting correlations using redundancy in the data. Perhaps something like principal component analysis or latent semantic analysis could be useful. The different dimensions (principal components or singular vectors) would indicate distinct major themes, with the terms corresponding to the vector components hopefully being the words appearing in the description. One drawback is that there's no guarantee that the strongest correlations would lead easily to a sensible description.
Take a look at "topic models" and "Latent Dirichlet Allocation". These are popular and you'll find code in a variety of languages.
You might use hierarchical clustering algorithms to investigate relationships between your items - the closest items (news with almost the same description) would be in the same clusters, and the closest clusters (groups of similar news) would be in the same super-cluster etc.
Also, there is pretty nice and fast algorithm called CLOPE - http://www.google.com.ua/url?sa=t&source=web&cd=11&sqi=2&ved=0CF0QFjAK&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.13.7142%26rep%3Drep1%26type%3Dpdf&rct=j&q=CLOPE&ei=gvo_Tsi4AsKa-gas-uCkAw&usg=AFQjCNGcR9sFqhsEkAJowEjIGbDBwSjeXw&cad=rja
There are many document clustering algorithms out there. Take a look at "Hierarchical document clustering using frequent itemsets", for example, and see if that is similar to what you want. If you're programming in Java, you may get some mileage out of the S-space package, which includes algorithms for latent semantic analysis (LSA) among others.

Is Latent Semantic Indexing (LSI) a Statistical Classification algorithm?

Is Latent Semantic Indexing (LSI) a Statistical Classification algorithm? Why or why not?
Basically, I'm trying to figure out why the Wikipedia page for Statistical Classification does not mention LSI. I'm just getting into this stuff and I'm trying to see how all the different approaches for classifying something relate to one another.
No, they're not quite the same. Statistical classification is intended to separate items into categories as cleanly as possible -- to make a clean decision about whether item X is more like the items in group A or group B, for example.
LSI is intended to show the degree to which items are similar or different and, primarily, find items that show a degree of similarity to an specified item. While this is similar, it's not quite the same.
LSI/LSA is eventually a technique for dimensionality reduction, and usually is coupled with a nearest neighbor algorithm to make it a into classification system. Hence in itself, its only a way of "indexing" the data in lower dimension using SVD.
Have you read about LSI on Wikipedia ? It says it uses matrix factorization (SVD), which in turn is sometimes used in classification.
The primary distinction in machine learning is between "supervised" and "unsupervised" modeling.
Usually the words "statistical classification" refer to supervised models, but not always.
With supervised methods the training set contains a "ground-truth" label that you build a model to predict. When you evaluate the model, the goal is to predict the best guess at (or probability distribution of) the true label, which you will not have at time of evaluation. Often there's a performance metric and it's quite clear what the right vs wrong answer is.
Unsupervised classification methods attempt to cluster a large number of data points which may appear to vary in complicated ways into a smaller number of "similar" categories. Data in each category ought to be similar in some kind of 'interesting' or 'deep' way. Since there is no "ground truth" you can't evaluate 'right or wrong', but 'more' vs 'less' interesting or useful.
Similarly evaluation time you can place new examples into potentially one of the clusters (crisp classification) or give some kind of weighting quantifying how similar or different looks like the "archetype" of the cluster.
So in some ways supervised and unsupervised models can yield something which is a "prediction", prediction of class/cluster label, but they are intrinsically different.
Often the goal of an unsupervised model is to provide more intelligent and powerfully compact inputs for a subsequent supervised model.

Is there an algorithm that tells the semantic similarity of two phrases

input: phrase 1, phrase 2
output: semantic similarity value (between 0 and 1), or the probability these two phrases are talking about the same thing
You might want to check out this paper:
Sentence similarity based on semantic nets and corpus statistics (PDF)
I've implemented the algorithm described. Our context was very general (effectively any two English sentences) and we found the approach taken was too slow and the results, while promising, not good enough (or likely to be so without considerable, extra, effort).
You don't give a lot of context so I can't necessarily recommend this but reading the paper could be useful for you in understanding how to tackle the problem.
Regards,
Matt.
There's a short and a long answer to this.
The short answer:
Use the WordNet::Similarity Perl package. If Perl is not your language of choice, check the WordNet project page at Princeton, or google for a wrapper library.
The long answer:
Determining word similarity is a complicated issue, and research is still very hot in this area. To compute similarity, you need an appropriate represenation of the meaning of a word. But what would be a representation of the meaning of, say, 'chair'? In fact, what is the exact meaning of 'chair'? If you think long and hard about this, it will twist your mind, you will go slightly mad, and finally take up a research career in Philosophy or Computational Linguistics to find the truth™. Both philosophers and linguists have tried to come up with an answer for literally thousands of years, and there's no end in sight.
So, if you're interested in exploring this problem a little more in-depth, I highly recommend reading Chapter 20.7 in Speech and Language Processing by Jurafsky and Martin, some of which is available through Google Books. It gives a very good overview of the state-of-the-art of distributional methods, which use word co-occurrence statistics to define a measure for word similarity. You are not likely to find libraries implementing these, however.
For anyone just coming at this, i would suggest taking a look at SEMILAR - http://www.semanticsimilarity.org/ . They implement a lot of the modern research methods for calculating word and sentence similarity. It is written in Java.
SEMILAR API comes with various similarity methods based on Wordnet, Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), BLEU, Meteor, Pointwise Mutual Information (PMI), Dependency based methods, optimized methods based on Quadratic Assignment, etc. And the similarity methods work in different granularities - word to word, sentence to sentence, or bigger texts.
You might want to check into the WordNet project at Princeton University. One possible approach to this would be to first run each phrase through a stop-word list (to remove "common" words such as "a", "to", "the", etc.) Then for each of the remaining words in each phrase, you could compute the semantic "similarity" between each of the words in the other phrase using a distance measure based on WordNet. The distance measure could be something like: the number of arcs you have to pass through in WordNet to get from word1 to word2.
Sorry this is pretty high-level. I've obviously never tried this. Just a quick thought.
I would look into latent semantic indexing for this. I believe you can create something similar to a vector space search index but with semantically related terms being closer together i.e. having a smaller angle between them. If I learn more I will post here.
Sorry to dig up a 6 year old question, but as I just came across this post today, I'll throw in an answer in case anyone else is looking for something similar.
cortical.io has developed a process for calculating the semantic similarity of two expressions and they have a demo of it up on their website. They offer a free API providing access to the functionality, so you can use it in your own application without having to implement the algorithm yourself.
One simple solution is to use the dot product of character n-gram vectors. This is robust over ordering changes (which many edit distance metrics are not) and captures many issues around stemming. It also prevents the AI-complete problem of full semantic understanding.
To compute the n-gram vector, just pick a value of n (say, 3), and hash every 3-word sequence in the phrase into a vector. Normalize the vector to unit length, then take the dot product of different vectors to detect similarity.
This approach has been described in
J. Mitchell and M. Lapata, “Composition in Distributional Models of Semantics,” Cognitive Science, vol. 34, no. 8, pp. 1388–1429, Nov. 2010., DOI 10.1111/j.1551-6709.2010.01106.x
I would have a look at statistical techniques that take into consideration the probability of each word to appear within a sentence. This will allow you to give less importance to popular words such as 'and', 'or', 'the' and give more importance to words that appear less regurarly, and that are therefore a better discriminating factor. For example, if you have two sentences:
1) The smith-waterman algorithm gives you a similarity measure between two strings.
2) We have reviewed the smith-waterman algorithm and we found it to be good enough for our project.
The fact that the two sentences share the words "smith-waterman" and the words "algorithms" (which are not as common as 'and', 'or', etc.), will allow you to say that the two sentences might indeed be talking about the same topic.
Summarizing, I would suggest you have a look at:
1) String similarity measures;
2) Statistic methods;
Hope this helps.
Try SimService, which provides a service for computing top-n similar words and phrase similarity.
This requires your algorithm actually knows what your talking about. It can be done in some rudimentary form by just comparing words and looking for synonyms etc, but any sort of accurate result would require some form of intelligence.
Take a look at http://mkusner.github.io/publications/WMD.pdf This paper describes an algorithm called Word Mover distance that tries to uncover semantic similarity. It relies on the similarity scores as dictated by word2vec. Integrating this with GoogleNews-vectors-negative300 yields desirable results.

Resources