Latent Dirichlet Allocation, pitfalls, tips and programs - algorithm

I'm experimenting with Latent Dirichlet Allocation for topic disambiguation and assignment, and I'm looking for advice.
Which program is the "best", where best is some combination of easiest to use, best prior estimation, fast
How do I incorporate my intuitions about topicality. Let's say I think I know that some items in the corpus are really in the same category, like all articles by the same author. Can I add that into the analysis?
Any unexpected pitfalls or tips I should know before embarking?
I'd prefer is there are R or Python front ends for whatever program, but I expect (and accept) that I'll be dealing with C.

http://mallet.cs.umass.edu/ is IMHO the most awesome plug-n-play LDA package out there.. It uses Gibbs sampling to estimate topics and has a really straightforward command-line interface with a lot of extra bells-n-whistles (a few more complicated models, hyper-parameter optimization, etc)
Its best to let the algorithm do its job. There may be variants of LDA (and pLSI,etc) which let you do some sort of semi-supervised thing.. I don't know of any at the moment.
I found removing stop-words and other really high-frequency words seemed to improve the quality of my topics a lot (evaluated by looking at top words of each topic, not any rigorous metric).. I am guessing stemming/lemmatization would help as well.

You mentioned a preference for R, you can use two packages topicmodels (slow) or lda (fast). Python has deltaLDA, pyLDA, Gensim, etc.
Topic modeling with specified topics or words is tricky out-of-the-box, David Andrzejewski has some Python code that seems to do it. There is a C++ implementation of supervised LDA here. And plenty of papers on related approaches (DiscLDA, Labeled LDA but not in an easy-to-use form, for me anyway...
As #adi92 says, removing stopwords, white spaces, numbers, punctuation and stemming all improve things a lot. One possible pitfall is having the wrong (or an inappropriate) number of topics. Currently there are no straightforward diagnostics for how many topics are optimum for a coprus of a give size, etc. There are some measures of topic quality available in MALLET (fastest), which are very handy.

In addition to the usual sources, it seems like the most active area talking about this is on the topics-models listserv. From my initial survey, the easiest package to understand is the LDA Matlab package.
This is not lightweight stuff at all, so I'm not surprised it's hard to find good resources on it.

For this kind of analysis I have used LingPipe: http://alias-i.com/lingpipe/index.html. It is an open source Java library, parts of which I use directly or port. To incorporate your own data, you may use a classifier, such as naive bayes, in conjunction. my experiences with statistical nlp is limited, but it usually follows a cycle of setting up classifiers, training, and looking over results, tweaking.

i second that. Mallet's lda uses a sparselda data structure and distributed learning, so its v fast. switching on hyperparameter optimization will give a better result, imo.

def plot_top_words(model, feature_names, n_top_words, title):
fig, axes = plt.subplots(2, 5, figsize=(30, 15), sharex=True)
axes = axes.flatten()
for topic_idx, topic in enumerate(model.components_):
top_features_ind = topic.argsort()[:-n_top_words - 1:-1]
top_features = [feature_names[i] for i in top_features_ind]
weights = topic[top_features_ind]
ax = axes[topic_idx]
ax.barh(top_features, weights, height=0.7)
ax.set_title(f'Topic {topic_idx +1}',
fontdict={'fontsize': 30})
ax.invert_yaxis()
ax.tick_params(axis='both', which='major', labelsize=20)
for i in 'top right left'.split():
ax.spines[i].set_visible(False)
fig.suptitle(title, fontsize=40)
plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)
plt.show()

Related

Methods to identify duplicate questions on Twitter?

As stated in the title, I'm simply looking for algorithms or solutions one might use to take in the twitter firehose (or a portion of it) and
a) identify questions in general
b) for a question, identify questions that could be the same, with some degree of confidence
Thanks!
(A)
I would try to identify questions using machine learning and the Bag of Words model.
Create a labeled set of twits, and label each of them with a binary
flag: question or not question.
Extract the features from the training set. The features are traditionally words, but at least for any time I tried it - using bi-grams significantly improved the results. (3-grams were not helpful for my cases).
Build a classifier from the data. I usually found out SVM gives better performance then other classifiers, but you can use others as well - such as Naive Bayes or KNN (But you will probably need feature selection algorithm for these).
Now you can use your classifier to classify a tweet.1
(B)
This issue is referred in the world of Information-Retrieval as "duplicate detection" or "near-duplicate detection".
You can at least find questions which are very similar to each other using Semantic Interpretation, as described by Markovitch and Gabrilovich in their wonderful article Wikipedia-based Semantic Interpretation for Natural Language Processing. At the very least, it will help you identify if two questions are discussing the same issues (even though not identical).
The idea goes like this:
Use wikipedia to build a vector that represents its semantics, for a term t, the entry vector_t[i] is the tf-idf score of the term i as it co-appeared with the term t. The idea is described in details in the article. Reading the 3-4 first pages are enough to understand it. No need to read it all.2
For each tweet, construct a vector which is a function of the vectors of its terms. Compare between two vectors - and you can identify if two questions are discussing the same issues.
EDIT:
On 2nd thought, the BoW model is not a good fit here, since it ignores the position of terms. However, I believe if you add NLP processing for extracting feature (for examples, for each term, also denote if it is pre-subject or post-subject, and this was determined using NLP procssing), combining with Machine Learning will yield pretty good results.
(1) For evaluation of your classifier, you can use cross-validation, and check the expected accuracy.
(2) I know Evgeny Gabrilovich published the implemented algorithm they created as an open source project, just need to look for it.

Good algorithm for sentiment analysis

I tried naive bayes classifier and it's working very bad. SVM works a little better but still horrible. Most of the papers which i read about SVM and naive bayes with some variations(n-gram, POS etc) but all of them gives results close to 50% (authors of articles talk about 80% and high but i cannt to get same accurate on real data).
Is there any more powerfull methods except lexixal analys? SVM and Bayes suppose that words independet. These approach called "bag of words". What if we suppose that words are associated?
For example: Use apriory algorithm to detect that if sentences contains "bad and horrible" then 70% probality that sentence is negative. Also we can use distance between words and so on.
Is it good idea or i'm inventing bicycle?
You're confusing a couple of concepts here. Neither Naive Bayes nor SVMs are tied to the bag of words approach. Neither SVMs nor the BOW approach have an independence assumption between terms.
Here's some things you can try:
include punctuation marks in your bags of words; esp. ! and ? can be helpful for sentiment analysis, while many feature extractors geared toward document classification throw them away
same for stop words: words like "I" and "my" may be indicative of subjective text
build a two-stage classifier; first determine whether any opinion is expressed, then whether it's positive or negative
try a quadratic kernel SVM instead of a linear one to capture interactions between features.
Algorithms like SVM, Naive Bayes and maximum entropy ones are supervised machine learning algorithms and the output of your program depends on the training set you have provided.
For large scale sentiment analysis I prefer using unsupervised learning method in which one can determine the sentiments of the adjectives by clustering documents into same-oriented parts, and label the clusters positive or negative. More information can be found out from this paper.
http://icwsm.org/papers/3--Godbole-Srinivasaiah-Skiena.pdf
Hope this helps you in your work :)
You can find some useful material on Sentimnetal analysis using python.
This presentation summarizes Sentiment Analysis as 3 simple steps
Labeling data
Preprocessing &
Model Learning
Sentiment Analysis is an area of ongoing research. And there is a lot of research going on right now. For an overview of the most recent, most successful approaches, I would generally advice you to have a look at the shared tasks of SemEval. Usually, every year they run a competition on Sentiment Analysis in Twitter. You can find the paper describing the task, and the results for 2016 here (might be a bit technical though): http://alt.qcri.org/semeval2016/task4/data/uploads/semeval2016_task4_report.pdf
Starting from there, you can have a look in the papers describing the individual systems (as referenced there).

Using Sentiment Analysis to Detect Contradictory Arguments?

I don't have much background in sentiment analysis or natural language processing at all, but I have been reading a bit about it in my spare time. I would like to conduct and experiment to analyze forum threads/comments such as reddit, digg, blogs, etc. I'm particularity interested in doing something like counting the number of for, against, and neutral comments for threads of heated religious and political debates. Here's what I am thinking.
1) Find a thread that the original poster has defined a touchy political or religious topic.
2) For each comment categorize it as supporting the original poster or otherwise taking a contradicting or neutral stance.
3) Compare various mediums with the numbers of for or against arguments to determine what platforms are good "debate platforms" (i.e. balanced argument counts).
One big problem that I'm anticipating is that heated topics will invoke strong reactions from both supporting and contradicting parties so a simple happy/sad sentiment analysis won't cut it. I'm just sort of interested in this project for my own curiosities, so if anyone knows of similar research or utilities to conduct this experiment I'd be interested to hear more.
Can someone recommend a good sentiment analysis, word dictionary, training set, etc. for this task?
IMHO this is not possible without running into semantics. Consider the sentence:
Unlike many others, I am not against the abolishment of capital punishment.
Your AI may need to recognise idiomatic subfrases like "not against", or other "not ..." snippets. This is not impossible ;-)
An additional problem is, that "not" is more or less a stopword, its rank will probably be in the top-100, causing a low entropy (though it has a high "semantic" value to every sentence where it is unsed). Also note that omitting "the abolishment of", will cause the "polarity" of the sentence to flip as well.
You can try to use the bag of words [or even better: use n-grams as tokens to the bag]
The approach is basically:
Classify a set of examples, let your algorithm extract the relevant
words from the classified examples.
When a new comment is given, extract the relevant words, and use
k-nearest neighbors to decide if the new comment is a
pro/against/neutral.
Also, you might want to have a look on Apache Mahout.

Separation and pattern matching techniques

I am new to Artificial Neural Networks.
I am interested in an application like this:
I have a significantly large set of objects. Each object has six properties, denoted by P1–P6. Each property has a value which is a symbolic value. In other words, in my example P1–P6 can have a value from the set {A, B, C, D, E, F}. They are not numeric. (Suppose A,B,C,D,E,F are colours; then you will understand my idea.)
Now, there is another property R that I am interested in. Suppose
R = {G1, G2, G3, G4, G5}
I need to train a system for a large set of P1–P6 and the relevant R. Now I want to do the following.
I have an object and I know the values of P1 to P6. I need to find
the R (The Group that the object belongs.)
To get a desired R what is the pattern I need to have in P1–P6.
As an example given that R = G2 I need to figure out any pattern in P1–P6.
My questions are:
What are the theories/technologies/techniques I should read and
learn in order to implement 1 and 2, respectively?
What are the tools/libraries you can recommend to get this
simulated/implemented/tested?
The way you described your problem, you need to look up various machine learning techniques. If it were me, I would try and read about k-NN (k Nearest Neighbours) for the classification. When I say classification, I mean getting the R if you know P1-P6. It is a really simple technique and should be helpful here.
As for the other way around, what you basically need is a representative sample of your population. This is I think not so usual, but you could try something like a k-means Clustering. Clustering methods usually determine the class of an object (property R) by themselves, but k-means Clustering is cool in this situation because you need to give it the number of object classes (e.g. different possible values of R), and in the end you get one representative sample.
You definitely shouldn't go for any really complex techniques (like neural networks) in my opinion since your data doesn't have a precise numerical interpretation and the values can't be interpreted gradually.
The recommended tools really depend on your base programming language. There's a great tool called Orange which is Python-based and it's my tool of choice for these kind of things (especially since it is really easy to connect your Python modules with C/C++). If you prefer Java, there's a quite similar tool called Weka that you could use. I think Weka is a little bit better documented, but I don't like Java so I've never tried it out.
Both of these tools have a graphical clickable interface where you could just load your data and get the classification done, play with the parameters and check what kind of output you get using different techniques and different set-ups. Once you decide that you got the results you need (or if you just don't like graphical interfaces) you can also use both of them as libraries of a kind when programming (Python for Orange and Java for Weka) and make the classification a part of a bigger project.
If you look through the documentation of Orange or Weka, I think it will give you a few ideas about what you could actually do with the data you have and when you know a few techniques that seem interesting to you and applicable to the data, maybe you could get more quality comments and info on a few specific methods here than when just searching for a general advice.
You should check out classification algorithms (a subsection of artificial intelligence), especially the nearest neighbor-algorithms. Your problem may be solved by different techniques, which all have different advantages and disadvantages.
However, I do not know of any method in artificial intelligence, which allows a two-way classification (or in other words, that both implement your prerequisites 1 and 2 simultaneously). As all you want to do so far is having a bidirectional mapping of P1..P6 <=> R, I would suggest to just use a mapping table instead of an artificial intelligence algorithm. An AI would work great if you not exactly know, which of your samples is categorized under A..E in P1..P6.
If you insist on using an AI for it, I'd suggest to first look at a Perceptron. A perceptron consists of input, intermediate and output neurons. For your example, you'd have the input-Neurons P1a..P1e, P2a..P2e, ... and five output neurons R1..R5. After training, you should be able to input P1..P6 and get the appropriate R1..R5 as output.
As for frameworks and technologies, I only know of the Business Intelligence suite for Visual Studio, although there are a lot of other frameworks for AI out there. Since I do not have used any of them (I always coded them myself in C/C++), I can't recommend any.
It seems like a typical classification problem. In case you really have a lot of data have a look at Apache Mahout which provides distributed implementations of machine learning algorithms. If you need something less complex for prototyping TimBL is a nice alternative.

Automatic Tagging Algorithm

Does anyone know how to build automatic tagging (blog post/document) algorithm? Any example will be appreciated.
I agree with what Wooble is saying. However the naïve solution is to simply write an algorithm that calculates the lexical similarities and differences of the given blog post compared to a corpus of text. This lexical difference will give you words that are found in the blog post with more frequency than those found in the corpus. And from those words, you can infer a tag.
But I strongly recommend against it. Automatic tagging doesn't seem to work in practice. Just outsource the tagging work to your users or to services like Mechanical Turk
Late response but also had this task for a course - so in case someone else is looking to explore this, here is a starting point:
If you are looking for simple solutions or perhaps as a machine learning exercise, you might view automatic tagging as a text categorization/classification task. Naive Bayes classifiers are simple tools to figure out and there is plenty of pseudocode and material to understand these. TFIDF (term frequency-inverse document frequency) metric is something else you can look into - although commonly associated with information retrieval it can be tasked for this problem when combined with other machine learning techniques.
However, instead of assigning the new sample a single label based on a the definition of NB classifier, you will have to determine multiple labels. You can probably use the tag co-occurrence information from training set to help you with this.
This is a simplistic and naive solution and there are a lot of details on feature selection left out (stemming to reduce independent parameters, information gain, etc). Plenty of easily accessible papers on this research topic to try it out!

Resources