What is the difference between using gensim.models.LdaMallet and gensim.models.LdaModel? I noticed that the parameters are not all the same and would like to know when one should be used over the other?
TL;DR: Both are two completely independent implementations of Latent Dirichlet Allocation.
Use gensim if you simply want to try out LDA and you are not interested in special features of Mallet.
gensim.models.LdaModel is the single-core version of LDA implemented in gensim.
There is also parallelized LDA version available in gensim (gensim.models.ldamulticore).
Both Gensim implementations use an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation as described in Hoffman et al. [1].
Gensim algorithms (not limited to LDA) are memory-independent w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core).
Gensim also offers wrappers for the popular tools Mallet (Java) and Vowpal Wabbit (C++).
gensim.models.wrappers.LdaVowpalWabbit uses the same online variational Bayes (VB) algorithm that Gensim’s LdaModel is based on [1].
gensim.models.wrappers.LdaMallet uses an optimized Gibbs sampling algorithm for Latent Dirichlet Allocation [2].
This is the reason for different parameters.
However, most of the parameters, e.g., the number of topics, alpha and (b)eta) are shared between both algorithms because both implement LDA.
Both wrappers (gensim.models.wrappers.LdaVowpalWabbit and
gensim.models.wrappers.LdaMallet) need to have the respective tool installed (independent of gensim). Therefore, gensim is easier to use.
Besides that, try out the different implementations and see what works for you.
References
[1] Hoffman, Matthew, Francis R. Bach, and David M. Blei. "Online learning for latent dirichlet allocation." advances in neural information processing systems. 2010.
[2] Yao, Limin, David Mimno, and Andrew McCallum. "Efficient methods for topic model inference on streaming document collections." Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 2009.
Related
I am looking for difference between LDA and NTM . What are some use case where you will use LDA over NTM?
As per AWS doc:
LDA : The Amazon SageMaker Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. LDA is most commonly used to discover a user-specified number of topics shared by documents within a text corpus.
Although you can use both the Amazon SageMaker NTM and LDA algorithms for topic modeling, they are distinct algorithms and can be expected to produce different results on the same input data.
LDA and NTM have different scientific logic:
SageMaker LDA (Latent Dirichlet Allocation, not to be confused with Linear Discriminant Analysis) model works by assuming that documents are formed by sampling words from a finite set of topics. It is made of 2 moving parts: (1) the word composition per topic and (2) the topic composition per document
SageMaker NTM on the other hand doesn't explicitly learn a word distribution per topic, it is a neural network that passes document through a bottleneck layer and tries to reproduce the input document (presumably a Variational Auto Encoder (VAE) according to AWS documentation). That means that the bottleneck layer ends up containing all necessary information to predict document composition and its coefficients can be considered as topics
Here are considerations for choosing one or the other:
VAE-based method such as SageMaker NTM may do a better job of discerning relevant topics than LDA, presumably because of their possibly deeper expressive power. A benchmark here (featuring a VAE-NTM that could be different that SageMaker NTM) shows that NTMs can beat LDA in both metrics of topic coherence and perplexity
So far there seems to be more community knowledge about LDA than about VAEs, NTMs and SageMaker NTM. That means a possibly easier learning and troubleshooting path if you play with LDAs. Things change fast though, so this point may be less and less relevant as DL knowledge grows
SageMaker NTM has more flexible hardware options than SageMaker LDA and may scale better: SageMaker NTM can run on CPU, GPU, multi-GPUs instances and multi-instance context. For example, the official NTM demo uses an ephemeral cluster of 2 ml.c4.xlarge instances. SageMaker LDA currently only support single-instance CPU training.
I have a very large dataset (500 Million) of documents and want to cluster all documents according to their content.
What would be the best way to approach this?
I tried using k-means but it does not seem suitable because it needs all documents at once in order to do the calculations.
Are there any cluster algorithms suitable for larger datasets?
For reference: I am using Elasticsearch to store my data.
According to Prof. J. Han, who is currently teaching the Cluster Analysis in Data Mining class at Coursera, the most common methods for clustering text data are:
Combination of k-means and agglomerative clustering (bottom-up)
topic modeling
co-clustering.
But I can't tell how to apply these on your dataset. It's big - good luck.
For k-means clustering, I recommend to read the dissertation of Ingo Feinerer (2008). This guy is the developer of the tm package (used in R) for text mining via Document-Term-matrices.
The thesis contains case-studies (Ch. 8.1.4 and 9) on applying k-Means and then the Support Vector Machine Classifier on some documents (mailing lists and law texts). The case studies are written in tutorial style, but the dataset are not available.
The process contains lots of intermediate steps of manual inspection.
There are k-means variants thst process documents one by one,
MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability 1.
and k-means variants that repeatedly draw a random sample.
D. Sculley (2010). Web Scale K-Means clustering. Proceedings of the 19th international conference on World Wide Web
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., & Vassilvitskii, S. (2012). Scalable k-means++. Proceedings of the VLDB Endowment, 5(7), 622-633.
But in the end, it's still useless old k-means. It's a good quantization approach, but not very robust to noise, not capable of handling clusters of different size, non-convex shape, hierarchy (e.g. sports, inside baseball) etc. it's a signal processing technique, not a data organization technique.
So the practical impact of all these is 0. Yes, they can run k-means on insane data - but if you can't make sense of the result, why would you do so?
Do you know where I can find source code(any language) to program an information retrieval system based on the probabilistic model?
I tried to search it on the web and found an algorithm named bm25 or bmf25, but I don't know if it is useful.
Basically I´m trying to compare the performance of 3 IR algorithms: Vector space model, boolean model and the probabilistic model. Right now I have found the vector space and the boolean models. Depending on the results we need to use the best of them to develop a question-answering system
Thanks in advance
If you are looking for an IR engine that have BM25 implemented, you can try Terrier IR Platform
The language is Java. You can either use the engine itself or look into the source code for implementations of BM25 or other term weighting models.
The confusion here is that there are several probabilistic IR models (e.g. 2-Poisson, Binary Independence Model, language modeling variants), so the question is ambiguous. But in my experience, when people say "the probabilistic model" they usually mean some variant of the Binary Independence model due to Robertson and Sparch-Jones. BM25 (quite roughly) approximates this model, and that's what I'd use in this case. A canonical implementation of BM25 is included in the Lemur Toolkit. See:
http://www.lemurproject.org/doxygen/lemur/html/OkapiRetMethod_8hpp-source.html
I tried naive bayes classifier and it's working very bad. SVM works a little better but still horrible. Most of the papers which i read about SVM and naive bayes with some variations(n-gram, POS etc) but all of them gives results close to 50% (authors of articles talk about 80% and high but i cannt to get same accurate on real data).
Is there any more powerfull methods except lexixal analys? SVM and Bayes suppose that words independet. These approach called "bag of words". What if we suppose that words are associated?
For example: Use apriory algorithm to detect that if sentences contains "bad and horrible" then 70% probality that sentence is negative. Also we can use distance between words and so on.
Is it good idea or i'm inventing bicycle?
You're confusing a couple of concepts here. Neither Naive Bayes nor SVMs are tied to the bag of words approach. Neither SVMs nor the BOW approach have an independence assumption between terms.
Here's some things you can try:
include punctuation marks in your bags of words; esp. ! and ? can be helpful for sentiment analysis, while many feature extractors geared toward document classification throw them away
same for stop words: words like "I" and "my" may be indicative of subjective text
build a two-stage classifier; first determine whether any opinion is expressed, then whether it's positive or negative
try a quadratic kernel SVM instead of a linear one to capture interactions between features.
Algorithms like SVM, Naive Bayes and maximum entropy ones are supervised machine learning algorithms and the output of your program depends on the training set you have provided.
For large scale sentiment analysis I prefer using unsupervised learning method in which one can determine the sentiments of the adjectives by clustering documents into same-oriented parts, and label the clusters positive or negative. More information can be found out from this paper.
http://icwsm.org/papers/3--Godbole-Srinivasaiah-Skiena.pdf
Hope this helps you in your work :)
You can find some useful material on Sentimnetal analysis using python.
This presentation summarizes Sentiment Analysis as 3 simple steps
Labeling data
Preprocessing &
Model Learning
Sentiment Analysis is an area of ongoing research. And there is a lot of research going on right now. For an overview of the most recent, most successful approaches, I would generally advice you to have a look at the shared tasks of SemEval. Usually, every year they run a competition on Sentiment Analysis in Twitter. You can find the paper describing the task, and the results for 2016 here (might be a bit technical though): http://alt.qcri.org/semeval2016/task4/data/uploads/semeval2016_task4_report.pdf
Starting from there, you can have a look in the papers describing the individual systems (as referenced there).
Traditionional software metrics deal with quality of software. I'm looking for metrics that can be used to identify developers by their code, in the same vein as plagiarism software and stylometry can be used to identify authors by their writing style. I can imagine that certain existing metrics can be used here as well, such as comment ratio. I can also imagine metrics that would irrelevant from a quality point of view, such as the (over)use of certain methods or design patterns, average length of variable names, etc.
I'm interested either in a pointer to a collection of such metrics or studies, or individual metrics. They may be language-agnostic or related to a language or programming paradigm.
I want to use it to understand and analyze different coding styles, not to detect plagiarism.
I see there are already a couple of studies that looked into this. They might help.
Kothari, J., Shevertalov, M., Stehle, E., Mancoridis, S., "A probabilistic approach to source code authorship identification", In Proceedings of the International Conference on Information Technology, pp.243-248, IEEE, 2007.
Available online here
Quoting from the abstract:
We begin by computing a set of metrics to build profiles for a population of known authors using code samples that are verified to be authentic. We then compute metrics on unidentified source code to determine the closest matching profile. [...] In our case study we are able
to determine authorship with greater than 70% accuracy in choosing the single nearest match and greater than 90% accuracy in choosing the top three ordered nearest matches.
Shevertalov, M., Kothari, J., Stehle, E., Mancoridis, S., "On the use of discretized source code metrics for author identification", In Proceedings of the 1st International Symposium on Search Based Software Engineering, pp.69-78, IEEE, 2009.
Available online here, this is a follow-up of the previous study.
Lange, R., Mancoridis, S., "Using code metric histograms and genetic algorithms to perform author identification for software forensics", In Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, pp.2082-2089, ACM, 2007.
Available online here
This is also related to the first reference (common author), and discusses the metrics in more detail. Again quoting from the abstract:
Our method involves measuring the differences in histogram distributions for code metrics. Identifying a combination of metrics that is effective in distinguishing developer styles is key to the utility of the technique. Our case study involves 18 metrics.
You can also use Google Scholar for other references, and for finding other papers based on the ones above (using the "cited by" option).
If you're looking for potential metrics, you might try reviewing some coding standards. Since these dictate a particular style, it follows that the things they talk about (spacing, placement of braces, identifier lengths, mandatory comments, etc.) are things that might be used to identify developers from their code.
Also, if you're interested in .NET code, you might find NDepend to be a useful tool. It enables you to run queries against a code base, and supports 82 metrics.