LDA Gensim/Mallet documentation on alpha - gensim

I'm a little bit confused about the comments to alpha in the documentation of LDA (Gensim).
In the "regular" Gensim LdaModel it says that if one sets alpha = 'asymmetric', Gensim uses a "fixed normalized asymmetric prior of 1.0 / topicno" (topicno is num_topics, right?!). But why it is called asymmetric? Isn't that the symmetric case? (see https://radimrehurek.com/gensim/models/ldamodel.html)
And whats the default number for alpha used by Mallet? 50? If so, why? As far as i know one should choose some value <1 to get good results.
(see https://radimrehurek.com/gensim/models/wrappers/ldamallet.html)

Related

Which RDKit fingerprint corresponds to the ECFP4 fingerprint

I do have two questions about the Morgan fingerprint function of RDKit.
I couldn't figure out whether a Morgan fingerprint with the radius 2 or 4 corresponds to the ECFP4.
Furthermore I couldn't figure out, why the calculated similarity between two molecules differs substantially (much smaller) when using GetMorganFingerprintAsBitVect(nBits=2048) instead of GetMorganFingerprint?
Help or explanations would be very much appreciated.
Kind regards
Philipp
In answer to your first question, according to https://www.rdkit.org/docs/GettingStartedInPython.html, a radius of 2 is roughly equivalent to ecfp4.
The default atom invariants use connectivity information similar to
those used for the well known ECFP family of fingerprints.
Feature-based invariants, similar to those used for the FCFP
fingerprints, can also be used. The feature definitions used are
defined in the section Feature Definitions Used in the Morgan
Fingerprints. At times this can lead to quite different similarity
scores:
m1 = Chem.MolFromSmiles('c1ccccn1')
m2 = Chem.MolFromSmiles('c1ccco1')
fp1 = AllChem.GetMorganFingerprint(m1,2)
fp2 = AllChem.GetMorganFingerprint(m2,2)
ffp1 = AllChem.GetMorganFingerprint(m1,2,useFeatures=True)
ffp2 = AllChem.GetMorganFingerprint(m2,2,useFeatures=True)
DataStructs.DiceSimilarity(fp1,fp2)
0.36...
DataStructs.DiceSimilarity(ffp1,ffp2)
0.90...
When comparing the ECFP/FCFP fingerprints and the Morgan fingerprints generated by the RDKit, remember that the 4 in ECFP4
corresponds to the diameter of the atom environments considered, while
the Morgan fingerprints take a radius parameter. So the examples
above, with radius=2, are roughly equivalent to ECFP4 and FCFP4.

Which formula of tf-idf does the LSA model of gensim use?

There are many different ways in which tf and idf can be calculated. I want to know which formula is used by gensim in its LSA model. I have been going through its source code lsimodel.py, but it is not obvious to me where the document-term matrix is created (probably because of memory optimizations).
In one LSA paper, I read that each cell of the document-term matrix is the log-frequency of that word in that document, divided by the entropy of that word:
tf(w, d) = log(1 + frequency(w, d))
idf(w, D) = 1 / (-Σ_D p(w) log p(w))
However, this seems to be a very unusual formulation of tf-idf. A more familiar form of tf-idf is:
tf(w, d) = frequency(w, d)
idf(w, D) = log(|D| / |{d ∈ D: w ∈ d}|)
I also notice that there is a question on how the TfIdfModel itself is implemented in gensim. However, I didn't see lsimodel.py importing TfIdfModel, and therefore can only assume that lsimodel.py has its own implementation of tf-idf.
As I understand, lsimodel.py does not preform the tf-idf encoding step. You may find some details in gensim's API documentation - there's a dedicated tf-idf model, which can be employed to encode a text that can be later fed into the LSA model. From the tfidfmodel.py source code it appears that the latter of two definitions of tf-idf you listed is followed.

Adaptive Updates In Vowpal Wabbit Formula

I am looking at the following 2 presentations about the updates done by VW when the --adaptive flag is used. It seems these are different.
http://www.slideshare.net/jakehofman/technical-tricks-of-vowpal-wabbit
https://github.com/JohnLangford/vowpal_wabbit/wiki/v6.1_tutorial.pdf
With these two descriptions (respectively):
#1
#2
My questions:
Which of these are correct (or are they the same)?
For number 1 it appears that the gradient from the t+1 example is used in the denominator. How is this done? Does this mean that the new weight (labeled w_i) is the weight for example t+1?
As you noticed, the first presentation contains an error/typo in the AdaGrad formula. The formula should be w_{i, t+1} := w_{i, t} - (\eta * g_{i, t} / \sqrt{sum}), where sum=\sum_{t'=1}^t g_{i, t'}^2.
In VowpalWabbit, --adaptive (corresponding to the AdaGrad idea) is on by default. But --normalized and --invariant are also on by default, which means that on top of plain AdaGrad few more tricks/improvements are applied. The interaction of all these tricks is complex and there is no single slide which describes all the aspects, so the only reference is the source code (gd.cc).
Which of these are correct (or are they the same)?
I think they are not same, but they are different "layers" of the complex code. I think that the slide 33 (which you cite as #2) of the second presentation corresponds to the slide 31 (which you don't cite) of the
first presentation, but I am not sure.

MFCC Vector Quantization for Speaker Verification Hidden Markov Models

I am currently doing a project on speaker verification using Hidden Markov Models. I chose MFCC for my feature extraction. I also intend to apply VQ to it. I have implemented HMM and tested it on Eisner's data spreadsheet found here: http://www.cs.jhu.edu/~jason/papers/ and got correct results.
Using voice signals, I seem to have missed something since I was not getting correct acceptance (I did the probability estimation using the forward algorithm - no scaling applied).I was wondering on what could have I done wrong. I used scikits talkbox's MFCC function for feature extraction and used Scipy's cluster for vector quantization. Here is what I have written:
from scikits.talkbox.features import mfcc
from scikits.audiolab import wavread
from scipy.cluster.vq import vq, kmeans, whiten
(data, fs) = wavread(file_name)[:2]
mfcc_features = mfcc(data, fs=fs)[0]
#Vector Quantization
#collected_feats is a list of spectral vectors taken together from 3 voice samples
random.seed(0)
collected_feats = whiten(collected_feats)
codebook = kmeans(collected_feats, no_clusters)[0]
feature = vq(mfcc_features, codebook)
#feature is then used as the observation for the hidden markov model
I assumed that the default parameters for scikits' mfcc function is already fit for speaker verification. The audio files are of sampling rates 8000 and 22050. Is there something I am lacking here? I chose a cluster of 64 for VQ. Each sample is an isolated word. at least 1 second in duration. I haven't found a Python function yet to remove the silences in the voice samples so I use Audacity to manually truncate the silence parts. Any help would be appreciated. Thanks!
Well I am not sure about HMM approach but I would recommend using GMM. ALize is a great library for doing that. For Silence removal, use the LIUM library. The process is called speaker diarization, the program detects where the speaker is speaking and gives the time stamp.

Computationally simple pseudo-Gaussian distribution with varying mean and standard deviation?

This picture from Wikipedia has a nice example of the sort of functions I'd ideally like to generate:
Right now I'm using the Irwin-Hall Distribution, which is more or less a polynomial approximation of the Gaussian distribution...basically, you use uniform random number generator and iterate it x times, and take the average. The more iterations, the more like a Gaussian Distribution it is.
It's pretty nice; however I'd like to be able to have one where I can vary the mean. For example, let's say I wanted a number between the range 0 and 10, but around 7. Like, the mean (if I repeated this function multiple times) would turn out to be 7, but the actual range is 0-10.
Is there one I should look up, or should I work on doing some fancy maths with standard Gaussian distributions?
I see a contradiction in your question. From one side you want normal distribution which is symmetrical by it's nature, from other side you want the range asymmetrically disposed to mean value.
I suspect you should try to look at other distributions density functions of which are like bell curve but asymmetrical. Like log distribution or beta distribution.
Look into generating normal random variates. You can generate pairs of normal random variates X = N(0,1) and tranform it into ANY normal random variate Y = N(m,s) (Y = m + s*X).
Sounds like the Truncated Normal distribution is just what the doctor ordered. It is not "computationally simple" per se, but easy to implement if you have an existing implementation of a normal distribution.
You can just generate the distribution with the mean you want, standard deviation you want, and the two ends wherever you want. You'll have to do some work beforehand to compute the mean and standard deviation of the underlying (non-truncated) normal distribution to get the mean for the TN that you want, but you can use the formulae in that article. Also note that you can adjust the variance as well using this method :)
I have Java code (based on the Commons Math framework) for both an accurate (slower) and quick (less accurate) implementation of this distribution, with PDF, CDF, and sampling.

Resources