How do I train Glove embeddings in gensim from scratch? - stanford-nlp

How do I train Glove embeddings in gensim from scratch? Can I use gensim for this?

Gensim doesn't implement the GLoVe algorithm. But it does offer the very-similar word2vec algorithm, which also creates a "dense embedding" (aka high-dimensional vector with many varied nonzero values) for individual words. See:
https://radimrehurek.com/gensim/models/word2vec.html
And, the FastText algorithm which, for some languages & purposes, can offer better-than-random guess-vectors for words it's never seen before, based on substrings within those words:
https://radimrehurek.com/gensim/models/fasttext.html
Gensim's KeyedVectors class can also load sets of GLoVe vectors that were trained elsewhere, for applying those vectors to other tasks:
from gensim.models import KeyedVectors
glove_kv = KeyedVectors.load_word2vec_format(GLOVE_FILE, binary=False, no_header=True)
print(glove_kv['apple'])

Related

Word2Vec convert a sentence

I have trained a Word2Vec model using gensim, I have a dataset of tweets that I would like to convert to vectors. What is the best way to convert a sentence to a vector + how can this be done using a word2vec model.
Formally, the word2vec algorithm only gives you a vector per word, not per longer text (like a sentence or paragraph or tweet or article).
One quick & easy baseline approach for turning longer texts into vectors is to just average together the vectors of each word. Recent versions of Gensim have a helper method get_mean_vector() to do this on KeyedVectors model objects (sets-of-word-vectors):
text_vector = kv_model.get_mean_vector(list_of_words)
Of course, such a simpleminded average has no way to model the effects of word-order/grammar. Words may tend to cancel each other out rather than have the compositional effects of real language, and the space of possible multiword-text meanings is much larger than the space of single-word meanings – so just collapsing the text into the same coordinate system as words may lose a lot.
More sophisticated ways of vectorizing text rely on model far more more sophisticated than plain word2vec, such as deep/recurrent neural networks for modelling longer ranges of text.

Using fastText Sentence Vector as an Input Feature

I want to use the fastText Sentence Vector as an input Feature.
vector = model.get_sentence_vector('Original Sentence')
I am attempting to perform Binary Classification of sentences using MLPs and will train the algorithm using the fixed sized feature generated by the above code. Is this a plausible thing to do?
You can take the mean of the word embeddings, i.e., tokenize the sentence, look up embeddings for all words computing an average. In this way, you will get a NumPy array that you can use as an input to whatever classifier you want. Depending on the classification task, it might be useful to remove function words first.
Gensim has a richer Python API than FastText itself. If you just want to quickly train a classifier, the best option is using the command line interface of FastText.

Is there a way to load pre-trained word vectors before training the doc2vec model?

I am trying to build a doc2vec model with more or less 10k sentences, after that I will use the model to find the most similar sentence in the model of some new sentences.
I have trained a gensim doc2vec model using the corpus(10k sentences) I have. This model can to some extend tell me if a new sentence is similar to some of the sentences in the corpus.
But, there is a problem: it may happen that there are words in new sentences which don't exist in the corpus, which means that they don't have a word embedding. If this happens, the prediction result will not be good.
As far as I know, the trained doc2vec model does have a matrix of doc vectors as well as a matrix of word vectors. So what I were thinking is to load a set of pre-trained word vectors, which contains a large number of words, and then train the model to get the doc vectors. Does it make sense? Is it possible with gensim? Or is there another way to do it?
Unlike what you might guess, typical Doc2Vec training does not train up word-vectors first, then compose doc-vectors using those word-vectors. Rather, in the modes that use word-vectors, the word-vectors trained in a simultaneous, interleaved fashion alongside the doc-vectors, both changing together. And in one fast and well-performing mode, PV-DBOW (dm=0 in gensim), word-vectors aren't trained or used at all.
So, gensim Doc2Vec doesn't support pre-loading state from elsewhere, and even if it did, it probably wouldn't provide the benefit you expect. (You could dig through the source code & perhaps force it by doing a bunch of initialization steps yourself. But then, if words were in the pre-loaded set, but not in your training data, training the rest of the active words would adjust the entire model in direction incompatible with the imported-but-untrained 'foreign' words. It's only the interleaved, tug-of-war co-training of the model's state which makes the various vectors meaningful in relation to each other.)
The most straightforward and reliable strategy would be to try to expand your training corpus, by finding more documents from a similar/compatible domain, to include multiple varied examples of any words you might encounter later. (If you thought some other word-vectors were apt enough for your domain, perhaps the texts that were used to train those word-vectors can be mixed-into your training corpus. That's a reasonable way to put the word/document data from that other source on equal footing in your model.)
And, as new documents arrive, you can also occasionally re-train the model from scratch, with the now-expanded corpus, letting newer documents contribute equally to the model's vocabulary and modeling strength.

gensim(1.0.1) Doc2Vec with google pretrained vectors

For gensim(1.0.1) doc2vec, I am trying to load google pre-trained word vectors instead of using Doc2Vec.build_vocab
wordVec_google = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
model0 = Doc2Vec(size=300, alpha=0.05, min_alpha=0.05, window=8, min_count=5, workers=4, dm=0, hs=1)
model0.wv = wordVec_google
##some other code
model0.build_vocab(sentences=allEmails, max_vocab_size = 20000)
but this object model0 can not be further trained with "labeled Docs", and can't infer vectors for documents.
Anyone knows how to use doc2vec with google pretrained word vectors?
I tried this post: http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/
but it does not work to load into gensim.models.Word2Vec object, perhaps it is a different gensim version.
The GoogleNews vectors are just raw vectors - not a full Word2Vec model.
Also, the gensim Doc2Vec class does not have general support for loading pretrained word-vectors. The Doc2Vec algorithm doesn't need pre-trained word-vectors – only some modes even use such vectors, and when they do, they're trained simultaneously as needed alongside the doc-vectors.
Specifically, the mode your code is using, dm=0, is the 'Paragraph Vectors' PV-DBOW mode, and does not use word-vectors at all. So even if there was a function to load them, they'd be loaded – then ignored during training and inference. (You would need to use PV-DM, 'dm=1', or add skip-gram word-training to PV-DBOW, dm=0, dbow_words=1, in order for such reused vectors to have any relevance to your training.)
Why do you think you want/need to use pre-trained vectors? (Especially, a set of 3 million word-vectors, from another kind of data, when a later step suggests you only care about a vocabulary of 20,000 words?)
If for some reason you feel sure you want to initialize Doc2Vec with wrod-vectors from elsewhere, and use a training mode where that would have some effect, you can look into the intersect_word2vec_format() method that gensim Doc2Vec inherits from Word2Vec.
That method specifically needs to be called after build_vocab() has already learned the corpus-specific vocabulary, and it only brings in the words from the outside source that are locally relevant. It's at best an advanced, experimental feature – see its source code, doc-comments, and discussion on the gensim list to understand its side-effects and limitations.

Necessary to apply TF-IDF to new documents in gensim LDA model?

I'm following the 'English Wikipedia' gensim tutorial at https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation
where it explains that tf-idf is used during training (at least for LSA, not so clear with LDA).
I expected to apply a tf-idf transformer to new documents, but instead, at the end of the tut, it suggests to simply input a bag-of-words.
doc_lda = lda[doc_bow]
Does LDA require bag-of-words vectors only?
TL;DR: Yes, LDA only needs a bag-of-word vector.
Indeed, in the Wikipedia example of the gensim tutorial, Radim Rehurek uses the TF-IDF corpus generated in the preprocessing step.
mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
I believe the reason for that is only that this matrix is sparse and easy to handle (and already exists anyways due to the preprocessing step).
LDA does not necessarily need to be trained on a TF-IDF corpus. The model works just fine if you use the corpus shown in the gensim tutorial Corpora and Vector Spaces:
from gensim import corpora, models
texts = [['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1, chunksize =10000, passes=1)
Notice that texts is a bag-of-word vector. As you pointed out correctly, that is the center piece of the LDA model. TF-IDF does not play any role in it at all.
In fact, Blei (who developed LDA), points out in the introduction of the paper of 2003 (entitled "Latent Dirichlet Allocation") that LDA addresses the shortcomings of the TF-IDF model and leaves this approach behind. LSA is compeltely algebraic and generally (but not necessarily) uses a TF-IDF matrix, while LDA is a probabilistic model that tries to estimate probability distributions for topics in documents and words in topics. The weighting of TF-IDF is not necessary for this.
Not to disagree with Jérôme's answer, tf-idf is used in the latent dirichlet allocation to some extent. As can be read in the paper Topic Models by Blei and Lafferty (e.g. p.6 - Visualizing Topics and p.12), the tf-idf score can be very useful for LDA. It can be used to visualize topics or to chose the vocabulary. "It is often computationally expensive to use the entire vocabulary. Choosing the top V words by TFIDF is an effective way to prune the vocabulary".
This said, LDA does not need tf-idf to infer topics, but it can be useful and it can improve your results.

Resources