Gensim word2vec score function when out-of-vocabulary - gensim

Word2Vec cannot handle out-of-vocabulary words (returns an error). However, when I try the score function https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.score
with sentences including OOV words, surprisingly, I do not get an error. Why is this the case?
Thank you!

The score() function is a training-like function, and like train() itself, simply ignores unknown words as if they weren't there. (Pondering whether this is the right decision or not for the goals of such 'scoring' is the subject of a nearby source-code-comment.)
Note that these score() functions are a non-standard extension of Word2Vec contributed a while ago as part of the research paper mentioned in the related docs. Whether they work for any purpose, or still work for as originally intended in the latest versions of Gensim, isn't clear or certain. They might not be maintained in the future (and even now don't work in for usual default negative-sampling Word2Vec models).
So, you may not want to rely on them, and should study their raw source for info on their functionality.

Related

Extract Word Saliency from Gensim LDA or pyLDAvis

I see that pyLDAvis visualize each word's saliency under each topic.
But do we have a way to extract each word's saliency under each topic? Or how to calculate each word's saliency directly using Gensim LDA?
So finally, I want to get a pandas dataframe such that one row represents one word, each column represents each topic and its value represents the word's saliency under the corresponding topic.
Many thanks in advance.
Gensim's LDA support does not have out-of-the-box support for this particular 'saliency' calculation from Chuang et al (2012).
Still, I suspect the model's .get_term_topics() and/or .get_topic_terms() methods are the proper supporting data for implementing that calculation. In particular, one or the other of those methods might provide the p( w | t ) term, but a deeper read of the paper would be required to know for sure. (I suspect the P(t) term might require a separate survey of the training data.)
From the class docs:
https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_term_topics
Returns The relevant topics represented as pairs of their ID and their
assigned probability, sorted by relevance to the given word.
https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_topic_terms
Returns Word ID - probability pairs for the most relevant words generated by the topic.
I hadn't come across this particular 'saliency' calculation before, but if it is popular among LDA users, or of potential general use, and you figure out how to calculate it, it'd likely be a welcome contribution to the Gensim project - especially if it can be a simple extra convenience method on LdaModel.
Adding to #gojomo's reply: Yes, there is no direct way of getting the list of most salient words as proposed by Chuang et al. (2012). But, there is a library named TMToolkit that offers a way of extracting this. They provide a method called word_saliency that can give you what you are looking for. The problem is this method expects you to provide the following items:
topic_word_distribution
doc_topic_distribution
doc_lengths
If you are using gensim LDA, then providing doc_topic_distribution will become a significant challenge as Gensim does not provide this out of the box. In that case, you can utilize _extract_data method that is part of PyLDAvis library. As this method is designed for Gensim specifically, you should have all the parameters required for this method. This will yield a dictionary that will contain topic_word_distribution, doc_topic_distribution, and doc_lengths. However, you might want to sort the output of TMToolkit.
A word of caution about TMToolkit: it is notorious for downgrading most of the helpful packages like numpy, pandas, etc. So it is highly recommended to install it using virtual environments.

Gensim Word2vec model parameter tuning

I am working on Word2Vec model. Is there any way to get the ideal value for one of its parameter i.e iter. Like the way we used do in K-Means (Elbo curve plot) to get the K value.Or is there any other way for parameter tuning on this model.
There's no one ideal set of parameters for a word2vec session – it depends on your intended usage of the word-vectors.
For example, some research has suggested that using a larger window tends to position the final vectors in a way that's more sensitive to topical/domain similarity, while a smaller window value shifts the word-neighborhoods to be more syntactic/functional drop-in replacements for each other. So depending on your particular project goals, you'd want a different value here.
(Similarly, because the original word2vec paper evaluated models, & tuned model meta-parameters, based on the usefulness of the word-vectors to solve a set of English-language analogy problems, many have often tuned their models to do well on the same analogy task. But I've seen cases where the model that scores best on those analogies does worse when contributing to downstream classification tasks.)
So what you really want is a project-specific way to score a set of word-vectors, well-matched to your goals. Then, you run many alternate word2vec training sessions, and pick the parameters that do best on your score.
The case of iter/epochs is special, in that by the logic of the underlying stochastic-gradient-descent optimization method, you'd ideally want to use as many training-epochs as necessary for the per-epoch running 'loss' to stop improving. At that point, the model is plausibly as good as it can be – 'converged' – given its inherent number of free-parameters and structure. (Any further internal adjustments that improve it for some examples worsen it for others, and vice-versa.)
So potentially, you'd watch this 'loss', and choose a number of training-iterations that's just enough to show the 'loss' stagnating (jittering up-and-down in a tight window) for a few passes. However, the loss-reporting in gensim isn't yet quite optimal – see project bug #2617 – and many word2vec implementations, including gensim and going back to the original word2vec.c code released by Google researchers, just let you set a fixed count of training iterations, rather than implement any loss-sensitive stopping rules.

How are keyword clouds constructed?

How are keyword clouds constructed?
I know there are a lot of nlp methods, but I'm not sure how they solve the following problem:
You can have several items that each have a list of keywords relating to them.
(In my own program, these items are articles where I can use nlp methods to detect proper nouns, people, places, and (?) possibly subjects. This will be a very large list given a sufficiently sized article, but I will assume that I can winnow the list down using some method by comparing articles. How to do this properly is what I am confused about).
Each item can have a list of keywords, but how do they pick keywords such that the keywords aren't overly specific or overly general between each item?
For example, trivially "the" can be a keyword that is a lot of items.
While "supercalifragilistic" could only be in one.
I suppose that I could create a heuristic where if a word exists in n% of the items where n is sufficiently small, but will return a nice sublist (say 5% of 1000 articles is 50, which seems reasonable) then I could just use that. However, the issue that I take with this approach is that given two different sets of entirely different items, there is most likely some difference in interrelatedness between the items, and I'm throwing away that information.
This is very unsatisfying.
I feel that given the popularity of keyword clouds there must have been a solution created already. I don't want to use a library however as I want to understand and manipulate the assumptions in the math.
If anyone has any ideas please let me know.
Thanks!
EDIT:
freenode/programming/guardianx has suggested https://en.wikipedia.org/wiki/Tf%E2%80%93idf
tf-idf is ok btw, but the issue is that the weighting needs to be determined apriori. Given that two distinct collections of documents will have a different inherent similarity between documents, assuming an apriori weighting does not feel correct
freenode/programming/anon suggested https://en.wikipedia.org/wiki/Word2vec
I'm not sure I want something that uses a neural net (a little complicated for this problem?), but still considering.
Tf-idf is still a pretty standard method for extracting keywords. You can try a demo of a tf-idf-based keyword extractor (which has the idf vector, as you say apriori determined, estimated from Wikipedia). A popular alternative is the TextRank algorithm based on PageRank that has an off-the-shelf implementation in Gensim.
If you decide for your own implementation, note that all algorithms typically need plenty of tuning and text preprocessing to work correctly.
The minimum you need to do is removing stopwords that you know that they never can be a keyword (prepositions, articles, pronouns, etc.). If you want something fancier, you can use for instance Spacy to keep only desired parts of speech (nouns, verbs, adjectives). You can also include frequent multiword expressions (gensim has good function for automatic collocation detection), named entities (spacy can do it). You can get better results if you run coreference resolution and substitute pronouns with what they refer to... There are endless options for improvements.

Can I preserve the random state of a doc2vec mode for each document I want to infer by infering all documents at the same time?

is there a way to infer multiple documents at the same time to preserve the random state of the model using Gensim Doc2Vec?
The function infer_vector is defined as
infer_vector(doc_words, alpha=None, min_alpha=None, epochs=None, steps=None)¶
where doc_words (list of str) – A document for which the vector representation will be inferred. And I could not find any opther option to infer multiple documents at the same time.
There's no current option to infer multiple documents at once. It's one of many wishlist improvements for infer_vector() (collected in an open issue), but there's no work in progress or targeted release for that to arrive.
I'm not sure what you mean by "preserve the random state of the model". The main motivations for batching that I can see would be user convenience, or added performance via multithreading.
If what you really want is deterministic inference, see an answer in the Gensim FAQ which explains why deterministic Doc2Vec inference isn't necessarily a good idea. (It also includes a link to an issue with some ideas for how to force it, if you're determined to do that despite the good reasons not to.)

Is there pre-trained doc2vec model?

Is there a pre-trained doc2vec model with a large data set, like Wikipedia or similar?
I don't know of any good one. There's one linked from this project, but:
it's based on a custom fork from an older gensim, so won't load in recent code
it's not clear what parameters or data it was trained with, and the associated paper may have made uninformed choices about the effects of parameters
it doesn't appear to be the right size to include actual doc-vectors for either Wikipedia articles (4-million-plus) or article paragraphs (tens-of-millions), or a significant number of word-vectors, so it's unclear what's been discarded
While it takes a long time and significant amount of working RAM, there is a Jupyter notebook demonstrating the creation of a Doc2Vec model from Wikipedia included in gensim:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb
So, I would recommend fixing the mistakes in your attempt. (And, if you succeed in creating a model, and want to document it for others, you could upload it somewhere for others to re-use.)
Yes!
I could find two pre-trained doc2vec models at this link
but still could not find any pre-trained doc2vec model which is trained on tweets

Resources