Is there a way to load pre-trained word vectors before training the doc2vec model? - gensim

I am trying to build a doc2vec model with more or less 10k sentences, after that I will use the model to find the most similar sentence in the model of some new sentences.
I have trained a gensim doc2vec model using the corpus(10k sentences) I have. This model can to some extend tell me if a new sentence is similar to some of the sentences in the corpus.
But, there is a problem: it may happen that there are words in new sentences which don't exist in the corpus, which means that they don't have a word embedding. If this happens, the prediction result will not be good.
As far as I know, the trained doc2vec model does have a matrix of doc vectors as well as a matrix of word vectors. So what I were thinking is to load a set of pre-trained word vectors, which contains a large number of words, and then train the model to get the doc vectors. Does it make sense? Is it possible with gensim? Or is there another way to do it?

Unlike what you might guess, typical Doc2Vec training does not train up word-vectors first, then compose doc-vectors using those word-vectors. Rather, in the modes that use word-vectors, the word-vectors trained in a simultaneous, interleaved fashion alongside the doc-vectors, both changing together. And in one fast and well-performing mode, PV-DBOW (dm=0 in gensim), word-vectors aren't trained or used at all.
So, gensim Doc2Vec doesn't support pre-loading state from elsewhere, and even if it did, it probably wouldn't provide the benefit you expect. (You could dig through the source code & perhaps force it by doing a bunch of initialization steps yourself. But then, if words were in the pre-loaded set, but not in your training data, training the rest of the active words would adjust the entire model in direction incompatible with the imported-but-untrained 'foreign' words. It's only the interleaved, tug-of-war co-training of the model's state which makes the various vectors meaningful in relation to each other.)
The most straightforward and reliable strategy would be to try to expand your training corpus, by finding more documents from a similar/compatible domain, to include multiple varied examples of any words you might encounter later. (If you thought some other word-vectors were apt enough for your domain, perhaps the texts that were used to train those word-vectors can be mixed-into your training corpus. That's a reasonable way to put the word/document data from that other source on equal footing in your model.)
And, as new documents arrive, you can also occasionally re-train the model from scratch, with the now-expanded corpus, letting newer documents contribute equally to the model's vocabulary and modeling strength.

Related

If I train a custom tokenizer on my dataset, I would still be able to leverage a pre-trained model weight

This is a declaration, but I'm not sure it is correct. I can elaborate.
I have a considerably large dataset (23Gb). I'd like to pre-train the Roberta-base or XLM-Roberta-base, so my language model would fit better to be used in further downstream tasks.
I know I can just run it against my dataset for a few epochs and get good results. But, what if I also train the tokenizer to generate a new vocab, and merge files? The weights from the pre-trained model I started from will still be used, or the new set of tokens will demand complete training from scratch?
I'm asking this because maybe some layers can still contribute with knowledge, so the final model will have the better of both worlds: A tokenizer that fits my dataset, and the weights from previous training.
That makes sense?
In short no.
You cannot use your own pretrained tokenizer for a pretrained model. The reason is that the vocabulary for your tokenizer and the vocabulary of the tokenizer that was used to pretrain the model that later you will use it as pretrained model are different. Thus a word-piece token which is present in Tokenizers's vocabulary may not be present in pretrained model's vocabulary.
Detailed answers can be found here,

How to interpret doc2vec classifier in terms of words?

I have trained a doc2vec (PV-DM) model in gensim on documents which fall into a few classes. I am working in a non-linguistic setting where both the number of documents and the number of unique words are small (~100 documents, ~100 words) for practical reasons. Each document has perhaps 10k tokens. My goal is to show that the doc2vec embeddings are more predictive of document class than simpler statistics and to explain which words (or perhaps word sequences, etc.) in each document are indicative of class.
I have good performance of a (cross-validated) classifier trained on the embeddings compared to one compared on the other statistic, but I am still unsure of how to connect the results of the classifier to any features of a given document. Is there a standard way to do this? My first inclination was to simply pass the co-learned word embeddings through the document classifier in order to see which words inhabited which classifier-partitioned regions of the embedding space. The document classes output on word embeddings are very consistent across cross validation splits, which is encouraging, although I don't know how to turn these effective labels into a statement to the effect of "Document X got label Y because of such and such properties of words A, B and C in the document".
Another idea is to look at similarities between word vectors and document vectors. The ordering of similar word vectors is pretty stable across random seeds and hyperparameters, but the output of this sort of labeling does not correspond at all to the output from the previous method.
Thanks for help in advance.
Edit: Here are some clarifying points. The tokens in the "documents" are ordered, and they are measured from a discrete-valued process whose states, I suspect, get their "meaning" from context in the sequence, much like words. There are only a handful of classes, usually between 3 and 5. The documents are given unique tags and the classes are not used for learning the embedding. The embeddings have rather dimension, always < 100, which are learned over many epochs, since I am only worried about overfitting when the classifier is learned, not the embeddings. For now, I'm using a multinomial logistic regressor for classification, but I'm not married to it. On that note, I've also tried using the normalized regressor coefficients as vector in the embedding space to which I can compare words, documents, etc.
That's a very small dataset (100 docs) and vocabulary (100 words) compared to much published work of Doc2Vec, which has usually used tens-of-thousands or millions of distinct documents.
That each doc is thousands of words and you're using PV-DM mode that mixes both doc-to-word and word-to-word contexts for training helps a bit. I'd still expect you might need to use a smaller-than-defualt dimensionaity (vector_size<<100), & more training epochs - but if it does seem to be working for you, great.
You don't mention how many classes you have, nor what classifier algorithm you're using, nor whether known classes are being mixed into the (often unsupervised) Doc2Vec training mode.
If you're only using known classes as the doc-tags, and your "a few" classes is, say, only 3, then to some extent you only have 3 unique "documents", which you're training on in fragments. Using only "a few" unique doctags might be prematurely hiding variety on the data that could be useful to a downstream classifier.
On the other hand, if you're giving each doc a unique ID - the original 'Paragraph Vectors' paper approach, and then you're feeding those to a downstream classifier, that can be OK alone, but may also benefit from adding the known-classes as extra tags, in addition to the per-doc IDs. (And perhaps if you have many classes, those may be OK as the only doc-tags. It can be worth comparing each approach.)
I haven't seen specific work on making Doc2Vec models explainable, other than the observation that when you are using a mode which co-trains both doc- and word- vectors, the doc-vectors & word-vectors have the same sort of useful similarities/neighborhoods/orientations as word-vectors alone tend to have.
You could simply try creating synthetic documents, or tampering with real documents' words via targeted removal/addition of candidate words, or blended mixes of documents with strong/correct classifier predictions, to see how much that changes either (a) their doc-vector, & the nearest other doc-vectors or class-vectors; or (b) the predictions/relative-confidences of any downstream classifier.
(A wishlist feature for Doc2Vec for a while has been to synthesize a pseudo-document from a doc-vector. See this issue for details, including a link to one partial implementation. While the mere ranked list of such words would be nonsense in natural language, it might give doc-vectors a certain "vividness".)
Whn you're not using real natural language, some useful things to keep in mind:
if your 'texts' are really unordered bags-of-tokens, then window may not really be an interesting parameter. Setting it to a very-large number can make sense (to essentially put all words in each others' windows), but may not be practical/appropriate given your large docs. Or, trying PV-DBOW instead - potentially even mixing known-classes & word-tokens in either tags or words.
the default ns_exponent=0.75 is inherited from word2vec & natural-language corpora, & at least one research paper (linked from the class documentation) suggests that for other applications, especially recommender systems, very different values may help.

Determining the "goodness" of a phrase based on "grammatical" or "contextual" relevancy

Given a random string of words, I would like to assign a "goodness" score to the phrase, where "goodness" is some indication of grammatical and contextual relevancy.
For example:
"the green tree was tall" [Good score]
"delicious tires swim open" [Medium score]
"jump an con porch calmly" [Poor score]
I've been experimenting with the Natural Language Toolkit. I'd considered using a trained tagger to assign parts-of-speech to each word in a phrase, and then parse a corpus for occurrences of that POS pattern. This may give me an indication of grammatical "goodness". However, as the tagger itself is trained on the same corpus that I'm using for validation, I can't imagine the results would be reliable. This approach also does not take into consideration the contextual relevancy of the words.
Is anyone aware of existing projects or research into this sort of thing? How would you approach this?
You could employ two different approaches - supervised and semi-supervised.
Supervised
Assuming you have a labeled dataset of tuples of the form <sentence> <goodness label> (like the one in your examples), you could first split your dataset up in a train:test fold (e.g. 4:1).
Then you could simply use BERT feature vectors (these are pre-trained on large volumes of natural language text). The following piece of code gives you the vector for the sentence the green tree was tall (read more here).
nlp_features = pipeline('feature-extraction')
output = nlp_features('the green tree was tall')
np.array(output).shape # (Samples, Tokens, Vector Size)
Assuming you vectorize every sentence, you could then train a simple logistc regression model (sklearn) that learns a set of parameters to minimize the errors in these predictions on the training set and eventually you throw the test set sentences at this model to see how it behaves.
Instead of BERT, you could also use embedded vectors as inputs to an LSTM network for training the classifier (like the one here).
Semi-supervised
This is applicable when you don't have sufficient labeled data (although you need a few to get you started with).
In this case, I think what you could do is to map the words of a sentence into POS tag sequences, e.g.,
the green tree was tall --> ARTICLE ADJ NOUN VERB ADJ (see here for more details).
This step would make your method depend less on the words themselves. A model trained on these sequences would try to discover some latent distinguishing characteristics of good sentences from the bad ones.
In particular, you could run a standard text classification approach with Bidirectional LSTMs for training your classifier (this time not with words but with a much smaller vocabulary of POS tags).
You can use a transformer model from HuggingFace that is fine tuned for sentence correctness. Specifically, the model has to be fine tuned on the Corpus of Linguistic Acceptability (CoLA). Here's a medium article on HuggingFace, transformers, and the fine tuning process.
You can also get a model that's already fine-tuned and you can put in the text classification pipeline for HuggingFace's transformers library here. That site hosts fine-tuned models and you can search for a few others that are fine tuned for the CoLA task there.

gensim doc2vec train more documents from pre-trained model

I am trying to train with new labelled document(TaggedDocument) with the pre-trained model.
Pretrained model is the trained model with documents which the unique id with label1_index, for instance, Good_0, Good_1 to Good_999
And the total size of trained data is about 7000
Now, I want to train the pre-trained model with new documents which the unique id with label2_index, for instance, Bad_0, Bad_1... to Bad_1211
And the total size of trained data is about 1211
The train itself was successful without any error, but the problem is that whenever I try to use 'most_similar' it only suggests the similar document labelled with Good_... where I expect the labelled with Bad_.
If I train altogether from the beginning, it gives me the answers I expected - it infers a newly given document similar to either labelled with Good or Bad.
However, the practice above will not work as the one trained altogether from the beginning.
Is continuing train not working properly or did I make some mistake?
The gensim Doc2Vec class can always be fed extra examples via train(), but it only discovers the working vocabulary of both word-tokens and document-tags during an initial build_vocab() step. So unless words/tags were available during the build_vocab(), they'll be ignored as unknown later. (The words get silently dropped from the text; the tags aren't trained or remembered inside the model.)
The Word2Vec superclass from which Doc2Vec borrows a lot of functionality has a newer, more-experimental parameter on its build_vocab() called update. If set true, that call to build_vocab() will add to, rather than replace, any prior vocabulary. However, as of February 2018, this option doesn't yet work with Doc2Vec, and indeed often causes memory-fault crashes.
But even if/when that can be made to work, providing incremental training examples isn't necessarily a good idea. By only updating parts of the model – those exercised by the new examples – the overall model can get worse, or its vectors made less self-consistent with each other. (The essence of these dense-embedding models is that the optimization over all varied examples results in generally-useful vectors. Training over just some subset causes the model to drift towards being good on just that subset, at likely cost to earlier examples.)
If you need new examples to also become part of the results for most_similar(), you might want to create your own separate set-of-vectors outside of Doc2Vec. When you infer new vectors for new texts, you could add those to that outside set, and then implement your own most_similar() (using the gensim code as a model) to search over this expanding set of vectors, rather than just the fixed set that is created by initial bulk Doc2Vec training.

gensim(1.0.1) Doc2Vec with google pretrained vectors

For gensim(1.0.1) doc2vec, I am trying to load google pre-trained word vectors instead of using Doc2Vec.build_vocab
wordVec_google = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
model0 = Doc2Vec(size=300, alpha=0.05, min_alpha=0.05, window=8, min_count=5, workers=4, dm=0, hs=1)
model0.wv = wordVec_google
##some other code
model0.build_vocab(sentences=allEmails, max_vocab_size = 20000)
but this object model0 can not be further trained with "labeled Docs", and can't infer vectors for documents.
Anyone knows how to use doc2vec with google pretrained word vectors?
I tried this post: http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/
but it does not work to load into gensim.models.Word2Vec object, perhaps it is a different gensim version.
The GoogleNews vectors are just raw vectors - not a full Word2Vec model.
Also, the gensim Doc2Vec class does not have general support for loading pretrained word-vectors. The Doc2Vec algorithm doesn't need pre-trained word-vectors – only some modes even use such vectors, and when they do, they're trained simultaneously as needed alongside the doc-vectors.
Specifically, the mode your code is using, dm=0, is the 'Paragraph Vectors' PV-DBOW mode, and does not use word-vectors at all. So even if there was a function to load them, they'd be loaded – then ignored during training and inference. (You would need to use PV-DM, 'dm=1', or add skip-gram word-training to PV-DBOW, dm=0, dbow_words=1, in order for such reused vectors to have any relevance to your training.)
Why do you think you want/need to use pre-trained vectors? (Especially, a set of 3 million word-vectors, from another kind of data, when a later step suggests you only care about a vocabulary of 20,000 words?)
If for some reason you feel sure you want to initialize Doc2Vec with wrod-vectors from elsewhere, and use a training mode where that would have some effect, you can look into the intersect_word2vec_format() method that gensim Doc2Vec inherits from Word2Vec.
That method specifically needs to be called after build_vocab() has already learned the corpus-specific vocabulary, and it only brings in the words from the outside source that are locally relevant. It's at best an advanced, experimental feature – see its source code, doc-comments, and discussion on the gensim list to understand its side-effects and limitations.

Resources