I have trained a Word2Vec model using gensim, I have a dataset of tweets that I would like to convert to vectors. What is the best way to convert a sentence to a vector + how can this be done using a word2vec model.
Formally, the word2vec algorithm only gives you a vector per word, not per longer text (like a sentence or paragraph or tweet or article).
One quick & easy baseline approach for turning longer texts into vectors is to just average together the vectors of each word. Recent versions of Gensim have a helper method get_mean_vector() to do this on KeyedVectors model objects (sets-of-word-vectors):
text_vector = kv_model.get_mean_vector(list_of_words)
Of course, such a simpleminded average has no way to model the effects of word-order/grammar. Words may tend to cancel each other out rather than have the compositional effects of real language, and the space of possible multiword-text meanings is much larger than the space of single-word meanings – so just collapsing the text into the same coordinate system as words may lose a lot.
More sophisticated ways of vectorizing text rely on model far more more sophisticated than plain word2vec, such as deep/recurrent neural networks for modelling longer ranges of text.
Related
I have trained a doc2vec (PV-DM) model in gensim on documents which fall into a few classes. I am working in a non-linguistic setting where both the number of documents and the number of unique words are small (~100 documents, ~100 words) for practical reasons. Each document has perhaps 10k tokens. My goal is to show that the doc2vec embeddings are more predictive of document class than simpler statistics and to explain which words (or perhaps word sequences, etc.) in each document are indicative of class.
I have good performance of a (cross-validated) classifier trained on the embeddings compared to one compared on the other statistic, but I am still unsure of how to connect the results of the classifier to any features of a given document. Is there a standard way to do this? My first inclination was to simply pass the co-learned word embeddings through the document classifier in order to see which words inhabited which classifier-partitioned regions of the embedding space. The document classes output on word embeddings are very consistent across cross validation splits, which is encouraging, although I don't know how to turn these effective labels into a statement to the effect of "Document X got label Y because of such and such properties of words A, B and C in the document".
Another idea is to look at similarities between word vectors and document vectors. The ordering of similar word vectors is pretty stable across random seeds and hyperparameters, but the output of this sort of labeling does not correspond at all to the output from the previous method.
Thanks for help in advance.
Edit: Here are some clarifying points. The tokens in the "documents" are ordered, and they are measured from a discrete-valued process whose states, I suspect, get their "meaning" from context in the sequence, much like words. There are only a handful of classes, usually between 3 and 5. The documents are given unique tags and the classes are not used for learning the embedding. The embeddings have rather dimension, always < 100, which are learned over many epochs, since I am only worried about overfitting when the classifier is learned, not the embeddings. For now, I'm using a multinomial logistic regressor for classification, but I'm not married to it. On that note, I've also tried using the normalized regressor coefficients as vector in the embedding space to which I can compare words, documents, etc.
That's a very small dataset (100 docs) and vocabulary (100 words) compared to much published work of Doc2Vec, which has usually used tens-of-thousands or millions of distinct documents.
That each doc is thousands of words and you're using PV-DM mode that mixes both doc-to-word and word-to-word contexts for training helps a bit. I'd still expect you might need to use a smaller-than-defualt dimensionaity (vector_size<<100), & more training epochs - but if it does seem to be working for you, great.
You don't mention how many classes you have, nor what classifier algorithm you're using, nor whether known classes are being mixed into the (often unsupervised) Doc2Vec training mode.
If you're only using known classes as the doc-tags, and your "a few" classes is, say, only 3, then to some extent you only have 3 unique "documents", which you're training on in fragments. Using only "a few" unique doctags might be prematurely hiding variety on the data that could be useful to a downstream classifier.
On the other hand, if you're giving each doc a unique ID - the original 'Paragraph Vectors' paper approach, and then you're feeding those to a downstream classifier, that can be OK alone, but may also benefit from adding the known-classes as extra tags, in addition to the per-doc IDs. (And perhaps if you have many classes, those may be OK as the only doc-tags. It can be worth comparing each approach.)
I haven't seen specific work on making Doc2Vec models explainable, other than the observation that when you are using a mode which co-trains both doc- and word- vectors, the doc-vectors & word-vectors have the same sort of useful similarities/neighborhoods/orientations as word-vectors alone tend to have.
You could simply try creating synthetic documents, or tampering with real documents' words via targeted removal/addition of candidate words, or blended mixes of documents with strong/correct classifier predictions, to see how much that changes either (a) their doc-vector, & the nearest other doc-vectors or class-vectors; or (b) the predictions/relative-confidences of any downstream classifier.
(A wishlist feature for Doc2Vec for a while has been to synthesize a pseudo-document from a doc-vector. See this issue for details, including a link to one partial implementation. While the mere ranked list of such words would be nonsense in natural language, it might give doc-vectors a certain "vividness".)
Whn you're not using real natural language, some useful things to keep in mind:
if your 'texts' are really unordered bags-of-tokens, then window may not really be an interesting parameter. Setting it to a very-large number can make sense (to essentially put all words in each others' windows), but may not be practical/appropriate given your large docs. Or, trying PV-DBOW instead - potentially even mixing known-classes & word-tokens in either tags or words.
the default ns_exponent=0.75 is inherited from word2vec & natural-language corpora, & at least one research paper (linked from the class documentation) suggests that for other applications, especially recommender systems, very different values may help.
Given a random string of words, I would like to assign a "goodness" score to the phrase, where "goodness" is some indication of grammatical and contextual relevancy.
For example:
"the green tree was tall" [Good score]
"delicious tires swim open" [Medium score]
"jump an con porch calmly" [Poor score]
I've been experimenting with the Natural Language Toolkit. I'd considered using a trained tagger to assign parts-of-speech to each word in a phrase, and then parse a corpus for occurrences of that POS pattern. This may give me an indication of grammatical "goodness". However, as the tagger itself is trained on the same corpus that I'm using for validation, I can't imagine the results would be reliable. This approach also does not take into consideration the contextual relevancy of the words.
Is anyone aware of existing projects or research into this sort of thing? How would you approach this?
You could employ two different approaches - supervised and semi-supervised.
Supervised
Assuming you have a labeled dataset of tuples of the form <sentence> <goodness label> (like the one in your examples), you could first split your dataset up in a train:test fold (e.g. 4:1).
Then you could simply use BERT feature vectors (these are pre-trained on large volumes of natural language text). The following piece of code gives you the vector for the sentence the green tree was tall (read more here).
nlp_features = pipeline('feature-extraction')
output = nlp_features('the green tree was tall')
np.array(output).shape # (Samples, Tokens, Vector Size)
Assuming you vectorize every sentence, you could then train a simple logistc regression model (sklearn) that learns a set of parameters to minimize the errors in these predictions on the training set and eventually you throw the test set sentences at this model to see how it behaves.
Instead of BERT, you could also use embedded vectors as inputs to an LSTM network for training the classifier (like the one here).
Semi-supervised
This is applicable when you don't have sufficient labeled data (although you need a few to get you started with).
In this case, I think what you could do is to map the words of a sentence into POS tag sequences, e.g.,
the green tree was tall --> ARTICLE ADJ NOUN VERB ADJ (see here for more details).
This step would make your method depend less on the words themselves. A model trained on these sequences would try to discover some latent distinguishing characteristics of good sentences from the bad ones.
In particular, you could run a standard text classification approach with Bidirectional LSTMs for training your classifier (this time not with words but with a much smaller vocabulary of POS tags).
You can use a transformer model from HuggingFace that is fine tuned for sentence correctness. Specifically, the model has to be fine tuned on the Corpus of Linguistic Acceptability (CoLA). Here's a medium article on HuggingFace, transformers, and the fine tuning process.
You can also get a model that's already fine-tuned and you can put in the text classification pipeline for HuggingFace's transformers library here. That site hosts fine-tuned models and you can search for a few others that are fine tuned for the CoLA task there.
I work on the problem of finding the nearest document in a list of documents. Each document is a word or a very short sentence (e.g. "jeans" or "machine tool" or "biological tomatoes"). By closest I mean close in a semantical way.
I have tried to use word2vec embeddings (from Mikolov article) but the closest words or more contextually linked than semanticaly linked ("jeans" is linked to "shoes" and not "trousers" as expected).
I have tried to use Bert encoding (https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/#32-understanding-the-output) using last layers but it faces the same issues.
I have tried elastic search, but it doesn't find semantical similarities.
(The task needs to be solved in French but maybe solving it in English is a good first step)
Note different sets of word-vectors may vary in how well they capture your desired 'semantic' similarities. (In particular, training with a shorter window may emphasize similarity among words that are drop-in replacements for each other, as opposed to just used-in-similar domains, as larger window values may emphasize. See this answer for more details.)
You may also want to take a look at "Word Mover's Distance" as a way to compare short texts that contain various mixes of somewhat-similar words. (It's fairly expensive, but should be practical on your short texts. It's available in the Python gensim library as wmdistance() on KeyedVectors instances.)
If you have training data where your specific multi-word phrases are used, in many natural-language-like subtly-varied contexts, you could consider combining all such phrases-of-interest into single tokens (like machine_tool or biological_tomatoes), and training your own domain-specific word-vectors.
For computing similarity between short texts which contains 2 or 3 words, you can use word2vec with getting the average vector of the sentence.
for example, if you have a text (machine tool) and want to represent it in one vector using word2vec so you have to get the vector of "machine" and the vector if "tool" then combine them in one vector by getting the average vector which is to add the two vectors and divide by 2 (the number of words). this will give you a vector representation for a sentence which is more than one word.
You can use also something like doc2vec which is designed on the top of word2vec and its purpose to get a vector for a sentence or paragraph.
You might try document embedding that is built on top of word2vec
However, notice that word and document embedding do not always capture "desired similarity", they just learn a language model on your corpus, they are heavy influenced by text size and word frequency.
How big is your corpus? If you need it just to perform some classification it might be better to train your vectors on a large dataset such as Google News corpus.
I am trying to build a doc2vec model with more or less 10k sentences, after that I will use the model to find the most similar sentence in the model of some new sentences.
I have trained a gensim doc2vec model using the corpus(10k sentences) I have. This model can to some extend tell me if a new sentence is similar to some of the sentences in the corpus.
But, there is a problem: it may happen that there are words in new sentences which don't exist in the corpus, which means that they don't have a word embedding. If this happens, the prediction result will not be good.
As far as I know, the trained doc2vec model does have a matrix of doc vectors as well as a matrix of word vectors. So what I were thinking is to load a set of pre-trained word vectors, which contains a large number of words, and then train the model to get the doc vectors. Does it make sense? Is it possible with gensim? Or is there another way to do it?
Unlike what you might guess, typical Doc2Vec training does not train up word-vectors first, then compose doc-vectors using those word-vectors. Rather, in the modes that use word-vectors, the word-vectors trained in a simultaneous, interleaved fashion alongside the doc-vectors, both changing together. And in one fast and well-performing mode, PV-DBOW (dm=0 in gensim), word-vectors aren't trained or used at all.
So, gensim Doc2Vec doesn't support pre-loading state from elsewhere, and even if it did, it probably wouldn't provide the benefit you expect. (You could dig through the source code & perhaps force it by doing a bunch of initialization steps yourself. But then, if words were in the pre-loaded set, but not in your training data, training the rest of the active words would adjust the entire model in direction incompatible with the imported-but-untrained 'foreign' words. It's only the interleaved, tug-of-war co-training of the model's state which makes the various vectors meaningful in relation to each other.)
The most straightforward and reliable strategy would be to try to expand your training corpus, by finding more documents from a similar/compatible domain, to include multiple varied examples of any words you might encounter later. (If you thought some other word-vectors were apt enough for your domain, perhaps the texts that were used to train those word-vectors can be mixed-into your training corpus. That's a reasonable way to put the word/document data from that other source on equal footing in your model.)
And, as new documents arrive, you can also occasionally re-train the model from scratch, with the now-expanded corpus, letting newer documents contribute equally to the model's vocabulary and modeling strength.
I am trying to find out new concepts in a Corpus from Konkani language.
I had trained two models on 1) a domain specific corpus 2) on newspaper corpus.
I have used Gensim word2vec to train the model however I am unable to get the terms of similar meaning on close proximity in vector space.
The closes words show no relation of being synonym with each other. Their similarity is as good as just some random words.
What am i doing wrong?
How big is your corpus?
For your trained vector to be meaningful, you would need at least 100 million word corpus (assuming about 1-2 million unique words).
You can suspect a sampling method if you had used Negative sampling instead of hierarchical, but I still think that small corpus size is the main problem of yours.