Structure of Gensim Word Embedding corpus - gensim

I want to train a word2vec model using Gensim. I preprocessed my corpus, which is made of hundreds of thousands of articles from a specific newspaper. I preprocessed them (lower casing, lemmatizing, removing stop words and punctuations, etc.) and then make a list of lists, in which each element is a list of words.
corpus = [['first', 'sentence', 'second', 'dictum', 'third', 'saying', 'last', 'claim'],
['first', 'adage', 'second', 'sentence', 'third', 'judgment', 'last', 'pronouncement']]
I wanted to know if it is the right way, or it should be like the following:
corpus = [['first', 'sentence'], ['second', 'dictum'], ['third', 'saying'], ['last', 'claim'], ['first', 'adage'], ['second', 'sentence'], ['third', 'judgment'], ['last', 'pronouncement']]

Both would minimally work.
But in the second, no matter how big your window parameter, the fact all texts are no more than 2 tokens long means words will only affect their immediate neighbors. That's probably not what you want.
There's no real harm in longer texts, except to note that:
Tokens all in the same list will appear in each other's window-sized neighborhood - so don't run words together that shouldn't imply any realistic use alongside each other. (But, in large-enough corpuses, even the noise of some run-together unrelated texts won't make much difference, swamped by the real relationships in the bulk of the texts.)
Each text shouldn't be more than 10,000 tokens long, as an internal implementation limit will cause any tokens beyond that limit to be ignored.

Related

Correct way to represent documents containing multiple sentences in gensim file-based training

I am trying to use gensim's file-based training (example from documentation below):
from multiprocessing import cpu_count
from gensim.utils import save_as_line_sentence
from gensim.test.utils import get_tmpfile
from gensim.models import Word2Vec, Doc2Vec, FastText
# Convert any corpus to the needed format: 1 document per line, words delimited by " "
corpus = api.load("text8")
corpus_fname = get_tmpfile("text8-file-sentence.txt")
save_as_line_sentence(corpus, corpus_fname)
# Choose num of cores that you want to use (let's use all, models scale linearly now!)
num_cores = cpu_count()
# Train models using all cores
w2v_model = Word2Vec(corpus_file=corpus_fname, workers=num_cores)
d2v_model = Doc2Vec(corpus_file=corpus_fname, workers=num_cores)
ft_model = FastText(corpus_file=corpus_fname, workers=num_cores)
However, my actual corpus contains many documents, each containing many sentences.
For example, let's assume my corpus is the plays of Shakespeare - Each play is a document, each document has many many sentences, and I would like to learn embeddings for each play, but the word embeddings only from within the same sentence.
Since the file-based training is meant to be one document per line, I assume that I should put one play per line. However, the documentation for file-based-training doesn't have an example of any documents with multiple sentences.
Is there a way to peek inside the model to see the documents and word context pairs that have been found before they are trained?
What is the correct way to build this file, maintaining sentence boundaries?
Thank you
These algorithm implementations don't have any real understanding of, or dependence on, actual sentences. They just take texts – runs of word-tokens.
Often the texts provided to Word2Vec will be multiple sentences. Sometimes punctuation like sentence-ending periods are even retained as pseudo-words. (And when the sentences were really consecutive with each other in the source data, the overlapping word-context windows, between sentences, may even be a benefit.)
So you don't have to worry about "maintaining sentence boundaries". Any texts you provide that are sensible units of words that really co-occur will work about as well. (Especially in Word2Vec and FastText, even changing your breaks between texts to be sentences, or paragraphs, or sections, or documents is unlikely to have very much effect on the final word-vectors – it's just changing a subset of the training contexts, and probably not in any way that significantly changes which words influence which other words.)
There is, however, another implementation limit in gensim that you should watch out for: each training text can only be 10,000 tokens long, and if you supply larger texts, the extra tokens will be silently ignored.
So, be sure to use texts that are 10k tokens or shorter – even if you have to arbitrarily split longer ones. (Per above, any such arbitrary extra break in the token grouping is unlikely to have a noticeable effect on results.)
However, this presents a special problem using Doc2Vec in corpus_file mode, because in that mode, you don't get to specify your preferred tags for a text. (A text's tag, in this mode, is essentially just the line-number.)
In the original sequence corpus mode, the workaround for this 10k token limit was just to break up larger docs into multiple docs - but use the same repeated tags for all sub-documents from an original document. (This very closely approximates how a doc of any size would affect training.)
If you have documents with more than 10k tokens, I'd recommend either not using corpus_file mode, or figuring some way to use logical sub-documents of less than 10k tokens, then perhaps modeling your larger docs as the set of their sub-documents, or otherwise adjusting your downstream tasks to work on the same sub-document units.

word2vec window size at sentence boundaries

I am using word2vec (and doc2vec) to get embeddings for sentences, but i want to completely ignore word order.
I am currently using gensim, but can use other packages if necessary.
As an example, my text looks like this:
[
['apple', 'banana','carrot','dates', 'elderberry', ..., 'zucchini'],
['aluminium', 'brass','copper', ..., 'zinc'],
...
]
I intentionally want 'apple' to be considered as close to 'zucchini' as it is to 'banana' so I have set the window size to a very large number, say 1000.
I am aware of 2 problems that may arise with this.
Problem 1:
The window might roll in at the start of a sentence creating the following training pairs:
('apple', ('banana')), ('apple', ('banana', 'carrot')), ('apple', ('banana', 'carrot', 'date')) before it eventually gets to the correct ('apple', ('banana','carrot', ..., 'zucchini')).
This would seem to have the effect of making 'apple' closer to 'banana' than 'zucchini',
since their are so many more pairs containing 'apple' and 'banana' than there are pairs containing 'apple' and 'zucchini'.
Problem 2:
I heard that pairs are sampled with inverse proportion to the distance from the target word to the context word- This also causes an issue making nearby words more seem more connected than I want them to be.
Is there a way around problems 1 and 2?
Should I be using cbow as opposed to sgns? Are there any other hyperparameters that I should be aware of?
What is the best way to go about removing/ignoring the order in this case?
Thank you
I'm not sure what you mean by "Problem 1" - there's no "roll" or "wraparound" in the usual interpretation of a word2vec-style algorithm's window parameter. So I wouldn't worry about this.
Regarding "Problem 2", this factor can be essentially made negligible by the choice of a giant window value – say for example, a value one million times larger than your largest sentence. Then, any difference in how the algorithm treats the nearest-word and the 2nd-nearest-word is vanishingly tiny.
(More specifically, the way the gensim implementation – which copies the original Google word2vec.c in this respect – achieves a sort of distance-based weighting is actually via random dynamic shrinking of the actual window used. That is, for each visit during training to each target word, the effective window truly used is some random number from 1 to the user-specified window. By effectively using smaller windows much of the time, the nearer words have more influence – just without the cost of performing other scaling on the whole window's words every time. But in your case, with a giant window value, it will be incredibly rare for the effective-window to ever be smaller than your actual sentences. Thus every word will be included, equally, almost every time.)
All these considerations would be the same using SG or CBOW mode.
I believe a million-times-larger window will be adequate for your needs, for if for some reason it wasn't, another way to essentially cancel-out any nearness effects could be to ensure your corpus's items individual word-orders are re-shuffled between each time they're accessed as training data. That ensures any nearness advantages will be mixed evenly across all words – especially if each sentence is trained on many times. (In a large-enough corpus, perhaps even just a 1-time shuffle of each sentence would be enough. Then, over all examples of co-occurring words, the word co-occurrences would be sampled in the right proportions even with small windows.)
Other tips:
If your training data starts in some arranged order that clumps words/topics together, it can be beneficial to shuffle them into a random order instead. (It's better if the full variety of the data is interleaved, rather than presented in runs of many similar examples.)
When your data isn't true natural-language data (with its usual distributions & ordering significance), it may be worth it to search further from the usual defaults to find optimal metaparameters. This goes for negative, sample, & especially ns_exponent. (One paper has suggested the optimal ns_exponent for training vectors for recommendation-systems is far different from the usual 0.75 default for natural-language modeling.)

How to rotate a word2vec onto another word2vec?

I am training multiple word2vec models with Gensim. Each of the word2vec will have the same parameter and dimension, but trained with slightly different data. Then I want to compare how the change in data affected the vector representation of some words.
But every time I train a model, the vector representation of the same word is wildly different. Their similarity among other words remain similar, but the whole vector space seems to be rotated.
Is there any way I can rotate both of the word2vec representation in such way that same words occupy same position in vector space, or at least they are as close as possible.
Thanks in advance.
That the locations of words vary between runs is to be expected. There's no one 'right' place for words, just mutual arrangements that are good at the training task (predicting words from other nearby words) – and the algorithm involves random initialization, random choices during training, and (usually) multithreaded operation which can change the effective ordering of training examples, and thus final results, even if you were to try to eliminate the randomness by reliance on a deterministically-seeded pseudorandom number generator.
There's a class called TranslationMatrix in gensim that implements the learn-a-projection-between-two-spaces method, as used for machine-translation between natural languages in one of the early word2vec papers. It requires you to have some words that you specify should have equivalent vectors – an anchor/reference set – then lets other words find their positions in relation to those. There's a demo of its use in gensim's documentation notebooks:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb
But, there are some other techniques you could also consider:
transform & concatenate the training corpuses instead, to both retain some words that are the same across all corpuses (such as very frequent words), but make other words of interest different per segment. For example, you might leave words like "hot" and "cold" unchanged, but replace words like "tamale" or "skiing" with subcorpus-specific versions, like "tamale(A)", "tamale(B)", "skiing(A)", "skiing(B)". Shuffle all data together for training in a single session, then check the distances/directions between "tamale(A)" and "tamale(B)" - since they were each only trained by their respective subsets of the data. (It's still important to have many 'anchor' words, shared between different sets, to force a correlation on those words, and thus a shared influence/meaning for the varying-words.)
create a model for all the data, with a single vector per word. Save that model aside. Then, re-load it, and try re-training it with just subsets of the whole data. Check how much words move, when trained on just the segments. (It might again help comparability to hold certain prominent anchor words constant. There's an experimental property in the model.trainables, with a name ending _lockf, that lets you scale the updates to each word. If you set its values to 0.0, instead of the default 1.0, for certain word slots, those words can't be further updated. So after re-loading the model, you could 'freeze' your reference words, by setting their _lockf values to 0.0, so that only other words get updated by the secondary training, and they're still bound to have coordinates that make sense with regard to the unmoving anchor words. Read the source code to better understand how _lockf works.)

How to extract features from plain text?

I am writing a text parser which should extract features from product descriptions.
Eg:
text = "Canon EOS 7D Mark II Digital SLR Camera with 18-135mm IS STM Lens"
features = extract(text)
print features
Brand: Canon
Model: EOS 7D
....
The way I do this is by training the system with structured data and coming up with an inverted index which can map a term to a feature. This works mostly well.
When the text contains measurements like 50ml, or 2kg, the inverted index will say 2kg -> Size and 50ml -> Size for eg.
The problem here is that, when I get a value which I haven't seen before, like 13ml, it won't be processed. But since the patterns matches to a size, we could tag it as size.
I was thinking to solve this problem by preprocessing the tokens that I get from the text and look for patterns that I know. So when new patterns are identified, that has to be added to the preprocessing.
I was wondering, is this the best way to go about this? Or is there a better way of doing this?
The age-old problem of unseen cases. You could train your scraper to grab any number-like characters preceding certain suffixes (ml, kg, etc) and treat those as size. The problem with this is typos and other poorly formatted texts could enter into your structure data. There is no right answer for how to handle values you haven't seen before - you'll either have to QC them individually, or have rules around them. This is dependent on your dataset.
As far as identifying patterns, you'll either have to manually enter them, or manually classify a lot of records and let the algorithm learn them. Not sure that's very helpful, but a lot of this is very dependent on your data.
If you have a training data like this:
word label
10ml size-valume
20kg size-weight
etc...
you could train a classifier based on character n-grams and that would detect that ml is size-volume even if it sees a 11-ml or ml11 etc. you should also convert the numbers into a single number (e.g. 0) so that 11-ml is seen as 0-ml before feature extraction.
For that you'll need a preprocessing module and also a large training sample. For feature extraction you can use scikit-learn's character n-grams and also SVM.

How can I get popular tags/keywords from a collection of unstructured text chunks?

I am storing small chunks of texts - say of around 100 - 200 words - in a NoSQL database, and need to display the trending keywords/tags among all of these chunks.
I know of text analysis APIs like alchemy which extract entities from a single chunk of text, but I want top keywords/tags among all the chunks.
Should I store keywords against each text-chunk and then do an exhaustive counting of the top keywords? In which case, each keyword may differ slightly and may lead to fragmentation of similar keywords.
Its not always necessary that filtering out entities would provide you the result (thought it serves a basic purpose). If you want it to be more effective you should remove the stopwords, do stemming, UpperCase to LowerCase converstion, spelling correction and then use a HashMap to find frequencies.
Using this frequency you can filter out top 100-200 entities/tags.
I hope this helps.

Resources