FastTextKeyedVectors difference between vectors, vectors_vocab and vectors_ngrams instance variables - gensim

I downloaded wiki-news-300d-1M-subword.bin.zip and loaded it as follows:
import gensim
print(gensim.__version__)
model = gensim.models.fasttext.load_facebook_model('./wiki-news-300d-1M-subword.bin')
print(type(model))
model_keyedvectors = model.wv
print(type(model_keyedvectors))
model_keyedvectors.save('./wiki-news-300d-1M-subword.keyedvectors')
As expected, I see the following output:
3.8.1
<class 'gensim.models.fasttext.FastText'>
<class 'gensim.models.keyedvectors.FastTextKeyedVectors'>
I also see the following three numpy arrays serialized to the disk:
$ du -h wiki-news-300d-1M-subword.keyedvectors*
127M wiki-news-300d-1M-subword.keyedvectors
2.3G wiki-news-300d-1M-subword.keyedvectors.vectors_ngrams.npy
2.3G wiki-news-300d-1M-subword.keyedvectors.vectors.npy
2.3G wiki-news-300d-1M-subword.keyedvectors.vectors_vocab.npy
I understand vectors_vocab.npy and vectors_ngrams.npy, however, what is vectors.npy is used for internally in gensim.models.keyedvectors.FastTextKeyedVectors? If I look at the source code for finding out word vector, I do not see how attribute vectors is being used anywhere. I see the attributes vectors_vocab and vectors_ngrams bing used. However, if I remove vectors.npy file, I am not able to load the model using gensim.models.keyedvectors.FastTextKeyedVectors.load method.
Can someone please explain where this variable is used? Can I remove it if all I am interested is in looking word vectors (to reduce memory footprint)?
Thanks.

vectors_ngrams are the buckets storing the vectors that are learned from word-fragments (character-n-grams). It's a fixed size no matter how many n-grams are encountered - as multiple n-grams can 'collide' into the same slot.
vectors_vocab are the full-word-token vectors as trained by the FastText algorithm, for full-words of interest. However, note that the actual word-vector, as returned by FastText for an in-vocabulary word, is defined as being this vector plus all the subword vectors.
vectors stores the actual, returnable full-word vectors for in-vocabulary words. That is: it's the precalculated combination of the vectors_vocab value plus all the word's n-gram vectors.
So, vectors is never directly trained, and can always be recalculated from the other arrays. It probably should not be stored as part of the saved model (as it's redundant info that could be reconstructed on demand).
(It could possibly even be made an optional optimization, for the specific case of FastText – with users who are willing to save memory, but have slower per-word lookup, discarding it. However, this would complicate the very common and important most_similar()-like operations, which are far more efficient if they have a full, ready array of all potential-answer word-vectors.)
If you don't see vectors being directly accessed, perhaps you're not considering methods inherited from superclasses.
While any model that was saved with vectors present will need that file when later .load()ed, you could conceivably save on disk-storage by discarding the model.wv.vectors property before saving, then forcing its reconstruction after loading. You would still be paying the RAM cost, when the model is loaded.
After vectors is calculated, and if you're completely done training, you could conceivably discard the vectors_vocab property to save RAM. (For any known word, the vectors can be consulted directly for instant look-up, and vectors_vocab is only needed in the case of further training or needing to re-generate vectors.)

Related

What is the fastest way to intersect two large set of ids

The Problem
On a server, I host ids in a json file. From clients, I need to mandate the server to intersect and sometimes negate these ids (the ids never travel to the client even though the client instructs the server its operations to perform).
I typically have 1000's of ids, often have 100,000's of ids, and have a maximum of 56,000,000 of them, where each value is unique and between -100,000,000 and +100,000,000.
These ids files are stable and do not change (so it is possible to generate a different representation for it that is better adapted for the calculations if needed).
Sample ids
Largest file sizes
I need an algorithm that will intersect ids in the sub-second range for most cases. What would you suggest? I code in java, but do not limit myself to java for the resolution of this problem (I could use JNI to bridge to native language).
Potential solutions to consider
Although you could not limit yourselves to the following list of broad considerations for solutions, here is a list of what I internally debated to resolve the situation.
Neural-Network pre-qualifier: Train a neural-network for each ids list that accepts another list of ids to score its intersection potential (0 means definitely no intersection, 1 means definitely there is an intersection). Since neural networks are good and efficient at pattern recognition, I am thinking of pre-qualifying a more time-consuming algorithm behind it.
Assembly-language: On a Linux server, code an assembly module that does such algorithm. I know that assembly is a mess to maintain and code, but sometimes one need the speed of an highly optimized algorithm without the overhead of a higher-level compiler. Maybe this use-case is simple enough to benefit from an assembly language routine to be executed directly on the Linux server (and then I'd always pay attention to stick with the same processor to avoid having to re-write this too often)? Or, alternately, maybe C would be close enough to assembly to produce clean and optimized assembly code without the overhead to maintain assembly code.
Images and GPU: GPU and image processing could be used and instead of comparing ids, I could BITAND images. That is, I create a B&W image of each ids list. Since each id have unique values between -100,000,000 and +100,000,000 (where a maximum of 56,000,000 of them are used), the image would be mostly black, but the pixel would become white if the corresponding id is set. Then, instead of keeping the list of ids, I'd keep the images, and do a BITAND operation on both images to intersect them. This may be fast indeed, but then to translate the resulting image back to ids may be the bottleneck. Also, each image could be significantly large (maybe too large for this to be a viable solution). An estimate of a 200,000,000 bits sequence is 23MB each, just loading this in memory is quite demanding.
String-matching algorithms: String comparisons have many adapted algorithms that are typically extremely efficient at their task. Create a binary file for each ids set. Each id would be 4 bytes long. The corresponding binary file would have each and every id sequenced as their 4 bytes equivalent into it. The algorithm could then be to process the smallest file to match each 4 bytes sequence as a string into the other file.
Am I missing anything? Any other potential solution? Could any of these approaches be worth diving into them?
I did not yet try anything as I want to secure a strategy before I invest what I believe will be a significant amount of time into this.
EDIT #1:
Could the solution be a map of hashes for each sector in the list? If the information is structured in such a way that each id resides within its corresponding hash key, then, the smaller of the ids set could be sequentially ran and matching the id into the larger ids set first would require hashing the value to match, and then sequentially matching of the corresponding ids into that key match?
This should make the algorithm an O(n) time based one, and since I'd pick the smallest ids set to be the sequentially ran one, n is small. Does that make sense? Is that the solution?
Something like this (where the H entry is the hash):
{
"H780" : [ 45902780, 46062780, -42912780, -19812780, 25323780, 40572780, -30131780, 60266780, -26203780, 46152780, 67216780, 71666780, -67146780, 46162780, 67226780, 67781780, -47021780, 46122780, 19973780, 22113780, 67876780, 42692780, -18473780, 30993780, 67711780, 67791780, -44036780, -45904780, -42142780, 18703780, 60276780, 46182780, 63600780, 63680780, -70486780, -68290780, -18493780, -68210780, 67731780, 46092780, 63450780, 30074780, 24772780, -26483780, 68371780, -18483780, 18723780, -29834780, 46202780, 67821780, 29594780, 46082780, 44632780, -68406780, -68310780, -44056780, 67751780, 45912780, 40842780, 44642780, 18743780, -68220780, -44066780, 46142780, -26193780, 67681780, 46222780, 67761780 ],
"H782" : [ 27343782, 67456782, 18693782, 43322782, -37832782, 46152782, 19113782, -68411782, 18763782, 67466782, -68400782, -68320782, 34031782, 45056782, -26713782, -61776782, 67791782, 44176782, -44096782, 34041782, -39324782, -21873782, 67961782, 18703782, 44186782, -31143782, 67721782, -68340782, 36103782, 19143782, 19223782, 31711782, 66350782, 43362782, 18733782, -29233782, 67811782, -44076782, -19623782, -68290782, 31721782, 19233782, 65726782, 27313782, 43352782, -68280782, 67346782, -44086782, 67741782, -19203782, -19363782, 29583782, 67911782, 67751782, 26663782, -67910782, 19213782, 45992782, -17201782, 43372782, -19992782, -44066782, 46142782, 29993782 ],
"H540" : [...
You can convert each file (list of ids) into a bit-array of length 200_000_001, where bit at index j is set if the list contains value j-100_000_000. It is possible, because the range of id values is fixed and small.
Then you can simply use bitwise and and not operations to intersect and negate lists of ids. Depending on the language and libraries used, it would require operating element-wise: iterating over arrays and applying corresponding operations to each index.
Finally, you should measure your performance and decide whether you need to do some optimizations, such as parallelizing operations (you can work on different parts of arrays on different processors), preloading some of arrays (or all of them) into memory, using GPU, etc.
First, the bitmap approach will produce the required performance, at a huge overhead in memory. You'll need to benchmark it, but I'd expect times of maybe 0.2 seconds, with that almost entirely dominated by the cost of loading data from disk, and then reading the result.
However there is another approach that is worth considering. It will use less memory most of the time. For most of the files that you state, it will perform well.
First let's use Cap'n Proto for a file format. The type can be something like this:
struct Ids {
is_negated #0 :Bool;
ids #1 :List(Int32);
}
The key is that ids are always kept sorted. So list operations are a question of running through them in parallel. And now:
Applying not is just flipping is_negated.
If neither is negated, it is a question of finding IDs in both lists.
If the first is not negated and the second is, you just want to find IDs in the first that are not in the second.
If the first is negated and the second is not, you just want to find IDs in the second that are not in the first.
If both are negated, you just want to find all ids in either list.
If your list has 100k entries, then the file will be about 400k. A not requires copying 400k of data (very fast). And intersecting with another list of the same size involves 200k comparisons. Integer comparisons complete in a clock cycle, and branch mispredictions take something like 10-20 clock cycles. So you should be able to do this operation in the 0-2 millisecond range.
Your worst case 56,000,000 file will take over 200 MB and intersecting 2 of them can take around 200 million operations. This is in the 0-2 second range.
For the 56 million file and a 10k file, your time is almost all spent on numbers in the 56 million file and not in the 10k one. You can speed that up by adding a "galloping" mode where you do a binary search forward in the larger file looking for the next matching number and picking most of them. Do be warned that this code tends to be tricky and involves lots of mispredictions. You'll have to benchmark it to find out how big a size difference is needed.
In general this approach will lose for your very biggest files. But it will be a huge win for most of the sizes of file that you've talked about.

Is a gensim vocab index the index in the corresponding 1-hot-vector?

I am doing research that requires direct manipulation & embedding of one-hot vectors and I am trying to use gensim to load a pretrained word2vec model for this.
The problem is they don't seem to have a direct api for working with 1-hot-vectors. And I am looking for work arounds.
So I wanted to know if anyone knows of a way to do this? Or more specifically if these vocab indices (which are defined quite ambiguously). Could be indices into corresponding 1-hot-vectors?
Context I have found:
Seems this question is related but I tried accessing the 'input embeddings' (assuming they were one-hot representations), via model.syn0 (from link in answer), but I got a non-sparse matrix...
Also appears they refer to word indices as 'doctags' (search for Doctag/index).
Here is another question giving some context to the indices (although not quite answering my question).
Here is the official documentation:
################################################
class gensim.models.keyedvectors.Vocab(**kwargs)
Bases: object
A single vocabulary item, used internally for collecting per-word frequency/sampling info, and for constructing binary trees (incl. both word leaves and inner nodes).
################################################
Yes, you can think of the index (position) of gensim's Word2Vec word-vectors as being the one dimension that would be 1.0 – with all other V dimensions, where V is the count of unique words, being 0.0.
The implementation doesn't actually ever create one-hot vectors, as a sparse or explicit representation. It's just using the word's index as a look-up for its dense vector – following in the path of the word2vec.c code from Google on which the gensim implementation was originally based.
(The term 'doctags' is only relevant in the Doc2Vec – aka 'Paragraph Vector' – implementation. There it is the name for the distinct tokens/ints that are used for looking up document-vectors, using a different namespace from in-document words. That is, in Doc2Vec you could use 'doc_007' as a doc-vector name, aka a 'doctag', and even if the string-token 'doc_007' also appears as a word inside documents, the doc-vector referenced by doctag-key 'doc_007' and the word-vector referenced by word-key 'doc_007' wouldn't be the same internal vector.)

Assessing doc2vec accuracy

I am trying to assess a doc2vec model based on the code from here. Basically, I want to know the percentual of inferred documents are found to be most similar to itself. This is my current code an:
for doc_id, doc in enumerate(cur.execute('SELECT Text FROM Patents')):
docs += 1
doc = clean_text(doc)
inferred_vector = model.infer_vector(doc)
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
rank = [docid for docid, sim in sims].index(doc_id)
ranks.append(rank)
counter = collections.Counter(ranks)
accuracy = counter[0] / docs
This code works perfectly with smaller datasets. However, since I have a huge file with millions of documents, this code becomes too slow, it would take months to compute. I profiled my code and most of the time is consumed by the following line: sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs)).
If I am not mistaken, this is having to measure each document to every other document. I think computation time might be massively reduced if I change this to topn=1 instead since the only thing I want to know is if the most similar document is itself or not. Doing this will basically take each doc (i.e., inferred_vector), measure its most similar document (i.e., topn=1), and then I just see if it is itself or not. How could I implement this? Any help or idea is welcome.
To have most_similar() return only the single most-similar document, that is as simple as specifying topn=1.
However, to know which one document of the millions is the most-similar to a single target vector, the similarities to all the candidates must be calculated & sorted. (If even one document was left out, it might have been the top-ranked one!)
Making sure absolutely no virtual-memory swapping is happening will help ensure that brute-force comparison happens as fast as possible, all in RAM – but with millions of docs, it will still be time-consuming.
What you're attempting is a fairly simple "self-check" as to whether training led to self-consistent model: whether the re-inference of a document creates a vector "very close to" the same doc-vector left over from bulk training. Failing that will indicate some big problems in doc-prep or training, but it's not a true measure of the model's "accuracy" for any real task, and the model's value is best evaluated against your intended use.
Also, because this "re-inference self-check" is just a crude sanity check, there's no real need to do it for every document. Picking a thousand (or ten thousand, or whatever) random documents will give you a representative idea of whether most of the re-inferred vectors have this quality, or not.
Similarly, you could simply check the similarity of the re-inferred vector against the single in-model vector for that same document-ID, and check whether they are "similar enough". (This will be much faster, but could also be done on just a random sample of docs.) There's no magic proper threshold for "similar enough"; you'd have to pick one that seems to match your other goals. For example, using scikit-learn's cosine_similarity() to compare the two vectors:
from sklearn.metrics.pairwise import cosine_similarity
# ...
inferred_vector = model.infer_vector(doc_words)
looked_up_vector = model.dv[doc_id]
self_similarity = cosine_similarity([inferred_vector], [looked_up_vector])[0]
# then check that value against some threshold
(You have to wrap the single vectors in lists as arguments to cosine_similarity(), then access the 0th element of the return value, because it is designed to usually work on larger lists of vectors.)
With this calculation, you wouldn't know if, for example, some of the other stored-doc-vectors are a little closer to your inferred target - but that may not be that important, anyway. The docs might be really similar! And while the original "closest to itself" self-check will fail miserably if there were major defects in training, even a well-trained model will likely have some cases where natural model jitter prevents a "closest to itself" for every document. (With more documents inside the same number of dimensions, or certain corpuses with lots of very-similar documents, this would become more common... but not be a concerning indicator of any model problems.)

Does ND4J slicing make a copy of the original array?

ND4J INDArray slicing is achieved through one of the overloaded get() methods as answered in java - Get an arbitrary slice of a Nd4j array - Stack Overflow. As an INDArray takes a continuous block of native memory, does slicing using get() make a copy of the original memory (especially row slicing, in which it is possible create a new INDArray with the same backing memory)?
I have found another INDArray method subArray(). Does this one make any difference?
I am asking this because I am trying to create a DatasetIterator that can directly extract data from INDArrays, and I want to eliminate possible overhead. There is too much abstraction in the source code and I couldn't find the implementation myself.
A similar question about NumPy is asked in python - Numpy: views vs copy by slicing - Stack Overflow, and the answer can be found in Indexing — NumPy v1.16 Manual:
The rule of thumb here can be: in the context of lvalue indexing (i.e. the indices are placed in the left hand side value of an assignment), no view or copy of the array is created (because there is no need to). However, with regular values, the above rules for creating views does apply.
The short answer is: no it is using reference when possible. To make a copy the .dup() function can be called.
To quote https://deeplearning4j.org/docs/latest/nd4j-overview
Views: When Two or More NDArrays Refer to the Same Data
A key concept in ND4J is the fact that two NDArrays can actually point to the same
underlying data in memory. Usually, we have one NDArray referring to
some subset of another array, and this only occurs for certain
operations (such as INDArray.get(), INDArray.transpose(),
INDArray.getRow() etc. This is a powerful concept, and one that is
worth understanding.
There are two primary motivations for this:
There are considerable performance benefits, most notably in avoiding
copying arrays We gain a lot of power in terms of how we can perform
operations on our NDArrays Consider a simple operation like a matrix
transpose on a large (10,000 x 10,000) matrix. Using views, we can
perform this matrix transpose in constant time without performing any
copies (i.e., O(1) in big O notation), avoiding the considerable cost
copying all of the array elements. Of course, sometimes we do want to
make a copy - at which point we can use the INDArray.dup() to get a
copy. For example, to get a copy of a transposed matrix, use INDArray
out = myMatrix.transpose().dup(). After this dup() call, there will be
no link between the original array myMatrix and the array out (thus,
changes to one will not impact the other).

How to rotate a word2vec onto another word2vec?

I am training multiple word2vec models with Gensim. Each of the word2vec will have the same parameter and dimension, but trained with slightly different data. Then I want to compare how the change in data affected the vector representation of some words.
But every time I train a model, the vector representation of the same word is wildly different. Their similarity among other words remain similar, but the whole vector space seems to be rotated.
Is there any way I can rotate both of the word2vec representation in such way that same words occupy same position in vector space, or at least they are as close as possible.
Thanks in advance.
That the locations of words vary between runs is to be expected. There's no one 'right' place for words, just mutual arrangements that are good at the training task (predicting words from other nearby words) – and the algorithm involves random initialization, random choices during training, and (usually) multithreaded operation which can change the effective ordering of training examples, and thus final results, even if you were to try to eliminate the randomness by reliance on a deterministically-seeded pseudorandom number generator.
There's a class called TranslationMatrix in gensim that implements the learn-a-projection-between-two-spaces method, as used for machine-translation between natural languages in one of the early word2vec papers. It requires you to have some words that you specify should have equivalent vectors – an anchor/reference set – then lets other words find their positions in relation to those. There's a demo of its use in gensim's documentation notebooks:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb
But, there are some other techniques you could also consider:
transform & concatenate the training corpuses instead, to both retain some words that are the same across all corpuses (such as very frequent words), but make other words of interest different per segment. For example, you might leave words like "hot" and "cold" unchanged, but replace words like "tamale" or "skiing" with subcorpus-specific versions, like "tamale(A)", "tamale(B)", "skiing(A)", "skiing(B)". Shuffle all data together for training in a single session, then check the distances/directions between "tamale(A)" and "tamale(B)" - since they were each only trained by their respective subsets of the data. (It's still important to have many 'anchor' words, shared between different sets, to force a correlation on those words, and thus a shared influence/meaning for the varying-words.)
create a model for all the data, with a single vector per word. Save that model aside. Then, re-load it, and try re-training it with just subsets of the whole data. Check how much words move, when trained on just the segments. (It might again help comparability to hold certain prominent anchor words constant. There's an experimental property in the model.trainables, with a name ending _lockf, that lets you scale the updates to each word. If you set its values to 0.0, instead of the default 1.0, for certain word slots, those words can't be further updated. So after re-loading the model, you could 'freeze' your reference words, by setting their _lockf values to 0.0, so that only other words get updated by the secondary training, and they're still bound to have coordinates that make sense with regard to the unmoving anchor words. Read the source code to better understand how _lockf works.)

Resources