Gensim's FastText KeyedVector out of vocab - gensim

I want to use the read-only version of Gensim's FastText Embedding to save some RAM compared to the full model.
After loading the KeyVectors version, I get the following Error when fetching a vector:
IndexError: index 878080 is out of bounds for axis 0 with size 761210
The error occurs when using words that should be out-of-vocabulary e.g. "lawyerxy" instead of "lawyer". The full model returns a vector for both.
from gensim.models import KeyedVectors
model = KeyedVectors.load("model.kv")
model .wv.__getitem__("lawyerxy")
So, my assumption is that the KeyedVectors do not offer FastText's out of vacabulary function - a key feature for my usecase. This limitation is not given in the documentation:
https://radimrehurek.com/gensim/models/word2vec.html
Can anyone prove that assumption and/or name a fix to allow vectors for "lawyerxy" etc. ?

The KeyedVectors name is (as of gensim-3.8.0) just an alias for class Word2VecKeyedVectors, which only maintains a simple word (as key) to vector (as value) mapping.
You shouldn't expect FastText's advanced ability to synthesize vectors for out-of-vocabulary words to appear in any model/representation that doesn't explicitly claim to offer that ability.
(I would expect a lookup of an out-of-vocabulary word to give a clearer KeyError rather than the IndexError you've reported. But, you'd need to show exactly what code created the file you're loading, and triggered the error, and the full error stack, to further guess what's going wrong in your case.)
Depending on how your model.kv file was saved, you might be able to load it, with retained OOV-vector functionality, by using the class FastTextKeyedVectors instead of plain KeyedVectors.

Related

Creating balanced bootstrap resamples in caret

I'm using caret to compare models for a classification problem with nested CV. Vfold in the outer loop and bootstrap (500 replicates) in the inner loop. I get this error after training knn:
Warning: There were missing values in resampled performance measures.
Which I believe comes from the fact that some resamples have zero items of the class of interest in the holdout sample, yielding NA for Sensitivity and ROC. My question is: Is there any way to ensure that items from this class are present in every bootstrap resample? Kind of what the CreateDataPartition function does (I believe this is also called stratified bootstrap?).
If not, how should we proceed with this? (In terms of comparing model performance on the same resamples)
Thanks!
So I couldn't find a way to do this within caret but here is a workaround using rsample package. The point is to compute the resamples before and feed this information to trainControl function via index and indexOut arguments, previous conversion to caret format.
indices=bootstraps(train,times=50,strata="class_of_interest")
indices=rsample2caret(indices)
train_control <- trainControl(method="boot",number=50,index=indices$index,indexOut = indices$indexOut)
Hope this helps.

Doc2Vec How to find most similar document

I am using Gensim's Doc2Vec, and was wondering if there is a way to get the most similar document to another document that is outside the list of TaggedDocuments used to train the Doc2Vec model.
Right now I can infer a vector from a document not in the training set:
# 'model' here is a instance of Doc2Vec class that has been trained
# Inferring a vector
doc_not_in_training_set = "Foo Foo Foo Foo Foo Foo Fie"
v1 = model.infer_vector(word_tokenize(doc_not_in_training_set.lower()))
print("V1_infer", v1)
This prints out a vector representation of the 'doc_not_in_training_set' string. However, is there a way to use this vector to find the n most similar documents to the 'doc_not_in_training_set' string (in the TaggedDocuments training set for this word2vec model)?
Looking under the documentation, the closest I could find was the model.docvec.most_similar() method:
# Finding most similar to first
similar_doc = model.docvecs.most_similar('0')
This returns the document in the training set most similar to the document in the training set with tag '0'.
In the documentation of this method, it looks like there is not yet the functionality I am looking for:
TODO: Accept vectors of out-of-training-set docs, as if from inference.
Is there another method I can use to find documents similar to a document not in the training set?
The .most_similar() method will also take a raw vectors as the target position.
It helps to explicitly name the positive parameter, to prevent other logic of that method, which tries to intuit what other strings/etc supplied as arguments might mean, from misinterpreting a single raw vector.
So try:
similar_docs = model.docvecs.most_similar(positive=[v1])
You should get back a list of nearest-neighbors to the v1 vector that you'd previously inferred.

Change lgbm internal parameter (threshold) by hand

I have trained a model with lgbm. I can dump its interval values with
booster.dump_model()
and see all the internal parameters that has been optimized during the training (leaf values, threshold, index of the variables for each split, ...). For testing purpose I would like to change some. Is there a way? I guess that changing just the output of dump_model will do nothing.
You can save your model to a human-understandable format using
booster.save_model('model.txt'), do your modifications on model.txt, and load back the modified model using modified_booster = lightgbm.Booster(model_file='model.txt').
I hope it helps!

Gensim most_similar() with Fasttext word vectors return useless/meaningless words

I'm using Gensim with Fasttext Word vectors for return similar words.
This is my code:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('cc.it.300.vec')
words = model.most_similar(positive=['sole'],topn=10)
print(words)
This will return:
[('sole.', 0.6860659122467041), ('sole.Ma', 0.6750558614730835), ('sole.Il', 0.6727924942970276), ('sole.E', 0.6680260896682739), ('sole.A', 0.6419174075126648), ('sole.È', 0.6401025652885437), ('splende', 0.6336565613746643), ('sole.La', 0.6049465537071228), ('sole.I', 0.5922051668167114), ('sole.Un', 0.5904430150985718)]
The problem is that "sole" ("sun", in english) return a series of words with a dot in it (like sole., sole.Ma, ecc...). Where is the problem? Why most_similar return this meaningless word?
EDIT
I tried with english word vector and the word "sun" return this:
[('sunlight', 0.6970556974411011), ('sunshine', 0.6911839246749878), ('sun.', 0.6835992336273193), ('sun-', 0.6780728101730347), ('suns', 0.6730450391769409), ('moon', 0.6499731540679932), ('solar', 0.6437565088272095), ('rays', 0.6423950791358948), ('shade', 0.6366724371910095), ('sunrays', 0.6306195259094238)] 
Is it impossible to reproduce results like relatedwords.org?
Perhaps the bigger question is: why does the Facebook FastText cc.it.300.vec model include so many meaningless words? (I haven't noticed that before – is there any chance you've downloaded a peculiar model that has decorated words with extra analytical markup?)
To gain the unique benefits of FastText – including the ability to synthesize plausible (better-than-nothing) vectors for out-of-vocabulary words – you may not want to use the general load_word2vec_format() on the plain-text .vec file, but rather a Facebook-FastText specific load method on the .bin file. See:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_vectors
(I'm not sure that will help with these results, but if choosing to use FastText, you may be interesting it using it "fully".)
Finally, given the source of this training – common-crawl text from the open web, which may contain lots of typos/junk – these might be legimate word-like tokens, essentially typos of sole, that appear often enough in the training data to get word-vectors. (And because they really are typo-synonyms for 'sole', they're not necessarily bad results for all purposes, just for your desired purpose of only seeing "real-ish" words.)
You might find it helpful to try using the restrict_vocab argument of most_similar(), to only receive results from the leading (most-frequent) part of all known word-vectors. For example, to only get results from among the top 50000 words:
words = model.most_similar(positive=['sole'], topn=10, restrict_vocab=50000)
Picking the right value for restrict_vocab might help in practice to leave out long-tail 'junk' words, while still providing the real/common similar words you seek.

plone.scale annotation bloated with (usesless?) scales

While investigating a ConflictError (see this previous question) I saw a lot of persistent.mapping.PersistentMapping conflicts.
Looking at a specific one it turned out to be a PersistentMapping for plone.scale.
Turns out that a random object with just one image has 562 keys on it, no wonder why it gets a conflict error...
Some context on the object that holds this plone.scale annotation:
- dexterity content type
- one of its behaviors has an image field (plone.namedfile.field.NamedBlobImage)
The code to see it is as following:
Start a debugging instance: ./bin/instance debug
from ZODB.utils import p64
OID = 0x568428 # got from zeo client logs
mapping = app._p_jar[p64(OID)]
len(mapping) # that returns 562
The mysterious part is that only 4 keys on that persistent mapping are tuples, while the other 558 are just hashes.
A brief look at plone.scale.storage.AnnotationStorage.scale method seems to imply that there should be only one to one relation from tuples and hashes keys on the persistent mapping.
Further investigating the elements reveals that, indeed, if you look at the width and height properties from all elements there are only 4 different combinations (the ones from the tuples itself).
As a new scale is generated whenever the modified time is bigger (see the scale method pointed above) and plone.namedfield.scaling.ImageScaling.modified uses context as the source for modified, that means that at every single update of the object a new scale will be generated?
So two questions arise from the previous:
my assumption of only 4 scales are really used and the other 558 are old and useless is true?
provided a yes on the previous, shouldn't they be cleaned up then?
You may be right, but surely the correct place to report this is https://dev.plone.org/newticket

Resources