How to load pre-trained glove model with gensim load_word2vec_format? - stanford-nlp

I am trying to load a pre-trained glove as a word2vec model in gensim. I have downloaded the glove file from here. I am using the following script:
from gensim import models
model = models.KeyedVectors.load_word2vec_format('glove.6B.300d.txt', binary=True)
but get the following error
ValueError Traceback (most recent call last)
<ipython-input-38-e0b48b51f433> in <module>()
1 from gensim import models
----> 2 model = models.KeyedVectors.load_word2vec_format('glove.6B.300d.txt', binary=True)
2 frames
/usr/local/lib/python3.6/dist-packages/gensim/models/utils_any2vec.py in <genexpr>(.0)
171 with utils.smart_open(fname) as fin:
172 header = utils.to_unicode(fin.readline(), encoding=encoding)
--> 173 vocab_size, vector_size = (int(x) for x in header.split()) # throws for invalid file format
174 if limit:
175 vocab_size = min(vocab_size, limit)
ValueError: invalid literal for int() with base 10: 'the'
What is the underlying problem? Does gensim need a specific format to be able to load it?

The GLoVe format is slightly different – missing a 1st-line declaration of vector-count & dimensions – than the format that load_word2vec_format() supports.
There's a glove2word2vec utility script included you can run once to convert the file:
https://radimrehurek.com/gensim/scripts/glove2word2vec.html
Also, starting in Gensim 4.0.0 (currentlyu in prerelease testing), the load_word2vec_format() method gets a new optional no_header parameter:
https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=load_word2vec_format#gensim.models.keyedvectors.KeyedVectors.load_word2vec_format
If set as no_header=True, the method will deduce the count/dimensions from a preliminary scan of the file - so it can read a GLoVe file with that option – but at the cost of two full-file reads instead of one. (So, you may still want to re-save the object with .save_word2vec_format(), or use the glove2word2vec script, to make future loads faster.)

Related

'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte

I want to use Word2Vec, and i have download a Word2Vec's corpus in indonesian language, but when i call it, it was give me an error, this is what i try :
Model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/MyDrive/Feature Extraction Lexicon Based/Word2Vec/idwiki_word2vec_100_new_lower.model.wv.vectors.npy', binary=True,)
and it was give me an error, like this :
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-73-219e152ee7d9> in <module>()
----> 1 Model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/MyDrive/Feature Extraction Lexicon Based/Word2Vec/idwiki_word2vec_100_new_lower.model.wv.vectors.npy', binary=True,)
2 frames
/usr/local/lib/python3.7/dist-packages/gensim/utils.py in any2unicode(text, encoding, errors)
353 if isinstance(text, unicode):
354 return text
--> 355 return unicode(text, encoding, errors=errors)
356
357
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte
A file named idwiki_word2vec_100_new_lower.model.wv.vectors.npy is unlikely to be in the format needed by load_word2vec_format().
The .npy suggests it is a raw numpy array, which is not the format expected.
Also, the .wv.vectors. section suggests this could be part of a full, multi-file Gensim .save() of a complete Word2Vec model. That's more than just the vectors, & requires all associated files to re-load.
You should double-check the source of the vectors and what their claims are about its format and the proper ways to load. (If you're still having problems & need more guidance, you should specify more details about the origin of the file – for example a link to the website where it was obtained – to support other suggestions.)

T5Tokenizer requires the SentencePiece library but it was not found in your environment

I am trying to explore T5
this is the code
!pip install transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration
qa_input = """question: What is the capital of Syria? context: The name "Syria" historically referred to a wider region,
broadly synonymous with the Levant, and known in Arabic as al-Sham. The modern state encompasses the sites of several ancient
kingdoms and empires, including the Eblan civilization of the 3rd millennium BC. Aleppo and the capital city Damascus are
among the oldest continuously inhabited cities in the world."""
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')
input_ids = tokenizer.encode(qa_input, return_tensors="pt") # Batch size 1
outputs = model.generate(input_ids)
output_str = tokenizer.decode(outputs.reshape(-1))
I got this error:
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-2-8d24c6a196e4> in <module>()
5 kingdoms and empires, including the Eblan civilization of the 3rd millennium BC. Aleppo and the capital city Damascus are
6 among the oldest continuously inhabited cities in the world."""
----> 7 tokenizer = T5Tokenizer.from_pretrained('t5-small')
8 model = T5ForConditionalGeneration.from_pretrained('t5-small')
9 input_ids = tokenizer.encode(qa_input, return_tensors="pt") # Batch size 1
1 frames
/usr/local/lib/python3.6/dist-packages/transformers/file_utils.py in requires_sentencepiece(obj)
521 name = obj.__name__ if hasattr(obj, "__name__") else obj.__class__.__name__
522 if not is_sentencepiece_available():
--> 523 raise ImportError(SENTENCEPIECE_IMPORT_ERROR.format(name))
524
525
ImportError:
T5Tokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
that match your environment.
--------------------------------------------------------------------------
after that I install sentencepiece library as was suggested like this:
!pip install transformers
!pip install sentencepiece
from transformers import T5Tokenizer, T5ForConditionalGeneration
qa_input = """question: What is the capital of Syria? context: The name "Syria" historically referred to a wider region,
broadly synonymous with the Levant, and known in Arabic as al-Sham. The modern state encompasses the sites of several ancient
kingdoms and empires, including the Eblan civilization of the 3rd millennium BC. Aleppo and the capital city Damascus are
among the oldest continuously inhabited cities in the world."""
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')
input_ids = tokenizer.encode(qa_input, return_tensors="pt") # Batch size 1
outputs = model.generate(input_ids)
output_str = tokenizer.decode(outputs.reshape(-1))
but I got another issue:
Some weights of the model checkpoint at t5-small were not used when
initializing T5ForConditionalGeneration:
['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another
architecture (e.g. initializing a BertForSequenceClassification model
from a BertForPreTraining model).
This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you
expect to be exactly identical (initializing a
BertForSequenceClassification model from a
BertForSequenceClassification model).
so I did not understand what is going on, any explanation?
I used these two command and this working fine for me!
!pip install datsets transformers[sentencepiece]
!pip install sentencepiece
This is not an issue. I also observe the second output. It is just a warning that the library shows. You fixed your actual issue. Do not worry about the warning.

Load Doc2Vec without the docs vectors only for infer_vector

I have big gensim Doc2vec model, I only need to infer vectors while i am loading the training documents vectors from other source.
Is it possible to load it as is without the big npy file
I did
Edit:
from gensim.models.doc2vec import Doc2Vec
model_path = r'C:\model/model'
model = Doc2Vec.load(model_path)
model.delete_temporary_training_data(keep_doctags_vectors=False, keep_inference=True)
model.save(model_path)
remove the files (model.trainables.syn1neg.npy,model.wv.vectors.npy) manually
model = Doc2Vec.load(model_path)
but it ask for
Traceback (most recent call last):
File "<ipython-input-5-7f868a7dbe0c>", line 1, in <module>
model = Doc2Vec.load(model_path)
File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\gensim\models\doc2vec.py", line 1113, in load
return super(Doc2Vec, cls).load(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\gensim\models\base_any2vec.py", line 1244, in load
model = super(BaseWordEmbeddingsModel, cls).load(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\gensim\models\base_any2vec.py", line 603, in load
return super(BaseAny2VecModel, cls).load(fname_or_handle, **kwargs)
File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\gensim\utils.py", line 427, in load
obj._load_specials(fname, mmap, compress, subname)
File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\gensim\utils.py", line 458, in _load_specials
getattr(self, attrib)._load_specials(cfname, mmap, compress, subname)
File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\gensim\utils.py", line 469, in _load_specials
val = np.load(subname(fname, attrib), mmap_mode=mmap)
File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\numpy\lib\npyio.py", line 428, in load
fid = open(os_fspath(file), "rb")
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\model/model.trainables.syn1neg.npy'
Note:
Those files not exists in the directory,
The model run on a server and download the model file from the storage
My question is, Do the model must have those files for inference?
I want to run it as low memory consumption as possible.
Thanks.
Edit:
Is the file model.trainables.syn1neg.npy is the model weights?
Is the file model.wv.vectors.npy is necessary for running an inference?
I'm not a fan of the delete_temporary_training_data() method. It implies there's a clearer separation between training-state and that needed for later uses. (Inference is very similar to training, though it doesn't need the cached doc-vectors for training texts.)
That said, if you've used that method, you shouldn't then be deleting any of the side-files that were still part of the save. If they were written by .save(), they'll be expected, by name, by the .load(). They must be kept with the main model file. (There might be fewer such files, or smaller such files, after the delete_temporary_training_data() call - but any written must be kept for reading.)
The syn1neg file is absolutely required for inference: it's the model's hidden-to-output weights, needed to perform new forward-predictions (and thus also backpropagated inference-adjustments). The wv.vectors file is definitely needed in default dm=1 mode, where word-vectors are part of the doc-vector calculation. (It might be optional in dm=0 mode, but I'm not sure the code is armored against them being absent - not via in-memory trimming, and definitely not against the expected file being deleted out-of-band.)
For anyone who wants to load gensim.Doc2Vec without doc vectors for inference, I found this works:
import gensim
fname = "path/to/model.pickle"
d2v = gensim.utils.unpickle(fname)
d2v.__recursive_saveloads = ["wv"] # Would normally be ['wv', 'dv']
d2v._load_specials(fname, None, *gensim.utils.SaveLoad._adapt_by_suffix(fname))
I know it's a pretty ugly hack, but in my case the doc vectors ended up being a 28GB file, and I didn't want to retrain.

gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars

/Users/Barry/anaconda/lib/python2.7/site-packages/gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)
#dynamic topic model
def run_dtm(num_topics=18):
docs, years, titles = preprocessing(datasetType=2)
#resort document by years
Z = zip(years, docs)
Z = sorted(Z, reverse=False)
years_new, docs_new = zip(*Z)
#generate time slice
time_slice = Counter(years_new).values()
for year in Counter(years_new):
print year,' --- ',Counter(years_new)[year]
print '********* data set loaded ********'
dictionary = corpora.Dictionary(docs_new)
corpus = [dictionary.doc2bow(text) for text in docs_new]
print '********* train lda seq model ********'
ldaseq = ldaseqmodel.LdaSeqModel(corpus=corpus, id2word=dictionary, time_slice=time_slice, num_topics=num_topics)
print '********* lda seq model done ********'
ldaseq.print_topics(time=1)
Hey guys, I'm using the dynamic topic models in gensim package for topic analysis, following this tutorial, https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/ldaseqmodel.ipynb, however I always got the same unexpected error. Can anyone give me some guidance? I'm really puzzled even thought I have tried some different dataset for generating corpus and dictionary.
The error is like this:
/Users/Barry/anaconda/lib/python2.7/site-packages/gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)
The np.fabs error means it is encountering an error with NumPy. What NumPy and gensim versions are you using?
NumPy no longer supports Python 2.7, and Ldaseq was added to Gensim in 2016, so you might just not have a compatible version available. If you are recoding a Python 3+ tutorial to a 2.7 variant, you obviously understand a little bit about the version differences - try running it in a, say, 3.6.8 environment (you will have to upgrade sometime anyway, 2020 is the end of 2.7 support from Python itself). That might already help, I've gone through the tutorial and did not encounter this with my own data.
That being said, I have encountered the same error before when running LdaMulticore, and it was caused by an empty corpus.
Instead of running your code fully in a function, can you try to go through it line by line (or look at you DEBUG level log) and check whether your output has the expected properties: that, for example your corpus is not empty (or contains empty documents)?
If that happens, fix the preprocessing steps and try again - that at least helped me and helped with the same ldamodel error in the mailing list.
PS: not commenting because I lack the reputation, feel free to edit this.
This is the issue with the source code of ldaseqmodel.py itself.
For the latest gensim package(version 3.8.3) I am getting the same error at line 293:
ldaseqmodel.py:293: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)
Now, if you go through the code you will see this:
enter image description here
You can see that here they divide the difference between bound and old_bound by the old_bound(which is also visible from the warning)
Now if you analyze further you will see that at line 263, the old_bound is initialized with zero and this is the main reason that you are getting this warning of divide by zero encountered.
enter image description here
For further information, I put a print statement at line 294:
print('bound = {}, old_bound = {}'.format(bound, old_bound))
The output I received is: enter image description here
So, in a single line you are getting this warning because of the source code of the package ldaseqmodel.py not because of any empty document. Although if you do not remove the empty documents from your corpus you will receive another warning. So I suggest if there are any empty documents in your corpus remove them and just ignore the above warning of division by zero.

Gensim Predict Output Word Function Syntax

How do you use the Gensim predict output word function?
model = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
model.predict_output_word(['Hi', 'how', 'you'], topn=10)
AttributeError: 'Word2VecKeyedVectors' object has no attribute 'predict_output_word'
I tried Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True), which was deprecated as well.
A file like GoogleNews-vectors-negative300.bin only contains the word vectors, not the complete model used for training. So it is not possible to use predict_output_word in this case. If you would have trained a full model yourself and saved it with model.save(), then the method predict_output_word would be available.

Resources