How to load pre-trained fastText model in gensim with .npy extension - gensim

I am new to deep learning and I am trying to play with a pretrained word embedding model from a paper. I downloaded the following files:
1)sa-d300-m2-fasttext.model
2)sa-d300-m2-fasttext.model.trainables.syn1neg.npy
3)sa-d300-m2-fasttext.model.trainables.vectors_ngrams_lockf.npy
4)sa-d300-m2-fasttext.model.wv.vectors.npy
5)sa-d300-m2-fasttext.model.wv.vectors_ngrams.npy
6)sa-d300-m2-fasttext.model.wv.vectors_vocab.npy
If in case these details are needed
sa - sanskrit
d300 - embedding dimension
fastText - fastText
I dont have a prior experience with gensim, how can load the model into gensim or into tensorflow.
I tried
from gensim.models.wrappers import FastText
FastText.load_fasttext_format('/content/sa/300/fasttext/sa-d300-m2-fasttext.model.wv.vectors_ngrams.npy')
FileNotFoundError: [Errno 2] No such file or directory: '/content/sa/300/fasttext/sa-d300-m2-fasttext.model.wv.vectors_ngrams.npy.bin'

That set of multiple files looks like it was saved from Gensim's FastText implementation, using Gensim's save() method - and thus is not in Facebook's original 'fasttext_format'.
So, try loading them with the following instead:
from gensim.models.fasttext import FastText
model = FastText.load('/content/sa/300/fasttext/sa-d300-m2-fasttext.model')
(Upon loading that main/root file, it will find the subsidiary related files in the same directory, as long as they're all present.)
The source where you downloaded these files should have included clear instructions for loading them nearby!

Related

Huggingface pre-trained model

I try to use the below code:
from transformers import AutoTokenizer, AutoModel
t = "ProsusAI/finbert"
tokenizer = AutoTokenizer.from_pretrained(t)
model = AutoModel.from_pretrained(t)
The error: I think this error is due to the old version of transformers not having such pre-trained model. I checked and its confirmed.
/usr/local/lib/python3.7/dist-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
380 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a {CONFIG_NAME} file\n\n"
381 )
--> 382 raise EnvironmentError(msg)
383
384 except json.JSONDecodeError:
OSError: Can't load config for 'ProsusAI/finbert'. Make sure that:
- 'ProsusAI/finbert' is a correct model identifier listed on 'https://huggingface.co/models'
- or 'ProsusAI/finbert' is the correct path to a directory containing a config.json file
My current versions:
python 3.7
transformers 3.4.0
I understand that my transformers version is old but that is the only version that is compatible with python 3.7. Also, the reason why I cant upgrade it to 3.9 is because I am using the below multimodal-transformers which only support up to 3.7.
Reasons:
https://multimodal-toolkit.readthedocs.io/en/latest/ <- this only support up to 3.7
Transformers only up to 3.4.0 supported by python 3.7.
I need to use multimodal-transformers because it is easy to do text classification with tabular data. My dataset has text and category columns so I wish to use both, so this is the easiest practice I found. (If you have any suggestion, please do share with me thank you. )
My question is, is there a way to use the latest pre-trained model despite having the old tranformers?

How to use the binirised data from fairseq-preprocess to fine-tune a transformer model

I am trying to fine-tune a BERT model based on the methods2test data.
The data (corpus/preprocessed folder) are "corpus preprocessed using fairseq as well as the dictionary. Specifically, we use the fairseq-preprocess command that build vocabularies and binarize the data, to be used during training."
From my part, I follow the instructions from here: https://huggingface.co/docs/transformers/training
But, I cannot figure out how to use the corpus/preprocessed data for the training part:
trainer = Trainer( model=model, args=training_args,
train_dataset=small_train_dataset, eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics, )
When I unzip the .tar.bz2 files I get some .bin and .idx items.
How do I use them in the train_dataset/eval_detaset from the above snippet?

gensim w2k - additional file

I trained w2v on rather big (> 200 million sentences) corpus, and got, in addition to file w2v_model.model, files: w2v_model.model.trainables.syn1neg.npy and w2v.model_model.wv.vectors.npy. Model file was successfully loaded and read all npy files without any exceptions. The obtained model performed OK.
Now I retrained the model on much bigger corpus (> 1 billion sentences). The same 3 files were automatically saved, as expected.
When I try to load my new retrained model:
w2v_model = Word2Vec.load(path_filename)
I get:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/...../w2v_US.model.trainables.vectors_lockf.npy'
But no .npy file with such extension was saved by gensim at the end of the training
(I save all output files in the same library, as required).
What should I do to obtain such file as a part of output .npy files (may be some option in gensim w2v when training)? May be there are other ways to overcome this issue?
If a .save() is creating any files with the word trainables in it, you're using a older version fo Gensim. Any new training should definitely prefer using a current version. As of now (January 2022), that's gensim-4.1.2, released 2021-09.
If an attempt at a .load() generated that particular error, then there should've been that file, alongside the others you mention, created when the .save() had been done. (In fact, the only way that the main file you named with path_filename should be able to know that other filename is if that other file was written successfully, allowing the main file to complete writing.)
Are you sure that file wasn't written, but then somehow left behind, perhaps getting deleted or not moving alongside the other few files to some new filesystem path?
In general, I would suggest:
using latest Gensim for any new training
always enable Python logging at the INFO level, & watch the logging/console output of training/saving processes closely to see confirmation of expected activity/steps
keep all files from a .save() that begin with the same main filename (in your examples above, w2v_US.model) together - & keep in mind that for larger models it may be a larger roster of files than for a small test model
You will probably have to re-train the model, but you might be able to re-generate a compatible lockf file via steps like the following:
save aside all files of any potential use
from the exact same configuration as your original .save() – including the same outdated Gensim version, exact same model parameters, & exact same training corpus – repeat all the model-building steps you did before up through the .build_vocab() step. (That is: no extra need to .train().) This will create an untrained dummy model that should exactly match the vocabulary 'shape' of your broken model.
use .save() to save that dummy model again - watching the logs/output for errors. There should be, alongside the other files, a file with a name like dummy.model.trainables.vectors_lockf.npy. If so, you might be able to copy that away, rename it to tbe the file expected by the original model whose load failed, then leave it alongside that original model - and the .load() might then succeed, or fail in a different way.
(If there were other problems/corruption at the time of the original model creation, this might not work. In particular, I wonder if when you talk about retraining the model, you didn't start with a fresh Word2Vec instance, but somehow expanded the older one, which might've added other problems/complications. In that case, a full retraining, ideally in the latest Gensim, would be necessary, and also a better basis for going forward.)

Can't load the pre-trained word2vec of korean language

I would like to download and load the pre-trained word2vec for analyzing Korean text.
I download the pre-trained word2vec here: https://drive.google.com/file/d/0B0ZXk88koS2KbDhXdWg1Q2RydlU/view?resourcekey=0-Dq9yyzwZxAqT3J02qvnFwg
from the Github Pre-trained word vectors of 30+ languages: https://github.com/Kyubyong/wordvectors
My gensim version is 4.1.0, thus I used:
KeyedVectors.load_word2vec_format('./ko.bin', binary=False) to load the model. But there was an error that :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
I already tried many options including in stackoverflow and Github, but it still not work well.
Would you mind letting me the suitable solution?
Thanks,
While the page at https://github.com/Kyubyong/wordvectors isn't clear about the formats this author has chosen, by looking at their source code at...
https://github.com/Kyubyong/wordvectors/blob/master/make_wordvectors.py#L61
...shows it using the Gensim model .save() method.
Such saved models should be reloaded using the .load() class method of the same model class. For example, if a Word2Vec model was saved with...
model.save('language.bin')
...then it could be reloaded with...
loaded_model = Word2Vec.load('language.bin')
Note, through, that:
Models saved this way are often split over multiple files that should be kept together (and all start with the same root name) - but I don't see those here.
This work appears to be ~5 years old, based on a pre-1.0 version of Gensim – so there might be issues loading the models directly into the latest Gensim. If you do run into such issues, & absolutely need to make these vectors work, you might need to temporarily use a prior version of Gensim to .load() the model. Then, you could save the plain vectors out with .save_word2vec_format() for later reloading across any version. (Or, using the latest interim version that can load the model, re-save the model as .save(), then repeat the process with the latest version that can read that model, until you reach the current Gensim.)
But, you also might want to find a more recent & better-documented set of pretrained word-vectors.
For example, Facebook makes FastText pretrained vectors available in both a 'text' format and a 'bin' format for many languages at https://fasttext.cc/docs/en/pretrained-vectors.html (trained on Wikipedia only) or https://fasttext.cc/docs/en/crawl-vectors.html (trained on Wikipedia plus web crawl data).
The 'text' format should in fact be loadable with KeyedVectors.load_word2vec_format(filename, binary=False), but will only include full-word vectors. (It will also be relatively easy to view as text, or write simply code to massage into other formats.)
The 'bin' format is Facebook's own native FastText model format, and should be loadable with either the load_facebook_model() or load_facebook_vectors() utility methods. Then, the loaded model (or vectors) will be able to create the FastText algorithm's substring-based guesstimate vectors even for many words that weren't in the model or training data.

How to use ImageDataGenerator with nii/NIFTI files using ADNI dataset

I'm trying to load my data using keras' ImageDataGenerator class, but I am having trouble since the image files are not a standard jpeg/png image file but rather nii.gz files. I found this github repo https://github.com/sremedios/nifti_image_generator/blob/master/utils/nifti_image.py but the dimensions outputted were not matching up and
train_generator.next()
throws an error of
ValueError: could not broadcast input array from shape (233,189) into shape (197,233,189,1)
In order to use Nii file which is standard format for medical images like MR , CT etc.
You have to use nibabel library to read medical imaging dataset.
Also DLTK library is available from Google for working on medical imaging.

Resources