Can't load the pre-trained word2vec of korean language - gensim

I would like to download and load the pre-trained word2vec for analyzing Korean text.
I download the pre-trained word2vec here: https://drive.google.com/file/d/0B0ZXk88koS2KbDhXdWg1Q2RydlU/view?resourcekey=0-Dq9yyzwZxAqT3J02qvnFwg
from the Github Pre-trained word vectors of 30+ languages: https://github.com/Kyubyong/wordvectors
My gensim version is 4.1.0, thus I used:
KeyedVectors.load_word2vec_format('./ko.bin', binary=False) to load the model. But there was an error that :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
I already tried many options including in stackoverflow and Github, but it still not work well.
Would you mind letting me the suitable solution?
Thanks,

While the page at https://github.com/Kyubyong/wordvectors isn't clear about the formats this author has chosen, by looking at their source code at...
https://github.com/Kyubyong/wordvectors/blob/master/make_wordvectors.py#L61
...shows it using the Gensim model .save() method.
Such saved models should be reloaded using the .load() class method of the same model class. For example, if a Word2Vec model was saved with...
model.save('language.bin')
...then it could be reloaded with...
loaded_model = Word2Vec.load('language.bin')
Note, through, that:
Models saved this way are often split over multiple files that should be kept together (and all start with the same root name) - but I don't see those here.
This work appears to be ~5 years old, based on a pre-1.0 version of Gensim – so there might be issues loading the models directly into the latest Gensim. If you do run into such issues, & absolutely need to make these vectors work, you might need to temporarily use a prior version of Gensim to .load() the model. Then, you could save the plain vectors out with .save_word2vec_format() for later reloading across any version. (Or, using the latest interim version that can load the model, re-save the model as .save(), then repeat the process with the latest version that can read that model, until you reach the current Gensim.)
But, you also might want to find a more recent & better-documented set of pretrained word-vectors.
For example, Facebook makes FastText pretrained vectors available in both a 'text' format and a 'bin' format for many languages at https://fasttext.cc/docs/en/pretrained-vectors.html (trained on Wikipedia only) or https://fasttext.cc/docs/en/crawl-vectors.html (trained on Wikipedia plus web crawl data).
The 'text' format should in fact be loadable with KeyedVectors.load_word2vec_format(filename, binary=False), but will only include full-word vectors. (It will also be relatively easy to view as text, or write simply code to massage into other formats.)
The 'bin' format is Facebook's own native FastText model format, and should be loadable with either the load_facebook_model() or load_facebook_vectors() utility methods. Then, the loaded model (or vectors) will be able to create the FastText algorithm's substring-based guesstimate vectors even for many words that weren't in the model or training data.

Related

fine tuning word2vec on a specific article, using transfer learning

i try to fine tune an exicting model on specific article. I have tried transfer learning using genism build_vocab, adding gloveword2vec to a base model i trained on the article. but the build_vocab does not change the basic model- it is very small and no words are added to it's vocabulary.
this is the code:
#load glove model
glove_file = datapath("/content/glove.6B.200d.txt")
tmp_file = get_tmpfile("test_word2vec.txt")
_ = glove2word2vec(glove_file, tmp_file)
glove_vectors = KeyedVectors.load_word2vec_format(tmp_file)`
(in here - len(glove_vectors.wv.vocab) = 40000)
#create good article basic model
base_model = Word2Vec(size=300, min_count=5)
base_model.build_vocab([tokenizer.tokenize(data.text[0])])
total_examples = base_model.corpus_count`
(in here - len(base_model.wv.vocab) = 24)
#add GloVe's vocabulary & weights base_model.build_vocab([list(glove_vectors.vocab.keys())], update=True)
(in here- still - len(base_model_good_wv.vocab) = 24)
#training
base_model.train([tokenizer.tokenize(good_trump.text[0])], total_examples=total_examples, epochs=base_model.epochs+5)
base_model_wv = base_model.wv
i think that the
"base_model.build_vocab([list(glove_vectors.vocab.keys())], update=True)"
does nothing- so there is no transfer learning.
any recommendations?
i relied on this article for the guideline...
Many articles at the 'Towards Data Science' site are very confused, to the point of misleading more than helping. Unfortunately, the article you've linked is a good example:
The author first uses an unsupported value (workers=-1) that manages to make his local-corpus training do nothing, and rather than discovering & fixing that error, incorrectly concludes he needs to use 'transfer learning'/'fine-tuning' instead. (He doesn't.)
He then tries to improvise a re-use of the GLoVe vectors, but as you've noted, his build_vocab() only manages to add the word-tokens to the model's vocabulary. This operation does not copy over any of the actual vectors!
Then, by doing training in a model where the default workers=3 was still in-effect, he finally does real training on just his own texts – no contribution from GLoVe values at all. He attributes the improvement to GLoVE, but really multiple mistakes have just cancelled each other.
I would avoid relying on a 'Towards Data Science' source if any other docs or tutorials are available.
Further, many who think they want to do re-use of someone else's pretrained vectors, with a small update from their own texts, should really just improve their own training corpus, so that they have one unified, evenly-trained model that covers all their needed words.
There's no explicit support for 'fine-tuning' in Gensim. Bold advanced users can try to cobble it together from other methods, and tampering with the model between usual steps, but I've never seen a well-characterized & evaluated process for doing so. (Lots of the people fumbling through the process aren't even doing a good check of end-quality versus other approaches, just noting some improvement on a few ad hoc, perhaps unrepresentative tests.)
Are you sure you need to do this? What was wrong with vectors taught on just your corpus? Might extending your corpus with extra texts to expand its vocabulary work as well or better?
Or, you could try translating the new domain words from your limited corpus & model into the same coordinate space as some older larger set of pretrained vectors that you like. There's an example of that process in a Gensim demo notebook using its utility TranslationMatrix class: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb

How do I access h2o xgb model input features after saving a model to disk and reloading it?

I'm using h2o's xgboost implementation in Python. I've saved a model to disk and I'm trying to load it later on for analysis and predicting. I'm trying to access the input features list or, even better, the feature list used by the model which does not include the features it decided not to use. The way people advise doing this is to use varimp function to get the variable importance and while this does remove features that aren't used in the model this actually gives you the variable importance of intermediate features created by OHE the categorical features, not the original categorical feature names.
I've searched for how to do this and so far I've found the following but no concrete way to do this:
Someone asking something very similar to this and being told the feature has been requested in Jira
Said Jira ticket which has been marked resolved but I believe says this was implemented but not customer visible.
A similar ticket requesting this feature (original categorical feature importance) for variable importance heatmaps but it is still open.
Someone else who found an unofficial way to access the columns with model._model_json['output']['names'] but that doesn't give the features that weren't used by the model and they are told to use a different method that doesn't work if you have saved the model to disk and reloaded it (which I am doing).
The only option I see is to just use the varimp features, split on period character to break the OHE feature names, select the first part of all the splits, and then run a set over everything to get the unique column names. But I'm hoping there's a better way to do this.

gensim w2k - additional file

I trained w2v on rather big (> 200 million sentences) corpus, and got, in addition to file w2v_model.model, files: w2v_model.model.trainables.syn1neg.npy and w2v.model_model.wv.vectors.npy. Model file was successfully loaded and read all npy files without any exceptions. The obtained model performed OK.
Now I retrained the model on much bigger corpus (> 1 billion sentences). The same 3 files were automatically saved, as expected.
When I try to load my new retrained model:
w2v_model = Word2Vec.load(path_filename)
I get:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/...../w2v_US.model.trainables.vectors_lockf.npy'
But no .npy file with such extension was saved by gensim at the end of the training
(I save all output files in the same library, as required).
What should I do to obtain such file as a part of output .npy files (may be some option in gensim w2v when training)? May be there are other ways to overcome this issue?
If a .save() is creating any files with the word trainables in it, you're using a older version fo Gensim. Any new training should definitely prefer using a current version. As of now (January 2022), that's gensim-4.1.2, released 2021-09.
If an attempt at a .load() generated that particular error, then there should've been that file, alongside the others you mention, created when the .save() had been done. (In fact, the only way that the main file you named with path_filename should be able to know that other filename is if that other file was written successfully, allowing the main file to complete writing.)
Are you sure that file wasn't written, but then somehow left behind, perhaps getting deleted or not moving alongside the other few files to some new filesystem path?
In general, I would suggest:
using latest Gensim for any new training
always enable Python logging at the INFO level, & watch the logging/console output of training/saving processes closely to see confirmation of expected activity/steps
keep all files from a .save() that begin with the same main filename (in your examples above, w2v_US.model) together - & keep in mind that for larger models it may be a larger roster of files than for a small test model
You will probably have to re-train the model, but you might be able to re-generate a compatible lockf file via steps like the following:
save aside all files of any potential use
from the exact same configuration as your original .save() – including the same outdated Gensim version, exact same model parameters, & exact same training corpus – repeat all the model-building steps you did before up through the .build_vocab() step. (That is: no extra need to .train().) This will create an untrained dummy model that should exactly match the vocabulary 'shape' of your broken model.
use .save() to save that dummy model again - watching the logs/output for errors. There should be, alongside the other files, a file with a name like dummy.model.trainables.vectors_lockf.npy. If so, you might be able to copy that away, rename it to tbe the file expected by the original model whose load failed, then leave it alongside that original model - and the .load() might then succeed, or fail in a different way.
(If there were other problems/corruption at the time of the original model creation, this might not work. In particular, I wonder if when you talk about retraining the model, you didn't start with a fresh Word2Vec instance, but somehow expanded the older one, which might've added other problems/complications. In that case, a full retraining, ideally in the latest Gensim, would be necessary, and also a better basis for going forward.)

Dutch pre-trained model not working in gensim

When trying to upload the fasttext model (cc.nl.300.bin) in gensim I get the following error:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.nl.300.bin.gz
!gunzip cc.nl.300.bin.gz
model = FastText_gensim.load_fasttext_format('cc.nl.300.bin')
model.build_vocab(cleaned_text, update=True)
AttributeError: 'FastTextTrainables' object has no attribute 'syn1neg'
The code goes wrong when building the vocab with my own dataset. The format of that dataset is all right, as I already used it to build and train other (not pre-trained) Word2Vec and FastText models.
I saw other had the same error on this blog, however their solution did not work for me: https://github.com/RaRe-Technologies/gensim/issues/2588
Also, I read somewhere that I should use 'load_facebook_model'? However I was not able to import load_facebook_model at all? Is this even a good way to solve this problem?
Any other suggestions?
Are you sure you're using the latest version of Gensim, 4.0.1, with many improvements to the FastText implementation?
And, there you will definitely want to use .load_facebook_model() to load a full .bin Facebook-format model:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_model
But also note: the post-training expansion of the vocabulary is best considered an advanced & experimental function. It may not offer any improvement on typical tasks - indeed, without careful consideration of tradeoffs & balancing influence of later traiing against earlier, it can make things worse.
A FastText model trained on a large, diverse corpus may already be able to synthesize better-than-nothing guess vectors for out-of-vocabulary words, via its subword vectors.
If there's some data with very-different words & word-senses you need to integrate, it will often be better to re-train from scratch, using an equal combination of all desired text influences. Then you'll be doing things in a standard and balanced way, without harder-to-tune and harder-to-evaluate improvised changes to usual practice.

Nvidia Digits accuracy and loss plots data

I trained my model in Nvidia Digits 5 and I would now like to extract the accuracy and loss plots that were generated during training for a report. Is this data saved somewhere so that it would possible to extract the data for these plots so that I could plot it in Python and perhaps ultimately modify the plots to compare different models etc?
The best solution I have found is to either look at the HTML file or to scan the text file caffe_output.log that is produced by Caffe. The text file is usually stored in /var/digits/jobs/insert_your_job_id/ but you can also just run on linux systems:
locate caffe_output.log
Go to your DIGITS job folder and locate your job's subfolder. Inside you'll find a file status.pickle, which is a pickled object containing all your job's information.
You can load it in python like so:
import digits
import pickle
data = pickle.load(open('status.pickle','rb'))
This object is somewhat generic and may contain multiple tasks. For a typical classification task it will likely be just one, but you will still need to access it via data.tasks[0]. From there you can grab the plots:
data.tasks[0].combined_graph_data()
which returns a somewhat convoluted dict (unfortunately - since your network can produce many accuracy/loss outputs, as well as even custom ones). It contains everything you need though - I managed to plot accuracy with:
plt.plot( data.tasks[0].combined_graph_data()['columns'][2][1:] )
but it's likely that you'll have to write a bit of custom code. As always, dir() is your friend.

Resources