Dutch pre-trained model not working in gensim - gensim

When trying to upload the fasttext model (cc.nl.300.bin) in gensim I get the following error:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.nl.300.bin.gz
!gunzip cc.nl.300.bin.gz
model = FastText_gensim.load_fasttext_format('cc.nl.300.bin')
model.build_vocab(cleaned_text, update=True)
AttributeError: 'FastTextTrainables' object has no attribute 'syn1neg'
The code goes wrong when building the vocab with my own dataset. The format of that dataset is all right, as I already used it to build and train other (not pre-trained) Word2Vec and FastText models.
I saw other had the same error on this blog, however their solution did not work for me: https://github.com/RaRe-Technologies/gensim/issues/2588
Also, I read somewhere that I should use 'load_facebook_model'? However I was not able to import load_facebook_model at all? Is this even a good way to solve this problem?
Any other suggestions?

Are you sure you're using the latest version of Gensim, 4.0.1, with many improvements to the FastText implementation?
And, there you will definitely want to use .load_facebook_model() to load a full .bin Facebook-format model:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_model
But also note: the post-training expansion of the vocabulary is best considered an advanced & experimental function. It may not offer any improvement on typical tasks - indeed, without careful consideration of tradeoffs & balancing influence of later traiing against earlier, it can make things worse.
A FastText model trained on a large, diverse corpus may already be able to synthesize better-than-nothing guess vectors for out-of-vocabulary words, via its subword vectors.
If there's some data with very-different words & word-senses you need to integrate, it will often be better to re-train from scratch, using an equal combination of all desired text influences. Then you'll be doing things in a standard and balanced way, without harder-to-tune and harder-to-evaluate improvised changes to usual practice.

Related

fine tuning word2vec on a specific article, using transfer learning

i try to fine tune an exicting model on specific article. I have tried transfer learning using genism build_vocab, adding gloveword2vec to a base model i trained on the article. but the build_vocab does not change the basic model- it is very small and no words are added to it's vocabulary.
this is the code:
#load glove model
glove_file = datapath("/content/glove.6B.200d.txt")
tmp_file = get_tmpfile("test_word2vec.txt")
_ = glove2word2vec(glove_file, tmp_file)
glove_vectors = KeyedVectors.load_word2vec_format(tmp_file)`
(in here - len(glove_vectors.wv.vocab) = 40000)
#create good article basic model
base_model = Word2Vec(size=300, min_count=5)
base_model.build_vocab([tokenizer.tokenize(data.text[0])])
total_examples = base_model.corpus_count`
(in here - len(base_model.wv.vocab) = 24)
#add GloVe's vocabulary & weights base_model.build_vocab([list(glove_vectors.vocab.keys())], update=True)
(in here- still - len(base_model_good_wv.vocab) = 24)
#training
base_model.train([tokenizer.tokenize(good_trump.text[0])], total_examples=total_examples, epochs=base_model.epochs+5)
base_model_wv = base_model.wv
i think that the
"base_model.build_vocab([list(glove_vectors.vocab.keys())], update=True)"
does nothing- so there is no transfer learning.
any recommendations?
i relied on this article for the guideline...
Many articles at the 'Towards Data Science' site are very confused, to the point of misleading more than helping. Unfortunately, the article you've linked is a good example:
The author first uses an unsupported value (workers=-1) that manages to make his local-corpus training do nothing, and rather than discovering & fixing that error, incorrectly concludes he needs to use 'transfer learning'/'fine-tuning' instead. (He doesn't.)
He then tries to improvise a re-use of the GLoVe vectors, but as you've noted, his build_vocab() only manages to add the word-tokens to the model's vocabulary. This operation does not copy over any of the actual vectors!
Then, by doing training in a model where the default workers=3 was still in-effect, he finally does real training on just his own texts – no contribution from GLoVe values at all. He attributes the improvement to GLoVE, but really multiple mistakes have just cancelled each other.
I would avoid relying on a 'Towards Data Science' source if any other docs or tutorials are available.
Further, many who think they want to do re-use of someone else's pretrained vectors, with a small update from their own texts, should really just improve their own training corpus, so that they have one unified, evenly-trained model that covers all their needed words.
There's no explicit support for 'fine-tuning' in Gensim. Bold advanced users can try to cobble it together from other methods, and tampering with the model between usual steps, but I've never seen a well-characterized & evaluated process for doing so. (Lots of the people fumbling through the process aren't even doing a good check of end-quality versus other approaches, just noting some improvement on a few ad hoc, perhaps unrepresentative tests.)
Are you sure you need to do this? What was wrong with vectors taught on just your corpus? Might extending your corpus with extra texts to expand its vocabulary work as well or better?
Or, you could try translating the new domain words from your limited corpus & model into the same coordinate space as some older larger set of pretrained vectors that you like. There's an example of that process in a Gensim demo notebook using its utility TranslationMatrix class: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb

Clarify steps to add a language variant to Stanza

I would like to add a non-standard variant of a language already supported by Stanza. It should be named differently from the standard variety included in the common distribution of Stanza. I could use a modification of the corpus for training the AI, since the changes are mostly morphological rather than syntactical, but how many steps would I need to take in order to make a new language variety for Stanza from this background? I don't understand what data are input and what are output in the process of adding a new language in the web documentation.
It sounds like you are trying to add a different set of processors rather than a whole new language. The difference being that other steps of the pipeline will still work the same, right? NER models, for example.
If that's the case, if you can follow the steps to retrain the current models, you should be able to then replace the input data with your morphological updates.
I suggest filing an issue on github if you encounter difficulties in the process. It will be a lot easier to back & forth there.
Times when we would actually recommend a whole new language are when 1) it's actually a new language or 2) it uses a different character set - think different writing systems for ZH or for Punjabi, if we had any Punjabi models

How do I access h2o xgb model input features after saving a model to disk and reloading it?

I'm using h2o's xgboost implementation in Python. I've saved a model to disk and I'm trying to load it later on for analysis and predicting. I'm trying to access the input features list or, even better, the feature list used by the model which does not include the features it decided not to use. The way people advise doing this is to use varimp function to get the variable importance and while this does remove features that aren't used in the model this actually gives you the variable importance of intermediate features created by OHE the categorical features, not the original categorical feature names.
I've searched for how to do this and so far I've found the following but no concrete way to do this:
Someone asking something very similar to this and being told the feature has been requested in Jira
Said Jira ticket which has been marked resolved but I believe says this was implemented but not customer visible.
A similar ticket requesting this feature (original categorical feature importance) for variable importance heatmaps but it is still open.
Someone else who found an unofficial way to access the columns with model._model_json['output']['names'] but that doesn't give the features that weren't used by the model and they are told to use a different method that doesn't work if you have saved the model to disk and reloaded it (which I am doing).
The only option I see is to just use the varimp features, split on period character to break the OHE feature names, select the first part of all the splits, and then run a set over everything to get the unique column names. But I'm hoping there's a better way to do this.

Can't load the pre-trained word2vec of korean language

I would like to download and load the pre-trained word2vec for analyzing Korean text.
I download the pre-trained word2vec here: https://drive.google.com/file/d/0B0ZXk88koS2KbDhXdWg1Q2RydlU/view?resourcekey=0-Dq9yyzwZxAqT3J02qvnFwg
from the Github Pre-trained word vectors of 30+ languages: https://github.com/Kyubyong/wordvectors
My gensim version is 4.1.0, thus I used:
KeyedVectors.load_word2vec_format('./ko.bin', binary=False) to load the model. But there was an error that :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
I already tried many options including in stackoverflow and Github, but it still not work well.
Would you mind letting me the suitable solution?
Thanks,
While the page at https://github.com/Kyubyong/wordvectors isn't clear about the formats this author has chosen, by looking at their source code at...
https://github.com/Kyubyong/wordvectors/blob/master/make_wordvectors.py#L61
...shows it using the Gensim model .save() method.
Such saved models should be reloaded using the .load() class method of the same model class. For example, if a Word2Vec model was saved with...
model.save('language.bin')
...then it could be reloaded with...
loaded_model = Word2Vec.load('language.bin')
Note, through, that:
Models saved this way are often split over multiple files that should be kept together (and all start with the same root name) - but I don't see those here.
This work appears to be ~5 years old, based on a pre-1.0 version of Gensim – so there might be issues loading the models directly into the latest Gensim. If you do run into such issues, & absolutely need to make these vectors work, you might need to temporarily use a prior version of Gensim to .load() the model. Then, you could save the plain vectors out with .save_word2vec_format() for later reloading across any version. (Or, using the latest interim version that can load the model, re-save the model as .save(), then repeat the process with the latest version that can read that model, until you reach the current Gensim.)
But, you also might want to find a more recent & better-documented set of pretrained word-vectors.
For example, Facebook makes FastText pretrained vectors available in both a 'text' format and a 'bin' format for many languages at https://fasttext.cc/docs/en/pretrained-vectors.html (trained on Wikipedia only) or https://fasttext.cc/docs/en/crawl-vectors.html (trained on Wikipedia plus web crawl data).
The 'text' format should in fact be loadable with KeyedVectors.load_word2vec_format(filename, binary=False), but will only include full-word vectors. (It will also be relatively easy to view as text, or write simply code to massage into other formats.)
The 'bin' format is Facebook's own native FastText model format, and should be loadable with either the load_facebook_model() or load_facebook_vectors() utility methods. Then, the loaded model (or vectors) will be able to create the FastText algorithm's substring-based guesstimate vectors even for many words that weren't in the model or training data.

RETURNN Librispeech Task: reused parameters of pretrained model for both LM and encoder-decoder model

I want to train RETURRN on LibriSpeech dataset reusing pretrained model of LM and encoder-decoder that has been offered on git, but don't know how to do. Is this possible? I don't see any option to enable it in .config file.
Yes, is it possible. The models can be downloaded here.
Do you just want to use the pretrained model for recognition? You don't need to do anything at all then. Just use it. Use the provided recognition scripts (see the paper, and the same repository).
Or do you want to train it further, using some additional data or so? In that case, this is also very simple. There is e.g. the option import_model_train_epoch1. But there are also related options, which you can use, depending on what you want to do exactly. See e.g. the comments in the code on preload_from_files.

Resources