How can you pass a pretrainend LDA Model to ldaseq in Gensim for DTM? - gensim

I have a tuned and pretrainend LDA Model that I want to pass on to the ldaseq model in gensim, but don't understand how to do it. I've tried lda_model and sstats but it doesn'T seem to work, I still get this from the logging:
running online (multi-pass) LDA training, 10 topics, 10 passes over
the supplied corpus of 1699 documents, updating model once every 1699
documents, evaluating perplexity every 1699 documents, iterating 50x
with a convergence threshold of 0.001000

In case anyone ever wonders this:
initialize='own' you need to supply sstats of the previously trained model in the shape (vocab_len, num_topics), and initialize='lda_model' you need to supply the previously trained lda model.
I found the answer here

Related

How to properly finetune t5 model

I'm finetuning a t5-base model following this notebook.
However, the loss of both validation set and training set decreases very slowly. I changed the learning_rate to a larger number, but it did not help. Eventually, the bleu score on the validation set was low (around 13.7), and the translation quality was low as well.
***** Running Evaluation *****
Num examples = 1000
Batch size = 32
{'eval_loss': 1.06500244140625, 'eval_bleu': 13.7229, 'eval_gen_len': 17.564, 'eval_runtime': 16.7915, 'eval_samples_per_second': 59.554, 'eval_steps_per_second': 1.906, 'epoch': 5.0}
If I use the "Helsinki-NLP/opus-mt-en-ro" model, the loss decreases properly, and at the end, the finetuned model works pretty well.
How to fine-tune t5-base properly? Did I miss something?
I think the metrics shown in the tutorial are for the already trained EN>RO opus-mt model which was then fine-tuned. I don't see the before and after comparison of the metrics for it, so it is hard to tell how much of a difference that fine-tuning really made.
You generally shouldn't expect the same results from fine-tuning T5 which is not a (pure) machine translation model. More important is the difference in metrics before and after the fine-tuning.
Two things I could imagine having gone wrong with your training:
Did you add the proper T5 prefix to the input sequences ("translate English to Romanian: ") for both your training and your evaluation? If you did not you might have been training a new task from scratch and not use the bit of pre-training the model did on MT to Romanian (and German and perhaps some other ones). You can see how that affects the model behavior for example in this inference demo: Language used during pretraining and Language not used during pretraining.
If you chose a relatively small model like t5-base but you stuck with the num_train_epochs=1 in the tutorial your train epoch number is probably a lot too low to make a noticable difference. Try increasing the epochs for as long as you get significant performance boosts from it, in the example this is probably the case for at least the first 5 to 10 epochs.
I actually did something very similar to what you are doing before for EN>DE (German). I fine-tuned both opus-mt-en-de and t5-base on a custom dataset of 30.000 samples for 10 epochs. opus-mt-en-de BLEU increased from 0.256 to 0.388 and t5-base from 0.166 to 0.340, just to give you an idea of what to expect. Romanian/the dataset you use might be more of a challenge for the model and result in different scores though.

Gensim Doc2Vec model returns different cosine similarity depending on the dataset

I trained two versions of doc2vec models with two datasets.
The first dataset was made with 2400 documents and the second one was made with 3000 documents including the documents which were used in the first dataset.
For an example,
dataset 1 = doc1, doc2, ... doc2400
dataset 2 = doc1, doc2, ... doc2400, doc2401, ... doc3000
I thought that both doc2vec models should return the same similarity score between doc1 and doc2, however, they returned different scores.
Does doc2vec model's result change upon the datasets even they include the same documents?
Yes, any addition to the training set will change the relative results.
Further, as explained in the Gensim FAQ, even re-training with the exact same data will typically result in different end coordinates for each training doc, though each run should be about equivalently useful:
https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ#q11-ive-trained-my-word2vec--doc2vec--etc-model-repeatedly-using-the-exact-same-text-corpus-but-the-vectors-are-different-each-time-is-there-a-bug-or-have-i-made-a-mistake-2vec-training-non-determinism
What should remain roughly the same between runs is the neighborhoods around each document. That is, adding some extra training docs shouldn't change the general result that some candidate doc is "very close" or "closer than other docs" to some target doc - except to the extent that (1) the new docs might include some even-closer docs; and (2) a small amount of 'jitter' between runs, per the FAQ answer above.
If in fact you see lots of change in the relative neighborhoods and top-N neighbors of a document, either in repeated runs or runs with small increments of extra data, there's possibly something else wrong in the training.
In particular, 2400 docs is a pretty small dataset for Doc2Vec - smaller datasets might need smaller vector_size and/or more epochs and/or other tweaks to get more reliable results, and even then, might not show off the strengths of this algorithm on larger (tens-of-thousands to millions of docs) datasets.

Concatenated Doc2Vec - calculate similarities

I have two Doc2Vec models trained on the same corpus but with different parameters. I would like to concatenate the two of them and calculate the similarity of a given input word, choosing the returned vectors from the concatenated model. I read a lot of comments regarding the fact that this method may not be particularly suited for performance improvement and that it might be necessary to change the source code to the KeyedVector class in gensim to enable it. Up to now I attempted to do that using the Translation Matrix but it returns 5 features from the second model and I am not sure about whether it is performing the translations correctly or not.
Has anybody already encountered this issue? Is there another way to calculate the similarity for an input word in a concatenated doc2vec model?
Up to now I have been able to reproduce this:
vocab1 = model1.wv
vocab2 = model2.wv
concatenated_vectors = {}
vocab_concatenated = vocab1
for i in range(len(vocab1.vectors)):
v1 = vocab1.vectors[i]
v2 = vocab2.vectors[i]
vocab_concatenated[list(vocab1.vocab.keys())[i]] = np.concatenate((v1, v2))
In order to re-calculate the most_similar() top-n features for a passed argument, how should I re-istantiate the newly created object? It seems that
.add_vectors(list(vocab1.vocab.keys()), vocab_concatenated[list(vocab1.vocab.keys())])
is not working, but I am sure I am missing something.

Validation Split and Checkpoint Best Model in Keras

Let us use a validation split of 0.3 when fitting a Sequential model. What will be used for validation, the first or the last 30% samples?
Secondly, checkpointing the best model saves the best model weights in .hdf5 file format. Does this mean that, for a certain experiment, the saved model is the best tuned model?
For your first question, the last 30% samples will be used for validation.
From Keras documentation:
validation_split: Float between 0 and 1. Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the x and y data provided, before shuffling
For your second question, I assume that you're talking about ModelCheckpoint with save_best_only=True. In this case, this callback saves the weights of a given epoch only if monitor ('val_loss', by default) is better than the best monitored value. Concretely, this happens here. If monitor is 'val_loss', this should be the tuned model for a particular setting of hyperparameters, according to the validation loss.

How to get word vectors from a gensim Doc2Vec?

I trained a gensim.models.doc2vec.Doc2Vec model
d2v_model = Doc2Vec(sentences, size=100, window=8, min_count=5, workers=4)
and I can get document vectors by
docvec = d2v_model.docvecs[0]
How can I get word vectors from trained model ?
Doc2Vec inherits from Word2Vec, and thus you can access word vectors the same as in Word2Vec, directly by indexing the model:
wv = d2v_model['apple']
Note, however, that a Doc2Vec training mode like pure DBOW (dm=0) doesn't need or create word vectors. (Pure DBOW still works pretty well and fast for many purposes!) If you do access word vectors from such a model, they'll just be the automatic randomly-initialized vectors, with no meaning.
Only when the Doc2Vec mode itself co-trains word-vectors, as in the DM mode (default dm=1) or when adding optional word-training to DBOW (dm=0, dbow_words=1), are word-vectors and doc-vectors both learned simultaneously.
If you want to get all the trained doc vectors, you can easily use
model.docvecs.doctag_syn0. If you want to get the indexed doc, you can use model.docvecs[i].
If you are training a Word2Vec model, you can get model.wv.syn0.
If you want to get more, check this github issue link: (https://github.com/RaRe-Technologies/gensim/issues/1513)

Resources