Why does the loss of Word2Vec model trained by gensim at first increase for a few epochs and then decrease? - gensim

Actually this is a sequel to:
post
I am training a Word2Vec model using gensim, with parameters hs=1, sg=0 and negative=0. Less training time is required after the code is modified, but something seems to go wrong with the loss, it would at first increase then decrease, I don't know what happened.
The code is as follows:
from gensim.models.keyedvectors import KeyedVectors
from gensim.models import word2vec
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO)
sentences = word2vec.Text8Corpus("text8") # loading the corpus
from gensim.models.callbacks import CallbackAny2Vec
loss_list = []
class Callback(CallbackAny2Vec):
def __init__(self):
self.epoch = 0
def on_epoch_end(self, model):
loss = model.get_latest_training_loss()
loss_list.append(loss)
print('Loss after epoch {}:{}'.format(self.epoch, loss))
model.running_training_loss = 0.0
self.epoch = self.epoch + 1
from gensim.models import KeyedVectors,word2vec,Word2Vec
import time
start_time = time.time()
model = word2vec.Word2Vec(sentences, hs=1, sg=0, negative=0, compute_loss=True, epochs=30, callbacks=[Callback()])
end_time = time.time()
print('Running time: %s seconds' % (end_time - start_time))
The codes are actually written in jupyter, as can be seen in the screenshot:
and the output is like this:
More details about the output:
Loss after epoch 0:39370848.0
Loss after epoch 1:43579636.0
Loss after epoch 2:45213772.0
Loss after epoch 3:46132356.0
Loss after epoch 4:46788412.0
Loss after epoch 5:47218508.0
Loss after epoch 6:47553520.0
Loss after epoch 7:47793332.0
Loss after epoch 8:47995616.0
Loss after epoch 9:48134664.0
Loss after epoch 10:48224960.0
Loss after epoch 11:48326640.0
Loss after epoch 12:48371072.0
Loss after epoch 13:48405980.0
Loss after epoch 14:48437804.0
Loss after epoch 15:48417612.0
Loss after epoch 16:48415112.0
Loss after epoch 17:48396260.0
Loss after epoch 18:48349064.0
Loss after epoch 19:48301088.0
Loss after epoch 20:48247328.0
Loss after epoch 21:48167340.0
Loss after epoch 22:48053500.0
Loss after epoch 23:47937300.0
Loss after epoch 24:47810964.0
Loss after epoch 25:47669088.0
Loss after epoch 26:47500524.0
Loss after epoch 27:47300488.0
Loss after epoch 28:47044920.0
Loss after epoch 29:46747080.0
Running time: 259.9046218395233 seconds

I wouldn't expect that pattern of rising-then-falling loss; I would think that usual SGD optimization would usually ahieve falling full-epoch loss from the very beginning.
However, if the end result vectors are still performing well, I wouldn't worry too much about surprises in secondary progress indicators, like that loss tally, for a number of reasons:
as noted in my previous answer (& further discussed in the Gensim project open issue #2617), Gensim's external loss-reporting has a number of known bugs & inconsistencies. Any oddness in observed loss-reporting may be a side-effect of those issues, without necessarily indicating any problem with the actual training updates.
it appears you're finishing 30 training epochs in 260 seconds – each full training pass in under 9 seconds. That suggests your training data is pretty tiny – perhaps too small to be a good example of word2vec capabilities, or too small for the default 100-dimensional vectors. That smallness, or other peculiarities in the training data, might contribute to atypical loss trends, or exercising some of the other weaknesses of the current loss-tallying code. If the same hard-to-explain pattern occurs with a more typical training corpus – such as one 100x larger – then it'd be more interesting to do a deep-dive investigation to understand what's happening. But unexpected results on tiny/unusual/atypical training runs might just be because such runs are far from where the usual intuitions apply, and getting to the bottom of their causes less productive than acquiring enough data to run the algorithm in a more typical/reliable way.

Related

Deep Learning with Pytorch - should validation be inside or outside the epoch loop?

I have seen most tutorials/guides having the validation step outside the epoch loop. A guide I follow though has the validation step inside the epoch loop. Which one is right?
I notice that if you have the validation inside the epoch loop you can plot the validation per epoch loss, but you can't have a proper confusion matrix (due to validating the same image dataset all over again) and vice versa. Or I haven't found a proper way yet. Any suggestions?
Thanks
The general way to write training, val loop in PyTorch is:
for epoch in range(num_epochs):
for phase, batch in [('train', train_dataloader), ('val', val_dataloader)]:
if phase == 'train':
model.train()
train(model, batch)
else:
model.eval()
val(model, batch)
# calculate the remaining statistics appropriately
A point to note here is that train_dataloader and val_dataloader are created from the train_dataset depending on your crossvalidation strategy (random split, stratified split etc)

Gensim Doc2Vec - Why does infer_vector() use alpha?

I try to map sentences to a vector in order to make sentences comparable to each other. To test gensim's Doc2Vec model, I downloaded sklearn's newsgroup dataset and trained the model on it.
In order to compare two sentences, I use model.infer_vector() and I am wondering why two calls using the same sentence delivers me different vectors:
model = Doc2Vec(vector_size=100, window=8, min_count=5, workers=6)
model.build_vocab(documents)
epochs=10
for epoch in range(epochs):
print("Training epoch %d" % (epoch+1))
model.train(documents, total_examples=len(documents), epochs=epochs)
v1 = model.infer_vector("I feel good")
v2 = model.infer_vector("I feel good")
print(np.linalg.norm(v1-v2))
Output:
Training epoch 1
0.41606528
Training epoch 2
0.43440753
Training epoch 3
0.3203116
Training epoch 4
0.3039317
Training epoch 5
0.68224543
Training epoch 6
0.5862567
Training epoch 7
0.5424634
Training epoch 8
0.7618142
Training epoch 9
0.8170159
Training epoch 10
0.6028216
If I set alpha and min_alpha = 0 I get consistent vectors for the "I feel fine" and "I feel good", but the model gives me the same vector in every epoch, so it does not seem to learn anything:
Training epoch 1
0.043668125
Training epoch 2
0.043668125
Training epoch 3
0.043668125
Training epoch 4
0.043668125
Training epoch 5
0.043668125
Training epoch 6
0.043668125
Training epoch 7
0.043668125
Training epoch 8
0.043668125
Training epoch 9
0.043668125
Training epoch 10
0.043668125
So my questions are:
Why do I even have the possibility to specify a learning rate for inference? I would expect that the model is only changed during training and not during inference.
If I specify alpha=0 for inference, why does the distance between those two vectors not change during different epochs?
Inference uses an alpha because it is the same iterative adjustment process as training, just limited to updating the one new vector for the one new text example.
So yes, the model's various weights are frozen. But the one new vector's weights (dimensions) start at small random values, just as every other vector also began, and then get incrementally nudged over multiple training cycles to make the vector work better as a doc-vector for predicting the text's words. Then the final new-vector is returned.
Those nudges begin at the larger starting alpha value, and wind up as the negligible min_alpha. With an alpha at 0.0, no training/inference can happen, because every nudge-correction to the updatable weights is multiplied by 0.0 before it's applied, meaning no change happens.
Separate from that, your code has a number of problems that may prevent desirable results:
By calling train() epochs times in a loop, and then also supplying a value larger than 1 for epochs, you're actually performing epochs * epochs total training passes
further, by leaving alpha and min_alpha unspecified, each call to train() will descend the effective alpha from its high-value to its low-value each call – a sawtooth pattern that's not proper for this kind of stochastic gradient descent optimization. (There should be a warning in your logs about this error.)
It's rare to need to call train() multiple times in a loop. Just call it once, with the right epochs value, and it will do the right thing: that many passes, with a smoothly-decaying alpha learning-rate.
Separately, when calling infer_vector():
it needs a list-of-tokens, just like the words property of the training examples that were items in documents – not a string. (By supplying a string, it looks like a list-of-characters, so it will be inferring a doc-vector for the document ['I', ' ', 'f', 'e', 'e', 'l', ' ', 'g', 'o', 'o', 'd'] not ['I', 'feel', 'good'].)
those tokens should be preprocessed the same as the training documents – for example if they were lowercased there, they should be lowercased before passing to infer_vector()
the default argument passes=5 is very small, especially for short texts – many report better results with a value in the tens or hundreds
the default argument alpha=0.1 is somewhat large compared to the training default 0.025; using the training value (especially with more passes) often gives better results
Finally, just like the algorithm during training makes use of randomization (to adjust word-prediction context windows, or randomly-sample negative examples, or randomly down-sample highly-frequent words), the inference does as well. So even supplying the exact same tokens won't automatically yield the exact same inferred-vector.
However, if the model has been sufficiently-trained, and the inference is adjusted as above for better results, the vectors for the same text should be very, very close. And because this is a randomized algorithm with some inherent 'jitter' between runs, it's best to make your downstream evaluations and uses tolerant to such small variances. (And, if you're instead seeing large variances, correct other model/inference issues, usually with more data or other parameter adjustments.)
If you want to force determinism, there's some discussion of how to do that in a gensim project issue. But, understanding & tolerating the small variances is often more consistent with the choice of such a randomly-influenced algorithm.

Keras: How to obtain after an epoch the samples of the validation dataset with a wrong prediction?

During training of my CNN with Keras, after each epoch I obtain the validation accuracy (val_acc). For instance I obtain val_acc: 0.9910 which means that the current trained model can predict, as expected, 991 out of 1000 samples of my validation dataset. Correct?
Then, how can I know (via callback or maybe enabling somehow the level of verbosity) which are the 9 samples of my validation dataset that resulted with incorrect prediction?

How can I plot the accuracy as a function of the processing time in Keras?

I've been training a CNN in Keras and plotting the training and validation accuracy as a function of epochs. I was wondering if there is a way of plotting the accuracy as a function of processing time.
Reason being that I want to demonstrate the speed of transfer learning as opposed to retraining a full network. When transfer learning is used, the network takes a similar number of epochs to train, give or take, but each epoch takes far less time (an order of magnitude faster) and I want to capture this graphically.
Here is the code I've been using so far:
history = model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=nb_epoch, verbose=1, validation_data=(X_test, Y_test))
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='lower right')
plt.show()
So it's actually really simple to implement a piece of code to perform your task in Keras. In order to do that - it's good to become familiar with a keras.callback. It makes possible to call a custom functions on:
on_epoch_begin: called at the beginning of every epoch.
on_epoch_end: called at the end of every epoch,
on_batch_begin: called at the beginning of every batch,
on_batch_end: called at the end of every batch,
on_train_begin: called at the beginning of model training,
on_train_end: called at the end of model training.
So now you might implement e.g. a new callback which will :
register the start of training time on_train_begin,
at the end of each epoch it will register the actual time of end of computations using on_epoch_end in a dict provided,
register the end of training on_train_end.
By using the data collected by this callback you could easily present the dependency between time and accuracy in a variety of ways. Of course - it could be easily extended to batch/iterations time periods.

Performance comparison of “patternnet” and “newff” for binary classification in MATLAB R2014a

I have a binary classification problem for financial ratios and variables. When I use newff (with trainlm and mse and threshold of 0.5 for output) I have a high classification accuracy (5-fold cross validation – near 89-92%) but when I use patternnet (trainscg with crossentropy) my accuracy is 10% lower than newff. (I normalized data before insert it to network - mapminmax or mapstd)
When I use these models for out-sample data (for current year- created models designed based one previous year(s) data sets) I have better classification accuracies in patternnet with better sensitivity and specificity. For example I have these results in my problem:
Newff:
Accuracy: 92.8% sensitivity: 94.08% specificity: 91.62%
Out sample results: accuracy: 60% sensitivity: 48% and specificity: 65.57%
Patternnet:
Accuracy: 73.31% sensitivity: 69.85% specificity: 76.77%
Out sample results: accuracy: 70% sensitivity: 62.79% and specificity: 73.77%
Why we have these differences between newff and patternent. Which model should I use?
Thanks.

Resources