My Doc2Vec code, after many loops/epochs of training, isn't giving good results. What might be wrong? - gensim

I'm training a Doc2Vec model using the below code, where tagged_data is a list of TaggedDocument instances I set up before:
max_epochs = 40
model = Doc2Vec(alpha=0.025,
min_alpha=0.001)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
# decrease the learning rate
model.alpha -= 0.001
# fix the learning rate, no decay
model.min_alpha = model.alpha
model.save("d2v.model")
print("Model Saved")
When I later check the model results, they're not good. What might have gone wrong?

Do not call .train() multiple times in your own loop that tries to do alpha arithmetic.
It's unnecessary, and it's error-prone.
Specifically, in the above code, decrementing the original 0.025 alpha by 0.001 forty times results in (0.025 - 40*0.001) -0.015 final alpha, which would also have been negative for many of the training epochs. But a negative alpha learning-rate is nonsensical: it essentially asks the model to nudge its predictions a little bit in the wrong direction, rather than a little bit in the right direction, on every bulk training update. (Further, since model.iter is by default 5, the above code actually performs 40 * 5 training passes – 200 – which probably isn't the conscious intent. But that will just confuse readers of the code & slow training, not totally sabotage results, like the alpha mishandling.)
There are other variants of error that are common here, as well. If the alpha were instead decremented by 0.0001, the 40 decrements would only reduce the final alpha to 0.021 – whereas the proper practice for this style of SGD (Stochastic Gradient Descent) with linear learning-rate decay is for the value to end "very close to 0.000"). If users start tinkering with max_epochs – it is, after all, a parameter pulled out on top! – but don't also adjust the decrement every time, they are likely to far-undershoot or far-overshoot 0.000.
So don't use this pattern.
Unfortunately, many bad online examples have copied this anti-pattern from each other, and make serious errors in their own epochs and alpha handling. Please don't copy their error, and please let their authors know they're misleading people wherever this problem appears.
The above code can be improved with the much-simpler replacement:
max_epochs = 40
model = Doc2Vec() # of course, if non-default parameters needed, use them here
# most users won't need to change alpha/min_alpha at all
# but many will want to use more than default `epochs=5`
model.build_vocab(tagged_data)
model.train(tagged_data, total_examples=model.corpus_count, epochs=max_epochs)
model.save("d2v.model")
Here, the .train() method will do exactly the requested number of epochs, smoothly reducing the internal effective alpha from its default starting value to near-zero. (It's rare to need to change the starting alpha, but even if you wanted to, just setting a new non-default value at initial model-creation is enough.)
Also: note that later calls to infer_vector() will reuse the epochs specified at the time of model-creation. If nothing is specified, the default epochs=5 will be used - which is often smaller than is best for training or inference. So if you find a larger number of epochs (such as 10, 20 or more) is better for training, remember to also use at least the same number of epochs for inference. (.infer_vector() takes an optional epochs parameter whihc can override any value set at model-contruction.

Related

How to set a convergence tolerance to an specific variable using Dymola?

So, I have a model of a tube with pressure loss, where the unknown is the mass flow rate. Normally, and on most models of this problem, the conservation equations are used to calculate the mass flow rate, but such models have lots of convergence issues (because of the blocked flow at the end of the tube which results in an infinite pressure derivative at the end). See figure below for a representation of the problem on the left and the right a graph showing the infinite pressure derivative.
Because of that I'm using a model which is more robust, though it outputs not the mass flow rate but the tube length, which is known. Therefore an iterative loop is needed to determine the mass flow rate. Ok then, I coded a function length that given the tube geometry, mass flow rate and boundary conditions it outputs the calculated tube length and made the equations like so:
parameter Real L;
Real m_flow;
...
equation
L = length(geometry, boundary, m_flow)
It simulates fine, but it takes ages... And it shouldn't because the mass flow rate is rather insensitive to the tube length, e.g. if L=3 I could say that m_flow has converged if the output of length is within L ± 0.1. On the other hand the default convergence tolerance of DASSL in Dymola is 0.0001, which is fine for all other variables, but a major setback to my model here...
That being said, I'd like to know if there's a (hacky) way of setting a specific tolerance L (from annotations or something). I was unable to find any solution online or in Dymola's user manual... So far I managed a workaround by making a second function which uses a Newton-Raphson method to determine the mass flow rate, something like:
function massflowrate
input geometry, boundary, m_flow_start, tolerance;
output m_flow;
protected
Real error, L, dL, dLdm_flow, Delta_m_flow;
algorithm
error = geometry.L;
m_flow = m_flow_start;
while error>tolerance loop
L = length(geometry, boundary, m_flow);
error = abs(boundary.L - L);
dL = length(geometry, boundary, m_flow*1.001);
dLdm_flow = dL/(0.001*m_flow);
Delta_m_flow = (geometry.L - L)/dLdm_flow;
m_flow = m_flow + Delta_m_flow;
end while;
end massflowrate;
And then I use it in the equations section:
parameter Real L;
Real m_flow;
...
equation
m_flow = massflowrate(geometry, boundary, delay(m_flow,10), tolerance)
Nevertheless, this solutions is not without it's problems, the real equations are very non-linear and depending on the boundary conditions the solver reaches a never-ending loop... =/
PS: I'm sorry for the long post and the lack of a MWE, the real equations are very long and with loads of thermodynamics which I believe not to be of any help, be that as it may, if necessary, I'm able to provide the real model.
Is the length-function smooth? To me that it being non-smooth seems like a likely cause for problems, and the suggestions by #Phil might also be good ideas.
However, it should also be possible to do what you want as follows:
Real m_flow(nominal=1e9);
Explanation: The equations are normally solved to a certain tolerance in unknowns - in this case m_flow.
The tolerance for each variable is a relative/absolute tolerance taking into the nominal value, and Dymola does not allow you to set different tolerances for different variables.
Thus the simple way to compute m_flow less accurately is by setting a high nominal value for it, since the error tolerance will be tol*(abs(m_flow)+abs(nominal(m_flow))) or something like that.
The downside is that it may be too inaccurate, e.g. causing additional events, or that the error is so random that the solver is still slowed down.

Tensorflow: How to check the validation accuracy every 100 iterations?

I try to use the following code to check the validation accuracy every 100 iterations, however, the validation accuracy is not changing(the network is fine)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(1000):
batch = mnist.train.next_batch(50)
if i % 100 == 0:
train_accuracy = accuracy.eval(feed_dict={x:batch[0], y_:batch[1], keep_prob:1.0})
print('step %d, training accuracy %g' %(i, train_accuracy))
validation_accuracy = accuracy.eval(feed_dict={x:mnist.test.images, y_:mnist.test.labels, keep_prob:0.0})
print('step %d, validation accuracy %g' %(i, validation_accuracy))
train_step.run(feed_dict={x:batch[0], y_:batch[1], keep_prob:0.5})
Since you haven't added your network implementation, my answer will be an educated guess.
TL;DR: You should use keep_prob:1.0 instead of keep_prob:0.0 in your validation step.
By the appearance of keep_prob, I deduce that your network is using dropout. By using feed_dict={x:mnist.test.images, y_:mnist.test.labels, keep_prob:0.0}, you are feeding a 0.0 probability to keeping an activation, which is equivalent to a 1.0 probability of dropping it. The result is that when you are performing validation, you are basically ignoring the input to the network and all hidden layers. This has the effect that the last layer gives you the same output values for all classes of MNIST (this may be only approximately true, depending on the specific implementation), therefore the accuracy is constant.
Dropout is a method for regularization, which drops neurons during training steps, thus improving the generalization ability of the network. When you are not training (such as during a validation step), you want to keep all neurons. Thus, what you probably want to do is feed the value 1.0 instead.

How does choosing between pre and post zero padding of sequences impact results

I'm working on an NLP sequence labelling problem. My data consists of variable length sequences (w_1, w_2, ..., w_k) with corresponding labels (l_1, l_2, ..., l_k) (in this case the task is named entity extraction).
I intend to solve the problem using Recurrent Neural Networks. As the sequences are of variable length I need to pad them (I want batch size >1). I have the option of either pre zero padding them, or post zero padding them. I.e. either I make every sequence (0, 0, ..., w_1, w_2, ..., w_k) or (w_1, w_2, ..., w_k, 0, 0, ..., 0) such that the lenght of each sequence is the same.
How does the choice between pre- and post padding impact results?
It seems like pre padding is more common, but I can't find an explanation of why it would be better. Due to the nature of RNNs it feels like an arbitrary choice for me, since they share weights across time steps.
Commonly in RNN's, we take the final output or hidden state and use this to make a prediction (or do whatever task we are trying to do).
If we send a bunch of 0's to the RNN before taking the final output (i.e. 'post' padding as you describe), then the hidden state of the network at the final word in the sentence would likely get 'flushed out' to some extent by all the zero inputs that come after this word.
So intuitively, this might be why pre-padding is more popular/effective.
This paper (https://arxiv.org/pdf/1903.07288.pdf) studied the effect of padding types on LSTM and CNN. They found that post-padding achieved substantially lower accuracy (nearly half) compared to pre-padding in LSTMs, although there wasn't a significant difference for CNNs (post-padding was only slightly worse).
A simple/intuitive explanation for RNNs is that, post-padding seems to add noise to what has been learned from the sequence through time, and there aren't more timesteps for the RNN to recover from this noise. With pre-padding, however, the RNN is better able to adjust to the added noise of zeros at the beginning as it learns from the sequence through time.
I think more thorough experiments are needed in the community for more detailed mechanistic explanations on how padding affects performance.
I always recommend using pre-padding over post-padding, even for CNNs, unless the problem specifically requires post-padding.

Vowpal Wabbit Continuing Training

Continuing with some experimentation here I was interested is seeing how to continuing training a VW model.
I first ran this and saved the model.
vw -d housing.vm --loss_function squared -f housing2.mod --invert_hash readable.housing2.mod
Examining the readable model:
Version 7.7.0
Min label:0.000000
Max label:50.000000
bits:18
0 pairs:
0 triples:
rank:0
lda:0
0 ngram:
0 skip:
options:
:0
^AGE:104042:0.020412
^B:158346:0.007608
^CHAS:102153:1.014402
^CRIM:141890:0.016158
^DIS:182658:0.278865
^INDUS:125597:0.062041
^LSTAT:170288:0.028373
^NOX:165794:2.872270
^PTRATIO:223085:0.108966
^RAD:232476:0.074916
^RM:2580:0.330865
^TAX:108300:0.002732
^ZN:54950:0.020350
Constant:116060:2.728616
If I then continue to train the model using two more examples (in housing_2.vm), which note, has zero values for ZN and CHAS:
27.50 | CRIM:0.14866 ZN:0.00 INDUS:8.560 CHAS:0 NOX:0.5200 RM:6.7270 AGE:79.90 DIS:2.7778 RAD:5 TAX:384.0 PTRATIO:20.90 B:394.76 LSTAT:9.42
26.50 | CRIM:0.11432 ZN:0.00 INDUS:8.560 CHAS:0 NOX:0.5200 RM:6.7810 AGE:71.30 DIS:2.8561 RAD:5 TAX:384.0 PTRATIO:20.90 B:395.58 LSTAT:7.67
If the model saved is loaded and training continues, the coefficients appear to be lost from these zero valued features. Am I doing something wrong or is this a bug?
vw -d housing_2.vm --loss_function squared -i housing2.mod --invert_hash readable.housing3.mod
output from readable.housing3.mod:
Version 7.7.0
Min label:0.000000
Max label:50.000000
bits:18
0 pairs:
0 triples:
rank:0
lda:0
0 ngram:
0 skip:
options:
:0
^AGE:104042:0.023086
^B:158346:0.008148
^CRIM:141890:1.400201
^DIS:182658:0.348675
^INDUS:125597:0.087712
^LSTAT:170288:0.050539
^NOX:165794:3.294814
^PTRATIO:223085:0.119479
^RAD:232476:0.118868
^RM:2580:0.360698
^TAX:108300:0.003304
Constant:116060:2.948345
If you want to continue learning from saved state in a smooth fashion you must use the --save_resume option.
There are 3 fundamentally different types of "state" that can be saved into a vw "model" file:
The weight vector (regressor) obviously. That's the model itself.
invariant parameters like the version of vw (to ensure binary compatibility which is not always preserved between versions), number of bits in the vector (-b), and type of model
state which dynamically changes during learning. This subset includes parameters like learning and decay rates which gradually change during learning with each example, the example numbers themselves, etc.
Only --save_resume saves the last group.
--save_resume is not the default because it has an overhead and in most use-cases it isn't needed. e.g. if you save a model once in order to do many predictions and no learning (-t), there's no need in saving the 3rd subset of state.
So, I believe in your particular case, you want to use --save_resume.
The possibility of a bug always exists, especially since vw supports so many options (about 100 at last count) which are often interdependent. Some option combinations make sense, other don't. Doing a sanity check for roughly 2^100 possible option combinations is a bit unrealistic. If you find a bug, please open an issue on github. In this case, please make sure to use a complete example (full data & command line) so your problem can be reproduced.
Update 2014-09-20 (after an issue was opened on github, thanks!):
The reason for 0 valued features "disappearing" (not really from the model, but only from the --invert_hash output) is that 1) --invert_hash was never designed for multiple passes, because keeping the original feature names in a hash-table, incurs a large performance overhead 2) The missing features are those with a zero value, which are discarded. The model itself should still have any feature with any prior pass non-zero weight in it. Fixing this inconsistency is too complex and costly for implementation reasons, and would go against the overriding motivation of making vw fast, especially for the most useful/common use-cases. Anyway, thanks for the report, I too learned something new from it.

SVM training performance

I'm using SVMLib to train a simple SVM over the MNIST dataset. It contains 60.000 training data. However, I have several performance issues: the training seems to be endless (after a few hours, I had to shut it down by hand, because it doesn't respond). My code is very simple, I just call ovrtrain on the dataset without any kernel and any special constants:
function features = readFeatures(fileName)
[fid, msg] = fopen(fileName, 'r', 'ieee-be');
header = fread(fid, 4, "int32" , 0, "ieee-be");
if header(1) ~= 2051
fprintf("Wrong magic number!");
end
M = header(2);
rows = header(3);
columns = header(4);
features = fread(fid, [M, rows*columns], "uint8", 0, "ieee-be");
fclose(fid);
return;
endfunction
function labels = readLabels(fileName)
[fid, msg] = fopen(fileName, 'r', 'ieee-be');
header = fread(fid, 2, "int32" , 0, "ieee-be");
if header(1) ~= 2049
fprintf("Wrong magic number!");
end
M = header(2);
labels = fread(fid, [M, 1], "uint8", 0, "ieee-be");
fclose(fid);
return;
endfunction
labels = readLabels("train-labels.idx1-ubyte");
features = readFeatures("train-images.idx3-ubyte");
model = ovrtrain(labels, features, "-t 0"); % doesn't respond...
My question: is it normal? I'm running it on Ubuntu, a virtual machine. Should I wait longer?
I don't know whether you took your answer or not, but let me tell you what I predict about your situation. 60.000 examples is not a lot for a power trainer like LibSVM. Currently, I am working on a training set of 6000 examples and it takes 3-to-5 seconds to train. However, the parameter selection is important and that is the one probably taking long time. If the number of unique features in your data set is too high, then for any example, there will be lots of zero feature values for non-existing features. If the tool is implementing data scaling on your training set, then most probably those lots of zero feature values will be scaled to a certain non-zero value, leaving you astronomic number of unique and non-zero valued features for each and every example. This is very very complicated for a SVM tool to get in and extract efficient parameter values.
Long story short, if you had enough research on SVM tools and understand what I mean, you either assign parameter values in the training command before executing it or find a way to decrease the number of unique features. If you haven't, go on and download the latest version of LibSVM, read the ReadME files as well as the FAQ from the website of the tool.
If non of these is the case, then sorry for taking your time:) Good luck.
It might be an issue of convergence given the characteristics of your data.
Check the kernel you have as default selection and change it. Also, check the stopping criterion of the package. Additionally, if you are looking for faster implementation, check MSVMpack which is a parallel implementation of SVM.
Finally, feature selection in your case is desired. You can end up with a good feature subset of almost half of what you have. In addition, you need only a portion of data for training e.g. 60~70 % are sufficient.
First of all 60k is huge data for training.Training that much data with linear kernel will take hell of time unless you have a supercomputing. Also you have selected a linear kernel function of degree 1. Its better to use Gaussian or higher degree polynomial kernel (deg 4 used with the same dataset showed a good tranning accuracy). Try to add the LIBSVM options for -c cost -m memory cachesize -e epsilon tolerance of termination criterion (default 0.001). First run 1000 samples with Gaussian/ polynomial of deg 4 and compare the accuracy.

Resources