Make predictions on on HuggingFace's BERT with Dropout on - huggingface-transformers

The default behavior of Trainer(...) in HuggingFace when evaluating model is disabling Dropout. Concretely, y_pred for M runs will be exactly the same
for i in range(M):
logits, labels, metrics = trainer.predict(tokenized_datasets["eval"])
y_pred = np.argmax(logits, axis=2)
...
Now I am trying to apply Monte Carlo Dropout trick introduced this this answer. This requires to turn the Dropout on while making predictions on the validation set.
I am wondering how I achieve this goal. Any input is appreciated.

You can set only the dropout layers to training:
from torch import nn
from transformers import BertModel
model= BertModel.from_pretrained('bert-base-uncased')
model.eval()
def apply_dropout(m):
if type(m) == nn.Dropout:
m.train()
model.apply(apply_dropout)
as recommended in the pytorch forums (here, here).

Related

Why is my Doc2Vec model in gensim not reproducible?

I have noticed that my gensim Doc2Vec (DBOW) model is sensitive to document tags. My understanding was that these tags are cosmetic and so they should not influence the learned embeddings. Am I misunderstanding something? Here is a minimal example:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import numpy as np
import os
os.environ['PYTHONHASHSEED'] = '0'
reps = []
for a in [0,500]:
documents = [TaggedDocument(doc, [i + a])
for i, doc in enumerate(common_texts)]
model = Doc2Vec(documents, vector_size=100, window=2, min_count=0,
workers=1, epochs=10, dm=0, seed=0)
reps.append(np.array([model.docvecs[k] for k in range(len(common_texts))])
reps[0].sum() == reps[1].sum()
This last line returns False. I am working with gensim 3.8.3 and Python 3.5.2. More generally, is there any role that the values of the tags play (assuming they are unique)? I ask because I have found that using different tags for documents in a classification task leads to widely varying performance.
Thanks in advance.
First & foremost, your test isn't even comparing vectors corresponding to the same texts!
In run #1, the vector for the 1st text in in model.docvecs[0]. In run #2, the vector for the 1st text is in model.docvecs[1].
And, in run #2, the vector at model.docvecs[0] is just a randomly-initialized, but never-trained, vector - because none of the training texts had a document tag of (int) 0. (If using pure ints as the doc-tags, Doc2Vec uses them as literal indexes - potentially leaving any unused slots less than your highest tag allocated-and-initialized, but never-trained.)
Since common_texts only has 11 entries, by the time you reach run #12, all the vectors in your reps array of the first 11 vectors are garbage uncorrelated with any of your texts/
However, even after correcting that:
As explained in the Gensim FAQ answer #11, determinism in this algorithm shouldn't generally be expected, given many sources of potential randomness, and the fuzzy/approximate nature of the whole approach. If you're relying on it, or testing for it, you're probably making some unwarranted assumptions.
In general, tests of these algorithms should be evaluating "roughly equivalent usefulness in comparative uses" rather than "identical (or even similar) specific vectors". For example, a test whether apple and orange are roughly at the same positions in each others' nearest-neighbor rankings makes more sense than checking their (somewhat arbitrary) exact vector positions or even cosine-similarity.
Additionally:
tiny toy datasets like common_texts won't show the algorithm's usual behavior/benefits
PYTHONHASHSEED is only consulted by the Python interpreter at startup; setting it from Python can't have any effect. But also, the kind of indeterminism it introduces only comes up with separate interpreter launches: a tight loop within a single interpreter run like this wouldn't be affected by that in any case.
Have you checked the magnitude of the differences?
Just running:
delta = reps[0].sum() - reps[1].sum()
for the aggregate differences results with -1.2598932e-05 when I run it.
Comparison dimension-wise:
eps = 10**-4
over = (np.abs(diff) <= eps).all()
Returns True on a vast majority of the runs which means that you are getting quite reproducible results given the complexity of the calculations.
I would blame numerical stability of the calculations or uncontrolled randomness. Even though you do try to control the random seed, there is a different random seed in NumPy and different in random standard library so you are not controlling for all of the sources of randomness. This can also have an influence on the results but I did not check the actual implementation in gensim and it's dependencies.
Change
import os
os.environ['PYTHONHASHSEED'] = '0'
to
import os
import sys
hashseed = os.getenv('PYTHONHASHSEED')
if not hashseed:
os.environ['PYTHONHASHSEED'] = '0'
os.execv(sys.executable, [sys.executable] + sys.argv)

Gekko PRED_HOR and CTRL_HOR vs m.time

I'm trying to implement an online MPC controller and I'm a bit confused about what exactly the m.time does.
With m.options.IMODE = 6 #MPC and m.options.REQCTRLMODE=3, I try to define the prediction and control horizons:
m.options.CTRL_HOR=10
m.options.CTRL_TIME=0.05
m.options.PRED_HOR=10
m.options.PRED_TIME=0.05
If I understand it right the ctrl_hor and pred_hor sets how many future timesteps we calculate and the pred_time and ctrl_time defines how long is one timestep.
But the problem is that the controller throws an error if I don't define m.time, but what exactly does it do and why isn't it enough to set ctrl and pred horizons with respective timesteps?
Gekko uses m.time by default instead of CTRL_HOR and PRED_HOR. You can define an equivalent control / prediction horizon in Gekko with:
import numpy as np
from gekko import GEKKO
m = GEKKO()
m.time = np.linspace(0,0.05,11)
The CTRL_HOR and PRED_HOR properties are optionally used when CSV_READ=0. However, Gekko uses the CSV file to insert information about default values for parameters and variables so I don't recommend that you turn it off. Using m.time is also more flexible because you can have a non-uniform control / prediction horizon such as:
m.time = [0,0.05,0.1,0.2,0.5,1.0]
This helps to have the fine resolution at the beginning and then larger steps to determine steady-state move plans. Here is a practical TCLab MPC application with real-time data.

Tensorflow: How to check the validation accuracy every 100 iterations?

I try to use the following code to check the validation accuracy every 100 iterations, however, the validation accuracy is not changing(the network is fine)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(1000):
batch = mnist.train.next_batch(50)
if i % 100 == 0:
train_accuracy = accuracy.eval(feed_dict={x:batch[0], y_:batch[1], keep_prob:1.0})
print('step %d, training accuracy %g' %(i, train_accuracy))
validation_accuracy = accuracy.eval(feed_dict={x:mnist.test.images, y_:mnist.test.labels, keep_prob:0.0})
print('step %d, validation accuracy %g' %(i, validation_accuracy))
train_step.run(feed_dict={x:batch[0], y_:batch[1], keep_prob:0.5})
Since you haven't added your network implementation, my answer will be an educated guess.
TL;DR: You should use keep_prob:1.0 instead of keep_prob:0.0 in your validation step.
By the appearance of keep_prob, I deduce that your network is using dropout. By using feed_dict={x:mnist.test.images, y_:mnist.test.labels, keep_prob:0.0}, you are feeding a 0.0 probability to keeping an activation, which is equivalent to a 1.0 probability of dropping it. The result is that when you are performing validation, you are basically ignoring the input to the network and all hidden layers. This has the effect that the last layer gives you the same output values for all classes of MNIST (this may be only approximately true, depending on the specific implementation), therefore the accuracy is constant.
Dropout is a method for regularization, which drops neurons during training steps, thus improving the generalization ability of the network. When you are not training (such as during a validation step), you want to keep all neurons. Thus, what you probably want to do is feed the value 1.0 instead.

word vectors using gensim's word2vec implementation and GPU does not show any speed-up

I have PC with NVIDIA gpu. I have installed OpenBLAS. I am trying to train word vectors using gensim's word2vec implementation. I have set number of workers =4. But when I run top command to see CPU usage. It is showing only 100%. Does it mean only one core is utilised? And my program does not show any speed-up.
My code snippet is:
import gensim
import time
import numpy
class MySentences(object):
def __init__(self, dirname):
self.dirname = dirname
#called when Word2Vec is called
def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname)):
yield line.split()
sentences=MySentences("/home/lalchand/NewdatasetforAssgn2/tfidf/spam")
start = time.time()
model = gensim.models.Word2Vec(sentences, min_count=1,iter=5,workers=4)
print(model.syn0.shape)
Gensim does not currently support using GPUs: https://github.com/RaRe-Technologies/gensim/issues/449

Speeding up evaluation of many scipy splines over the same set of knots

I have a few quick questions with regards to speeding-up spline function evaluation in scipy (version 0.12.0) and I wish to apologize in advance for my novice understanding of splines. I am trying to create an object for scipy.integrate.odeint integration of a chemical kinetics problems using spline lookups for reaction rates (1.e2-1.e3 functions of ode system variables) and generated c-code for all of the algebra in the ODE system of equations. In comparison to a previous implementation that was purely in python, evaluating the c-code is so much faster than the spline interpolations that the evaluation of splines is the bottleneck in the ODE function. In trying to remove the bottleneck, I have reformed all of the reaction rates into splines that exist on the same knot values with the same order while having different smoothing coefficients (In reality I will have multiple sets of functions, where each function set was found on the same knots, has the same argument variable, and at the same derivative level, but for simplicity I will assume one function set for this question).
In principle this is just a collection of curves on the same x-values and could be treated with interp1d (equivalently rewrapping splmake and spleval from scipy.interpolate) or a list of splev calls on tck data from splrep.
In [1]: %paste
import numpy
import scipy
from scipy.interpolate import *
#Length of Data
num_pts = 3000
#Number of functions
num_func = 100
#Shared x list by all functions
x = numpy.linspace(0.0,100.0,num_pts)
#Separate y(x) list for each function
ylist = numpy.zeros((num_pts,num_func))
for ind in range(0,num_func):
#Dummy test for different data
ylist[:,ind] = (x**ind + x - 3.0)
testval = 55.0
print 'Method 1'
fs1 = [scipy.interpolate.splrep(x,ylist[:,ind],k=3) for ind in range(0,num_func)]
out1 = [scipy.interpolate.splev(testval,fs1[ind]) for ind in range(0,num_func)]
%timeit [scipy.interpolate.splev(testval,fs1[ind]) for ind in range(0,num_func)]
print 'Method 2 '
fs2 = scipy.interpolate.splmake(x,ylist,order=3)
out2 = scipy.interpolate.spleval(fs2,testval)
%timeit scipy.interpolate.spleval(fs2,testval)
## -- End pasted text --
Method 1
1000 loops, best of 3: 1.51 ms per loop
Method 2
1000 loops, best of 3: 1.32 ms per loop
As far as I understand spline evaluations, once the tck arrays have been created (either with splrep or splmake) the evaluation functions (splev and spleval) perform two operations when given some new value xnew:
1) Determine relevant indicies of knots and smoothing coefficients
2) Evaluate polynomial expression with smoothing coefficients and new xnew
Questions
Since all of the splines (in a function set) are created on the same knot values, is it possible to avoid step (1, relevant indices) in the spline evaluation once it has been performed on the first function of a function set? From my looking at the Fortran fitpack files (directly from DIERCKX, I could not find the .c files used by scipy on my machine) I do not think this is supported, but I would love to be shown wrong.
The compilation of the system c-code as well as the creation of all of the spline tck arrays is a preprocessing step as far as I am concerned; if I am worried about the speed of evaluating these lists of many functions, should be looking at a compiled variant since my tck lists will be unchanging?
One of my function sets will likely have an x-array of geometrically spaced values as opposed to linearly spaced; will this drastically reduce the evaluation time of the splines?
Thank you in advance for your time and answers.
Cheers,
Guy

Resources