Precision and recall not changing while training BERT - precision

I am training 'bert-base-cased' model for multiclass classification.
Loss is incredibly high, because training data is very noisy and I'm classifying over 700 classes.
I just wanted to see how the training goes.
However precision, recall, f1_score is not changing and I don't know why. (loss is decreasing)
I checked the value in more detail in case the change was too small, but the value was exactly the same.
I'm posting 'compute_metrics' I used.
Let me know if you need more information.
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
'f1': f1,
'precision': precision,
'recall': recall
}
trainer = Trainer(
model=model,
args=training_ars,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
compute_metrics=compute_metrics)

Related

Using RMSE function with 5-fold cross validation to choose the best model out of 3

I have defined three different models obtained from the dataset diabetes from the library lars. The first model (M1) is the one that minimizes the BIC value out of all the possible regression models obtained combining the explanatory variables (which are p=10, so 2^10 possible models). The other two are obtained through glmnet and are a Lasso regression with respectively lambda.min (M2) and lambda.1se (M3), where lambda.min and lambda.1se are obtained through cv.glmnet. Now I should perform 5-fold cross-validation using the RMSE (Root Mean Square Error) function to check which of the tree models Μ1, Μ2 and Μ3, has the best predictive performance. In order to find the errors in the models obtained from Lasso I have to use the ordinal least squares estimates.
This is my code as for now:
library(lars)
library(glmnet)
data(diabetes)
y<-diabetes$y
x<-diabetes$x
x2<-diabetes$x2
X = as.data.frame(cbind(x))
Y = as.data.frame(y)
p=10
n=442
best_score = Inf
M1 = NA
for (i in 1:(2^p-1)){
model = lm(y ~ ., data = subset(X, select = c(which(as.integer(intToBits(i)) == 1))))
if (BIC(model) < best_score){
M1 = model
best_score = BIC ( model )
}
}
W<-as.matrix(X)
Y<-as.matrix(Y)
lasso<-glmnet(W, Y)
x11()
plot(lasso, label=T)
x11()
plot(lasso, xvar = 'lambda', label=T)
lasso$df
lasso$lambda
cvfit<-cv.glmnet(W,Y)
cvfit
coef(cvfit, s="lambda.min")
coef(cvfit, s="lambda.1se")
M2<-glmnet(W,Y,lambda = cvfit$lambda.min)
M3<-glmnet(W,Y,lambda = cvfit$lambda.1se)
I really don't know where to put hands now. Should I first of all split the original dataset in 5 and then compute again the models on the different train and test set? And how do I compute the final RMSE for each model? And what does it mean that I should use ordinal least square estimates for the models obtained through Lasso?

Performance expectations when running caret::train() to develop a kknn model

I'm using the caret::train() function to develop a weighted knn classification model (kknn) with 10-fold cross-validation and a tuneGrid containing 15 values for kmax, one value for distance, and 3 values for kernel.
That’s 450 total iterations if I understand the process correctly (an iteration being the computation of the probability of a given outcome for a given combination of kmax, distance, and kernel). x has about 480,000 data points (6 predictors each having about 80,000 observations), and y has about 80,000 data points.
Understanding that there are innumerable variables affecting performance, how long can I reasonably expect the train function to take if run on a pc with an 8-core 3GHz Intel processor and 32GB of RAM?
It currently takes about 70 minutes per fold, which is about 1.5 minutes per iteration. Is this reasonable, or excessive?
This is a kknn learning exercise. I realize there are other types of algorithms that produce better results more efficiently.
Here is the essential code:
x <- as.matrix(train_set2[, c("n_launch_angle", "n_launch_speed", "n_spray_angle_Kolp", "n_spray_angle_adj", "n_hp_to_1b", "n_if_alignment")])
y <- train_set2$events
set.seed(1)
fitControl <- trainControl(method = "cv", number = 10, p = 0.8, returnData = TRUE,
returnResamp = "all", savePredictions = "all",
summaryFunction = twoClassSummary, classProbs = TRUE,
verboseIter = TRUE)
tuneGrid <- expand.grid(kmax = seq(11, 39, 2),
distance = 2,
kernel = c("triangular", "gaussian", "optimal"))
kknn_train <- train(x, y, method = "kknn",
tuneGrid = tuneGrid, trControl = fitControl)
As we have established in the comments, it is reasonable to expect this type of runtime. There are a few step to reduce this;
Running your code in parallel
Using a more efficient OS; like Linux
Be more efficient in your trainControl(), is it really necessary to have returnResamps=TRUE? There is small gains in controlling these.
Clearly, the first one is a no-brainer. For the second one, I can find as many computer-engineers who swears to linux as those who swears to windows. What convinced me to switch to Linux, was this particular test, which I hope will give you what it gave me.
# Calculate distance matrix
test_data <- function(dim, num, seed = 1903) {
set.seed(seed)
dist(
matrix(
rnorm(dim * num), nrow = num
)
)
}
# Benchmarking
microbenchmark::microbenchmark(test_data(120,4500))
This piece of code simply just runs faster on the exact same machine that runs Linux. At least this was my experience.

Why do higher learning rates in logistic regression produce NaN costs?

Summary
I am building a classifier for spam vs. ham emails using Octave and the Ling-Spam corpus; my method of classification is logistic regression.
Higher learning rates lead to NaN values being calculated for the cost, yet it does not break/decrease the performance of the classifier itself.
My Attempts
NB: My dataset is already normalised using mean normalisation.
When trying to choose my learning rate, I started with it as 0.1 and 400 iterations. This resulted in the following plot:
1 - Graph 1
When he lines completely disappear after a few iterations, it is due to a NaN value being produced; I thought this would result in broken parameter values and thus bad accuracy, but when checking the accuracy, I saw it was 95% on the test set (meaning that gradient descent was apparently still functioning). I checked different values of the learning rate and iterations to see how the graphs changed:
2 - Graph 2
The lines no longer disappeared, meaning no NaN values, BUT the accuracy was 87% which is substantially lower.
I did two more tests with more iterations and a slightly higher learning rate, and in both of them, the graphs both decreased with iterations as expected, but the accuracy was ~86-88%. No NaNs there either.
I realised that my dataset was skewed, with only 481 spam emails and 2412 ham emails. I therefore calculated the FScore for each of these different combinations, hoping to find the later ones had a higher FScore and the accuracy was due to the skew. That was not the case either - I have summed up my results in a table:
3 - Table
So there is no overfitting and the skew does not seem to be the problem; I don't know what to do now!
The only thing I can think of is that my calculations for accuracy and FScore are wrong, or that my initial debugging of the line 'disappearing' was wrong.
EDIT: This question is crucially about why the NaN values occur for those chosen learning rates. So the temporary fix I had of lowering the learning rate did not really answer my question - I always thought that higher learning rates simply diverged instead of converging, not producing NaN values.
My Code
My main.m code (bar getting the dataset from files):
numRecords = length(labels);
trainingSize = ceil(numRecords*0.6);
CVSize = trainingSize + ceil(numRecords*0.2);
featureData = normalise(data);
featureData = [ones(numRecords, 1), featureData];
numFeatures = size(featureData, 2);
featuresTrain = featureData(1:(trainingSize-1),:);
featuresCV = featureData(trainingSize:(CVSize-1),:);
featuresTest = featureData(CVSize:numRecords,:);
labelsTrain = labels(1:(trainingSize-1),:);
labelsCV = labels(trainingSize:(CVSize-1),:);
labelsTest = labels(CVSize:numRecords,:);
paramStart = zeros(numFeatures, 1);
learningRate = 0.0001;
iterations = 400;
[params] = gradDescent(featuresTrain, labelsTrain, learningRate, iterations, paramStart, featuresCV, labelsCV);
threshold = 0.5;
[accuracy, precision, recall] = predict(featuresTest, labelsTest, params, threshold);
fScore = (2*precision*recall)/(precision+recall);
My gradDescent.m code:
function [optimParams] = gradDescent(features, labels, learningRate, iterations, paramStart, featuresCV, labelsCV)
x_axis = [];
J_axis = [];
J_CV = [];
params = paramStart;
for i=1:iterations,
[cost, grad] = costFunction(features, labels, params);
[cost_CV] = costFunction(featuresCV, labelsCV, params);
params = params - (learningRate.*grad);
x_axis = [x_axis;i];
J_axis = [J_axis;cost];
J_CV = [J_CV;cost_CV];
endfor
graphics_toolkit("gnuplot")
plot(x_axis, J_axis, 'r', x_axis, J_CV, 'b');
legend("Training", "Cross-Validation");
xlabel("Iterations");
ylabel("Cost");
title("Cost as a function of iterations");
optimParams = params;
endfunction
My costFunction.m code:
function [cost, grad] = costFunction(features, labels, params)
numRecords = length(labels);
hypothesis = sigmoid(features*params);
cost = (-1/numRecords)*sum((labels).*log(hypothesis)+(1-labels).*log(1-hypothesis));
grad = (1/numRecords)*(features'*(hypothesis-labels));
endfunction
My predict.m code:
function [accuracy, precision, recall] = predict(features, labels, params, threshold)
numRecords=length(labels);
predictions = sigmoid(features*params)>threshold;
correct = predictions == labels;
truePositives = sum(predictions == labels == 1);
falsePositives = sum((predictions == 1) != labels);
falseNegatives = sum((predictions == 0) != labels);
precision = truePositives/(truePositives+falsePositives);
recall = truePositives/(truePositives+falseNegatives);
accuracy = 100*(sum(correct)/numRecords);
endfunction
Credit where it's due:
A big help here was this answer: https://stackoverflow.com/a/51896895/8959704 so this question is kind of a duplicate, but I didn't realise it, and it isn't obvious at first... I will do my best to try to explain why the solution works too, to avoid simply copying the answer.
Solution:
The issue was in fact the 0*log(0) = NaN result that occurred in my data. To fix it, in my calculation of the cost, it became:
cost = (-1/numRecords)*sum((labels).*log(hypothesis)+(1-labels).*log(1-hypothesis+eps(numRecords, 1)));
(see the question for the variables' values etc., it seems redundant to include the rest when just this line changes)
Explanation:
The eps() function is defined as follows:
Return a scalar, matrix or N-dimensional array whose elements are all
eps, the machine precision.
More precisely, eps is the relative spacing between any two adjacent
numbers in the machine’s floating point system. This number is
obviously system dependent. On machines that support IEEE floating
point arithmetic, eps is approximately 2.2204e-16 for double precision
and 1.1921e-07 for single precision.
When called with more than one argument the first two arguments are
taken as the number of rows and columns and any further arguments
specify additional matrix dimensions. The optional argument class
specifies the return type and may be either "double" or "single".
So this means that adding this value onto the value calculated by the Sigmoid function (which was previously so close to 0 it was taken as 0) will mean that it is the closest value to 0 that is not 0, making the log() not return -Inf.
When testing with the learning rate as 0.1 and iterations as 2000/1000/400, the full graph was plotted and no NaN values were produced when checking.
NB: Just in case anyone was wondering, the accuracy and FScores did not change after this, so the accuracy really was that good despite the error in calculating the cost with a higher learning rate.

KALMAN filter doesn't respond to changes

I am implementing a Kalman filter for the first time to get voltage values from a source. It works and it stabilizes at the source voltage value but if then the source changes the voltage the filter doesn't adapt to the new value.
I use 3 steps:
Get the Kalman gain
KG = previous_error_in_estimate / ( previous_error_in_estimate + Error_in_measurement )
Get current estimation
Estimation = previous_estimation + KG*[measurement - previous_estimation]
Calculate the error in estimate
Error_in_estimate = [1-KG]*previous_error_in_estimate
The thing is that, as 0 <= KG <= 1, Error_in_estimate decreases more and more and that makes KG to also decrease more and more ( error_in_measurement is a constant ), so at the end the estimation only depends on the previous estimation and the current measurement is not taken into account.
This prevents the filter from adapt himself to measurement changes.
How can I do to make that happen?
Thanks
EDIT:
Answering to Claes:
I am not sure that the Kalman filter is valid for my problem since I don't have a system model, I just have a bunch of readings from a quite noisy sensor measuring a not very predictable variable.
To keep things simple, imagine reading a potentiometer ( a variable resistor ) changed by the user, you can't predict or model the user's behavior.
I have implemented a very basic SMA ( Simple Moving Average ) algorithm and I was wondering if there is a better way to do it.
Is the Kalman filter valid for a problem like this?
If not, what would you suggest?
2ND EDIT
Thanks to Claes for such an useful information
I have been doing some numerical tests in MathLab (with no real data yet) and doing the convolution with a Gaussian filter seems to give the most accurate result.
With the Kalman filter I don't know how to estimate the process and measurement variances, is there any method for that?. Only when I decrease quite a lot the measurement variance the kalman filter seems to adapt. In the previous image the measurement variance was R=0.1^2 (the one in the original example). This is the same test with R=0.01^2
Of course, these are MathLab tests with no real data. Tomorrow I will try to implement this filters in the real system with real data and see if I can get similar results
A simple MA filter is probably sufficient for your example. If you would like to use the Kalman filter there is a great example at the SciPy cookbook
I have modified the code to include a step change so you can see the convergence.
# Kalman filter example demo in Python
# A Python implementation of the example given in pages 11-15 of "An
# Introduction to the Kalman Filter" by Greg Welch and Gary Bishop,
# University of North Carolina at Chapel Hill, Department of Computer
# Science, TR 95-041,
# http://www.cs.unc.edu/~welch/kalman/kalmanIntro.html
# by Andrew D. Straw
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 8)
# intial parameters
n_iter = 400
sz = (n_iter,) # size of array
x1 = -0.37727*np.ones(n_iter/2) # truth value 1
x2 = -0.57727*np.ones(n_iter/2) # truth value 2
x = np.concatenate((x1,x2),axis=0)
z = x+np.random.normal(0,0.1,size=sz) # observations (normal about x, sigma=0.1)
Q = 1e-5 # process variance
# allocate space for arrays
xhat=np.zeros(sz) # a posteri estimate of x
P=np.zeros(sz) # a posteri error estimate
xhatminus=np.zeros(sz) # a priori estimate of x
Pminus=np.zeros(sz) # a priori error estimate
K=np.zeros(sz) # gain or blending factor
R = 0.1**2 # estimate of measurement variance, change to see effect
# intial guesses
xhat[0] = 0.0
P[0] = 1.0
for k in range(1,n_iter):
# time update
xhatminus[k] = xhat[k-1]
Pminus[k] = P[k-1]+Q
# measurement update
K[k] = Pminus[k]/( Pminus[k]+R )
xhat[k] = xhatminus[k]+K[k]*(z[k]-xhatminus[k])
P[k] = (1-K[k])*Pminus[k]
plt.figure()
plt.plot(z,'k+',label='noisy measurements')
plt.plot(xhat,'b-',label='a posteri estimate')
plt.plot(x,color='g',label='truth value')
plt.legend()
plt.title('Estimate vs. iteration step', fontweight='bold')
plt.xlabel('Iteration')
plt.ylabel('Voltage')
And the output is:

Analysing the result of LSTM Theano Sentiment Analysis

I'm trying the code from this link http://deeplearning.net/tutorial/lstm.html but changing the imdb data to my own. This is the screenshot of my result.
I want to determine the overall accuracy of running LSTM for sentiment analysis, but cannot understand the output. The train, valid and test values print multiple times but it's usually the same value.
Any help would be much appreciated.
The value it prints is computed by the following function:
def pred_error(f_pred, prepare_data, data, iterator, verbose=False):
"""
Just compute the error
f_pred: Theano fct computing the prediction
prepare_data: usual prepare_data for that dataset.
"""
valid_err = 0
for _, valid_index in iterator:
x, mask, y = prepare_data([data[0][t] for t in valid_index],
numpy.array(data[1])[valid_index],
maxlen=None)
preds = f_pred(x, mask)
targets = numpy.array(data[1])[valid_index]
valid_err += (preds == targets).sum()
valid_err = 1. - numpy_floatX(valid_err) / len(data[0])
return valid_err
It is easy to follow, and what it computes is 1 - accuracy, where accuracy is percentage of samples labeled correctly. In other words, you get around 72% accuracy on the training set, almost 95% accuracy on the validation set, and 50% accuracy on the test set.
The fact that your validation accuracy is so high compared to the train accuracy is a little bit suspicious, I would trace the predictions and see if may be our validation set is somehow not representative, or too small.

Resources