Analysing the result of LSTM Theano Sentiment Analysis - sentiment-analysis

I'm trying the code from this link http://deeplearning.net/tutorial/lstm.html but changing the imdb data to my own. This is the screenshot of my result.
I want to determine the overall accuracy of running LSTM for sentiment analysis, but cannot understand the output. The train, valid and test values print multiple times but it's usually the same value.
Any help would be much appreciated.

The value it prints is computed by the following function:
def pred_error(f_pred, prepare_data, data, iterator, verbose=False):
"""
Just compute the error
f_pred: Theano fct computing the prediction
prepare_data: usual prepare_data for that dataset.
"""
valid_err = 0
for _, valid_index in iterator:
x, mask, y = prepare_data([data[0][t] for t in valid_index],
numpy.array(data[1])[valid_index],
maxlen=None)
preds = f_pred(x, mask)
targets = numpy.array(data[1])[valid_index]
valid_err += (preds == targets).sum()
valid_err = 1. - numpy_floatX(valid_err) / len(data[0])
return valid_err
It is easy to follow, and what it computes is 1 - accuracy, where accuracy is percentage of samples labeled correctly. In other words, you get around 72% accuracy on the training set, almost 95% accuracy on the validation set, and 50% accuracy on the test set.
The fact that your validation accuracy is so high compared to the train accuracy is a little bit suspicious, I would trace the predictions and see if may be our validation set is somehow not representative, or too small.

Related

Is there a bug when using keras.metrics.TopKCategoricalAccuracy with RNN? Could you tell me how to find the correct prediction rate per word?

Hallow,
I am working on an RNN for the Natural Language processing to predict the next word in sentences.
Recently, Recently, I have been confused to get a metrics of prediction accuracy with using keras.metrics.TopKCategoricalAccuracy with SimpleRNN like:
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=[keras.metrics.TopKCategoricalAccuracy(k=5)])
As a result, the following results were obtained.
Epoch 1/12
438/438 [==============================] - 40s 89ms/step - loss: 6.1173 - top_k_categorical_accuracy: 0.4646 - val_loss: 5.4228 - val_top_k_categorical_accuracy: 0.5072 - lr: 0.0040
...
Thus, even in a very short training period, it achieved an accuracy of over 40%. However, when I examined the details of each prediction, the prediction was still poor.
In this sentence, the success rate of the prediction was 0%.
The accuracy seems to be calculated as the probability of successful prediction of at least one word per sentence. But I think the probability of successful prediction per word makes more sense, so I need this one.
Consistent with this idea, when I increased the number of predicted words in a sentence from 8 (the first 8 words of the sentence) to 16, the accuracy improved dramatically, but each prediction was almost the same as the previous one (as above). This can be explained if one considers that the accuracy is calculated with a success rate of at least one success per sentence, and the more iterations of the prediction, the higher the probability of success per sentence.
Here are the codes which seem relevant:
##### building the model ######
def seq2vec_model_builder(HIDDEN_DIM):
encoder_inputs = layers.Input(shape=(time_step, ), dtype='int32',)
encoder_embedding = layers.Embedding(len(w_idx)+1, len(w_idx)+1, embeddings_initializer='identity', mask_zero=True, trainable=False)(encoder_inputs)
encoder_RNN = layers.SimpleRNN(HIDDEN_DIM, return_state=True, name='sRNN',kernel_initializer='glorot_normal')
encoder_outputs, state_h = encoder_RNN(encoder_embedding)
dense_layer = layers.Dense(len(w_idx)+1,
bias_initializer=keras.initializers.RandomNormal(stddev=0.00001), activation='softmax')
outputs = dense_layer(encoder_outputs)
model = keras.Model(encoder_inputs, outputs)
return model
model = seq2vec_model_builder(HIDDEN_DIM=128)
batch_size = 128;
def sample_generator(start, end):
while True:
for step in range((end - start) // batch_size):
x, y = [],[]
for line in range(batch_size):
dataset = preprocessing.sequence.TimeseriesGenerator(
sents[start+step*batch_size+line],
sents[start+step*batch_size+line],
length=time_step,
batch_size=1)
for batch in dataset:
X, Y = batch
x.extend(X[0])
y.extend(Y)
x = np.reshape(x,((seq_length-time_step)*batch_size,time_step))
y = keras.utils.to_categorical(y, len(w_idx)+1)
yield x, y
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=[keras.metrics.TopKCategoricalAccuracy(k=5)])
history_callback = model.fit(
sample_generator(train_start, train_end),
steps_per_epoch=(train_end - train_start) // batch_size,
validation_data=sample_generator(val_start, val_end),
validation_steps=(val_end - val_start) // batch_size,
initial_epoch=0,
epochs=12,
verbose=1 )
Could someone please tell me how to find the successful prediction rate per word?

When passing the init_score parameter to lightgbm for binary targets, do I pass my initial predictions as probabilities or as logit values?

My understanding of GBMs, is that the trees in the GBM model/predict the logit (log odds) that is then wrapped in the logistic function to get the final probability prediction. Given this, it is unclear whether I should pass my initial prediction as probabilities or as logit values.
Also when I make predictions, do I sum my initial probability prediction with the GBM probability prediction, or do I convert both to logits and then use the logistic function on the summed logits predictions?
Maybe this blog will give you some idea.
It says init_score in the binary classification is calculated in this way:
def logloss_init_score(y):
p = y.mean()
p = np.clip(p, 1e-15, 1 - 1e-15) # never hurts
log_odds = np.log(p / (1 - p))
return log_odds

What techniques are effective to find periodicity in arbitrary data points?

By "arbitrary" I mean that I don't have a signal sampled on a grid that is amenable to taking an FFT. I just have points (e.g. in time) where events happened, and I'd like an estimate of the rate, for example:
p = [0, 1.1, 1.9, 3, 3.9, 6.1 ...]
...could be hits from a process with a nominal periodicity (repetition interval) of 1.0, but with noise and some missed detections.
Are there well known methods for processing such data?
A least square algorithm may do the trick, if correctly initialized. A clustering method can be applied to this end.
As an FFT is performed, the signal is depicted as a sum of sine waves. The amplitude of the frequencies may be depicted as resulting from a least square fit on the signal. Hence, if the signal is unevenly sampled, resolving the same least square problem may make sense if the Fourier transform is to be estimated. If applied to a evenly sampled signal, it boils down to the same result.
As your signal is descrete, you may want to fit it as a sum of Dirac combs. It seems more sound to minimize the sum of squared distance to the nearest Dirac of the Dirac comb. This is a non-linear optimization problem where Dirac combs are described by their period and offset. This non-linear least-square problem can be solved by mean of the Levenberg-Marquardt algorithm. Here is an python example making use of the scipy.optimize.leastsq() function. Moreover, the error on the estimated period and offset can be estimated as depicted in How to compute standard deviation errors with scipy.optimize.least_squares . It is also documented in the documentation of curve_fit() and Getting standard errors on fitted parameters using the optimize.leastsq method in python
Nevertheless, half the period, or the thrid of the period, ..., would also fit, and multiples of the period are local minima that are to be avoided by a refining the initialization of the Levenberg-Marquardt algorithm. To this end, the differences between times of events can be clustered, the cluster featuring the smallest value being that of the expected period. As proposed in Clustering values by their proximity in python (machine learning?) , the clustering function sklearn.cluster.MeanShift() is applied.
Notice that the procedure can be extended to multidimentionnal data to look for periodic patterns or mixed periodic patterns featuring different fundamental periods.
import numpy as np
from scipy.optimize import least_squares
from scipy.optimize import leastsq
from sklearn.cluster import MeanShift, estimate_bandwidth
ticks=[0,1.1,1.9,3,3.9,6.1]
import scipy
print scipy.__version__
def crudeEstimate():
# loooking for the period by looking at differences between values :
diffs=np.zeros(((len(ticks))*(len(ticks)-1))/2)
k=0
for i in range(len(ticks)):
for j in range(i):
diffs[k]=ticks[i]-ticks[j]
k=k+1
#see https://stackoverflow.com/questions/18364026/clustering-values-by-their-proximity-in-python-machine-learning
X = np.array(zip(diffs,np.zeros(len(diffs))), dtype=np.float)
bandwidth = estimate_bandwidth(X, quantile=1.0/len(ticks))
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
print cluster_centers
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
for k in range(n_clusters_):
my_members = labels == k
print "cluster {0}: {1}".format(k, X[my_members, 0])
estimated_period=np.min(cluster_centers[:,0])
return estimated_period
def disttoDiracComb(x):
residual=np.zeros((len(ticks)))
for i in range(len(ticks)):
mindist=np.inf
for j in range(len(x)/2):
offset=x[2*j+1]
period=x[2*j]
#print period, offset
index=np.floor((ticks[i]-offset)/period)
#print 'index', index
currdist=ticks[i]-(index*period+offset)
if currdist>0.5*period:
currdist=period-currdist
index=index+1
#print 'event at ',ticks[i], 'not far from index ',index, '(', currdist, ')'
#currdist=currdist*currdist
#print currdist
if currdist<mindist:
mindist=currdist
residual[i]=mindist
#residual=residual-period*period
#print x, residual
return residual
estimated_period=crudeEstimate()
print 'crude estimate by clustering :',estimated_period
xp=np.array([estimated_period,0.0])
#res_1 = least_squares(disttoDiracComb, xp,method='lm',xtol=1e-15,verbose=1)
p,pcov,infodict,mesg,ier=leastsq(disttoDiracComb, x0=xp,ftol=1e-18, full_output=True)
#print ' p is ',p, 'covariance is ', pcov
# see https://stackoverflow.com/questions/14581358/getting-standard-errors-on-fitted-parameters-using-the-optimize-leastsq-method-i
s_sq = (disttoDiracComb(p)**2).sum()/(len(ticks)-len(p))
pcov=pcov *s_sq
perr = np.sqrt(np.diag(pcov))
#print 'estimated standard deviation on parameter :' , perr
print 'estimated period is ', p[0],' +/- ', 1.96*perr[0]
print 'estimated offset is ', p[1],' +/- ', 1.96*perr[1]
Applied to your sample, it prints :
crude estimate by clustering : 0.975
estimated period is 1.0042857141346768 +/- 0.04035792507868619
estimated offset is -0.011428571139828817 +/- 0.13385206912205957
It sounds like you need to decide what exactly you want to determine. If you want to know the average interval in a set of timestamps, then that's easy (just take the mean or median).
If you expect that the interval could be changing, then you need to have some idea about how fast it is changing. Then you can find a windowed moving average. You need to have an idea of how fast it is changing so that you can select your window size appropriately - a larger window will give you a smoother result, but a smaller window will be more responsive to a faster-changing rate.
If you have no idea whether the data is following any sort of pattern, then you are probably in the territory of data exploration. In that case, I would start by plotting the intervals, to see if a pattern appears to the eye. This might also benefit from applying a moving average if the data is quite noisy.
Essentially, whether or not there is something in the data and what it means is up to you and your knowledge of the domain. That is, in any set of timestamps there will be an average (and you can also easily calculate the variance to give an indication of variability in the data), but it is up to you whether that average carries any meaning.

Why do higher learning rates in logistic regression produce NaN costs?

Summary
I am building a classifier for spam vs. ham emails using Octave and the Ling-Spam corpus; my method of classification is logistic regression.
Higher learning rates lead to NaN values being calculated for the cost, yet it does not break/decrease the performance of the classifier itself.
My Attempts
NB: My dataset is already normalised using mean normalisation.
When trying to choose my learning rate, I started with it as 0.1 and 400 iterations. This resulted in the following plot:
1 - Graph 1
When he lines completely disappear after a few iterations, it is due to a NaN value being produced; I thought this would result in broken parameter values and thus bad accuracy, but when checking the accuracy, I saw it was 95% on the test set (meaning that gradient descent was apparently still functioning). I checked different values of the learning rate and iterations to see how the graphs changed:
2 - Graph 2
The lines no longer disappeared, meaning no NaN values, BUT the accuracy was 87% which is substantially lower.
I did two more tests with more iterations and a slightly higher learning rate, and in both of them, the graphs both decreased with iterations as expected, but the accuracy was ~86-88%. No NaNs there either.
I realised that my dataset was skewed, with only 481 spam emails and 2412 ham emails. I therefore calculated the FScore for each of these different combinations, hoping to find the later ones had a higher FScore and the accuracy was due to the skew. That was not the case either - I have summed up my results in a table:
3 - Table
So there is no overfitting and the skew does not seem to be the problem; I don't know what to do now!
The only thing I can think of is that my calculations for accuracy and FScore are wrong, or that my initial debugging of the line 'disappearing' was wrong.
EDIT: This question is crucially about why the NaN values occur for those chosen learning rates. So the temporary fix I had of lowering the learning rate did not really answer my question - I always thought that higher learning rates simply diverged instead of converging, not producing NaN values.
My Code
My main.m code (bar getting the dataset from files):
numRecords = length(labels);
trainingSize = ceil(numRecords*0.6);
CVSize = trainingSize + ceil(numRecords*0.2);
featureData = normalise(data);
featureData = [ones(numRecords, 1), featureData];
numFeatures = size(featureData, 2);
featuresTrain = featureData(1:(trainingSize-1),:);
featuresCV = featureData(trainingSize:(CVSize-1),:);
featuresTest = featureData(CVSize:numRecords,:);
labelsTrain = labels(1:(trainingSize-1),:);
labelsCV = labels(trainingSize:(CVSize-1),:);
labelsTest = labels(CVSize:numRecords,:);
paramStart = zeros(numFeatures, 1);
learningRate = 0.0001;
iterations = 400;
[params] = gradDescent(featuresTrain, labelsTrain, learningRate, iterations, paramStart, featuresCV, labelsCV);
threshold = 0.5;
[accuracy, precision, recall] = predict(featuresTest, labelsTest, params, threshold);
fScore = (2*precision*recall)/(precision+recall);
My gradDescent.m code:
function [optimParams] = gradDescent(features, labels, learningRate, iterations, paramStart, featuresCV, labelsCV)
x_axis = [];
J_axis = [];
J_CV = [];
params = paramStart;
for i=1:iterations,
[cost, grad] = costFunction(features, labels, params);
[cost_CV] = costFunction(featuresCV, labelsCV, params);
params = params - (learningRate.*grad);
x_axis = [x_axis;i];
J_axis = [J_axis;cost];
J_CV = [J_CV;cost_CV];
endfor
graphics_toolkit("gnuplot")
plot(x_axis, J_axis, 'r', x_axis, J_CV, 'b');
legend("Training", "Cross-Validation");
xlabel("Iterations");
ylabel("Cost");
title("Cost as a function of iterations");
optimParams = params;
endfunction
My costFunction.m code:
function [cost, grad] = costFunction(features, labels, params)
numRecords = length(labels);
hypothesis = sigmoid(features*params);
cost = (-1/numRecords)*sum((labels).*log(hypothesis)+(1-labels).*log(1-hypothesis));
grad = (1/numRecords)*(features'*(hypothesis-labels));
endfunction
My predict.m code:
function [accuracy, precision, recall] = predict(features, labels, params, threshold)
numRecords=length(labels);
predictions = sigmoid(features*params)>threshold;
correct = predictions == labels;
truePositives = sum(predictions == labels == 1);
falsePositives = sum((predictions == 1) != labels);
falseNegatives = sum((predictions == 0) != labels);
precision = truePositives/(truePositives+falsePositives);
recall = truePositives/(truePositives+falseNegatives);
accuracy = 100*(sum(correct)/numRecords);
endfunction
Credit where it's due:
A big help here was this answer: https://stackoverflow.com/a/51896895/8959704 so this question is kind of a duplicate, but I didn't realise it, and it isn't obvious at first... I will do my best to try to explain why the solution works too, to avoid simply copying the answer.
Solution:
The issue was in fact the 0*log(0) = NaN result that occurred in my data. To fix it, in my calculation of the cost, it became:
cost = (-1/numRecords)*sum((labels).*log(hypothesis)+(1-labels).*log(1-hypothesis+eps(numRecords, 1)));
(see the question for the variables' values etc., it seems redundant to include the rest when just this line changes)
Explanation:
The eps() function is defined as follows:
Return a scalar, matrix or N-dimensional array whose elements are all
eps, the machine precision.
More precisely, eps is the relative spacing between any two adjacent
numbers in the machine’s floating point system. This number is
obviously system dependent. On machines that support IEEE floating
point arithmetic, eps is approximately 2.2204e-16 for double precision
and 1.1921e-07 for single precision.
When called with more than one argument the first two arguments are
taken as the number of rows and columns and any further arguments
specify additional matrix dimensions. The optional argument class
specifies the return type and may be either "double" or "single".
So this means that adding this value onto the value calculated by the Sigmoid function (which was previously so close to 0 it was taken as 0) will mean that it is the closest value to 0 that is not 0, making the log() not return -Inf.
When testing with the learning rate as 0.1 and iterations as 2000/1000/400, the full graph was plotted and no NaN values were produced when checking.
NB: Just in case anyone was wondering, the accuracy and FScores did not change after this, so the accuracy really was that good despite the error in calculating the cost with a higher learning rate.

KALMAN filter doesn't respond to changes

I am implementing a Kalman filter for the first time to get voltage values from a source. It works and it stabilizes at the source voltage value but if then the source changes the voltage the filter doesn't adapt to the new value.
I use 3 steps:
Get the Kalman gain
KG = previous_error_in_estimate / ( previous_error_in_estimate + Error_in_measurement )
Get current estimation
Estimation = previous_estimation + KG*[measurement - previous_estimation]
Calculate the error in estimate
Error_in_estimate = [1-KG]*previous_error_in_estimate
The thing is that, as 0 <= KG <= 1, Error_in_estimate decreases more and more and that makes KG to also decrease more and more ( error_in_measurement is a constant ), so at the end the estimation only depends on the previous estimation and the current measurement is not taken into account.
This prevents the filter from adapt himself to measurement changes.
How can I do to make that happen?
Thanks
EDIT:
Answering to Claes:
I am not sure that the Kalman filter is valid for my problem since I don't have a system model, I just have a bunch of readings from a quite noisy sensor measuring a not very predictable variable.
To keep things simple, imagine reading a potentiometer ( a variable resistor ) changed by the user, you can't predict or model the user's behavior.
I have implemented a very basic SMA ( Simple Moving Average ) algorithm and I was wondering if there is a better way to do it.
Is the Kalman filter valid for a problem like this?
If not, what would you suggest?
2ND EDIT
Thanks to Claes for such an useful information
I have been doing some numerical tests in MathLab (with no real data yet) and doing the convolution with a Gaussian filter seems to give the most accurate result.
With the Kalman filter I don't know how to estimate the process and measurement variances, is there any method for that?. Only when I decrease quite a lot the measurement variance the kalman filter seems to adapt. In the previous image the measurement variance was R=0.1^2 (the one in the original example). This is the same test with R=0.01^2
Of course, these are MathLab tests with no real data. Tomorrow I will try to implement this filters in the real system with real data and see if I can get similar results
A simple MA filter is probably sufficient for your example. If you would like to use the Kalman filter there is a great example at the SciPy cookbook
I have modified the code to include a step change so you can see the convergence.
# Kalman filter example demo in Python
# A Python implementation of the example given in pages 11-15 of "An
# Introduction to the Kalman Filter" by Greg Welch and Gary Bishop,
# University of North Carolina at Chapel Hill, Department of Computer
# Science, TR 95-041,
# http://www.cs.unc.edu/~welch/kalman/kalmanIntro.html
# by Andrew D. Straw
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 8)
# intial parameters
n_iter = 400
sz = (n_iter,) # size of array
x1 = -0.37727*np.ones(n_iter/2) # truth value 1
x2 = -0.57727*np.ones(n_iter/2) # truth value 2
x = np.concatenate((x1,x2),axis=0)
z = x+np.random.normal(0,0.1,size=sz) # observations (normal about x, sigma=0.1)
Q = 1e-5 # process variance
# allocate space for arrays
xhat=np.zeros(sz) # a posteri estimate of x
P=np.zeros(sz) # a posteri error estimate
xhatminus=np.zeros(sz) # a priori estimate of x
Pminus=np.zeros(sz) # a priori error estimate
K=np.zeros(sz) # gain or blending factor
R = 0.1**2 # estimate of measurement variance, change to see effect
# intial guesses
xhat[0] = 0.0
P[0] = 1.0
for k in range(1,n_iter):
# time update
xhatminus[k] = xhat[k-1]
Pminus[k] = P[k-1]+Q
# measurement update
K[k] = Pminus[k]/( Pminus[k]+R )
xhat[k] = xhatminus[k]+K[k]*(z[k]-xhatminus[k])
P[k] = (1-K[k])*Pminus[k]
plt.figure()
plt.plot(z,'k+',label='noisy measurements')
plt.plot(xhat,'b-',label='a posteri estimate')
plt.plot(x,color='g',label='truth value')
plt.legend()
plt.title('Estimate vs. iteration step', fontweight='bold')
plt.xlabel('Iteration')
plt.ylabel('Voltage')
And the output is:

Resources