The interpretation of cross validation scores - cross-validation

I am trying to understand this part of code that I found on the Internet:
kfold = KFold(n_splits=7, random_state=seed)
results = cross_val_score(estimator, x, y, cv=kfold)
print("Results: %.2f (%.2f) MSE" % (results.mean(), results.std()))
What the does cross_val_score?
I know that it calculates scores. I want to understand the meaning of these scores, and how they are evaluated.

Here's the working of cross_val_score:
As seen in source code of cross_val_score, this x you supplied to cross_val_score will be divided into X_train, X_test using cv=kfold. Same for y.
X_test will be held back and X_train and y_train will be passed on to estimator for fit().
After fitting, estimator will then be scored using X_test and y_test.
The steps 1 to 3 will be repeated for folds specified in kfold and array of scores will returned from cross_val_score.
Explanation of 3rd point: Scoring depends on the estimator and scoring param in cross_val_score. In your code here, you have not passed any scorer in scoring. So default estimator.score() will be used.
If estimator is a classifier, then estimator.score(X_test, y_test) will return accuracy. If its a regressor, then R-squared is returned.

Related

When passing the init_score parameter to lightgbm for binary targets, do I pass my initial predictions as probabilities or as logit values?

My understanding of GBMs, is that the trees in the GBM model/predict the logit (log odds) that is then wrapped in the logistic function to get the final probability prediction. Given this, it is unclear whether I should pass my initial prediction as probabilities or as logit values.
Also when I make predictions, do I sum my initial probability prediction with the GBM probability prediction, or do I convert both to logits and then use the logistic function on the summed logits predictions?
Maybe this blog will give you some idea.
It says init_score in the binary classification is calculated in this way:
def logloss_init_score(y):
p = y.mean()
p = np.clip(p, 1e-15, 1 - 1e-15) # never hurts
log_odds = np.log(p / (1 - p))
return log_odds

What techniques are effective to find periodicity in arbitrary data points?

By "arbitrary" I mean that I don't have a signal sampled on a grid that is amenable to taking an FFT. I just have points (e.g. in time) where events happened, and I'd like an estimate of the rate, for example:
p = [0, 1.1, 1.9, 3, 3.9, 6.1 ...]
...could be hits from a process with a nominal periodicity (repetition interval) of 1.0, but with noise and some missed detections.
Are there well known methods for processing such data?
A least square algorithm may do the trick, if correctly initialized. A clustering method can be applied to this end.
As an FFT is performed, the signal is depicted as a sum of sine waves. The amplitude of the frequencies may be depicted as resulting from a least square fit on the signal. Hence, if the signal is unevenly sampled, resolving the same least square problem may make sense if the Fourier transform is to be estimated. If applied to a evenly sampled signal, it boils down to the same result.
As your signal is descrete, you may want to fit it as a sum of Dirac combs. It seems more sound to minimize the sum of squared distance to the nearest Dirac of the Dirac comb. This is a non-linear optimization problem where Dirac combs are described by their period and offset. This non-linear least-square problem can be solved by mean of the Levenberg-Marquardt algorithm. Here is an python example making use of the scipy.optimize.leastsq() function. Moreover, the error on the estimated period and offset can be estimated as depicted in How to compute standard deviation errors with scipy.optimize.least_squares . It is also documented in the documentation of curve_fit() and Getting standard errors on fitted parameters using the optimize.leastsq method in python
Nevertheless, half the period, or the thrid of the period, ..., would also fit, and multiples of the period are local minima that are to be avoided by a refining the initialization of the Levenberg-Marquardt algorithm. To this end, the differences between times of events can be clustered, the cluster featuring the smallest value being that of the expected period. As proposed in Clustering values by their proximity in python (machine learning?) , the clustering function sklearn.cluster.MeanShift() is applied.
Notice that the procedure can be extended to multidimentionnal data to look for periodic patterns or mixed periodic patterns featuring different fundamental periods.
import numpy as np
from scipy.optimize import least_squares
from scipy.optimize import leastsq
from sklearn.cluster import MeanShift, estimate_bandwidth
ticks=[0,1.1,1.9,3,3.9,6.1]
import scipy
print scipy.__version__
def crudeEstimate():
# loooking for the period by looking at differences between values :
diffs=np.zeros(((len(ticks))*(len(ticks)-1))/2)
k=0
for i in range(len(ticks)):
for j in range(i):
diffs[k]=ticks[i]-ticks[j]
k=k+1
#see https://stackoverflow.com/questions/18364026/clustering-values-by-their-proximity-in-python-machine-learning
X = np.array(zip(diffs,np.zeros(len(diffs))), dtype=np.float)
bandwidth = estimate_bandwidth(X, quantile=1.0/len(ticks))
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
print cluster_centers
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
for k in range(n_clusters_):
my_members = labels == k
print "cluster {0}: {1}".format(k, X[my_members, 0])
estimated_period=np.min(cluster_centers[:,0])
return estimated_period
def disttoDiracComb(x):
residual=np.zeros((len(ticks)))
for i in range(len(ticks)):
mindist=np.inf
for j in range(len(x)/2):
offset=x[2*j+1]
period=x[2*j]
#print period, offset
index=np.floor((ticks[i]-offset)/period)
#print 'index', index
currdist=ticks[i]-(index*period+offset)
if currdist>0.5*period:
currdist=period-currdist
index=index+1
#print 'event at ',ticks[i], 'not far from index ',index, '(', currdist, ')'
#currdist=currdist*currdist
#print currdist
if currdist<mindist:
mindist=currdist
residual[i]=mindist
#residual=residual-period*period
#print x, residual
return residual
estimated_period=crudeEstimate()
print 'crude estimate by clustering :',estimated_period
xp=np.array([estimated_period,0.0])
#res_1 = least_squares(disttoDiracComb, xp,method='lm',xtol=1e-15,verbose=1)
p,pcov,infodict,mesg,ier=leastsq(disttoDiracComb, x0=xp,ftol=1e-18, full_output=True)
#print ' p is ',p, 'covariance is ', pcov
# see https://stackoverflow.com/questions/14581358/getting-standard-errors-on-fitted-parameters-using-the-optimize-leastsq-method-i
s_sq = (disttoDiracComb(p)**2).sum()/(len(ticks)-len(p))
pcov=pcov *s_sq
perr = np.sqrt(np.diag(pcov))
#print 'estimated standard deviation on parameter :' , perr
print 'estimated period is ', p[0],' +/- ', 1.96*perr[0]
print 'estimated offset is ', p[1],' +/- ', 1.96*perr[1]
Applied to your sample, it prints :
crude estimate by clustering : 0.975
estimated period is 1.0042857141346768 +/- 0.04035792507868619
estimated offset is -0.011428571139828817 +/- 0.13385206912205957
It sounds like you need to decide what exactly you want to determine. If you want to know the average interval in a set of timestamps, then that's easy (just take the mean or median).
If you expect that the interval could be changing, then you need to have some idea about how fast it is changing. Then you can find a windowed moving average. You need to have an idea of how fast it is changing so that you can select your window size appropriately - a larger window will give you a smoother result, but a smaller window will be more responsive to a faster-changing rate.
If you have no idea whether the data is following any sort of pattern, then you are probably in the territory of data exploration. In that case, I would start by plotting the intervals, to see if a pattern appears to the eye. This might also benefit from applying a moving average if the data is quite noisy.
Essentially, whether or not there is something in the data and what it means is up to you and your knowledge of the domain. That is, in any set of timestamps there will be an average (and you can also easily calculate the variance to give an indication of variability in the data), but it is up to you whether that average carries any meaning.

word2vec training procedure clarification

I'm trying to learn the skip-gram model within word2vec, however I'm confused by some of the basic concepts. To start, here is my current understanding of the model motivated with an example. I am using Python gensim as I go.
Here I have a corpus with three sentences.
sentences = [
['i', 'like', 'cats', 'and', 'dogs'],
['i', 'like', 'dogs'],
['dogs', 'like', 'dogs']
]
From this, I can determine my vocabulary, V = ['and', 'cats', 'dogs', 'i', 'like'].
Following this paper by Tomas Mikolov (and others)
The basic Skip-gram formulation defines p(w_t+j |w_t) using the softmax
function:
where v_w and v′_w are the “input” and “output” vector representations
of w, and W is the number of words in the vocabulary.
To my understanding, the skip-gram model involves two matrices (I'll call them I and O) which are the vector representations of "input/center" words and the vector representation of "output/context" words. Assuming d = 2 (vector dimension or 'size' as its called in genism), I should be a 2x5 matrix and O should be a 5x2 matrix. At the start of the training procedure, these matrices are filled with random values (yes?). So we might have
import numpy as np
np.random.seed(2017)
I = np.random.rand(5,2).round(2) # 5 rows by 2 cols
[[ 0.02 0.77] # and
[ 0.45 0.12] # cats
[ 0.93 0.65] # dogs
[ 0.14 0.23] # i
[ 0.23 0.26]] # like
O = np.random.rand(2,5).round(2) # 2 rows by 5 cols
#and #cats #dogs #i #like
[[ 0.11 0.63 0.39 0.32 0.63]
[ 0.29 0.94 0.15 0.08 0.7 ]]
Now if I want to calculate the probability that the word "dogs" appears in the context of "cats" I should do
exp([0.39, 0.15] * [0.45 0.12])/(...) = (0.1125)/(...)
A few questions on this:
Is my understanding of the algorithm correct thus far?
Using genism, I can train a model on this data using
import gensim
model = gensim.models.Word2Vec(sentences, sg = 1, size=2, window=1, min_count=1)
model.wv['dogs'] # array([ 0.06249372, 0.22618999], dtype=float32)
For the array given, is that the vector for "dogs" in the Input matrix or the Output matrix? Is there a way to view both matrices in the final model?
Why does model.wv.similarity('cats','cats') = 1? I thought this should be closer to 0, since the data would indicate that the word "cats" is unlikely to occur in the context of the word "cats".
(1) Generally, yes, but:
The O output matrix – more properly understood as the weights from the neural-network's hidden layer, to a number of output nodes – is interpreted differently whether using 'negative sampling' ('NS') or 'hierarchical softmax' ('HS') training.
In practice in both I and O are len(vocab) rows and vector-size columns. (I is the Word2Vec model instance's model.wv.syn0 array; O is its model.syn1neg array in NS or model.syn1 in HS.)
I find NS a bit easier to think about: each predictable word corresponds to a single output node. For training data where (context)-indicates->(word), training tries to drive that word's node value toward 1.0, and the other randomly-chosen word node values toward 0.0.
In HS, each word is represented by a huffman-code of a small subset of the output nodes – those 'points' are driven to 1.0 or 0.0 to make the network more indicative of a single word after a (context)-indicates->(word) example.
Only the I matrix, initial word values, are randomized to low-magnitude vectors at the beginning. (The hidden-to-output weights O are left zeros.)
(2) Yes, that'll train things - just note that tiny toy-sized examples won't necessarily generate the useful constellations-of-vector-coordinates that are valued from word2vec.
(3) Note, model.similarity('cats', 'cats') is actually checking the cosine-similarity between the (input) vectors for those two words. Those are the same word, thus they definitionally have the same vector, and the similarity between identical vectors is 1.0.
That is, similarity() is not asking the model for a prediction, it's retrieving learned words by key and comparing those vectors. (Recent versions of gensim do have a predict_output_word() function, but it only works in NS mode, and making predictions isn't really the point of word2vec, and many implementations don't offer any prediction API at all. Rather, the point is using those attempted predictions during training to induce word-vectors that turn out to be useful for various other tasks.)
But even if you were reading predictions, 'cats' might still be a reasonable-although-bad prediction from the model in the context of 'cats'. The essence of forcing large vocabularies into the smaller dimensionality of 'dense' embeddings is compression – the model has no choice but to cluster related words together, because there's not enough internal complexity (learnable parameters) to simply memorize all details of the input. (And for the most part, that's a good thing, because it results in generalizable patterns, rather than just overfit idiosyncrasies of the training corpus.)
The word 'cats' will wind up close to 'dogs' and 'pets' – because they all co-occur with similar words, or each other. And thus the model will be forced to make similar output-predictions for each, because their input-vectors don't vary that much. And a few predictions that are nonsensical in logical language use – like a repeating word - may be made, but only because taking a larger error there still gives less error over the whole training set, compared to other weighting alternatives.

KALMAN filter doesn't respond to changes

I am implementing a Kalman filter for the first time to get voltage values from a source. It works and it stabilizes at the source voltage value but if then the source changes the voltage the filter doesn't adapt to the new value.
I use 3 steps:
Get the Kalman gain
KG = previous_error_in_estimate / ( previous_error_in_estimate + Error_in_measurement )
Get current estimation
Estimation = previous_estimation + KG*[measurement - previous_estimation]
Calculate the error in estimate
Error_in_estimate = [1-KG]*previous_error_in_estimate
The thing is that, as 0 <= KG <= 1, Error_in_estimate decreases more and more and that makes KG to also decrease more and more ( error_in_measurement is a constant ), so at the end the estimation only depends on the previous estimation and the current measurement is not taken into account.
This prevents the filter from adapt himself to measurement changes.
How can I do to make that happen?
Thanks
EDIT:
Answering to Claes:
I am not sure that the Kalman filter is valid for my problem since I don't have a system model, I just have a bunch of readings from a quite noisy sensor measuring a not very predictable variable.
To keep things simple, imagine reading a potentiometer ( a variable resistor ) changed by the user, you can't predict or model the user's behavior.
I have implemented a very basic SMA ( Simple Moving Average ) algorithm and I was wondering if there is a better way to do it.
Is the Kalman filter valid for a problem like this?
If not, what would you suggest?
2ND EDIT
Thanks to Claes for such an useful information
I have been doing some numerical tests in MathLab (with no real data yet) and doing the convolution with a Gaussian filter seems to give the most accurate result.
With the Kalman filter I don't know how to estimate the process and measurement variances, is there any method for that?. Only when I decrease quite a lot the measurement variance the kalman filter seems to adapt. In the previous image the measurement variance was R=0.1^2 (the one in the original example). This is the same test with R=0.01^2
Of course, these are MathLab tests with no real data. Tomorrow I will try to implement this filters in the real system with real data and see if I can get similar results
A simple MA filter is probably sufficient for your example. If you would like to use the Kalman filter there is a great example at the SciPy cookbook
I have modified the code to include a step change so you can see the convergence.
# Kalman filter example demo in Python
# A Python implementation of the example given in pages 11-15 of "An
# Introduction to the Kalman Filter" by Greg Welch and Gary Bishop,
# University of North Carolina at Chapel Hill, Department of Computer
# Science, TR 95-041,
# http://www.cs.unc.edu/~welch/kalman/kalmanIntro.html
# by Andrew D. Straw
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 8)
# intial parameters
n_iter = 400
sz = (n_iter,) # size of array
x1 = -0.37727*np.ones(n_iter/2) # truth value 1
x2 = -0.57727*np.ones(n_iter/2) # truth value 2
x = np.concatenate((x1,x2),axis=0)
z = x+np.random.normal(0,0.1,size=sz) # observations (normal about x, sigma=0.1)
Q = 1e-5 # process variance
# allocate space for arrays
xhat=np.zeros(sz) # a posteri estimate of x
P=np.zeros(sz) # a posteri error estimate
xhatminus=np.zeros(sz) # a priori estimate of x
Pminus=np.zeros(sz) # a priori error estimate
K=np.zeros(sz) # gain or blending factor
R = 0.1**2 # estimate of measurement variance, change to see effect
# intial guesses
xhat[0] = 0.0
P[0] = 1.0
for k in range(1,n_iter):
# time update
xhatminus[k] = xhat[k-1]
Pminus[k] = P[k-1]+Q
# measurement update
K[k] = Pminus[k]/( Pminus[k]+R )
xhat[k] = xhatminus[k]+K[k]*(z[k]-xhatminus[k])
P[k] = (1-K[k])*Pminus[k]
plt.figure()
plt.plot(z,'k+',label='noisy measurements')
plt.plot(xhat,'b-',label='a posteri estimate')
plt.plot(x,color='g',label='truth value')
plt.legend()
plt.title('Estimate vs. iteration step', fontweight='bold')
plt.xlabel('Iteration')
plt.ylabel('Voltage')
And the output is:

Analysing the result of LSTM Theano Sentiment Analysis

I'm trying the code from this link http://deeplearning.net/tutorial/lstm.html but changing the imdb data to my own. This is the screenshot of my result.
I want to determine the overall accuracy of running LSTM for sentiment analysis, but cannot understand the output. The train, valid and test values print multiple times but it's usually the same value.
Any help would be much appreciated.
The value it prints is computed by the following function:
def pred_error(f_pred, prepare_data, data, iterator, verbose=False):
"""
Just compute the error
f_pred: Theano fct computing the prediction
prepare_data: usual prepare_data for that dataset.
"""
valid_err = 0
for _, valid_index in iterator:
x, mask, y = prepare_data([data[0][t] for t in valid_index],
numpy.array(data[1])[valid_index],
maxlen=None)
preds = f_pred(x, mask)
targets = numpy.array(data[1])[valid_index]
valid_err += (preds == targets).sum()
valid_err = 1. - numpy_floatX(valid_err) / len(data[0])
return valid_err
It is easy to follow, and what it computes is 1 - accuracy, where accuracy is percentage of samples labeled correctly. In other words, you get around 72% accuracy on the training set, almost 95% accuracy on the validation set, and 50% accuracy on the test set.
The fact that your validation accuracy is so high compared to the train accuracy is a little bit suspicious, I would trace the predictions and see if may be our validation set is somehow not representative, or too small.

Resources