Equal performance across training and test set. Is it normal? - performance

I run a simple logit but I always get the same performance no matter the size of the training. I started with 90% training and then keep trying with lower size. I tried even with a training size of 15% but nothing changes at all: performance on test-set is the same of the training and also for any kind of metrics: Accuracy, Sensitivity, Specificity, etc.
Dataset has been preprocessed by removing outliers (or transformed in logarithm for some monetary feature such as income) and missing values
At first glance I thought this might happen because the two-class proportions is the same across train and test no matter the splitting size, that is 80-20.
As a matter of fact even with 15% training size the proportion is the same across training and test. So i tried by running the mode on a very little training sample (i.e. less than 1000 instances) and I got different performance across training and test set.
However, I'm here to ask if there's some other kind of explaination about that.
d5 #d5 is the dataset
n <- nrow(d5)
index <- sample(1:n,n*0.80,replace = F)
training <- d5[index,]
wgths <- ifelse(training$status == 0,0.21,0.79) #set weigths cause the classes are umbalanced
test <- d5[-index,]
logit <- glm(training$status~.,data = training[,-c(1,4,6,11,14)], #some columns excluded cause useless
family = binomial('logit'),weights = wgths)
score <- ifelse(logit$fitted.values > 0.5,1,0) #classification on training set
prb <- predict.glm(logit,newdata = test,type = 'response')
score1 <- ifelse(prb > 0.5,1,0) #classification on test set
train_mat <- confusionMatrix(as.factor(score),as.factor(training$status),'1')
test_mat <- confusionMatrix(as.factor(score1),as.factor(test$status),'1')
Each comparison of the confusion matrix between train and test is something similar to these two below.
Confusion Matrix and Statistics (Training)
Reference
Prediction 0 1
0 482321 66330
1 271584 130485
Accuracy : 0.6446
95% CI : (0.6436, 0.6455)
No Information Rate : 0.793
P-Value [Acc > NIR] : 1
Kappa : 0.2185
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.6630
Specificity : 0.6398
Pos Pred Value : 0.3245
Neg Pred Value : 0.8791
Prevalence : 0.2070
Detection Rate : 0.1372
Detection Prevalence : 0.4229
Balanced Accuracy : 0.6514
'Positive' Class : 1
Confusion Matrix and Statistics (Test)
Reference
Prediction 0 1
0 120775 16682
1 67544 32679
Accuracy : 0.6456
95% CI : (0.6437, 0.6476)
No Information Rate : 0.7923
P-Value [Acc > NIR] : 1
Kappa : 0.2198
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.6620
Specificity : 0.6413
Pos Pred Value : 0.3261
Neg Pred Value : 0.8786
Prevalence : 0.2077
Detection Rate : 0.1375
Detection Prevalence : 0.4217
Balanced Accuracy : 0.6517
'Positive' Class : 1

Related

What is the most efficient way to loop in Haxe?

I couldn't find any information on actual performance differences between loops in Haxe. They mentioned that Vector has some speed optimizations due to it being fixed length. What is the best way to loop over objects? And does it depend on the iterable object (e.g. Array vs. Vector vs. Map)?
Why does Haxe have so little presence on SO? Every other language has this question answered 5 times over...
Since no one had done performance benchmarks that I have found, I decided to run a test so this info is available to future Haxe programmers.
First note: If you aren't running over the loop very often, it is so fast it has almost no effect on performance. So if it is easier to just use an Array, do it. Performance is only affected if you are running over the thing over and over and/or if it is really big.
Turns out that your best choice depends on mostly your data structure. I found that Arrays tend to be faster when you do the for each style loop instead of the standard for loop or while loop. At small sizes Arrays are essentially as fast as Vectors, so in most cases you don't need to worry about which one to use. However, if you are doing stuff with pretty massive Arrays, it would be very beneficial to switch over to a Vector. And if you use a Vector, using the standard for or a while loop is essentially equivalent (although while is a touch faster). Maps are also pretty fast, especially if you avoid the foreach loop.
To reach these conclusions, I first tested the loops under these conditions:
Tested Array, Vector and Map (Map just for fun).
Filled each one to be structure[i] = i where i in 0...size with sizes in [20, 100, 1000, 10000, 100000] so you can find the right size for you.
Tested each data structure on each size using three for loop types
for (i in 0...size)
for (item in array)
while (i < size)
and inside each loop I performed a look up and assignment arr[i] = arr[i] + 1;
Each loop type was inside its own loop for (iter in 0...1000) to get a more accurate read on how loops perform. Note that I just add together the times for each loop, I don't average or anything like that. So if an array took 12 seconds, it was really 12 / 1000 => 0.012 seconds to execute once on average.
Finally, here is my benchmark (run in Debug for neko in HaxeDevelop):
Running test on size 20:
for (i...20) x 1000
Array : 0.0019989013671875
Vector : 0
Map : 0.00300025939941406
for each(i in iterable) x 1000
Array : 0.00100135803222656
Vector : 0.00099945068359375
Map : 0.0209999084472656
while (i < 20) x 1000
Array : 0.00200080871582031
Vector : 0.00099945068359375
Map : 0.0019989013671875
Running test on size 100:
for (i...100) x 1000
Array : 0.0120010375976563
Vector : 0.0019989013671875
Map : 0.0120010375976563
for each(i in iterable) x 1000
Array : 0.00600051879882813
Vector : 0.00299835205078125
Map : 0.0190010070800781
while (i < 100) x 1000
Array : 0.0119991302490234
Vector : 0.00200080871582031
Map : 0.0119991302490234
Running test on size 1000:
for (i...1000) x 1000
Array : 0.11400032043457
Vector : 0.0179996490478516
Map : 0.104999542236328
for each(i in iterable) x 1000
Array : 0.0550003051757813
Vector : 0.0229988098144531
Map : 0.210000991821289
while (i < 1000) x 1000
Array : 0.105998992919922
Vector : 0.0170001983642578
Map : 0.101999282836914
Running test on size 10000:
for (i...10000) x 1000
Array : 1.09500122070313
Vector : 0.180000305175781
Map : 1.09700012207031
for each(i in iterable) x 1000
Array : 0.553998947143555
Vector : 0.222999572753906
Map : 2.17600059509277
while (i < 10000) x 1000
Array : 1.07900047302246
Vector : 0.170999526977539
Map : 1.0620002746582
Running test on size 100000:
for (i...100000) x 1000
Array : 10.9670009613037
Vector : 1.80499839782715
Map : 11.0330009460449
for each(i in iterable) x 1000
Array : 5.54100036621094
Vector : 2.21299934387207
Map : 20.4000015258789
while (i < 100000) x 1000
Array : 10.7889995574951
Vector : 1.71500015258789
Map : 10.8209991455078
total time: 83.8239994049072
Hope that helps anyone who is worried about performance and Haxe and who needs to use a lot of loops.

Generate empirical/user defined distribution with desired mean and std

I have generated a demand distribution based on the actual demand data of one year. This distribution is non-normal or similar to any theoretical distributions. I use this empirical demand distribution for a simulation study.
In current empirical distribution:
mean = 1000
std = 600
Coefficient of variation (CV) = 0.6
I want to base on the current empirical distribution pattern/shape as the base case to generate four additional distributions.
dist1: Low volume, low variation -> mean:500, std:150, CV:0.3
dist2: Low volume, high variation -> mean:500, std:665, CV:1.33
dist3: High volume, low variation -> mean:2000, std:600, CV:0.3
dist4: High volume, high variation -> mean:2000, std:2660, CV:1.33
the key purpose for doing it is to investigate how the changes in demand volume and demand variation can impact the simulated system. Is it statistically feasible to create such distributions (dist1-4 above), or I have to change to the normal distribution?
Your problem is under-specified, but it might suffice to apply an appropriate linear function to your given distribution.
Since E(aX+b) = aE(X) + b and StDev(aX+b) = |a|StDev(X), you can pick a and b so that you get the given target parameters.
Suppose that you have a function f() which generates values with mean 1000 and standard deviation 600. The following definition will generate random numbers with mean m and standard deviation s:
g(m,s) = (s/600)*f()+m-5*s/3
A quick test in R:
> f <- function() rnorm(1,1000,600) #mock empirical f()
> g <- function(m,s) (s/600)*f()+m-5*s/3
> x <- replicate(1000,g(2000,300))
> mean(x)
[1] 1988.719
> sd(x)
[1] 300.7044

KALMAN filter doesn't respond to changes

I am implementing a Kalman filter for the first time to get voltage values from a source. It works and it stabilizes at the source voltage value but if then the source changes the voltage the filter doesn't adapt to the new value.
I use 3 steps:
Get the Kalman gain
KG = previous_error_in_estimate / ( previous_error_in_estimate + Error_in_measurement )
Get current estimation
Estimation = previous_estimation + KG*[measurement - previous_estimation]
Calculate the error in estimate
Error_in_estimate = [1-KG]*previous_error_in_estimate
The thing is that, as 0 <= KG <= 1, Error_in_estimate decreases more and more and that makes KG to also decrease more and more ( error_in_measurement is a constant ), so at the end the estimation only depends on the previous estimation and the current measurement is not taken into account.
This prevents the filter from adapt himself to measurement changes.
How can I do to make that happen?
Thanks
EDIT:
Answering to Claes:
I am not sure that the Kalman filter is valid for my problem since I don't have a system model, I just have a bunch of readings from a quite noisy sensor measuring a not very predictable variable.
To keep things simple, imagine reading a potentiometer ( a variable resistor ) changed by the user, you can't predict or model the user's behavior.
I have implemented a very basic SMA ( Simple Moving Average ) algorithm and I was wondering if there is a better way to do it.
Is the Kalman filter valid for a problem like this?
If not, what would you suggest?
2ND EDIT
Thanks to Claes for such an useful information
I have been doing some numerical tests in MathLab (with no real data yet) and doing the convolution with a Gaussian filter seems to give the most accurate result.
With the Kalman filter I don't know how to estimate the process and measurement variances, is there any method for that?. Only when I decrease quite a lot the measurement variance the kalman filter seems to adapt. In the previous image the measurement variance was R=0.1^2 (the one in the original example). This is the same test with R=0.01^2
Of course, these are MathLab tests with no real data. Tomorrow I will try to implement this filters in the real system with real data and see if I can get similar results
A simple MA filter is probably sufficient for your example. If you would like to use the Kalman filter there is a great example at the SciPy cookbook
I have modified the code to include a step change so you can see the convergence.
# Kalman filter example demo in Python
# A Python implementation of the example given in pages 11-15 of "An
# Introduction to the Kalman Filter" by Greg Welch and Gary Bishop,
# University of North Carolina at Chapel Hill, Department of Computer
# Science, TR 95-041,
# http://www.cs.unc.edu/~welch/kalman/kalmanIntro.html
# by Andrew D. Straw
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 8)
# intial parameters
n_iter = 400
sz = (n_iter,) # size of array
x1 = -0.37727*np.ones(n_iter/2) # truth value 1
x2 = -0.57727*np.ones(n_iter/2) # truth value 2
x = np.concatenate((x1,x2),axis=0)
z = x+np.random.normal(0,0.1,size=sz) # observations (normal about x, sigma=0.1)
Q = 1e-5 # process variance
# allocate space for arrays
xhat=np.zeros(sz) # a posteri estimate of x
P=np.zeros(sz) # a posteri error estimate
xhatminus=np.zeros(sz) # a priori estimate of x
Pminus=np.zeros(sz) # a priori error estimate
K=np.zeros(sz) # gain or blending factor
R = 0.1**2 # estimate of measurement variance, change to see effect
# intial guesses
xhat[0] = 0.0
P[0] = 1.0
for k in range(1,n_iter):
# time update
xhatminus[k] = xhat[k-1]
Pminus[k] = P[k-1]+Q
# measurement update
K[k] = Pminus[k]/( Pminus[k]+R )
xhat[k] = xhatminus[k]+K[k]*(z[k]-xhatminus[k])
P[k] = (1-K[k])*Pminus[k]
plt.figure()
plt.plot(z,'k+',label='noisy measurements')
plt.plot(xhat,'b-',label='a posteri estimate')
plt.plot(x,color='g',label='truth value')
plt.legend()
plt.title('Estimate vs. iteration step', fontweight='bold')
plt.xlabel('Iteration')
plt.ylabel('Voltage')
And the output is:

Analysing the result of LSTM Theano Sentiment Analysis

I'm trying the code from this link http://deeplearning.net/tutorial/lstm.html but changing the imdb data to my own. This is the screenshot of my result.
I want to determine the overall accuracy of running LSTM for sentiment analysis, but cannot understand the output. The train, valid and test values print multiple times but it's usually the same value.
Any help would be much appreciated.
The value it prints is computed by the following function:
def pred_error(f_pred, prepare_data, data, iterator, verbose=False):
"""
Just compute the error
f_pred: Theano fct computing the prediction
prepare_data: usual prepare_data for that dataset.
"""
valid_err = 0
for _, valid_index in iterator:
x, mask, y = prepare_data([data[0][t] for t in valid_index],
numpy.array(data[1])[valid_index],
maxlen=None)
preds = f_pred(x, mask)
targets = numpy.array(data[1])[valid_index]
valid_err += (preds == targets).sum()
valid_err = 1. - numpy_floatX(valid_err) / len(data[0])
return valid_err
It is easy to follow, and what it computes is 1 - accuracy, where accuracy is percentage of samples labeled correctly. In other words, you get around 72% accuracy on the training set, almost 95% accuracy on the validation set, and 50% accuracy on the test set.
The fact that your validation accuracy is so high compared to the train accuracy is a little bit suspicious, I would trace the predictions and see if may be our validation set is somehow not representative, or too small.

Effect of --oaa 2 and --loss_function=logistic in Vowpal Wabbit

What parameters should I use in VW for a binary classification task? For example, let's use rcv1_small.dat. I thought it is better to use the logistic loss function (or hinge) and it makes no sense to use --oaa 2. However, the empirical results (with progressive validation 0/1 loss reported in all 4 experiments) show that best combination is --oaa 2 without logistic loss (i.e. with the default squared loss):
cd vowpal_wabbit/test/train-sets
cat rcv1_small.dat | vw --binary
# average loss = 0.0861
cat rcv1_small.dat | vw --binary --loss_function=logistic
# average loss = 0.0909
cat rcv1_small.dat | sed 's/^-1/2/' | vw --oaa 2
# average loss = 0.0857
cat rcv1_small.dat | sed 's/^-1/2/' | vw --oaa 2 --loss_function=logistic
# average loss = 0.0934
My primary question is: Why --oaa 2 does not give exactly the same results as --binary (in the above setting)?
My secondary questions are: Why optimizing the logistic loss does not improve the 0/1 loss (compared to optimizing the default square loss)? Is this a specific of this particular dataset?
I have experienced something similar while using --csoaa. The details could be found here. My guess is that in case of multiclass problem with N classes (no matter that you specified 2 as a number of classes) vw virtually works with N copies of features. Same example gets different ft_offset value when it's predicted/learned for every possible class and this offset is used in hashing algorithm. So all classes get "independent" set of features from same dataset's row. Of course feature values are same, but vw doesn't keep values - only feature weights. And weights are different for each possible class. And as amount of RAM used for storing these weights is fixed with -b (-b 18 by default) - the more classes you have the more chance to get a hash collision. You can try to increase -b value and check if difference between --oaa 2 and --binary results is decreasing. But I might be wrong as I didn't go too deep into the vw code.
As for loss function - you can't compare avg loss values of squared (default) and logistic loss functions directly. You shall get raw prediction values from result obtained with squared loss and get loss of these predictions in terms of logistic loss. The function will be: log(1 + exp(-label * prediction) where label is a priori known answer. Such functions (float getLoss(float prediction, float label) ) for all loss functions implemented in vw could be found in loss_functions.cc. Or you can preliminary scale raw prediction value to [0..1] with 1.f / (1.f + exp(- prediction) and then calc log loss as described on kaggle.com :
double val = 1.f / (1.f + exp(- prediction); // y = f(x) -> [0, 1]
if (val < 1e-15) val = 1e-15;
if (val > (1.0 - 1e-15)) val = 1.0 - 1e-15;
float xx = (label < 0)?0:1; // label {-1,1} -> {0,1}
double loss = xx*log(val) + (1.0 - xx) * log(1.0 - val);
loss *= -1;
You can also scale raw predictions to [0..1] with '/vowpal_wabbit/utl/logistic' script or --link=logistic parameter. Both use 1/(1+exp(-i)).

Resources