Using RMSE function with 5-fold cross validation to choose the best model out of 3 - cross-validation

I have defined three different models obtained from the dataset diabetes from the library lars. The first model (M1) is the one that minimizes the BIC value out of all the possible regression models obtained combining the explanatory variables (which are p=10, so 2^10 possible models). The other two are obtained through glmnet and are a Lasso regression with respectively lambda.min (M2) and lambda.1se (M3), where lambda.min and lambda.1se are obtained through cv.glmnet. Now I should perform 5-fold cross-validation using the RMSE (Root Mean Square Error) function to check which of the tree models Μ1, Μ2 and Μ3, has the best predictive performance. In order to find the errors in the models obtained from Lasso I have to use the ordinal least squares estimates.
This is my code as for now:
library(lars)
library(glmnet)
data(diabetes)
y<-diabetes$y
x<-diabetes$x
x2<-diabetes$x2
X = as.data.frame(cbind(x))
Y = as.data.frame(y)
p=10
n=442
best_score = Inf
M1 = NA
for (i in 1:(2^p-1)){
model = lm(y ~ ., data = subset(X, select = c(which(as.integer(intToBits(i)) == 1))))
if (BIC(model) < best_score){
M1 = model
best_score = BIC ( model )
}
}
W<-as.matrix(X)
Y<-as.matrix(Y)
lasso<-glmnet(W, Y)
x11()
plot(lasso, label=T)
x11()
plot(lasso, xvar = 'lambda', label=T)
lasso$df
lasso$lambda
cvfit<-cv.glmnet(W,Y)
cvfit
coef(cvfit, s="lambda.min")
coef(cvfit, s="lambda.1se")
M2<-glmnet(W,Y,lambda = cvfit$lambda.min)
M3<-glmnet(W,Y,lambda = cvfit$lambda.1se)
I really don't know where to put hands now. Should I first of all split the original dataset in 5 and then compute again the models on the different train and test set? And how do I compute the final RMSE for each model? And what does it mean that I should use ordinal least square estimates for the models obtained through Lasso?

Related

How to speed up the solving of multiple optimization problems?

Currently, I'm writing a simulation that asses the performance of a positioning algorithm by measuring the mean error of the position estimator for different points around the room. Unfortunately the running times are pretty slow and so I am looking for ways to speed up my code.
The working principle of the position estimator is based on the MUSIC algorithm. The estimator gets an autocorrelation matrix (sized 12x12, with complex values in general) as an input and follows the next steps:
Find the 12 eigenvalues and eigenvectors of the autocorrelation matrix R.
Construct a new 12x11 matrix EN whose columns are the 11 eigenvectors corresponding to the 11 smallest eigenvalues.
Using the matrix EN, construct a function P = 1/(a' EN EN' a).
Where a is a 12x1 complex vector and a' is the Hermitian conjugate of a. The components of a are functions of 3 variables (named x,y and z) and so the scalar P is also a function P(x,y,z)
Finally, find the values (x0,y0,z0) which maximizes the value of P and return it as the position estimate.
In my code, I choose some constant z and create a grid on points in the plane (at heigh z, parallel to the xy plane). For each point I make n4Avg repetitions and calculate the error of the estimated point. At the end of the parfor loop (and some reshaping), I have a matrix of errors with dims (nx) x (ny) x (n4Avg) and the mean error is calculated by taking the mean of the error matrix (acting on the 3rd dimension).
nx=30 is the number of point along the x axis.
ny=15 is the number of points along the y axis.
n4Avg=100 is the number of repetitions used for calculating the mean error at each point.
nGen=100 is the number of generations in the GA algorithm (100 was tested to be good enough).
x = linspace(-20,20,nx);
y = linspace(0,20,ny);
z = 5;
[X,Y] = meshgrid(x,y);
parfor ri = 1:nx*ny
rT = [X(ri);Y(ri);z];
[ENs] = getEnNs(rT,stdv,n4R,n4Avg); % create n4Avg EN matrices
for rep = 1:n4Avg
pos_est = estPos_helper(squeeze(ENs(:,:,rep)),nGen);
posEstErr(ri,rep) = vecnorm(pos_est(:)-rT(:));
end
end
The matrices EN are generated by the following code
function [ENs] = getEnNs(rT,stdv,n4R,nEN)
% generate nEN simulated EN matrices, each using n4R simulated phases
f_c = 2402e6; % center frequency [Hz]
c0 = 299702547; % speed of light [m/s]
load antennaeArr1.mat antennaeArr1;
% generate initial phases.
phi0 = 2*pi*rand(n4R*nEN,1);
k0 = 2*pi.*(f_c)./c0;
I = cos(-k0.*vecnorm(antennaeArr1 - rT(:),2,1)-phi0);
Q = -sin(-k0.*vecnorm(antennaeArr1 - rT(:),2,1)-phi0);
phases = I+1i*Q;
phases = phases + stdv/sqrt(2)*(randn(size(phases)) + 1i*randn(size(phases)));
phases = reshape(phases',[12,n4R,nEN]);
Rxx = pagemtimes(phases,pagectranspose(phases));
ENs = zeros(12,11,nEN);
for i=1:nEN
[ENs(:,:,i),~] = eigs(squeeze(Rxx(:,:,i)),11,'smallestabs');
end
end
The position estimator uses a solver utilizing a 'genetic algorithm' (chosen because it preformed the best of all the other solvers).
function pos_est = estPos_helper(EN,nGen)
load antennaeArr1.mat antennaeArr1; % 3x12 constant matrix
antennae_array = antennaeArr1;
x0 = [0;10;5];
lb = [-20;0;0];
ub = [20;20;10];
function y = myfun(x)
k0 = 2*pi*2.402e9/299702547;
a = exp( -1i*k0*sqrt( (x(1)-antennae_array(1,:)').^2 + (x(2) - antennae_array(2,:)').^2 + (x(3)-antennae_array(3,:)').^2 ) );
y = 1/real((a')*(EN)*(EN')*a);
end
% Create optimization variables
x3 = optimvar("x",3,1,"LowerBound",lb,"UpperBound",ub);
% Set initial starting point for the solver
initialPoint2.x = x0;
% Create problem
problem = optimproblem("ObjectiveSense","Maximize");
% Define problem objective
problem.Objective = fcn2optimexpr(#myfun,x3);
% Set nondefault solver options
options2 = optimoptions("ga","Display","off","HybridFcn","fmincon",...
"MaxGenerations",nGen);
% Solve problem
solution = solve(problem,initialPoint2,"Solver","ga","Options",options2);
% Clear variables
clearvars x3 initialPoint2 options2
pos_est = solution.x;
end
The current runtime of the code, when setting the parameters as shown above, is around 700-800 seconds. This is a problem as I would like to increase the number of points in the grid and the number of repetitions to get a more accurate result.
The main ways I've tried to tackle this is by using parallel computing (in the form of the parloop) and by reducing the nested loops I had (one for x and one for y) into a single vectorized loop going over all the points in the grid.
It indeed helped, but not quite enough.
I apologize for the messy code.
Michael.

How to do an inverse orderNorm transformation (bestNormalize package) from a GAMLSS object?

My y variable (n=30,000) is distributed with very heavy tails (both positive and negative), for which the fitDist GAMLSS function selects the ST4 family.
I tried to assess a GAMLSS-based regression with an explanatory variable x (pb smoothing), but tails on y are so heavy that convergence does not reach after 50 cycles, even after refit (time consuming+++).
Therefore, I normalized y using the orderNorm transformation (bestNormalize package), which allowed to easily and quickly reach convergence, and then to predict the fitted value from the GAMLSS object.
However, these fitted "orderNormalized" values are a GAMLSS object, and thus cannot be inversed using the predict function from bestNormalize (since this latter seems to not recognize a GAMLSS object).
My question: is it possible, whatever the means, to apply an inverse orderNorm transformation to fitted values from a GAMLSS object?
It is easy to get confused about on what to use the predict function, so I list here the steps without code (as there is no example in the question):
1) transposeObj = orderNorm(data$outputvariable)
2) fitObj = gamlls(transposeObj$x.t ~., data)
3) pred = predict(fitObj, type = 'response')
4) inversedpredictions = predict(transposeObj, newdata = pred, inverse = TRUE)
In plain text, you normalize your data, fit a model, make predictions with the fit, and then predict on the predictions with the normalization object obtained from orderNorm.
The vignette for bestNormalize has a similar example, using lm instead of GAMLSS. See the application section of the vignette. Once you have run the normalization procedure, you should be able to repeat and invert the transformation with the predict function.
The key is storing a transformation object as an R object that can then be fed into the predict (or rather, the predict.bestNormalize) function.

prediction using mixed logit model in R

I am interested in using mixed logit model for prediction. Is there a function/ package in R that can do it for me? If not, how is it approached mathematically. I understand that the coefficients are random, therefore, a very naive approach is to draw from the distribution of betas and take mean for out of sample Xs.
This is probably what you're looking for:
mlogit is a package for R which enables the estimation of random utility models with individual and/or alternative specific variables. The main extensions of the basic multinomial model (heteroscedastic, nested and random parameter models) are implemented.
Some useful references:
Kenneth Train's exercises for mlogit
Estimation of Random Utility Models in R by Yves Croissant
you can use predict function directly
library("mlogit")
data("Mode", package="mlogit")
Mo <- mlogit.data(Mode, choice = 'choice', shape = 'wide',
varying = c(2:9))
# starting values of the beta coefficients
strt <- c(1.83086600, -1.28168186, 0.30935104, -0.41344010, -0.04665517,
1,0.25997237,0.73648694, 1.30789474, -0.79818416, 0.43013035)
p1 <- mlogit(choice ~ cost + time, Mo, seed = 20,
R = 100, probit = TRUE, start = strt)
Mo2 <- Mo
Mo2[Mo2$alt == 'car', 'cost'] <- Mo2[Mo2$alt == 'car', 'cost'] * 2
newShares <- apply(predict(p1, newdata = Mo2), 2, mean)
hope this answers your question

matlab curve fitting: restrictions on parameters

I have 5 non-parametric models all with 5 to 8 parameters. This models are used to fit longitudinal data y(t) with t being time. Every datafile is fitted by all 5 models for comparison. The model itself cannot be altered.
For fitting starting values are used and these are fitted into a lsqcurvefit model using a levenberg-marquardt algortihm. So I've written a script for several models and one function for curvefitting
if i perform the curve fitting a lot of the starting values are wandering off to extreme values. This is the thing I want to avoid since these parameters should stay in the proximity off it's starting values and should only change between a well defined range or so that only curve fits within a standard deviation are included.Important to note here is that this restrictions should be imposed during the curve fitting (iterative numerization techique) and not afterwards.
The function I've written to fit models into height:
% Fit a specific model for all valid persons
try
opts = optimoptions(#lsqcurvefit, 'Algorithm', 'levenberg-marquardt');
[personalParams,personalRes,personalResidual] = lsqcurvefit(heightModel,initialValues,personalData(:,1),personalData(:,2),[],[],opts);
catch
x=1;
end
The function I've written for one of my models
elseif strcmpi(model,'jpss')
% y = h_1(1-(1/(1+((t+0.75)^c_1/d_1)+((t+0.75)^c_2/d_2)+((t+0.75)^c_3/d_3)))
% heightModel = #(params,ages) params(1).*(1-1./(1+((ages+0.75).^params(2))./params(3) + ((ages+0.75).^params(4))./params(5) + ((ages+0.75).^params(6))./params(7)));
heightModel = #(params,ages) params(1).*(1-1./(1+(((ages+0.75)./params(3)).^params(2)) + (((ages+0.75)./params(5)).^params(4)) + ((ages+0.75)./params(7)).^params(6))); % Adapted 25/07
modelStrings = {'h1','c1','d1','c2','d2','c3','d3'};
% Define initial values
if strcmpi('male',gender)
initialValues = [174.8 0.6109 2.9743 3.614 9.88 22.393 13.59];
else
initialValues = [162.7 0.6546 2.43 4.011 8.579 18.394 11.846];
end
What I would like to do:
Is it possible to place restrictions on every startingvalue #initial values? Putting restrictions on lsqcurvefit wouldn't be a good idea I think since there are different models with different starting values and different ranges that are allowed.
I had 2 things in my mind:
1. using range and place this between the initial values
initialValues = [162.7 0.6546 2.43 4.011 8.579 18.394 11.846]`
if range a1=[150,180]; range a2=[0.3,0.8] and so one
place lb and ub restrictions seperatly on all my initialvalues between lsqcurvefit
if Heightmodel='name model'
initial value* 1.2 and lb = initial value* 0.8
Can someone give me some hints or pointers because I can't make it work.
Thanks in advance
Lucy
Could somebody help me out
You state: there are different models with different starting values and different ranges that are allowed. This is where you can use ub and lb. How to do this is outlined in the lsqcurvefit documentation:
X=LSQCURVEFIT(FUN,X0,XDATA,YDATA,LB,UB) defines a set of lower and
upper bounds on the design variables, X, so that the solution is in the
range LB <= X <= UB. Use empty matrices for LB and UB if no bounds
exist. Set LB(i) = -Inf if X(i) is unbounded below; set UB(i) = Inf if
X(i) is unbounded above.
For instance in the following example the parameters are constrained within limits during the fit. The lower bound (lb) and upper bound (ub) are set to 20% below and above the starting values, respectively.
heightModel = #(params,ages) abs(params(1).*(1-1./(1+(params(2).* (ages+params(8) )).^params(5) +(params(3).* (ages+params(8) )).^params(6) +(params(4) .*(ages+params(8) )).^params(7) )));
initialValues = [161.92 0.4173 0.1354 0.090 0.540 2.87 14.281 0.3701];
lb = 0.8*initialValues; % <-- lower bound is 20% smaller than initial par values
ub = 1.2*initialValues;
[parsout,resnorm,residual] = lsqcurvefit(heightModel,initialValues,t,ht,lb,ub);

Converting a Uniform Distribution to a Normal Distribution

How can I convert a uniform distribution (as most random number generators produce, e.g. between 0.0 and 1.0) into a normal distribution? What if I want a mean and standard deviation of my choosing?
There are plenty of methods:
Do not use Box Muller. Especially if you draw many gaussian numbers. Box Muller yields a result which is clamped between -6 and 6 (assuming double precision. Things worsen with floats.). And it is really less efficient than other available methods.
Ziggurat is fine, but needs a table lookup (and some platform-specific tweaking due to cache size issues)
Ratio-of-uniforms is my favorite, only a few addition/multiplications and a log 1/50th of the time (eg. look there).
Inverting the CDF is efficient (and overlooked, why ?), you have fast implementations of it available if you search google. It is mandatory for Quasi-Random numbers.
The Ziggurat algorithm is pretty efficient for this, although the Box-Muller transform is easier to implement from scratch (and not crazy slow).
Changing the distribution of any function to another involves using the inverse of the function you want.
In other words, if you aim for a specific probability function p(x) you get the distribution by integrating over it -> d(x) = integral(p(x)) and use its inverse: Inv(d(x)). Now use the random probability function (which have uniform distribution) and cast the result value through the function Inv(d(x)). You should get random values cast with distribution according to the function you chose.
This is the generic math approach - by using it you can now choose any probability or distribution function you have as long as it have inverse or good inverse approximation.
Hope this helped and thanks for the small remark about using the distribution and not the probability itself.
Here is a javascript implementation using the polar form of the Box-Muller transformation.
/*
* Returns member of set with a given mean and standard deviation
* mean: mean
* standard deviation: std_dev
*/
function createMemberInNormalDistribution(mean,std_dev){
return mean + (gaussRandom()*std_dev);
}
/*
* Returns random number in normal distribution centering on 0.
* ~95% of numbers returned should fall between -2 and 2
* ie within two standard deviations
*/
function gaussRandom() {
var u = 2*Math.random()-1;
var v = 2*Math.random()-1;
var r = u*u + v*v;
/*if outside interval [0,1] start over*/
if(r == 0 || r >= 1) return gaussRandom();
var c = Math.sqrt(-2*Math.log(r)/r);
return u*c;
/* todo: optimize this algorithm by caching (v*c)
* and returning next time gaussRandom() is called.
* left out for simplicity */
}
Where R1, R2 are random uniform numbers:
NORMAL DISTRIBUTION, with SD of 1:
sqrt(-2*log(R1))*cos(2*pi*R2)
This is exact... no need to do all those slow loops!
Reference: dspguide.com/ch2/6.htm
Use the central limit theorem wikipedia entry mathworld entry to your advantage.
Generate n of the uniformly distributed numbers, sum them, subtract n*0.5 and you have the output of an approximately normal distribution with mean equal to 0 and variance equal to (1/12) * (1/sqrt(N)) (see wikipedia on uniform distributions for that last one)
n=10 gives you something half decent fast. If you want something more than half decent go for tylers solution (as noted in the wikipedia entry on normal distributions)
I would use Box-Muller. Two things about this:
You end up with two values per iteration
Typically, you cache one value and return the other. On the next call for a sample, you return the cached value.
Box-Muller gives a Z-score
You have to then scale the Z-score by the standard deviation and add the mean to get the full value in the normal distribution.
It seems incredible that I could add something to this after eight years, but for the case of Java I would like to point readers to the Random.nextGaussian() method, which generates a Gaussian distribution with mean 0.0 and standard deviation 1.0 for you.
A simple addition and/or multiplication will change the mean and standard deviation to your needs.
The standard Python library module random has what you want:
normalvariate(mu, sigma)
Normal distribution. mu is the mean, and sigma is the standard deviation.
For the algorithm itself, take a look at the function in random.py in the Python library.
The manual entry is here
This is a Matlab implementation using the polar form of the Box-Muller transformation:
Function randn_box_muller.m:
function [values] = randn_box_muller(n, mean, std_dev)
if nargin == 1
mean = 0;
std_dev = 1;
end
r = gaussRandomN(n);
values = r.*std_dev - mean;
end
function [values] = gaussRandomN(n)
[u, v, r] = gaussRandomNValid(n);
c = sqrt(-2*log(r)./r);
values = u.*c;
end
function [u, v, r] = gaussRandomNValid(n)
r = zeros(n, 1);
u = zeros(n, 1);
v = zeros(n, 1);
filter = r==0 | r>=1;
% if outside interval [0,1] start over
while n ~= 0
u(filter) = 2*rand(n, 1)-1;
v(filter) = 2*rand(n, 1)-1;
r(filter) = u(filter).*u(filter) + v(filter).*v(filter);
filter = r==0 | r>=1;
n = size(r(filter),1);
end
end
And invoking histfit(randn_box_muller(10000000),100); this is the result:
Obviously it is really inefficient compared with the Matlab built-in randn.
This is my JavaScript implementation of Algorithm P (Polar method for normal deviates) from Section 3.4.1 of Donald Knuth's book The Art of Computer Programming:
function normal_random(mean,stddev)
{
var V1
var V2
var S
do{
var U1 = Math.random() // return uniform distributed in [0,1[
var U2 = Math.random()
V1 = 2*U1-1
V2 = 2*U2-1
S = V1*V1+V2*V2
}while(S >= 1)
if(S===0) return 0
return mean+stddev*(V1*Math.sqrt(-2*Math.log(S)/S))
}
I thing you should try this in EXCEL: =norminv(rand();0;1). This will product the random numbers which should be normally distributed with the zero mean and unite variance. "0" can be supplied with any value, so that the numbers will be of desired mean, and by changing "1", you will get the variance equal to the square of your input.
For example: =norminv(rand();50;3) will yield to the normally distributed numbers with MEAN = 50 VARIANCE = 9.
Q How can I convert a uniform distribution (as most random number generators produce, e.g. between 0.0 and 1.0) into a normal distribution?
For software implementation I know couple random generator names which give you a pseudo uniform random sequence in [0,1] (Mersenne Twister, Linear Congruate Generator). Let's call it U(x)
It is exist mathematical area which called probibility theory.
First thing: If you want to model r.v. with integral distribution F then you can try just to evaluate F^-1(U(x)). In pr.theory it was proved that such r.v. will have integral distribution F.
Step 2 can be appliable to generate r.v.~F without usage of any counting methods when F^-1 can be derived analytically without problems. (e.g. exp.distribution)
To model normal distribution you can cacculate y1*cos(y2), where y1~is uniform in[0,2pi]. and y2 is the relei distribution.
Q: What if I want a mean and standard deviation of my choosing?
You can calculate sigma*N(0,1)+m.
It can be shown that such shifting and scaling lead to N(m,sigma)
I have the following code which maybe could help:
set.seed(123)
n <- 1000
u <- runif(n) #creates U
x <- -log(u)
y <- runif(n, max=u*sqrt((2*exp(1))/pi)) #create Y
z <- ifelse (y < dnorm(x)/2, -x, NA)
z <- ifelse ((y > dnorm(x)/2) & (y < dnorm(x)), x, z)
z <- z[!is.na(z)]
It is also easier to use the implemented function rnorm() since it is faster than writing a random number generator for the normal distribution. See the following code as prove
n <- length(z)
t0 <- Sys.time()
z <- rnorm(n)
t1 <- Sys.time()
t1-t0
function distRandom(){
do{
x=random(DISTRIBUTION_DOMAIN);
}while(random(DISTRIBUTION_RANGE)>=distributionFunction(x));
return x;
}

Resources