How to impute for resampling rather than embedding imputation pipeline with a learner, especially for nested cross validation? - cross-validation

I want to first do imputation within each cv fold and then train the learner with autotuner, and test it on testing sets.
I can see that once the resampling scheme is fixed, the imputation is fixed, so that only (inner folds) * (outer folds) imputations are needed. However, in mlr3, the imputation is combined with the learner by pipelines, the number of imputations will be (inner folds) * (outer folds) * (autotuning evaluations).
Is there any way to impute along with resampling instead of a learner?

No that is not possible. You are right, it is unnecessary to impute the missing values again for each hyperparameter configuration. Unfortunately, mlr3 cannot cache the imputed data sets.

Related

KFold cross validation: shuffle =True vs shuffle=False

Should I set shuffle=True in sklearn.model_selection.KFold ?
I'm in this situation where I'm trying to evaluate the cross_val_score of my model on a given dataset.
if I write
cross_val_score(estimator=model, X=X, y=y, cv=KFold(shuffle=False), scoring='r2')
I get back:
array([0.39577543, 0.38461982, 0.15859382, 0.3412703 , 0.47607428])
Instead, by setting
cross_val_score(estimator=model, X=X, y=y, cv=KFold(shuffle=True), scoring='r2')
I obtain:
array([0.49701477, 0.53682238, 0.56207702, 0.56805794, 0.61073587])
So, in light of this, I want to understand if setting shuffle = True in KFold may lead obtaining over-optimistic cross validation scores.
Reading the documentation, it just says that the effect of initial shuffling just shuffles the data at the beginning, before splitting it into K-folds, training on the K-1 and testing on the one left out, and repeating for the number of folds without re-shuffling.. So, according to this, one shouldn't worry too much. Of course it the shuffle occurred at each iteration of training during cross validation, one would end up considering generalization error on points that were previously considered during training, committing a bad mistake, but is this the case?
How can I interpret the fact that in this case I get slightly better values when shuffle is True?

How to minimize a cost function with Matlab when input variable is a large image: increase speed and prevent memory crash

I am trying to implement a differential phase integration method described in this paper:
Thüring, Thomas, et al. "Non-linear regularized phase retrieval for unidirectional X-ray differential phase contrast radiography." Optics express 19.25 (2011): 25545-25558.
Basically, it's a way to integrate a differential image across the columns only, while imposing some constraints on continuity across the rows to prevent stripe noise.
From a mathematical point of view, I want to minimize the following equation:
where ||.|| is the L2 norm, Dx is the derivative along the columns, Dy is the derivative across the rows, A is the unknown integrated matrix, lambda is a user-defined parameter and phi is the differential profile I measured. Note that for the Dy operator the L1 norm can also be used.
I wrote down a code using fminunc as Matlab solver
pdiff=imresize(diff(padarray(p,[0,1],'replicate','post'),1,2),[128,128]);
noise = 0.02 * randn(size(pdiff));
pdiff_noise = pdiff + noise ;
% normal integration
integratedProfile=cumsum(pdiff_noise,2);
options=optimoptions(#fminunc,'Display','iter-detailed','UseParallel',true,'MaxIterations',35);
% regularized integration
startingPoint=zeros(size(pdiff_noise));
fun=#(x)costFunction(pdiff_noise,x);
integratedProfile_optmized=fminunc(fun,startingPoint,options);
function difference=costFunction(ep,op)
L=0.2;
dep_o=diff(padarray(op,[0,1],'replicate','post'),1,2);
dep_v=diff(padarray(op,[1,0],'replicate','post'),1,1);
difference=sum(sum((ep-dep_o).^2))+L*sum(sum(dep_v.^2));
end
It works using a 128x128 differential image.
The problem arises as soon as I try to work with a larger image. In particular, when I use a 256x256 matrix takes forever to make each iteration even using the parallel option and takes almost the entire RAM.
When I move to a matrix that is 512x512 I get this error
Requested 262144x262144 (512.0GB) array exceeds maximum
array size preference.
Error in fminusub (line 165)
H = eye(sizes.nVar);
Error in fminunc (line 446)
[x,FVAL,GRAD,HESSIAN,EXITFLAG,OUTPUT] =
fminusub(funfcn,x, ...
Error in Untitled (line 13)
integratedProfile_optmized=fminunc(fun,startingPoint,options);
Unfortunately, my final goal is to process approximately 3000 images of 500x500 size.
I think I have understood that the crash problem is related to the size of the matrix and to the fact that each pixel is a variable. Therefore, Matlab needs to calculate a huge hessian that doesn't fit into the memory.
However, I don't really know how to solve it while also speeding up the processing.
Do you have any suggestions on how to work with large images? Is there another solver that may work in a faster way? Any mathematical approach to making the problem easier?
Thanks!

Shouldn't H2O standardize categorical predictors for regularized GLM models (lasso, ridge, elastic net)?

"The lasso method requires initial standardization of the regressors,
so that the penalization scheme is fair to all regressors. For
categorical regressors, one codes the regressor with dummy variables
and then standardizes the dummy variables" (p. 394).
Tibshirani, R. (1997). The lasso method for variable selection in the Cox model.
Statistics in medicine, 16(4), 385-395. http://statweb.stanford.edu/~tibs/lasso/fulltext.pdf
H2O:
Similar to package ‘glmnet,’ the h2o.glm function includes a ‘standardize’ parameter that is true by default. However, if predictors are stored as factors within the input H2OFrame, H2O does not appear to standardize the automatically encoded factor variables (i.e., the resultant dummy or one-hot vectors). I've confirmed this experimentally, but references to this decision also show up in the source code:
For instance, method denormalizeBeta (https://github.com/h2oai/h2o-3/blob/553321ad5c061f4831c2c603c828a45303e63d2e/h2o-algos/src/main/java/hex/DataInfo.java#L359) includes the comment "denormalize only the numeric coefs (categoricals are not normalized)." It also looks like means (variable _normSub) and standard deviations (inverse of variable _normMul) are only calculated for the numerical variables, and not the categorical variables, in the setTransform method (https://github.com/h2oai/h2o-3/blob/553321ad5c061f4831c2c603c828a45303e63d2e/h2o-algos/src/main/java/hex/DataInfo.java#L599).
GLMnet:
In contrast, package 'glmnet' seems to expect categorical variables to be dummy-coded prior to fitting a model, using a function like model.matrix. The dummy variables are then standardized along with the continuous variables. It seems like the only way to avoid this would be to pre-standardize the continuous predictors only, concatenate them with the dummy variables, and then run glmnet with standardize=FALSE.
Statistical Considerations:
For a dummy variable or one-hot vector, the mean is the proportion of TRUE values, and the SD is directly proportional to the mean. The SD reaches its maximum when the proportion of TRUE and FALSE values is equal (i.e., σ = 0.5), and the sample SD (s) approaches 0.5 as n → ∞. Thus, if continuous predictors are standardized to have SD = 1, but dummy variables are left unstandardized, the continuous predictors will have at least twice the SD of the dummy predictors, and more than twice the SD for imbalanced dummy variables.
It seems like this could be a problem for regularization (LASSO, ridge, elastic net), because the scale/variance of predictors is expected to be equal so that the regularization penalty (λ) applies evenly across predictors. If two predictors A and B have the same standardized effect size, but A has a smaller SD than B, A will necessarily have a larger unstandardized coefficient than B. This means that, if left unstandardized, the regularization penalty will erroneously be more severe to A than B. In a regularized regression with a mixture of standardized continuous predictors and unstandardized categorical predictors, it seems like this could lead to systematic over-penalization of categorical predictors.
A commonly expressed concern is that standardizing dummy variables removes their normal interpretation. To avoid this issue, while still placing continuous and categorical predictors on an equal footing, Gelman (2008) suggested standardizing continuous predictors by dividing by 2 SD, rather than 1, resulting in standardized predictors with SD = 0.5. However, it seems like this would still be biased for class-imbalanced dummy variables, for which the SD might be substantially less than 0.5.
Gelman, A. (2008). Scaling regression inputs by dividing by two
standard deviations. Statistics in medicine, 27(15), 2865-2873.
http://www.stat.columbia.edu/~gelman/research/published/standardizing7.pdf
Question:
Is H2O's approach of not standardizing one-hot vectors for regularized regression correct? Could this lead to a bias toward over-penalizing dummy variables / one-hot vectors? Or has Tibshirani (1997)'s recommendation since been revised for some reason?
Personally, I rather keep the binary features untouched and apply MinMaxScalar between 0 and 1 to the numeric features instead of the normalization. This puts the numeric features on a similar standard deviation scale as those of binaries.

Collapsed sampling/individual Metropolis-Hastings steps

My model has three parameters, say theta_1, theta_2 and nu.
I want to sample theta_1, theta_2 from the posterior with nu marginalized out (which can be done analytically), i.e. from p(theta_1, theta_2 | D) instead of p(theta_1, theta_2, nu | D) where D is the data. After that, I want to resample nu based on the new values of theta_1 and theta_2. So one sampling scan would consist of the steps
Draw theta_1 and theta_2 from p(theta_1, theta_2 | D) (with nu marginalized out)
Draw nu from p(nu | theta_1, theta_2, D) (with nu marginalized out)
In other words, a collapsed Gibbs sampler.
How would I go about that with PyMC3? I reckon I should implement an individual step function, but I'm not sure how to construct the likelihood here. How do I get access to the model specification when implementing a step function in PyMC3?
The notions of step methods and likelihoods are somewhat conflated in the question, but I see what you are driving at. Step methods are typically independent of the likelihood, which is passed to the step method as an argument. For example check out the slice sampler step method in PyMC 3. Likelihoods are stochastic objects that return log-likelihood values conditional on the values of their parents in the directed acyclic graph.
If you are doing Gibbs sampling, you are not typically concerned with evaluating likelihoods because you are iteratively sampling directly from the conditionals of the model parameters. We do not currently have Gibbs in PyMC 3, and there is some rudimentary Gibbs support in PyMC 2. Its a little troublesome to implement generally because it involves recognizing conjugate associations in the model. Moreover, in PyMC 3 you have access to gradient-based samplers (Hamiltonian), which are much more efficient than Gibbs, so there are a few reasons you may not want to implement Gibbs.
That said, PyMC offers a tremendous amount of flexibility for implementing custom step methods and likeihoods. So long as the step (astep) function returns a new point, you can pretty much do what you like otherwise. There's no guarantee that it will be a good sampler,

Ising 2D Optimization

I have implemented a MC-Simulation of the 2D Ising model in C99.
Compiling with gcc 4.8.2 on Scientific Linux 6.5.
When I scale up the grid the simulation time increases, as expected.
The implementation simply uses the Metropolis–Hastings algorithm.
I tried to find out a way to speed up the algorithm, but I haven't any good idea ?
Are there some tricks to do so ?
As jimifiki wrote, try to do a profiling session.
In order to improve on the algorithmic side only, you could try the following:
Lookup Table:
When calculating the energy difference for the Metropolis criteria you need to evaluate the exponential exp[-K / T * dE ] where K is your scaling constant (in units of Boltzmann's constant) and dE the energy-difference between the original state and the one after a spin-flip.
Calculating exponentials is expensive
So you simply build a table beforehand where to look up the possible values for the dE. There will be (four choose one plus four choose two plus four choose three plus four choose four) possible combinations for a nearest-neightbour interaction, exploit the problem's symmetry and you get five values fordE: 8, 4, 0, -4, -8. Instead of using the exp-function, use the precalculated table.
Parallelization:
As mentioned before, it is possible to parallelize the algorithm. To preserve the physical correctness, you have to use a so-called checkerboard concept. Consider the two-dimensional grid as a checkerboard and compute only the white cells parallel at once, then the black ones. That should be clear, considering the nearest-neightbour interaction which introduces dependencies of the values.
Use GPGPU:
You can also implement the simulation on a GPGPU, e.g. using CUDA, if you're already working on C99.
Some tips:
- Don't forget to align C99-structs properly.
- Use linear Arrays, not that nested ones. Aligned memory is normally faster to access, if done properly.
- Try to let the compiler do loop-unrolling, etc. (gcc special options, not default on O2)
Some more information:
If you look for an efficient method to calculate the critical point of the system, the method of choice would be finite-size scaling where you simulate at different system-sizes and different temperature, then calculate a value which is system-size independet at the critical point, therefore an intersection point of the corresponding curves (please see the theory to get a detailed explaination)
I hope I was helpful.
Cheers...
It's normal that your simulation times scale at least with the square of the size. Isn't it?
Here some subjestions:
If you are concerned with thermalization issues, try to use parallel tempering. It can be of help.
The Metropolis-Hastings algorithm can be made parallel. You could try to do it.
Check you are not pessimizing the code.
Are your spin arrays of ints? You could put many spins on the same int. It's a lot of work.
Moreover, remember what Donald taught us:
premature optimisation is the root of all evil
Before optimising you should first understand where your program is slow. This is called profiling.

Resources