knn imputation in preprocess using caret cross-validation train function - cross-validation

Does anyone know if caret ignores the outcome in the preprocess function when calling the knn imputation procedure cross-validation? I assume you should not use the outcome to assess internal validity in your hold-out data set?
test = preProcess(data, method = c("knnImpute"), knnSummary = median)

Related

jax minimization with stochastically estimated gradients

I'm trying to use the bfgs optimizer from tensorflow_probability.substrates.jax and from jax.scipy.optimize.minimize to minimize a function f which is estimated from pseudo-random samples and has a jax.random.PRNGKey as argument. To use this function with the jax/tfp bfgs minimizer, I wrap the function inside a lambda function
seed = 100
key = jax.random.PRNGKey(seed)
fun = lambda x: return f(x,key)
result = jax.scipy.optimize.minimize(fun = fun, ...)
What is the best way to update the key when the minimization routine calls the function to be minimized so that I use different pseudo-random numbers in a reproducible way? Maybe a global key variable? If yes, is there an example I could follow?
Secondly, is there a way to make the optimization stop after a certain amount of time, as one could do with a callback in scipy? I could directly use the scipy implementation of bfgs/ l-bfgs-b/ etc and use jax ony for the estimation of the function and of tis gradients, which seems to work. Is there a difference between the scipy, jax.scipy and tfp.jax bfgs implementations?
Finally, is there a way to print the values of the arguments of fun during the bfgs optimization in jax.scipy or tfp, given that f is jitted?
Thank you!
There is no way to do what you're asking with jax.scipy.optimize.minimize, because the minimizer does not offer any means to track changing state between function calls, and does not provide for any inbuilt stochasticity in the optimizer.
If you're interested in stochastic optimization in JAX, you might try stochastic optimization in JAXOpt, which provides a much more flexible set of optimization routines.
Regarding your second question, if you'd like to print values during the course of a jit-compiled optimization or other loop, you can use jax.debug.print.

Do we need standardization in K-prototypes algorithm

I want to use the K-prototype algorithm (a type of KNN algorithm used for mixed data :numerical and categorical data) for a clustering problem.
The algorithm handles the categorical values without numerical encoding, so I don't need to encode them to numerical values.
My question is : do we need to standardize the numerical columns before applying k-prototypes?
For example, I have the following columns: age(float), salary(float), gender(object), city(object), profession(object).
Do I need to apply standardization like this?
from sklearn.preprocessing import StandardScaler
scaled_X = StandardScaler().fit_transform(X[['salary', 'age']])
X[['salary', 'age']] = scaled_X
But I think that standardization has no value if it is not applied to all columns,because its goal is to make all variables on the same scale and not just some columns!
so in this case, we do not need to apply it!
I hope I explained my question well, Thank you.

Is it possible to export additional variables from within an ODE45 function?

I have an equation of motion function file which I feed into ode45. Necessarily, the output variables of the function file is ydot.
Within my equation of motion function file, I calculate many objects from the state vector, y, to prescribe forces.
After ode45 is finished, I would like access to these objects at every time step so that I can calculate an energy.
Instead of recalculating them over every time step, it would be faster to just pull them from the Runge-Kutta process when they are calculated as intermediate steps anyway.
Is it possible to do this?
There is no guarantee that the ODE function for the right side is even called at the output points as they are usually interpolated from the points computed by the adaptive step size algorithm.
One trick I have often seen but would need to search for references is to have the function return all the values you will need and cut the return list down to the derivative in the ODE45 call. Modulo appropriate syntax
function [ydot, extra] = odefunc(t,y,params)
and then use
sol = ode45(#(t,y): odefunc(t,y,params)(1),...)
and then run odefunc on the points in sol to extract the extra information.
Perhaps that idea of selecting the output only works in python. Then define an explicit wrapper
function ydot = odewrapper(t,y)
[ydot,~] = odefunc(t,y,params)
end
that you then normally call in ode45.

Matlab - Genetic algorithm for mixed integer optimization

The problem that I am trying to solve is based on the following code:
https://www.mathworks.com/help/gads/examples/solving-a-mixed-integer-engineering-design-problem-using-the-genetic-algorithm.html
My function has a lot more variables but basically it is the same. I have a set of variables that needs to be optimized under given constraints. Some of the variables have to be discrete. However, they can only take the values 0 and 1, I don't have to specify them, as it is shown in the example. (I have tried both methods though)
First I create the upper and lower boundaries, which creates a variable of size 1x193, respectively.
[lb,ub] = GWO_LUBGA(n_var,n_comp,C,n_comp);
Afterwards I call up the constraints. As I have discrete values, I cannot use equality constraints. Therefore I am using the workaround that was proposed here:
http://www.mathworks.com/help/gads/mixed-integer-optimization.html
ObjCon = #(x) funconGA(x,C,ub,n_comp);
Same for the objective function:
ObjFcn = #(x) CostFcnGA(x,C);
Afterwards I pass it over to the genetic algorithm:
[Pos,Best,~,GWO_cg_curve] = ga(ObjFcn,n_var,[],[],[],[],lb,ub,ObjCon,C.T*6+2:C.T*8+1,opts);
with n_var = 193 and C.T=24
When I try to compile I receive the following error:
Error using ga (line 366)
Dimensions of matrices being concatenated are not consistent.
Line 366 contains the following code. Unfortunately gaminlp cannot be opened.
% Call appropriate single objective optimization solver
if ~isempty(intcon)
[x,fval,exitFlag,output,population,scores] = gaminlp(FitnessFcn,nvars, ...
Aineq,bineq,Aeq,beq,lb,ub,NonconFcn,intcon,options,output,Iterate);
Both anonymous functions work when random values are entered. What could be the reason for this error?

Vectorization of matlab code

i'm kinda new to vectorization. Have tried myself but couldn't. Can somebody help me vectorize this code as well as give a short explaination on how u do it, so that i can adapt the thinking process too. Thanks.
function [result] = newHitTest (point,Polygon,r,tol,stepSize)
%This function calculates whether a point is allowed.
%First is a quick test is done by calculating the distance from point to
%each point of the polygon. If that distance is smaller than range "r",
%the point is not allowed. This will slow down the algorithm at some
%points, but will greatly speed it up in others because less calls to the
%circleTest routine are needed.
polySize=size(Polygon,1);
testCounter=0;
for i=1:polySize
d = sqrt(sum((Polygon(i,:)-point).^2));
if d < tol*r
testCounter=1;
break
end
end
if testCounter == 0
circleTestResult = circleTest (point,Polygon,r,tol,stepSize);
testCounter = circleTestResult;
end
result = testCounter;
Given the information that Polygon is 2 dimensional, point is a row vector and the other variables are scalars, here is the first version of your new function (scroll down to see that there are lots of ways to skin this cat):
function [result] = newHitTest (point,Polygon,r,tol,stepSize)
result = 0;
linDiff = Polygon-repmat(point,size(Polygon,1),1);
testLogicals = sqrt( sum( ( linDiff ).^2 ,2 )) < tol*r;
if any(testLogicals); result = circleTest (point,Polygon,r,tol,stepSize); end
The thought process for vectorization in Matlab involves trying to operate on as much data as possible using a single command. Most of the basic builtin Matlab functions operate very efficiently on multi-dimensional data. Using for loop is the reverse of this, as you are breaking your data down into smaller segments for processing, each of which must be interpreted individually. By resorting to data decomposition using for loops, you potentially loose some of the massive performance benefits associated with the highly optimised code behind the Matlab builtin functions.
The first thing to think about in your example is the conditional break in your main loop. You cannot break from a vectorized process. Instead, calculate all possibilities, make an array of the outcome for each row of your data, then use the any keyword to see if any of your rows have signalled that the circleTest function should be called.
NOTE: It is not easy to efficiently conditionally break out of a calculation in Matlab. However, as you are just computing a form of Euclidean distance in the loop, you'll probably see a performance boost by using the vectorized version and calculating all possibilities. If the computation in your loop were more expensive, the input data were large, and you wanted to break out as soon as you hit a certain condition, then a matlab extension made with a compiled language could potentially be much faster than a vectorized version where you might be performing needless calculation. However this is assuming that you know how to program code that matches the performance of the Matlab builtins in a language that compiles to native code.
Back on topic ...
The first thing to do is to take the linear difference (linDiff in the code example) between Polygon and your row vector point. To do this in a vectorized manner, the dimensions of the 2 variables must be identical. One way to achieve this is to use repmat to copy each row of point to make it the same size as Polygon. However, bsxfun is usually a superior alternative to repmat (as described in this recent SO question), making the code ...
function [result] = newHitTest (point,Polygon,r,tol,stepSize)
result = 0;
linDiff = bsxfun(#minus, Polygon, point);
testLogicals = sqrt( sum( ( linDiff ).^2 ,2 )) < tol*r;
if any(testLogicals); result = circleTest (point,Polygon,r,tol,stepSize); end
I rolled your d value into a column of d by summing across the 2nd axis (note the removal of the array index from Polygon and the addition of ,2 in the sum command). I then went further and evaluated the logical array testLogicals inline with the calculation of the distance measure. You will quickly see that a downside of heavy vectorisation is that it can make the code less readable to those not familiar with Matlab, but the performance gains are worth it. Comments are pretty necessary.
Now, if you want to go completely crazy, you could argue that the test function is so simple now that it warrants use of an 'anonymous function' or 'lambda' rather than a complete function definition. The test for whether or not it is worth doing the circleTest does not require the stepSize argument either, which is another reason for perhaps using an anonymous function. You can roll your test into an anonymous function and then jut use circleTest in your calling script, making the code self documenting to some extent . . .
doCircleTest = #(point,Polygon,r,tol) any(sqrt( sum( bsxfun(#minus, Polygon, point).^2, 2 )) < tol*r);
if doCircleTest(point,Polygon,r,tol)
result = circleTest (point,Polygon,r,tol,stepSize);
else
result = 0;
end
Now everything is vectorised, the use of function handles gives me another idea . . .
If you plan on performing this at multiple points in the code, the repetition of the if statements would get a bit ugly. To stay dry, it seems sensible to put the test with the conditional function into a single function, just as you did in your original post. However, the utility of that function would be very narrow - it would only test if the circleTest function should be executed, and then execute it if needs be.
Now imagine that after a while, you have some other conditional functions, just like circleTest, with their own equivalent of doCircleTest. It would be nice to reuse the conditional switching code maybe. For this, make a function like your original that takes a default value, the boolean result of the computationally cheap test function, and the function handle of the expensive conditional function with its associated arguments ...
function result = conditionalFun( default, cheapFunResult, expensiveFun, varargin )
if cheapFunResult
result = expensiveFun(varargin{:});
else
result = default;
end
end %//of function
You could call this function from your main script with the following . . .
result = conditionalFun(0, doCircleTest(point,Polygon,r,tol), #circleTest, point,Polygon,r,tol,stepSize);
...and the beauty of it is you can use any test, default value, and expensive function. Perhaps a little overkill for this simple example, but it is where my mind wandered when I brought up the idea of using function handles.

Resources