MLR: How exactly is the process when using sequential optimization in nested resampling? - cross-validation

This is a questions of understanding. Suppose I want to do nested cross-validation (e.g. outer:5 x inner:4) and use sequential optimization to find the best set of parameters. Tuning parameters happens in the inner loop. When doing a normal grid search, I train on three folds and test on 1 fold of the inner loop for each combination of hyperparameters and then choose the best set of parameters. The hyperparameter combination of the inner loop is then trained and evaluated on the new test folds of the outer loop in a similar way as in the inner loop.
But since it is a grid search, all the parameters are a priori known. How are the new set of parameters determined when using sequential optimization? Do the newly suggested points depend on the previously evaluated points, averaged over all inner folds? But that seems intuitively wrong to me since it is like comparing apples and oranges. I hope my question is not too confusing.

I think you might have a misunderstand of the term "sequential optimization" here.
It can mean two things, depending on the context:
In a tuning context, this term is sometimes used as a synonym for "forward feature selection" (FFS). In this case, no grid search is done. Variables of the dataset are added sequentially to the model to see if a better performance is achieved.
When you use that term while doing a "grid search", you most likely just mean that the process is running sequentially (i.e. on one core, one setting at a time). The counterpart to this would be "parallel grid search" where you evaluate the predefined grid choices at the same time using multiple cores.

Related

How to persist the same folds when doing cross-validation across multiple models in scikit-learn?

I'm doing hyperparameter tuning across multiple models and comparing the results. The hyperparameters of each model are chosen by 5-fold cross-validation. I'm using the sklearn.model_selection.KFold(n_splits=5, shuffle=True) function to get a fold generator.
After checking the documentation on KFold and the source code of some models, I suspect a new set of folds is created for each model. I want to make things more fair and use the same (initially random) folds for all the models I'm tuning. Is there a way to do this in scikit-learn?
As a related question, does it make sense to use the same folds to obtain this fair comparison I'm trying to do?
You have two options:
Shuffle your data at the begining, then use Kfold with shuffle=False.
Set the parameter random_state equal to the same integer each time you perform KFold.
Either option should result in using the same folds when you repeat KFold. See the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
This approach makes logical sense to me, but I wouldn't expect it to make a significant difference. Perhaps someone else can give a more detailed explanation of the advantages / disadvantages.
The goal of cross-validation is to obtain a representative measure of the accuracy in the test set. The more fold you have the more accurate your metric will be.
If you are using 5 or 10 fold cross-validation to compare different sets of hyperparameters, you don't have to use the exact same splits to compare your models. The average accuracy of all folds will give you a good idea how the model is performing and will allow you to compare them.

Model tuning with Cross validation

I have a model tuning object that fits multiple models and tunes each one of them to find the best hyperparameter combination for each of the models. I want to perform cross-validation on the model tuning part and this is where I am facing a dilemma.
Let's assume that I am fitting just the one model- a random forest classifier and performing a 5 fold cross-validation. Currently, for the first fold that I leave out, I fit the random forest model and perform the model tuning. I am performing model tuning using the dlib package. I calculate the evaluation metric(accuracy, precision, etc) and select the best hyper-parameter combination.
Now when I am leaving out the second fold, should I be tuning the model again? Because if I do, I will get a different combination of hyperparameters than I did in the first case. If I do this across the five folds, what combination do I select?
The cross validators present in spark and sklearn use grid search so for each fold they have the same hyper-parameter combination and don't have to bother about hyper-parameter combinations changing across folds
Choosing the best hyper-parameter combination that I get when I leave out the first fold and using it for the subsequent folds doesn't sound right because then my entire model tuning is dependent on which fold got left out first. However, if I am getting different hyperparameters each time, which one do I settle on?
TLDR:
If you are performing let's say a derivative based model tuning along with cross-validation, your hyper-parameter combination changes as you iterate over folds. How do you select the best combination then? Generally speaking, how do you use cross-validation with derivative-based model tuning methods.
PS: Please let me know if you need more details
This is more of a comment, but it is too long for this, so I post it as an answer instead.
Cross-validation and hyperparameter tuning are two separate things. Cross Validation is done to get a sense of the out-of-sample prediction error of the model. You can do this by having a dedicated validation set, but this raises the question if you are overfitting to this particular validation data. As a consequence, we often use cross-validation where the data are split in to k folds and each fold is used once for validation while the others are used for fitting. After you have done this for each fold, you combine the prediction errors into a single metric (e.g. by averaging the error from each fold). This then tells you something about the expected performance on unseen data, for a given set of hyperparameters.
Once you have this single metric, you can change your hyperparameter, repeat, and see if you get a lower error with the new hyperparameter. This is the hpyerparameter tuning part. The CV part is just about getting a good estimate of the model performance for the given set of hyperparameters, i.e. you do not change hyperparameters 'between' folds.
I think one source of confusion might be the distinction between hyperparameters and parameters (sometimes also referred to as 'weights', 'feature importances', 'coefficients', etc). If you use a gradient-based optimization approach, these change between iterations until convergence or a stopping rule is reached. This is however different from hyperparameter search (e.g. how many trees to plant in the random forest?).
By the way, I think questions like these should better be posted to the Cross-Validated or Data Science section here on StackOverflow.

Intelligent purely functional sets

Set computations composed of unions, intersections and differences can often be expressed in many different ways. Are there any theories or concrete implementations that try to minimize the amount of computation required to reach a given answer?
For example, I first came across a practical application of this when trying to decompose atoms in a simulation of an amorphous material into neighbor shells where the first shell are the immediate neighbors of some given origin atom and the second shell are those atoms that are neighbors of the first shell not in either the first shell or the one before it:
nth 0 = singleton i
nth 1 = neighbors i
nth n = reduce union (map neighbors (nth(n-1))) - nth(n-1) - nth(n-2)
There are many different ways to solve this. You can incrementally test of membership in each set whilst composing the result or you can compute the union of three neighbor shells and use intersection to remove the previous two shells leaving the outermost one. In practice, solutions that require the construction of large intermediate sets are slower.
Presumably an intelligent set implementation could compose the expression that was to be evaluated and then optimize it (e.g. to reduce the size of intermediate sets) before evaluating it in order to improve performance. Do such set implementations exist?
Your question immediately reminded me of Haskell's stream fusion, described in this paper. The general principle can be summarized quite easily: Instead of storing a list, you store a way to build a list. Then the list transformation functions operate directly on the list generator, meaning that all the operations fuse into a single generation of the data without any intermediate structures. Then when you are done composing operations you run the generator and produce the data.
So I think the answer to your question is that if you wanted some similarly intelligent mechanism that fused computations and eliminated intermediate data structures, you'd need to find a way to transform a set into a "co-structure" (that's what the paper calls it) that generates a set and operate directly on that, then actually generate the set when you are done.
I think there's a very deep theory behind this concept that the paper hints at but never spells out, and if somebody else here knows what it is, please let me know, because this is very relevant to something else I am doing, too!

Evaluating a function at a particular value in parallel

The question may seem vague, but let me explain it.
Suppose we have a function f(x,y,z ....) and we need to find its value at the point (x1,y1,z1 .....).
The most trivial approach is to just replace (x,y,z ...) with (x1,y1,z1 .....).
Now suppose that the function is taking a lot of time in evaluation and I want to parallelize the algorithm to evaluate it. Obviously it will depend on the nature of function, too.
So my question is: what are the constraints that I have to look for while "thinking" to parallelize f(x,y,z...)?
If possible, please share links to study.
Asking the question in such a general way does not permit very specific advice to be given.
I'd begin the analysis by looking for ways to evaluate or rewrite the function using groups of variables that interact closely, creating intermediate expressions that can be used to make the final evaluation. You may find a way to do this involving a hierarchy of subexpressions that leads from the variables themselves to the final function.
In general the shorter and wider such an evaluation tree is, the greater the degree of parallelism. There are two cautionary notes to keep in mind that detract from "more parallelism is better."
For one thing a highly parallel approach may actually involve more total computation than your original "serial" approach. In fact some loss of efficiency in this regard is to be expected, since a serial approach can take advantage of all prior subexpression evaluations and maximize their reuse.
For another thing the parallel evaluation will often have worse rounding/accuracy behavior than a serial evaluation chosen to give good or optimal error estimates.
A lot of work has been done on evaluations that involve matrices, where there is usually a lot of symmetry to how the function value depends on its arguments. So it helps to be familiar with numerical linear algebra and parallel algorithms that have been developed there.
Another area where a lot is known is for multivariate polynomial and rational functions.
When the function is transcendental, one might hope for some transformations or refactoring that makes the dependence more tractable (algebraic).
Not directly relevant to your question are algorithms that amortize the cost of computing function values across a number of arguments. For example in computing solutions to ordinary differential equations, there may be "multi-step" methods that share the cost of evaluating derivatives at intermediate points by reusing those values several times.
I'd suggest that your concern to speed up the evaluation of the function suggests that you plan to perform more than one evaluation. So you might think about ways to take advantage of prior evaluations or perform evaluations at related arguments in a way that contributes to your search for parallelism.
Added: Some links and discussion of search strategy
Most authors use the phrase "parallel function evaluation" to
mean evaluating the same function at multiple argument points.
See for example:
[Coarse Grained Parallel Function Evaluation -- Rulon and Youssef]
http://cdsweb.cern.ch/record/401028/files/p837.pdf
A search strategy to find the kind of material Gaurav Kalra asks
about should try to avoid those. For example, we might include
"fine-grained" in our search terms.
It's also effective to focus on specific kinds of functions, e.g.
"polynomial evaluation" rather than "function evaluation".
Here for example we have a treatment of some well-known techniques
for "fast" evaluations applied to design for GPU-based computation:
[How to obtain efficient GPU kernels -- Cruz, Layton, and Barba]
http://arxiv.org/PS_cache/arxiv/pdf/1009/1009.3457v1.pdf
(from their Abstract) "Here, we have tackled fast summation
algorithms (fast multipole method and fast Gauss transform),
and applied algorithmic redesign for attaining performance on
GPUs. The progression of performance improvements attained
illustrates the exercise of formulating algorithms for the
massively parallel architecture of the GPU."
Another search term that might be worth excluding is "pipelined".
This term invariably discusses the sort of parallelism that can
be used when multiple function evaluations are to be done. Early
stages of the computation can be done in parallel with later
stages, but on different inputs.
So that's a search term that one might want to exclude. Or not.
Here's a paper that discusses n-fold speedup for n-variate
polynomial evaluation over finite fields GF(p). This might be
of direct interest for cryptographic applications, but the
approach via modified Horner's method may be interesting for
its potential for generalization:
[Comparison of Bit and Word Level Algorithms for Evaluating
Unstructured Functions over Finite Rings -- Sunar and Cyganski]
http://www.iacr.org/archive/ches2005/018.pdf
"We present a modification to Horner’s algorithm for evaluating
arbitrary n-variate functions defined over finite rings and fields.
... If the domain is a finite field GF(p) the complexity of
multivariate Horner polynomial evaluation is improved from O(p^n)
to O((p^n)/(2n)). We prove the optimality of the presented algorithm."
Multivariate rational functions can be considered simply as the
ratio of two such polynomial functions. For the special case
of univariate rational functions, which can be particularly
effective in approximating elementary transcendental functions
and others, can be evaluated via finite (resp. truncated)
continued fractions, whose convergents (partial numerators
and denominators) can be defined recursively.
The topic of continued fraction evaluations allows us to segue
to a final link that connects that topic with some familiar
parallelism of numerical linear algebra:
[LU Factorization and Parallel Evaluation of Continued Fractions
-- Ömer Egecioglu]
http://www.cs.ucsb.edu/~omer/DOWNLOADABLE/lu-cf98.pdf
"The first n convergents of a general continued fraction
(CF) can be computed optimally in logarithmic parallel
time using O(n/log(n))processors."
You've asked how to speed up the evalution of a single call to a single function. Unless that evaluation time is measured in hours, it isn't clear why it is worth the bother to speed it up. If you insist on speeding up the function execution itself, you'll have to inspect its content to see if some aspects of it are parallelizable. You haven't provided any information on what it computes or how it does so, so it is hard to give any further advice on this aspect. hardmath's answer suggests some ideas you can use, depending on the actual internal structure of your function.
However, usually people asking your question actually call the function many times (say, N times) for different values of x,y,z (eg., x1,y1,... x2,y2,... xN,yN, ... using your vocabulary).
Yes, if you speed up the execution of the function, making the collective set of calls will speed up and that's what people tend to want. If this is the case, it is "technically easy" to speed up overall execution: make N calls to the function in parallel. Then all the pointwise evaluations happen at the same time. To make this work, you pretty much have make vectors out of the values you want to process (so this kind of trick is called "data parallel" programming). So what you really want is something like:
PARALLEL DO I=1,N
RESULT(I)=F(X[J],Y[J], ...)
END PARALLEL DO
How you implement PARALLEL DO depends on the programming language and libraries you have.
This generally only works if N is a fairly big number, but the more expensive f is to execute, the smaller the effective N.
You can also take advantage of the structure of your function to make this even more efficient. If f computes some internal value the same way for commonly used cases, you might be able
to break out the special cases, pre-compute those, and then use those results to compute "the rest of f" for each individual call.
If you are combining ("reducing") the results of all the functions (e..g, summing all the results), you can do that outside the PARALELL DO loop. If you try to combine results inside the loop, you'll have "loop carried dependencies" and you'll either get the wrong answer or it won't go parallel in the way you expect, depending on your compiler or the parallelism libraries. You can combine the answers efficiently if the combination is some associative/commutative operation such as "sum", by building what amounts to a binary tree and running the evaluation of that in parallel. That's a different problem that also occurs frequently in data parallel computation, but we won't go into further here.
Often the overhead of a parallel for loop is pretty high (forking threads is expensive). So usually people divide the overhead across several iterations:
PARALLEL DO I=1,N,M
DO J=I,I+M
RESULT(J)=F(X[J],Y[J], ...)
END DO
END PARALLEL DO
The constant M requires calibration for efficiency; you have to "tune" it. You also have to take care of the fact that N might not be a multiple of M; that requires just an extra clean loop to handle the edge condition:
PARALLEL DO I=1,int(N/M)*M,M
DO J=I,I+M
RESULT(J)=F(X[J],Y[J], ...)
END DO
END PARALLEL DO
DO J=int(N/M)*M,N,1
RESULT(J)=F(X[J],Y[J], ...)
END DO

What is the best way to optimize or "tune" LINQ expressions?

When constructing LINQ expressions (for me, linq to objects) there are many ways to accomplish something, some much, much better and more efficient than others.
Is there a good way to "tune" or optimize these expressions?
What fundamental metrics do folks employ and how do you gather them?
Is there a way to get at "total iterations" count or some other metric, where you could "know" that lower means better?
EDIT
Thanks Richard/Jon for your answers.
What it seems that I really want is a way to get a simple Operation Count "OCount" for a LINQ Expression though I am not sure that the hooks exist in LINQ to allow it. Suppose that I have a target level of performance for a specific machine hardware (an SLA). Ideally, I would add a unit test to confirm that the typical data moved through that query would process within that allotted time (from the SLA). Problem is that this would be run on the build server/developers machine/etc. which probably bears little resemblance to the machine hardware of the SLA. So the idea is that I would determine an acceptable max "OCount" for the expression, knowing that if the OCount is less than X, it will certainly provide acceptable performance under the SLA on the target "typical" hardware. If the OCount exceeds this threshold, the build/unit test would generate a warning. Ideally, I would like to have something like this (pseudocode-ish):
var results = [big linq expression run against test dataset];
Assert.IsLess(MAXALLOWABLE_OCOUNT, results.OCount)
where results.OCount would simply give me the total iterations (n) necessary to produce the result set.
Why would I like this??
Well, with even a moderately sized LINQ expression, a small change/addition can have HUGE effects on the performance as a consequence of increasing the overall operation count. The application code would still pass all unit tests as it would still produce the correct result, but work miserably slowly when deployed.
The other reason is for simple learning. If you do something and the OCount goes up or down by an order of magnitude, then you learn something.
EDIT #2
I'll throw in a potential answer as well. It is not mine, it comes from Cameron MacFarland from another question that I asked that spawned this one. Turns out, I think the answer to that one could work here in a unit test environment like the one that I described in the first edit to this question.
The essence of it would be to create the test datasets in the unit test fixture that you feed into the LINQ expression in the way outlined in this answer and then add up the Iteration counts and compare to the max allowable iteration count.
See Cameron's answer here
You basically need to work out the complexity function. This depends on the operator, but doesn't tend to be very well documented, unfortunately.
(For the general principle I agree with Richard's answer - this is just LINQ to Objects stuff.)
If you have specific operators you're interested in, it would be worth asking about them, but off the top of my head:
Select = O(n)
Where = O(n)
Join = O(inner + outer + matches) (i.e. it's no cheaper than inner + outer, but could be as bad as inner * outer depending on the results)
GroupJoin = same as Join, but buffered instead of streaming by outer
OrderBy = O(n log n)
SelectMany = O(n + results)
Count = O(1) or O(n) depending on whether it implements IList
Count(predicate) = O(n)
Max/Min = O(n)
All/Any = O(n) (with possible early out)
Distinct = O(n)
Skip/Take = O(n)
SkipWhile/TakeWhile = O(n)
The exact characteristics depend on whether the operator buffers or streams too.
Get an SLA (or other definition) describing the required overall performance.
Measure the applications performance, and how far below requirements it is (if within requirements then stop and do something useful).
Use a profiler to get a detailed performance breakdown, identify the parts of the system most able to be improved (making a small improvement to hot code is likely to be better than a big improvement to rarely called code).
Make the change, re-run the unit/functional tests (no point doing the wrong thing fast).
Go to 1.
If, in #3 you find a LINQ expression is a performamce problem then start thinking about needing an answer to this question. The answer will completely depend on which LINQ provider you are using and the details of its use in your case. There is no general answer.
Adding onto Jon who is adding onto Richard
Another issue to consider is whether or not you are processing all results of a LINQ query. In certain cirmcumstances, particularly UI, you only end up processing a subset of the results returned from a LINQ query. In those situations it's important to know which LINQ queries support lazy evaluation. That is the ability to return a subset of the results without processing the entire collection.
For instance calling MoveNext() on the following LINQ operations will process one result at a time
Select
Where
But the following must process every element in the collection before returning a single item.
OrderBy
Except (processes the other collection entirely)

Resources