I've been reading up on Decision Trees and Cross Validation, and I understand both concepts. However, I'm having trouble understanding Cross Validation as it pertains to Decision Trees. Essentially Cross Validation allows you to alternate between training and testing when your dataset is relatively small to maximize your error estimation. A very simple algorithm goes something like this:
Decide on the number of folds you want (k)
Subdivide your dataset into k folds
Use k-1 folds for a training set to build a tree.
Use the testing set to estimate statistics about the error in your tree.
Save your results for later
Repeat steps 3-6 for k times leaving out a different fold for your test set.
Average the errors across your iterations to predict the overall error
The problem I can't figure out is at the end you'll have k Decision trees that could all be slightly different because they might not split the same way, etc. Which tree do you pick? One idea I had was pick the one with minimal errors (although that doesn't make it optimal just that it performed best on the fold it was given - maybe using stratification will help but everything I've read say it only helps a little bit).
As I understand cross validation the point is to compute in node statistics that can later be used for pruning. So really each node in the tree will have statistics calculated for it based on the test set given to it. What's important are these in node stats, but if your averaging your error. How do you merge these stats within each node across k trees when each tree could vary in what they choose to split on, etc.
What's the point of calculating the overall error across each iteration? That's not something that could be used during pruning.
Any help with this little wrinkle would be much appreciated.
The problem I can't figure out is at the end you'll have k Decision trees that could all be slightly different because they might not split the same way, etc. Which tree do you pick?
The purpose of cross validation is not to help select a particular instance of the classifier (or decision tree, or whatever automatic learning application) but rather to qualify the model, i.e. to provide metrics such as the average error ratio, the deviation relative to this average etc. which can be useful in asserting the level of precision one can expect from the application. One of the things cross validation can help assert is whether the training data is big enough.
With regards to selecting a particular tree, you should instead run yet another training on 100% of the training data available, as this typically will produce a better tree. (The downside of the Cross Validation approach is that we need to divide the [typically little] amount of training data into "folds" and as you hint in the question this can lead to trees which are either overfit or underfit for particular data instances).
In the case of decision tree, I'm not sure what your reference to statistics gathered in the node and used to prune the tree pertains to. Maybe a particular use of cross-validation related techniques?...
For the first part, and like the others have pointed out, we usually use the entire dataset for building the final model, but we use cross-validation (CV) to get a better estimate of the generalization error on new unseen data.
For the second part, I think you are confusing CV with the validation set, used to avoid overfitting the tree by pruning a node when some function value computed on the validation set does not increase before/after the split.
Cross validation isn't used for buliding/pruning the decision tree. It's used to estimate how good the tree (built on all of the data) will perform by simulating arrival of new data (by building the tree without some elements just as you wrote). I doesn't really make sense to pick one of the trees generated by it because the model is constrained by the data you have (and not using it all might actually be worse when you use the tree for new data).
The tree is built over the data that you choose (usualy all of it). Pruning is usually done by using some heuristic (i.e. 90% of the elements in the node belongs to class A so we don't go any further or the information gain is too small).
The main point of using cross-validation is that it gives you better estimate of the performance of your trained model when used on different data.
Which tree do you pick? One option would be that you bulid a new tree using all your data for training set.
It has been mentioned already that the purpose of the cross-validation is to qualify the model. In other words cross-validation provide us with an error/accuracy estimation of model generated with the selected "parameters" regardless of the used data.
The corss-validation process can be repeated using deferent parameters untill we are satisfied with the performance. Then we can train the model with the best parameters on the whole data.
I am currently facing the same problem, and I think there is no “correct” answer, since the concepts are contradictory and it’s a trade-off between model robustness and model interpretation.
I basically chose the decision tree algorithm for the sake of easy interpretability, visualization and straight forward hands-on application.
On the other hand, I want to proof the robustness of the model using cross-validation.
I think I will apply a two step approach:
1. Apply k-fold cross-validation to show robustness of the algorithm with this dataset
2. Use the whole dataset for the final decision tree for interpretable results.
You could also randomly choose a tree set of the cross-validation or the best performing tree, but then you would loose information of the hold-out set.
Related
I have a model tuning object that fits multiple models and tunes each one of them to find the best hyperparameter combination for each of the models. I want to perform cross-validation on the model tuning part and this is where I am facing a dilemma.
Let's assume that I am fitting just the one model- a random forest classifier and performing a 5 fold cross-validation. Currently, for the first fold that I leave out, I fit the random forest model and perform the model tuning. I am performing model tuning using the dlib package. I calculate the evaluation metric(accuracy, precision, etc) and select the best hyper-parameter combination.
Now when I am leaving out the second fold, should I be tuning the model again? Because if I do, I will get a different combination of hyperparameters than I did in the first case. If I do this across the five folds, what combination do I select?
The cross validators present in spark and sklearn use grid search so for each fold they have the same hyper-parameter combination and don't have to bother about hyper-parameter combinations changing across folds
Choosing the best hyper-parameter combination that I get when I leave out the first fold and using it for the subsequent folds doesn't sound right because then my entire model tuning is dependent on which fold got left out first. However, if I am getting different hyperparameters each time, which one do I settle on?
TLDR:
If you are performing let's say a derivative based model tuning along with cross-validation, your hyper-parameter combination changes as you iterate over folds. How do you select the best combination then? Generally speaking, how do you use cross-validation with derivative-based model tuning methods.
PS: Please let me know if you need more details
This is more of a comment, but it is too long for this, so I post it as an answer instead.
Cross-validation and hyperparameter tuning are two separate things. Cross Validation is done to get a sense of the out-of-sample prediction error of the model. You can do this by having a dedicated validation set, but this raises the question if you are overfitting to this particular validation data. As a consequence, we often use cross-validation where the data are split in to k folds and each fold is used once for validation while the others are used for fitting. After you have done this for each fold, you combine the prediction errors into a single metric (e.g. by averaging the error from each fold). This then tells you something about the expected performance on unseen data, for a given set of hyperparameters.
Once you have this single metric, you can change your hyperparameter, repeat, and see if you get a lower error with the new hyperparameter. This is the hpyerparameter tuning part. The CV part is just about getting a good estimate of the model performance for the given set of hyperparameters, i.e. you do not change hyperparameters 'between' folds.
I think one source of confusion might be the distinction between hyperparameters and parameters (sometimes also referred to as 'weights', 'feature importances', 'coefficients', etc). If you use a gradient-based optimization approach, these change between iterations until convergence or a stopping rule is reached. This is however different from hyperparameter search (e.g. how many trees to plant in the random forest?).
By the way, I think questions like these should better be posted to the Cross-Validated or Data Science section here on StackOverflow.
I was reading about cross validation and about how it it is used to select the best model and estimate parameters , I did not really understand the meaning of it.
Suppose I build a Linear regression model and go for a 10 fold cross validation, I think each of the 10 will have different coefficiant values , now from 10 different which should I pick as my final model or estimate parameters.
Or do we use Cross Validation only for the purpose of finding an average error(average of 10 models in our case) and comparing against another model ?
If your build a Linear regression model and go for a 10 fold cross validation, indeed each of the 10 will have different coefficient values. The reason why you use cross validation is that you get a robust idea of the error of your linear model - rather than just evaluating it on one train/test split only, which could be unfortunate or too lucky. CV is more robust as no ten splits can be all ten lucky or all ten unfortunate.
Your final model is then trained on the whole training set - this is where your final coefficients come from.
Cross-validation is used to see how good your models prediction is. It's pretty smart making multiple tests on the same data by splitting it as you probably know (i.e. if you don't have enough training data this is good to use).
As an example it might be used to make sure you aren't overfitting the function. So basically you try your function when you've finished it with Cross-validation and if you see that the error grows a lot somewhere you go back to tweaking the parameters.
Edit:
Read the wikipedia for deeper understanding of how it works: https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29
You are basically confusing Grid-search with cross-validation. The idea behind cross-validation is basically to check how well a model will perform in say a real world application. So we basically try randomly splitting the data in different proportions and validate it's performance. It should be noted that the parameters of the model remain the same throughout the cross-validation process.
In Grid-search we try to find the best possible parameters that would give the best results over a specific split of data (say 70% train and 30% test). So in this case, for different combinations of the same model, the dataset remains constant.
Read more about cross-validation here.
Cross Validation is mainly used for the comparison of different models.
For each model, you may get the average generalization error on the k validation sets. Then you will be able to choose the model with the lowest average generation error as your optimal model.
Cross-Validation or CV allows us to compare different machine learning methods and get a sense of how well they will work in practice.
Scenario-1 (Directly related to the question)
Yes, CV can be used to know which method (SVM, Random Forest, etc) will perform best and we can pick that method to work further.
(From these methods different models will be generated and evaluated for each method and an average metric is calculated for each method and the best average metric will help in selecting the method)
After getting the information about the best method/ or best parameters we can train/retrain our model on the training dataset.
For parameters or coefficients, these can be determined by grid search techniques. See grid search
Scenario-2:
Suppose you have a small amount of data and you want to perform training, validation and testing on data. Then dividing such a small amount of data into three sets reduce the training samples drastically and the result will depend on the choice of pairs of training and validation sets.
CV will come to the rescue here. In this case, we don't need the validation set but we still need to hold the test data.
A model will be trained on k-1 folds of training data and the remaining 1 fold will be used for validating the data. A mean and standard deviation metric will be generated to see how well the model will perform in practice.
After finding out about many transformations that can be applied on the target values(y column), of a data set, such as box-cox transformations I learned that linear regression models need to be trained with normally distributed target values in order to be efficient.(https://stats.stackexchange.com/questions/298/in-linear-regression-when-is-it-appropriate-to-use-the-log-of-an-independent-va)
I'd like to know if the same applies for non-linear regression algorithms. For now I've seen people on kaggle use log transformation for mitigation of heteroskedasticity, by using xgboost, but they never mention if it is also being done for getting normally distributed target values.
I've tried to do some research and I found in Andrew Ng's lecture notes(http://cs229.stanford.edu/notes/cs229-notes1.pdf) on page 11 that the least squares cost function, used by many algorithms linear and non-linear, is derived by assuming normal distribution of the error. I believe if the error should be normally distributed then the target values should be as well.
If this is true then all the regression algorithms using least squares cost function should work better with normally distributed target values.
Since xgboost uses least squares cost function for node splitting(http://cilvr.cs.nyu.edu/diglib/lsml/lecture03-trees-boosting.pdf - slide 13) then maybe this algorithm would work better if I transform the target values using box-cox transformations for training the model and then apply inverse box-cox transformations on the output in order to get the predicted values.
Will this theoretically speaking give better results?
Your conjecture "I believe if the error should be normally distributed then the target values should be as well." is totally wrong. So your question does not have any answer at all since it is not a valid question.
There are no assumptions on the target variable to be Normal at all.
Getting the target variable transformed does not mean the errors are normally distributed. In fact, that may ruin normality.
I have no idea what this is supposed to mean: "linear regression models need to be trained with normally distributed target values in order to be efficient." Efficient in what way?
Linear regression models are global models. They simply fit a surface to the overall data. The operations are matrix operations, so the time to "train" the model depends only on the size of data. The distribution of the target has nothing to do with model building performance. And, it has nothing to do with model scoring performance either.
Because targets are generally not normally distributed, I would certainly hope that such a distribution is not required for a machine learning algorithm to work effectively.
I am using K-Means algorithm for Text Clustering with initial seeding with K-Means++.
I try to make the algorithm more efficient with some changes like changing the stop-word dictionary and increasing the max_no_of_random_iterations.
I get different results. How do i compare them ? I could not apply the idea of confusion matrix here. Output is not in the form of some document getting some value or tag. A document goes to a set. It is just relative "good clustering" or the set that matters.
So Is there some standard way for marking the performance for this output set ?
If confusion matrix is the answer, please explain how to do it ?
Thanks.
You could decide in advance how to measure the quality of the clusters, for example count how many empty ones or some stats like Within Sum of Squares
This paper says
"... three distinctive approaches to cluster validity are possible.
The first approach relies on external criteria that investigate the
existence of some predefined structure in clustered data set. The
second approach makes use of internal criteria and the clustering
results are evaluated by quantities describing the data set such as
proximity matrix etc. Approaches based on internal and external
criteria make use of statistical tests and their disadvantage is
high computational cost. The third approach makes use of relative
criteria and relies on finding the best clustering scheme that meets
certain assumptions and requires predefined input parameters values"
Since clustering is unsupervised, you are asking for something difficult. I suggest researching how people cluster using genetic algorithms and see what fitness criteria they use.
I have a graph (and it is a graph because one node might have many parents) that has contains nodes with the following data:
Keyword Id
Keyword Label
Number of pervious searches
Depth of keyword promotion
The relevance is rated with a number starting from 1.
The relevance of a child node is determained by the distance from the parent node the child node minus the depth of the keyword's promotion.
The display order of of child nodes from the same depth is determained by the number of pervious searches.
Is there an algorithm that is able to search such a data structure?
Do I have an efficiency issue if I need to transverse all nodes, cache the generated result and display them by pages considering that this should scale well for a large amount of users? If I do have an issue, how can this be resolved?
What kind of database do I need to use? A NoSQL, a relational one or a graph database?
How the scheme would look like?
Can this be done using django-haystack?
It seems you're trying to compute a top-k query over a graph. There is a variety of algorithms fit to solve this problem, the simplest one I believe will help you to solve your problem is the Threshold Algorithm (TA), when the traversal over the graph is done in a BFS fashion. Some other top-k algorithms are Lawler-Murty Procedure, and other TA variations exist.
Regarding efficiency - the problem of computing the query itself might have an exponential time, simply due to exponential number of results to be returned, but when using a TA the time between outputting results should be relatively short. As far as caching & scale involved, the usual considerations apply - you'll probably want to use a distributed system when the scale gets and the appropriate TA version (such as Threshold Join Algorithm). Of course you'll need to consider the scaling & caching issues when choosing which database solution to use as well.
As far as the database goes you should definitely use one that supports graphs as first class citizens (those tend to be known as Graph Databases), and I believe it doesn't matter if the storage engine behind the graph database is relative or NoSQL. One point to note is that you'd probably will want to make sure the database you choose can scale to the scale you require (so for large scale, perhaps, you'll want to look into more distributed solutions). The schema will depend on the database you'll choose (assuming it won't be a schema-less database).
Last but not least - Haystack. As haystack will work with everything that the search engine you choose to use will work with, there should be at least one possible way to do it (combining Apache Solr for search and Neo4j or GoldenOrb for the database), and maybe more (as I'm not really familiar with Haystack or the search engines it supports other than Solr).