Sklearn GridSearchCV RandomForest, get model complexity - complexity-theory

I have a random forest model, for which I use sklearn GridSearchCV to find the best hyperparameters (of n_estimators, max_depth, max_features, min_samples_leaf). I want to plot the model complexity vs the performance of the grid search. The problem is coming up with a measure for the model complexity. I used the number of trees (n_estimators) plus max_depth parameter, which is saved by the grid for each candidate. But this does not include the other parameters like max_features. Therefore, is there a way to get for each grid candidate just the full size of the random forest? Like number of trees and number of leaves in each tree. Or is there a better measure for the complexity of random forest, like the parameter count for a neural network?

Related

What is the exact difference between a model and an algorithm?

What is the exact difference between a model and an algorithm?
Let us take as an example logistic regression. Is logistic regression an model or an algorithm, and why?
An algorithm is the general approach you will take. The model is what you get when you run the algorithm over your training data and what you use to make predictions on new data.
You can generate a new model with the same algorithm but with different data, or you can get a new model from the same data but with a different algorithm.
Do you like Ferrari? They have a very nice 812 Superfast model, but they also have other models. Every model is different and leads to a different behavior and experience.
Think of a model more like a mathematical description of a system. An equation that gives you a general way how to achieve your vision or an idea. For example:
is a model function that yields a straight line (see least squares linear regression).
Whereas an algorithm is a set of actions (or rules) that you need to perform in order to implement your vision. For example, the famous minimax algorithm often used in AI game players that have to choose the next move.
To finish my above idea, imagine that a Ferrari model is an already existing idea on a paper and an algorithm is a robot in a factory that performs its set of programmed actions. It is sequence of actions. This is naively speaking of course, but hopefully you get the idea.
An algorithm is a mathematical formula like linear regression for example. Linear regression (with one variable) defines a line in 2-D space. But the slope and position of the line cannot be determined unless some sample values are available to solve the equation.
This regression line can be represented mathematically as y = mx + a.
Once sample values (or training data) is applied to solve this equation, the line can be drawn in 2-D space.
This line now becomes the model with known slope (m) and intercept (a). Using this model, the value of y (label) can be determined for a given value of x (feature).

How to implement decision trees in boosting

I'm implementing AdaBoost(Boosting) that will use CART and C4.5. I read about AdaBoost, but i can't find good explenation how to join AdaBoost with Decision Trees. Let say i have data set D that have n examples. I split D to TR training examples and TE testing examples.
Let say TR.count = m,
so i set weights that should be 1/m, then i use TR to build tree, i test it with TR to get wrong examples, and test with TE to calculate error. Then i change weights, and now how i will get next Training Set? What kind of sampling should i use (with or without replacemnet)? I know that new Training Set should focus more on samples that were wrong classified but how can i achieve this? Well how CART or C4.5 will know that they should focus on examples with greater weight?
As I know, the TE data sets don't mean to be used to estimate the error rate. The raw data can be split into two parts (one for training, the other for cross validation). Mainly, we have two methods to apply weights on the training data sets distribution. Which method to use is determined by the weak learner you choose.
How to apply the weights?
Re-sample the training data sets without replacement. This method can be viewed as weighted boosting method. The generated re-sampling data sets contain miss-classification instances with higher probability than the correctly classified ones, therefore it force the weak learning algorithm to concentrate on the miss-classified data.
Directly use the weights when learning. Those models include Bayesian Classification, Decision Tree (C4.5 and CART) and so on. With respect to C4.5, we calculate the the gain information (mutation information) to determinate which predictor will be selected as the next node. Hence we can combine the weights and entropy to estimate the measurements. For example, we view the weights as the probability of the sample in the distribution. Given X = [1,2,3,3], weights [3/8,1/16,3/16,6/16 ]. Normally, the cross-entropy of X is (-0.25log(0.25)-0.25log(0.25)-0.5log(0.5)), but with weights taken into consideration, its weighted cross-entropy is (-(3/8)log(3/8)-(1/16)log(1/16)-(9/16log(9/16))). Generally, the C4.5 can be implemented by weighted cross-entropy, and its weight is [1,1,...,1]/N.
If you want to implement the AdaboostM.1 with C4.5 algorithmsm you should read some stuff in Page 339, the Elements of Statistical Learning.

Data mining algorithm selection for 3 classes with negative and positive values

I am trying to handle a data set on matlab with 3 classes and negative and positive values on attributes. I tried naive bayes classifier but matlab says tha naive bayes can't handle negative values. Svm algorithm also can't handle this problem because there are 3 classes. So, i am asking you which algorithm to chose?
Thank you in advance!!
The simples solution that comes to mind is a k-NN classifier using majority voting. Say you want to classify a point and you use 10 nearest neighbours. Let's say that six out of 10 are class 1, two neighbours are class 2 and the two remaining neighbours are class 3, so in this case you would classify your point as class 1.
If you want to include nonlinearity (as in the case of SVM) you can use nonlinear kernels in k-NN too which basically means modifying the distance calculation.
citing wikipedia:
Multiclass SVM[edit]Multiclass SVM aims to assign labels to instances by using support vector machines, where the labels are drawn from a finite set of several elements.
The dominant approach for doing so is to reduce the single multiclass problem into multiple binary classification problems.[8] Common methods for such reduction include:[8] [9]
Building binary classifiers which distinguish between (i) one of the labels and the rest (one-versus-all) or (ii) between every pair of classes (one-versus-one). Classification of new instances for the one-versus-all case is done by a winner-takes-all strategy, in which the classifier with the highest output function assigns the class (it is important that the output functions be calibrated to produce comparable scores). For the one-versus-one approach, classification is done by a max-wins voting strategy, in which every classifier assigns the instance to one of the two classes, then the vote for the assigned class is increased by one vote, and finally the class with the most votes determines the instance classification.
Directed Acyclic Graph SVM (DAGSVM)[10]
error-correcting output codes[11]
Crammer and Singer proposed a multiclass SVM method which casts the multiclass classification problem into a single optimization problem, rather than decomposing it into multiple binary classification problems.[12] See also Lee, Lin and Wahba.[13][14]

Confidence of categorical predictions in MATLAB to generate lift charts

For a homework assignment I need to fit several classification models to a data set and compare their lift charts to determine the most effective model. The models produce a binary result (or a probability of that binary result), lets call them YES or NO. Models with continuous output are easy to generate lift charts for as its easy to order the data set in descending order of confidence.
I am having trouble doing that with models that generate a binary result (k-NN and ClassificationTree) for example. In my head I know methods to create a confidence value but I don't know how to do it with these libraries.
For k-NN I would set the probability confidence to the probability of a YES in the training data that falls through a particular path in the tree. However with this method, and the tree model in MATLAB, I don't know which tree path each record falls through.
Similarly with k-NN I would take the probability based upon the k neighbors, and find the probability of a YES from those k neighbors, but the model doesn't tell me the k neighbors and I'd prefer to not do a search for them.
Any help with one or both of these problems (or a better way of producing lift charts in MATLAB is greatly appreciated)
I was actually able to find the answer to my own question. The predict function in MATLAB produces scores for the probability of each type of class in the prediction model
[class, score] = predict(mdl, new_observation);

Clustering by date (by distance) in Ruby

I have a huge journal with actions done by users (like, for example, moderating contents).
I would like to find the 'mass' actions, meaning the actions that are too dense (the user probably made those actions without thinking it too much :) ).
That would translate to clustering the actions by date (in a linear space), and to marking the clusters that are too dense.
I am no expert in clustering algorithms and methods, but I think the k-means clustering would not do the trick, since I don't know the number of clusters.
Also, ideally, I would also like to 'fine tune' the algorithm.
What would you advice?
P.S. Here are some resources that I found (in Ruby):
hierclust - a simple hierarchical clustering library for spatial data
AI4R - library that implements some clustering algorithms
K-means would probably do a good job as long as you're interested in an a priori known number of clusters. Since you don't you might consider reading about the LBG algorithm, which is based on k-means and is used in data compression for vector quantisation. It's basically iterative k-means which splits centroids after they converge and keeps splitting until you achieve an acceptable number of clusters.
On the other hand, since your data is one-dimensional, you could do something completely different.
Assume that you've got actions which took place at 5 points in time: (8, 11, 15, 16, 17). Let's plot a Gaussian for each of these actions with μ equal to the time and σ = 3.
Now let's see how a sum of values of these Gaussians looks like.
It shows a density of actions with a peak around 16.
Based on this observation I propose a following simple algorithm.
Create a vector of zeroes for the time range of interest.
For each action calculate the Gaussian and add it to the vector.
Scan the vector looking for values which are greater than the maximum value in the vector multiplied by α.
Note that for each action only a small section of the vector needs updates because values of a Gaussian converge to zero very quickly.
You can tune the algorithm by adjusting values of
α ∈ [0,1], which indicates how significant a peak of activity has to be to be noted,
σ, which affects the distance of actions which are considered close to each other, and
time periods per vector's element (minutes, seconds, etc.).
Notice that the algorithm is linear with regard to the number of actions. Moreover, it shouldn't be difficult to parallelise: split your data across multiple processes summing Gaussians and then sum generated vectors.
Have a look at density based clustering. E.g. DBSCAN and OPTICS.
This sounds like exactly what you want.

Resources