How does a bagging classifier(average) works? - algorithm

how does a bagging classifier works(averaging, not voting)?. I am working on bagging classifier, I want to use a average of the models but when I bag models, the result is a continuous value rather than a categorical value. Can I use averaging here? If yes, How?

You have to give more details on what programming language and library you are using,
If you are doing regression the bagging model can give you the average or a weighted average.
If you are doing classification then it can be voting or weighted voting.
However, if you are doing binary classification then the average of 1s and 0s can be used to give you some pseudo probability or confidence for the prediction.
you can do this for non-binary classification using the one vs all method to get probabilities for all possible classes.

Related

Multiclass classification confidence score using multiclass classification using predict_proba of SGDclassifier

I am using Logistic regression in SGDClassifier to perform multi-class classification of ~10k category.
To get confidence score for predicted result I am using predict_proba function.
But I am getting prediction probability value of 0.00026091,0.00049697,0.00019632 for both correct and wrong prediction.
Please suggest the way to normalize the score so that I can consider result by filtering the probability value
If the probability values of all classes are very low, it might mean, that your classifier has a hard time to classify the samples. You might want to do some feature engineering or try another model.
To normalize the values, have a look at scikit-learn MinMaxScaler. This will scale the data to numbers between 0 and 1. But as I said, if the probability for all values is very low, you wont get a good classification result.
Hope that helps

How to form precision-recall curve using one test dataset for my algorithm?

I'm working on knowledge graph, more precisely in natural language processing field. To evaluate the components of my algorithm, it is necessary to be able to classify the good and the poor candidates. For this purpose, we manually classified pairs in a dataset.
My system returns the relevant pairs according to the implementation logic. now I'm able to calculate :
Precision = X
Recall = Y
For establishing a complete curve I need the rest of points (X,Y), what should I do?:
build another dataset for test ?
split my dataset ?
or any other solution ?
Neither of your proposed two methods. In short, Precision-recall or ROC curve is designed for classifiers with probabilistic output. That is, instead of simply producing a 0 or 1 (in case of binary classification), you need classifiers that can provide a probability in [0,1] range. This is the function to do it in sklearn, note how the 2nd parameter is called probas_pred.
To turn this probabilities into concrete class prediction, you can then set a threshold, say at .5. Setting such a threshold is problematic however, since you can trade-off precision/recall by varying the threshold, and an arbitrary choice can give false impression of a classifier's performance. To circumvent this, threshold-independent measures like area under ROC or Precision-Recall curve is used. They create thresholds at different intervals, say 0.1,0.2,0.3...0.9, turn probabilities into binary classes and then compute precision-recall for each such threshold.

How to implement decision trees in boosting

I'm implementing AdaBoost(Boosting) that will use CART and C4.5. I read about AdaBoost, but i can't find good explenation how to join AdaBoost with Decision Trees. Let say i have data set D that have n examples. I split D to TR training examples and TE testing examples.
Let say TR.count = m,
so i set weights that should be 1/m, then i use TR to build tree, i test it with TR to get wrong examples, and test with TE to calculate error. Then i change weights, and now how i will get next Training Set? What kind of sampling should i use (with or without replacemnet)? I know that new Training Set should focus more on samples that were wrong classified but how can i achieve this? Well how CART or C4.5 will know that they should focus on examples with greater weight?
As I know, the TE data sets don't mean to be used to estimate the error rate. The raw data can be split into two parts (one for training, the other for cross validation). Mainly, we have two methods to apply weights on the training data sets distribution. Which method to use is determined by the weak learner you choose.
How to apply the weights?
Re-sample the training data sets without replacement. This method can be viewed as weighted boosting method. The generated re-sampling data sets contain miss-classification instances with higher probability than the correctly classified ones, therefore it force the weak learning algorithm to concentrate on the miss-classified data.
Directly use the weights when learning. Those models include Bayesian Classification, Decision Tree (C4.5 and CART) and so on. With respect to C4.5, we calculate the the gain information (mutation information) to determinate which predictor will be selected as the next node. Hence we can combine the weights and entropy to estimate the measurements. For example, we view the weights as the probability of the sample in the distribution. Given X = [1,2,3,3], weights [3/8,1/16,3/16,6/16 ]. Normally, the cross-entropy of X is (-0.25log(0.25)-0.25log(0.25)-0.5log(0.5)), but with weights taken into consideration, its weighted cross-entropy is (-(3/8)log(3/8)-(1/16)log(1/16)-(9/16log(9/16))). Generally, the C4.5 can be implemented by weighted cross-entropy, and its weight is [1,1,...,1]/N.
If you want to implement the AdaboostM.1 with C4.5 algorithmsm you should read some stuff in Page 339, the Elements of Statistical Learning.

How to merge up multiple algorithms in WEKA?

I've visited
this tutorial
and got the idea of merging up multiple algorithms using VOTE, but I'm not clear about the actual mechanism about how it works. I want to understand if the first mentioned algorithm is being applied at first to the data set and then the second algorithm is being applied to the classifier we are getting from the applied first algorithm ?
Suppose I choose Naive Bayes and Bayes Net, then what is happening? Is Naive Bayes being applied to the given data set first and then we get a classifier C1 and next Bayes Net is being applied to C1 and finally it is giving final classifier as C*,
or it that at each step both of the algorithms are working and the the higher VOTED result is proceeding further?
Each ensemble member (or algorithm) is being trained on its own training data. Once each of these have been trained, they are later evaluated using a specific voting algorithm.
Generally, when testing cases are presented for estimation, each of the algorithms generate their estimate, and then the voting algorithm determines how the classifier's weights are applied and assigns the best output as the ensemble estimate.
That's not to say that it always works this way. There was one proposed model I used in the past that selected a subset of algorithms depending on the locality of the testing case in the problem space and weighted each member's vote differently. Each voting algorithm works in a different way and Weka has a few common models that can be tried out.

Data mining algorithm selection for 3 classes with negative and positive values

I am trying to handle a data set on matlab with 3 classes and negative and positive values on attributes. I tried naive bayes classifier but matlab says tha naive bayes can't handle negative values. Svm algorithm also can't handle this problem because there are 3 classes. So, i am asking you which algorithm to chose?
Thank you in advance!!
The simples solution that comes to mind is a k-NN classifier using majority voting. Say you want to classify a point and you use 10 nearest neighbours. Let's say that six out of 10 are class 1, two neighbours are class 2 and the two remaining neighbours are class 3, so in this case you would classify your point as class 1.
If you want to include nonlinearity (as in the case of SVM) you can use nonlinear kernels in k-NN too which basically means modifying the distance calculation.
citing wikipedia:
Multiclass SVM[edit]Multiclass SVM aims to assign labels to instances by using support vector machines, where the labels are drawn from a finite set of several elements.
The dominant approach for doing so is to reduce the single multiclass problem into multiple binary classification problems.[8] Common methods for such reduction include:[8] [9]
Building binary classifiers which distinguish between (i) one of the labels and the rest (one-versus-all) or (ii) between every pair of classes (one-versus-one). Classification of new instances for the one-versus-all case is done by a winner-takes-all strategy, in which the classifier with the highest output function assigns the class (it is important that the output functions be calibrated to produce comparable scores). For the one-versus-one approach, classification is done by a max-wins voting strategy, in which every classifier assigns the instance to one of the two classes, then the vote for the assigned class is increased by one vote, and finally the class with the most votes determines the instance classification.
Directed Acyclic Graph SVM (DAGSVM)[10]
error-correcting output codes[11]
Crammer and Singer proposed a multiclass SVM method which casts the multiclass classification problem into a single optimization problem, rather than decomposing it into multiple binary classification problems.[12] See also Lee, Lin and Wahba.[13][14]

Resources