Data mining algorithm selection for 3 classes with negative and positive values - algorithm

I am trying to handle a data set on matlab with 3 classes and negative and positive values on attributes. I tried naive bayes classifier but matlab says tha naive bayes can't handle negative values. Svm algorithm also can't handle this problem because there are 3 classes. So, i am asking you which algorithm to chose?
Thank you in advance!!

The simples solution that comes to mind is a k-NN classifier using majority voting. Say you want to classify a point and you use 10 nearest neighbours. Let's say that six out of 10 are class 1, two neighbours are class 2 and the two remaining neighbours are class 3, so in this case you would classify your point as class 1.
If you want to include nonlinearity (as in the case of SVM) you can use nonlinear kernels in k-NN too which basically means modifying the distance calculation.

citing wikipedia:
Multiclass SVM[edit]Multiclass SVM aims to assign labels to instances by using support vector machines, where the labels are drawn from a finite set of several elements.
The dominant approach for doing so is to reduce the single multiclass problem into multiple binary classification problems.[8] Common methods for such reduction include:[8] [9]
Building binary classifiers which distinguish between (i) one of the labels and the rest (one-versus-all) or (ii) between every pair of classes (one-versus-one). Classification of new instances for the one-versus-all case is done by a winner-takes-all strategy, in which the classifier with the highest output function assigns the class (it is important that the output functions be calibrated to produce comparable scores). For the one-versus-one approach, classification is done by a max-wins voting strategy, in which every classifier assigns the instance to one of the two classes, then the vote for the assigned class is increased by one vote, and finally the class with the most votes determines the instance classification.
Directed Acyclic Graph SVM (DAGSVM)[10]
error-correcting output codes[11]
Crammer and Singer proposed a multiclass SVM method which casts the multiclass classification problem into a single optimization problem, rather than decomposing it into multiple binary classification problems.[12] See also Lee, Lin and Wahba.[13][14]

Related

How to form precision-recall curve using one test dataset for my algorithm?

I'm working on knowledge graph, more precisely in natural language processing field. To evaluate the components of my algorithm, it is necessary to be able to classify the good and the poor candidates. For this purpose, we manually classified pairs in a dataset.
My system returns the relevant pairs according to the implementation logic. now I'm able to calculate :
Precision = X
Recall = Y
For establishing a complete curve I need the rest of points (X,Y), what should I do?:
build another dataset for test ?
split my dataset ?
or any other solution ?
Neither of your proposed two methods. In short, Precision-recall or ROC curve is designed for classifiers with probabilistic output. That is, instead of simply producing a 0 or 1 (in case of binary classification), you need classifiers that can provide a probability in [0,1] range. This is the function to do it in sklearn, note how the 2nd parameter is called probas_pred.
To turn this probabilities into concrete class prediction, you can then set a threshold, say at .5. Setting such a threshold is problematic however, since you can trade-off precision/recall by varying the threshold, and an arbitrary choice can give false impression of a classifier's performance. To circumvent this, threshold-independent measures like area under ROC or Precision-Recall curve is used. They create thresholds at different intervals, say 0.1,0.2,0.3...0.9, turn probabilities into binary classes and then compute precision-recall for each such threshold.

How is the class center for a decision attribute calculated in class center based fuzzification algorithm?

I came across class center based fuzzification algorithm on page 16 of this research paper on TRFDT. However, I fail to understand what is happening in step 2 of this algorithm (titled in the paper as Algorithm 2: Fuzzification). If someone could explain it by giving a small example it would certainly be helpful.
It is not clear from your question which parts of the article you understand and IMHO the article is written in not the clearest possible way, so this is going to be a long answer.
Let's start with some intuition behind this article. In short I'd say it is: "let's add fuzziness everywhere to decision trees".
How a decision tree works? We have a classification problem and we say that instead of analyzing all attributes of a data point in a holistic way, we'll analyze them one by one in an order defined by the tree and will navigate the tree until we reach some leaf node. The label at that leaf node is our prediction. So the trick is how to build a good tree i.e. a good order of attributes and good splitting points. This is a well studied problem and the idea is to build a tree that encode as much information as possible by some metric. There are several metrics and this article uses entropy which is similar to widely used information gain.
The next idea is that we can change the classification (i.e. split of the values into a classes) as fuzzy rather than exact (aka "crisp"). The idea here is that in many real life situations not all members of the class are equally representative: some a more "core" examples and some a more "edge" example. If we can catch this difference, we can provide a better classification.
And finally there is a question of how similar the data points are (generally or by some subset of attributes) and here we can also have a fuzzy answer (see formulas 6-8).
So the idea of the main algorithm (Algorithm 1) is the same as in the ID3 tree: recursively find the attribute a* that classifies the data in the best way and perform the best split along that attribute. The main difference is in how the information gain for the best attribute selection is measured (see heuristic in formulas 20-24) and that because of fuzziness the usual stop rule of "only one class left" doesn't work anymore and thus another entropy (Kosko fuzzy entropy in 25) is used to decide if it is time to stop.
Given this skeleton of the algorithm 1 there are quite a few parts that you can (or should) select:
How do you measure μ(ai)τ(Cj)(x) used in (20) (this is a measure of how well x represents the class Cj with respect to attribute ai, note that here being not in Cj and far from the points in Cj is also good) with two obvious choices of the lower (16 and 18) and the upper bounds (17 and 19)
How do you measure μRτ(x, y) used in (16-19). Given that R is induced by ai this becomes μ(ai)τ(x, y) which is a measure of similarity between two points with respect to attribute ai. Here you can choose one of the metrics (6-8)
How do you measure μCi(y) used in (16-19). This is the measure of how well the point y fits in the class Ci. If you already have data as fuzzy classification, there is nothing you should do here. But if your input classification is crisp, then you should somehow produce μCi(y) from that and this is what the Algorithm 2 does.
There is a trivial solution of μCj(xi) = "1 if xi ∈ Cj and 0 otherwise" but this is not fuzzy at all. The process of building fuzzy data is called "fuzzification". The idea behind the Algorithm 2 is that we assume that every class Cj is actually some kind of a cluster in the space of attributes. And so we can measure the degree of membership μCj(xi) as the distance from the xi to the center of the cluster cj (the closer we are, the higher the membership should be so it is really some inverse of a distance). Note that since distance is measured by attributes, you should normalize your attributes somehow or one of them might dominate the distance. And this is exactly what the Algorithm 2 does:
it estimates the center of the cluster for class Cj as the center of mass of all the known points in that class i.e. just an average of all points by each coordinate (attribute).
it calculates the distances from each point xi to each estimated center of class cj
looking into formula at step #12 it uses inverse square of the distance as a measure of proximity and just normalizes the value because for fuzzy sets Sum[over all Cj](μCj(xi)) should be 1

How can I do multi class classification using naive bayes classifier?

How can I do multi class classification using naive bayes classifier?
I am developing diseases classification system based on symptoms. I know training data is needed. But I don't have. I have only probabilities of symptoms for each disease. Is it possible to develop?
There are two ways of extending simple classifiers to do multi class classification:
Source Wikipedia
The first one is called One-vs.-rest strategy. It involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives. This strategy requires the base classifiers to produce a real-valued confidence score for its decision. During inference, you give a sample to each model, retrieve the probabilities of belonging to the positive class and chose the class where the classifier is most confident.
The second way is called one-vs.-one (OvO) reduction, one trains K (K − 1) / 2 binary classifiers for a K-way multiclass problem; each receives the samples of a pair of classes from the original training set, and must learn to distinguish these two classes. At prediction time, a voting scheme is applied: all K (K − 1) / 2 classifiers are applied to an unseen sample and the class that got the highest number of "+1" predictions gets predicted by the combined classifier. This approach can lead to ambiguity in some cases.
I would recommend using One vs Rest. It is already implemented in some packages such as Sklearn
http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html

How to implement decision trees in boosting

I'm implementing AdaBoost(Boosting) that will use CART and C4.5. I read about AdaBoost, but i can't find good explenation how to join AdaBoost with Decision Trees. Let say i have data set D that have n examples. I split D to TR training examples and TE testing examples.
Let say TR.count = m,
so i set weights that should be 1/m, then i use TR to build tree, i test it with TR to get wrong examples, and test with TE to calculate error. Then i change weights, and now how i will get next Training Set? What kind of sampling should i use (with or without replacemnet)? I know that new Training Set should focus more on samples that were wrong classified but how can i achieve this? Well how CART or C4.5 will know that they should focus on examples with greater weight?
As I know, the TE data sets don't mean to be used to estimate the error rate. The raw data can be split into two parts (one for training, the other for cross validation). Mainly, we have two methods to apply weights on the training data sets distribution. Which method to use is determined by the weak learner you choose.
How to apply the weights?
Re-sample the training data sets without replacement. This method can be viewed as weighted boosting method. The generated re-sampling data sets contain miss-classification instances with higher probability than the correctly classified ones, therefore it force the weak learning algorithm to concentrate on the miss-classified data.
Directly use the weights when learning. Those models include Bayesian Classification, Decision Tree (C4.5 and CART) and so on. With respect to C4.5, we calculate the the gain information (mutation information) to determinate which predictor will be selected as the next node. Hence we can combine the weights and entropy to estimate the measurements. For example, we view the weights as the probability of the sample in the distribution. Given X = [1,2,3,3], weights [3/8,1/16,3/16,6/16 ]. Normally, the cross-entropy of X is (-0.25log(0.25)-0.25log(0.25)-0.5log(0.5)), but with weights taken into consideration, its weighted cross-entropy is (-(3/8)log(3/8)-(1/16)log(1/16)-(9/16log(9/16))). Generally, the C4.5 can be implemented by weighted cross-entropy, and its weight is [1,1,...,1]/N.
If you want to implement the AdaboostM.1 with C4.5 algorithmsm you should read some stuff in Page 339, the Elements of Statistical Learning.

Group detection in data sets

Assume a group of data points, such as one plotted here (this graph isn't specific to my problem, but just used as a suitable example):
Inspecting the scatter graph visually, it's fairly obvious the data points form two 'groups', with some random points that do not obviously belong to either.
I'm looking for an algorithm, that would allow me to:
start with a data set of two or more dimensions.
detect such groups from the dataset without prior knowledge on how many (or if any) might be there
once the groups have been detected, 'ask' the model of groups, if a new sample point seems to fit to any of the groups
There are many choices, but if you are interested in the probability that a new data point belongs to a particular mixture, I would use a probabilistic approach such as Gaussian mixture modeling either estimated by maximum likelihood or Bayes.
Maximum likelihood estimation of mixtures models is implemented in Matlab.
Your requirement that the number of components is unknown makes your model more complex. The dominant probabilistic approach is to place a Dirichlet Process prior on the mixture distribution and estimate by some Bayesian method. For instance, see this paper on infinite Gaussian mixture models. The DP mixture model will give you inference over the number of components and the components each elements belong to, which is exactly what you want. Alternatively you could perform model selection on the number of components, but this is generally less elegant.
There are many implementation of DP mixture models models, but they may not be as convenient. For instance, here's a Matlab implementation.
Your graph suggests you are an R user. In that case, if you are looking for prepacked solutions, the answer to your question lies on this Task View for cluster analysis.
I think you are looking for something along the lines of a k-means clustering algorithm.
You should be able to find adequate implementations in most general purpose languages.
You need one of clustering algorithms. All of them can be devided in 2 groups:
you specify number of groups (clusters) - 2 clusters in your example
algorithm try to guess correct number of clusters by itself
If you want algorithm of 1st type then K-Means is what you really need.
If you want algorithm of 2nd type then you probably need one of hierarchical clustering algorithms. I haven't ever implement any of them. But I see an easy way to improve K-means in such way thay it will be unnecessary to specify number of clusters.

Resources