How can I do multi class classification using naive bayes classifier? - data-structures

How can I do multi class classification using naive bayes classifier?
I am developing diseases classification system based on symptoms. I know training data is needed. But I don't have. I have only probabilities of symptoms for each disease. Is it possible to develop?

There are two ways of extending simple classifiers to do multi class classification:
Source Wikipedia
The first one is called One-vs.-rest strategy. It involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives. This strategy requires the base classifiers to produce a real-valued confidence score for its decision. During inference, you give a sample to each model, retrieve the probabilities of belonging to the positive class and chose the class where the classifier is most confident.
The second way is called one-vs.-one (OvO) reduction, one trains K (K − 1) / 2 binary classifiers for a K-way multiclass problem; each receives the samples of a pair of classes from the original training set, and must learn to distinguish these two classes. At prediction time, a voting scheme is applied: all K (K − 1) / 2 classifiers are applied to an unseen sample and the class that got the highest number of "+1" predictions gets predicted by the combined classifier. This approach can lead to ambiguity in some cases.
I would recommend using One vs Rest. It is already implemented in some packages such as Sklearn
http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html

Related

How does a bagging classifier(average) works?

how does a bagging classifier works(averaging, not voting)?. I am working on bagging classifier, I want to use a average of the models but when I bag models, the result is a continuous value rather than a categorical value. Can I use averaging here? If yes, How?
You have to give more details on what programming language and library you are using,
If you are doing regression the bagging model can give you the average or a weighted average.
If you are doing classification then it can be voting or weighted voting.
However, if you are doing binary classification then the average of 1s and 0s can be used to give you some pseudo probability or confidence for the prediction.
you can do this for non-binary classification using the one vs all method to get probabilities for all possible classes.

How to merge up multiple algorithms in WEKA?

I've visited
this tutorial
and got the idea of merging up multiple algorithms using VOTE, but I'm not clear about the actual mechanism about how it works. I want to understand if the first mentioned algorithm is being applied at first to the data set and then the second algorithm is being applied to the classifier we are getting from the applied first algorithm ?
Suppose I choose Naive Bayes and Bayes Net, then what is happening? Is Naive Bayes being applied to the given data set first and then we get a classifier C1 and next Bayes Net is being applied to C1 and finally it is giving final classifier as C*,
or it that at each step both of the algorithms are working and the the higher VOTED result is proceeding further?
Each ensemble member (or algorithm) is being trained on its own training data. Once each of these have been trained, they are later evaluated using a specific voting algorithm.
Generally, when testing cases are presented for estimation, each of the algorithms generate their estimate, and then the voting algorithm determines how the classifier's weights are applied and assigns the best output as the ensemble estimate.
That's not to say that it always works this way. There was one proposed model I used in the past that selected a subset of algorithms depending on the locality of the testing case in the problem space and weighted each member's vote differently. Each voting algorithm works in a different way and Weka has a few common models that can be tried out.

Data mining algorithm selection for 3 classes with negative and positive values

I am trying to handle a data set on matlab with 3 classes and negative and positive values on attributes. I tried naive bayes classifier but matlab says tha naive bayes can't handle negative values. Svm algorithm also can't handle this problem because there are 3 classes. So, i am asking you which algorithm to chose?
Thank you in advance!!
The simples solution that comes to mind is a k-NN classifier using majority voting. Say you want to classify a point and you use 10 nearest neighbours. Let's say that six out of 10 are class 1, two neighbours are class 2 and the two remaining neighbours are class 3, so in this case you would classify your point as class 1.
If you want to include nonlinearity (as in the case of SVM) you can use nonlinear kernels in k-NN too which basically means modifying the distance calculation.
citing wikipedia:
Multiclass SVM[edit]Multiclass SVM aims to assign labels to instances by using support vector machines, where the labels are drawn from a finite set of several elements.
The dominant approach for doing so is to reduce the single multiclass problem into multiple binary classification problems.[8] Common methods for such reduction include:[8] [9]
Building binary classifiers which distinguish between (i) one of the labels and the rest (one-versus-all) or (ii) between every pair of classes (one-versus-one). Classification of new instances for the one-versus-all case is done by a winner-takes-all strategy, in which the classifier with the highest output function assigns the class (it is important that the output functions be calibrated to produce comparable scores). For the one-versus-one approach, classification is done by a max-wins voting strategy, in which every classifier assigns the instance to one of the two classes, then the vote for the assigned class is increased by one vote, and finally the class with the most votes determines the instance classification.
Directed Acyclic Graph SVM (DAGSVM)[10]
error-correcting output codes[11]
Crammer and Singer proposed a multiclass SVM method which casts the multiclass classification problem into a single optimization problem, rather than decomposing it into multiple binary classification problems.[12] See also Lee, Lin and Wahba.[13][14]

Classification algorithms, which classifications can be evaluated as percentages

I'm implementing different classification algorithms to predict the outcome of soccer matches (Home, draw or away). In order to compare the classifications of different classifiers, the classifications from the classifiers are evaluated as percentages.
At the moment I'm using k-nearest neighbours (and counting neighbours of different classes to convert to percentages) and the naive bayes.
Besides the knn and naive bayes, which classifiers can be used for this task?
Support Vector Machines are probably the most common classifiers appearing in the literature right now, and there are several Random Forest classification schemes as well. Look at Weka for a package supporting those methods (and others) in Java. Also, R has a lot of tools for machine learning, so you could quickly test other algorithms without having to implement them yourself.
A logistic model will naturally express itself as probabilities. For soccer, quite a few people have modelled the goals scored by each side as a Poisson process, with rate depending on the relative strengths of the defense and offense concerned.

What is the difference between a generative and a discriminative algorithm? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
What is the difference between a generative and a
discriminative algorithm?
Let's say you have input data x and you want to classify the data into labels y. A generative model learns the joint probability distribution p(x,y) and a discriminative model learns the conditional probability distribution p(y|x) - which you should read as "the probability of y given x".
Here's a really simple example. Suppose you have the following data in the form (x,y):
(1,0), (1,0), (2,0), (2, 1)
p(x,y) is
y=0 y=1
-----------
x=1 | 1/2 0
x=2 | 1/4 1/4
p(y|x) is
y=0 y=1
-----------
x=1 | 1 0
x=2 | 1/2 1/2
If you take a few minutes to stare at those two matrices, you will understand the difference between the two probability distributions.
The distribution p(y|x) is the natural distribution for classifying a given example x into a class y, which is why algorithms that model this directly are called discriminative algorithms. Generative algorithms model p(x,y), which can be transformed into p(y|x) by applying Bayes rule and then used for classification. However, the distribution p(x,y) can also be used for other purposes. For example, you could use p(x,y) to generate likely (x,y) pairs.
From the description above, you might be thinking that generative models are more generally useful and therefore better, but it's not as simple as that. This paper is a very popular reference on the subject of discriminative vs. generative classifiers, but it's pretty heavy going. The overall gist is that discriminative models generally outperform generative models in classification tasks.
A generative algorithm models how the data was generated in order to categorize a signal. It asks the question: based on my generation assumptions, which category is most likely to generate this signal?
A discriminative algorithm does not care about how the data was generated, it simply categorizes a given signal.
Imagine your task is to classify a speech to a language.
You can do it by either:
learning each language, and then classifying it using the knowledge you just gained
or
determining the difference in the linguistic models without learning the languages, and then classifying the speech.
The first one is the generative approach and the second one is the discriminative approach.
Check this reference for more details: http://www.cedar.buffalo.edu/~srihari/CSE574/Discriminative-Generative.pdf.
In practice, the models are used as follows.
In discriminative models, to predict the label y from the training example x, you must evaluate:
which merely chooses what is the most likely class y considering x. It's like we were trying to model the decision boundary between the classes. This behavior is very clear in neural networks, where the computed weights can be seen as a complexly shaped curve isolating the elements of a class in the space.
Now, using Bayes' rule, let's replace the in the equation by . Since you are just interested in the arg max, you can wipe out the denominator, that will be the same for every y. So, you are left with
which is the equation you use in generative models.
While in the first case you had the conditional probability distribution p(y|x), which modeled the boundary between classes, in the second you had the joint probability distribution p(x, y), since p(x | y) p(y) = p(x, y), which explicitly models the actual distribution of each class.
With the joint probability distribution function, given a y, you can calculate ("generate") its respective x. For this reason, they are called "generative" models.
Here's the most important part from the lecture notes of CS299 (by Andrew Ng) related to the topic, which really helps me understand the difference between discriminative and generative learning algorithms.
Suppose we have two classes of animals, elephant (y = 1) and dog (y = 0). And x is the feature vector of the animals.
Given a training set, an algorithm like logistic regression or the perceptron algorithm (basically) tries to find a straight line — that is, a decision boundary — that separates the elephants and dogs. Then, to classify
a new animal as either an elephant or a dog, it checks on which side of the
decision boundary it falls, and makes its prediction accordingly. We call these discriminative learning algorithm.
Here's a different approach. First, looking at elephants, we can build a
model of what elephants look like. Then, looking at dogs, we can build a
separate model of what dogs look like. Finally, to classify a new animal,
we can match the new animal against the elephant model, and match it against
the dog model, to see whether the new animal looks more like the elephants
or more like the dogs we had seen in the training set. We call these generative learning algorithm.
The different models are summed up in the table below:
Image source: Supervised Learning cheatsheet - Stanford CS 229 (Machine Learning)
Generally, there is a practice in machine learning community not to learn something that you don’t want to. For example, consider a classification problem where one's goal is to assign y labels to a given x input. If we use generative model
p(x,y)=p(y|x).p(x)
we have to model p(x) which is irrelevant for the task in hand. Practical limitations like data sparseness will force us to model p(x) with some weak independence assumptions. Therefore, we intuitively use discriminative models for classification.
The short answer
Many of the answers here rely on the widely-used mathematical definition [1]:
Discriminative models directly learn the conditional predictive distribution p(y|x).
Generative models learn the joint distribution p(x,y) (or rather, p(x|y) and p(y)).
Predictive distribution p(y|x) can be obtained with Bayes' rule.
Although very useful, this narrow definition assumes the supervised setting, and is less handy when examining unsupervised or semi-supervised methods. It also doesn't apply to many contemporary approaches for deep generative modeling. For example, now we have implicit generative models, e.g. Generative Adversarial Networks (GANs), which are sampling-based and don't even explicitly model the probability density p(x) (instead learning a divergence measure via the discriminator network). But we call them "generative models” since they are used to generate (high-dimensional [10]) samples.
A broader and more fundamental definition [2] seems equally fitting for this general question:
Discriminative models learn the boundary between classes.
So they can discriminate between different kinds of data instances.
Generative models learn the distribution of data.
So they can generate new data instances.
Image source
A closer look
Even so, this question implies somewhat of a false dichotomy [3]. The generative-discriminative "dichotomy" is in fact a spectrum which you can even smoothly interpolate between [4].
As a consequence, this distinction gets arbitrary and confusing, especially when many popular models do not neatly fall into one or the other [5,6], or are in fact hybrid models (combinations of classically "discriminative" and "generative" models).
Nevertheless it's still a highly useful and common distinction to make. We can list some clear-cut examples of generative and discriminative models, both canonical and recent:
Generative: Naive Bayes, latent Dirichlet allocation (LDA), Generative Adversarial Networks (GAN), Variational Autoencoders (VAE), normalizing flows.
Discriminative: Support vector machine (SVM), logistic regression, most deep neural networks.
There is also a lot of interesting work deeply examining the generative-discriminative divide [7] and spectrum [4,8], and even transforming discriminative models into generative models [9].
In the end, definitions are constantly evolving, especially in this rapidly growing field :) It's best to take them with a pinch of salt, and maybe even redefine them for yourself and others.
Sources
Possibly originating from "Machine Learning - Discriminative and Generative" (Tony Jebara, 2004).
Crash Course in Machine Learning by Google
The Generative-Discriminative Fallacy
"Principled Hybrids of Generative and Discriminative Models" (Lasserre et al., 2006)
#shimao's question
Binu Jasim's answer
Comparing logistic regression and naive Bayes:
cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
"On Discriminative vs. Generative classifiers"
Comment on "On Discriminative vs. Generative classifiers"
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/04/DengJaitly2015-ch1-2.pdf
"Your classifier is secretly an energy-based model" (Grathwohl et al., 2019)
Stanford CS236 notes: Technically, a probabilistic discriminative model is also a generative model of the labels conditioned on the data. However, the term generative models is typically reserved for high dimensional data.
An addition informative point that goes well with the answer by StompChicken above.
The fundamental difference between discriminative models and generative models is:
Discriminative models learn the (hard or soft) boundary between classes
Generative models model the distribution of individual classes
Edit:
A Generative model is the one that can generate data. It models both the features and the class (i.e. the complete data).
If we model P(x,y): I can use this probability distribution to generate data points - and hence all algorithms modeling P(x,y) are generative.
Eg. of generative models
Naive Bayes models P(c) and P(d|c) - where c is the class and d is the feature vector.
Also, P(c,d) = P(c) * P(d|c)
Hence, Naive Bayes in some form models, P(c,d)
Bayes Net
Markov Nets
A discriminative model is the one that can only be used to discriminate/classify the data points.
You only require to model P(y|x) in such cases, (i.e. probability of class given the feature vector).
Eg. of discriminative models:
logistic regression
Neural Networks
Conditional random fields
In general, generative models need to model much more than the discriminative models and hence are sometimes not as effective. As a matter of fact, most (not sure if all) unsupervised learning algorithms like clustering etc can be called generative, since they model P(d) (and there are no classes:P)
PS: Part of the answer is taken from source
A generative algorithm model will learn completely from the training data and will predict the response.
A discriminative algorithm job is just to classify or differentiate between the 2 outcomes.
All previous answers are great, and I'd like to plug in one more point.
From generative algorithm models, we can derive any distribution; while we can only obtain the conditional distribution P(Y|X) from the discriminative algorithm models(or we can say they are only useful for discriminating Y’s label), and that's why it is called discriminative model. The discriminative model doesn't assume that the X's are independent given the Y($X_i \perp X_{-i} | Y$) and hence is usually more powerful for calculating that conditional distribution.
My two cents:
Discriminative approaches highlight differences
Generative approaches do not focus on differences; they try to build a model that is representative of the class.
There is an overlap between the two.
Ideally both approaches should be used: one will be useful to find similarities and the other will be useful to find dis-similarities.
This article helped me a lot in understanding the concept.
In summary,
Both are probabilistic models, meaning they both use probability (conditional probability , to be precise) to calculate classes for the unknown data.
The Generative Classifiers apply Joint PDF & Bayes Theorem on the data set and calculate conditional probability using values from those.
The Discriminative Classifiers directly find Conditional probablity on the data set
Some good reading material: conditional probability , Joint PDF

Resources