Why we cannot calculate an ROC curve in cost sensitive learning?

Why we cannot calculate an ROC curve in cost sensitive learning? - roc

In the Applied Predictive Modeling book, cost sensitivity learning approach, the author(s) write:
One consequence of this approach is that class probabilities cannot be
generated for the model, at least in the available implementation.
Therefore we cannot calculate an ROC curve and must use a different
performance metric. Instead we will now use the Kappa statistic,
sensitivity, and specificity to evaluate the impact of weighted
classes.
Can you explain to me why not ROC/AUC but Kappa Statistic, sensitivity and specificity instead? I think sensitivity or specificity is also ROC or AUC?
Link for the book: https://cloudflare-ipfs.com/ipfs/bafykbzacedepga3g6t7b6rq6irwhy5gzpc47bamquhygup4eqggidvkjcztqs?filename=Max%20Kuhn%2C%20Kjell%20Johnson%20-%20Applied%20Predictive%20Modeling-Springer%20%282013%29.pdf

Related

How to form precision-recall curve using one test dataset for my algorithm?

I'm working on knowledge graph, more precisely in natural language processing field. To evaluate the components of my algorithm, it is necessary to be able to classify the good and the poor candidates. For this purpose, we manually classified pairs in a dataset.
My system returns the relevant pairs according to the implementation logic. now I'm able to calculate :
Precision = X
Recall = Y
For establishing a complete curve I need the rest of points (X,Y), what should I do?:
build another dataset for test ?
split my dataset ?
or any other solution ?

Neither of your proposed two methods. In short, Precision-recall or ROC curve is designed for classifiers with probabilistic output. That is, instead of simply producing a 0 or 1 (in case of binary classification), you need classifiers that can provide a probability in [0,1] range. This is the function to do it in sklearn, note how the 2nd parameter is called probas_pred.
To turn this probabilities into concrete class prediction, you can then set a threshold, say at .5. Setting such a threshold is problematic however, since you can trade-off precision/recall by varying the threshold, and an arbitrary choice can give false impression of a classifier's performance. To circumvent this, threshold-independent measures like area under ROC or Precision-Recall curve is used. They create thresholds at different intervals, say 0.1,0.2,0.3...0.9, turn probabilities into binary classes and then compute precision-recall for each such threshold.

Evaluation in Elki

I know ELKI currently only includes unsupervised outlier detection methods, therefore Elki doesn't divide input data in traing set and test set.
But, i've seen evaluation is over minority class when avaiable. i would like to know:
Does elki use all input data to evaluation?
Does runtime take account evaluation or just training time?
Does evaluation take account outliers scores to estimate false positive rate and true positive rate in order to evaluate rankings?
In LOF algorithm, for example, suppose a instance in normal class has a high LOF score. will it be consider a false positive or true positive in evaluation?
Thanks!

Yes, all input is used for unsupervised methods.
The labels must not have been used for running the algorithm, they are only used at evaluation time.
Runtime reported is separately for every algorithm.
This depends on your evaluation. Most measures (e.g. ROC AUC) will only take the ranking into account. To evaluate the actual scores, you first need to normalize them. For a measure that takes (normalized) scores into account, please see
E. Schubert, R. Wojdanowski, A. Zimek, H.-P. KriegelOn Evaluation of Outlier Rankings and Outlier ScoresIn Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA: 1047–1058, 2012.
True positive and false positives require a binary decision. See ROC AUC for an approach that does not require to specify a threshold to make the decision binary, but evaluate all possible thresholds.

Reasonable bit string size for genetic algorithm convergeance

In a typical genetic algorithm, is there any guideline for estimating the generations required to converge given the amount of entropy in the description of an individual in the population?
Also, I suppose it is reasonable to also require the number of offspring per generation and rate of mutation, but adjustment of those parameters is of less interest to me at the moment.

Well, there are not any concrete guidelines in the form of mathematical models, but there are several concepts that people use to communicate about parameter settings and advice on how to choose them. One of these concepts is diversity, which would be similar to the entropy that you mentioned. The other concept is called selection pressure and determines the chance an individual has to be selected based on its relative fitness.
Diversity and selection pressure can be computed for each generation, but the change between generations is very difficult to estimate. You would also need models that predict the expected quality of your crossover and mutation operator in order to estimate the fitness distribution in the next generation.
There have been work published on these topics very recently:
* Chicano and Alba. 2011. Exact Computation of the Expectation Curves of the Bit-Flip Mutation using Landscapes Theory
* Chicano, Whitley, and Alba. 2012. Exact computation of the expectation curves for uniform crossover
Is your question resulting from a general research interest or do you seek practical guidence?

No. If you define a mathematical model of the algorithm (initial population, combination function, mutation function) you can use normal mathematical methods to calculate what you want to know, but "typical genetic algorithm" is too vague to have any meaningful answer.
If you want to set the hyperparameters of some genetic algorithm (eg number of "DNA" bits) than this is typically done in the usual way for any machine learning algorithm, with a cross validation set.

Lack of diversification, is it really a drawback of Genetic Algorithms?

We know that Genetic Algorithms (or evolutionary computation) work with an encoding of the points in our solution space Ω rather than these points directly. In the literature, we often find that GAs have the drawback : (1) since many chromosomes are coded into a similar point of Ω or similar chromosomes have very different points, the efficiency is quite low. Do you think that is really a drawback ? because these kind of algorithms uses the mutation operator in each iteration to diversify the candidate solutions. To add more diversivication we simply increase the probability of crossover. And we mustn't forget that our initial population ( of chromosones ) is randomly generated ( another more diversification). The question is, if you think that (1) is a drawback of GAs, can you provide more details ? Thank you.

Mutation and random initialization are not enough to combat the problem that is known as genetic drift which is the major problem of genetic algorithms. Genetic drift means that the GA may quickly lose most of its genetic diversity and the search proceeds in a way that is not beneficial for crossover. This is because the random initial population quickly converges. Mutation is a different thing, if it is high it will diversify, true, but at the same time it will prevent convergence and the solutions will remain at a certain distance to the optimum with higher probability. You will need to adapt the mutation probability (not the crossover probability) during the search. In a similar manner the Evolution Strategy, which is similar to a GA, adapts the mutation strength during the search.
We have developed a variant of the GA that is called OffspringSelection GA (OSGA) which introduces another selection step after crossover. Only those children will be accepted that surpass their parents' fitness (the better, the worse or any linearly interpolated value). This way you can even use random parent selection and put the bias on the quality of the offspring. It has been shown that this slows the genetic drift. The algorithm is implemented in our framework HeuristicLab. It features a GUI so you can download and try it on some problems.
Other techniques that combat genetic drift are niching and crowding which let the diversity flow into the selection and thus introduce another, but likely different bias.
EDIT: I want to add that the situation of having multiple solutions with equal quality might of course pose a problem as it creates neutral areas in the search space. However, I think you didn't really mean that. The primary problem is genetic drift, ie. the loss of (important) genetic information.

As a sidenote, you (the OP) said:
We know that Genetic Algorithms (or evolutionary computation) work with an encoding of the points in our solution space Ω rather than these points directly.
This is not always true. An individual is coded as a genotype, which can have any shape, such as a string (genetic algorithms) or a vector of real (evolution strategies). Each genotype is transformed into a phenotype when assessing the individual, i.e. when its fitness is calculated. In some cases, the phenotype is identical to the genotype: it is called direct coding. Otherwise, the coding is called indirect. (you may find more definitions here (section 2.2.1))
Example of direct encoding:
http://en.wikipedia.org/wiki/Neuroevolution#Direct_and_Indirect_Encoding_of_Networks
Example of indirect encoding:
Suppose you want to optimize the size of a rectangular parallelepiped dened by its length, height and width. To simplify the example, assume that these three quantities are integers between 0 and 15. We can then describe each of them using a 4-bit binary number. An example of a potential solution may be to genotype 0001 0111 01010. The corresponding phenotype is a parallelepiped of length 1, height 7 and width 10.
Now back to the original question on diversity, in addition to what DonAndre said you could read you read chapter 9 "Multi-Modal Problems and Spatial Distribution" of the excellent book Introduction to Evolutionary Computing written by A. E. Eiben and J. E. Smith. as well as a research paper on that matter such as Encouraging Behavioral Diversity in Evolutionary Robotics: an Empirical Study. In a word, diversity is not a drawback of GA, it is "just" an issue.

What is the difference between a generative and a discriminative algorithm? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
What is the difference between a generative and a
discriminative algorithm?

Let's say you have input data x and you want to classify the data into labels y. A generative model learns the joint probability distribution p(x,y) and a discriminative model learns the conditional probability distribution p(y|x) - which you should read as "the probability of y given x".
Here's a really simple example. Suppose you have the following data in the form (x,y):
(1,0), (1,0), (2,0), (2, 1)
p(x,y) is
y=0 y=1
-----------
x=1 | 1/2 0
x=2 | 1/4 1/4
p(y|x) is
y=0 y=1
-----------
x=1 | 1 0
x=2 | 1/2 1/2
If you take a few minutes to stare at those two matrices, you will understand the difference between the two probability distributions.
The distribution p(y|x) is the natural distribution for classifying a given example x into a class y, which is why algorithms that model this directly are called discriminative algorithms. Generative algorithms model p(x,y), which can be transformed into p(y|x) by applying Bayes rule and then used for classification. However, the distribution p(x,y) can also be used for other purposes. For example, you could use p(x,y) to generate likely (x,y) pairs.
From the description above, you might be thinking that generative models are more generally useful and therefore better, but it's not as simple as that. This paper is a very popular reference on the subject of discriminative vs. generative classifiers, but it's pretty heavy going. The overall gist is that discriminative models generally outperform generative models in classification tasks.

A generative algorithm models how the data was generated in order to categorize a signal. It asks the question: based on my generation assumptions, which category is most likely to generate this signal?
A discriminative algorithm does not care about how the data was generated, it simply categorizes a given signal.

Imagine your task is to classify a speech to a language.
You can do it by either:
learning each language, and then classifying it using the knowledge you just gained
or
determining the difference in the linguistic models without learning the languages, and then classifying the speech.
The first one is the generative approach and the second one is the discriminative approach.
Check this reference for more details: http://www.cedar.buffalo.edu/~srihari/CSE574/Discriminative-Generative.pdf.

In practice, the models are used as follows.
In discriminative models, to predict the label y from the training example x, you must evaluate:
which merely chooses what is the most likely class y considering x. It's like we were trying to model the decision boundary between the classes. This behavior is very clear in neural networks, where the computed weights can be seen as a complexly shaped curve isolating the elements of a class in the space.
Now, using Bayes' rule, let's replace the in the equation by . Since you are just interested in the arg max, you can wipe out the denominator, that will be the same for every y. So, you are left with
which is the equation you use in generative models.
While in the first case you had the conditional probability distribution p(y|x), which modeled the boundary between classes, in the second you had the joint probability distribution p(x, y), since p(x | y) p(y) = p(x, y), which explicitly models the actual distribution of each class.
With the joint probability distribution function, given a y, you can calculate ("generate") its respective x. For this reason, they are called "generative" models.

Here's the most important part from the lecture notes of CS299 (by Andrew Ng) related to the topic, which really helps me understand the difference between discriminative and generative learning algorithms.
Suppose we have two classes of animals, elephant (y = 1) and dog (y = 0). And x is the feature vector of the animals.
Given a training set, an algorithm like logistic regression or the perceptron algorithm (basically) tries to find a straight line — that is, a decision boundary — that separates the elephants and dogs. Then, to classify
a new animal as either an elephant or a dog, it checks on which side of the
decision boundary it falls, and makes its prediction accordingly. We call these discriminative learning algorithm.
Here's a different approach. First, looking at elephants, we can build a
model of what elephants look like. Then, looking at dogs, we can build a
separate model of what dogs look like. Finally, to classify a new animal,
we can match the new animal against the elephant model, and match it against
the dog model, to see whether the new animal looks more like the elephants
or more like the dogs we had seen in the training set. We call these generative learning algorithm.

The different models are summed up in the table below:
Image source: Supervised Learning cheatsheet - Stanford CS 229 (Machine Learning)

Generally, there is a practice in machine learning community not to learn something that you don’t want to. For example, consider a classification problem where one's goal is to assign y labels to a given x input. If we use generative model
p(x,y)=p(y|x).p(x)
we have to model p(x) which is irrelevant for the task in hand. Practical limitations like data sparseness will force us to model p(x) with some weak independence assumptions. Therefore, we intuitively use discriminative models for classification.

The short answer
Many of the answers here rely on the widely-used mathematical definition [1]:
Discriminative models directly learn the conditional predictive distribution p(y|x).
Generative models learn the joint distribution p(x,y) (or rather, p(x|y) and p(y)).
Predictive distribution p(y|x) can be obtained with Bayes' rule.
Although very useful, this narrow definition assumes the supervised setting, and is less handy when examining unsupervised or semi-supervised methods. It also doesn't apply to many contemporary approaches for deep generative modeling. For example, now we have implicit generative models, e.g. Generative Adversarial Networks (GANs), which are sampling-based and don't even explicitly model the probability density p(x) (instead learning a divergence measure via the discriminator network). But we call them "generative models” since they are used to generate (high-dimensional [10]) samples.
A broader and more fundamental definition [2] seems equally fitting for this general question:
Discriminative models learn the boundary between classes.
So they can discriminate between different kinds of data instances.
Generative models learn the distribution of data.
So they can generate new data instances.
Image source
A closer look
Even so, this question implies somewhat of a false dichotomy [3]. The generative-discriminative "dichotomy" is in fact a spectrum which you can even smoothly interpolate between [4].
As a consequence, this distinction gets arbitrary and confusing, especially when many popular models do not neatly fall into one or the other [5,6], or are in fact hybrid models (combinations of classically "discriminative" and "generative" models).
Nevertheless it's still a highly useful and common distinction to make. We can list some clear-cut examples of generative and discriminative models, both canonical and recent:
Generative: Naive Bayes, latent Dirichlet allocation (LDA), Generative Adversarial Networks (GAN), Variational Autoencoders (VAE), normalizing flows.
Discriminative: Support vector machine (SVM), logistic regression, most deep neural networks.
There is also a lot of interesting work deeply examining the generative-discriminative divide [7] and spectrum [4,8], and even transforming discriminative models into generative models [9].
In the end, definitions are constantly evolving, especially in this rapidly growing field :) It's best to take them with a pinch of salt, and maybe even redefine them for yourself and others.
Sources
Possibly originating from "Machine Learning - Discriminative and Generative" (Tony Jebara, 2004).
Crash Course in Machine Learning by Google
The Generative-Discriminative Fallacy
"Principled Hybrids of Generative and Discriminative Models" (Lasserre et al., 2006)
#shimao's question
Binu Jasim's answer
Comparing logistic regression and naive Bayes:
cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
"On Discriminative vs. Generative classifiers"
Comment on "On Discriminative vs. Generative classifiers"
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/04/DengJaitly2015-ch1-2.pdf
"Your classifier is secretly an energy-based model" (Grathwohl et al., 2019)
Stanford CS236 notes: Technically, a probabilistic discriminative model is also a generative model of the labels conditioned on the data. However, the term generative models is typically reserved for high dimensional data.

An addition informative point that goes well with the answer by StompChicken above.
The fundamental difference between discriminative models and generative models is:
Discriminative models learn the (hard or soft) boundary between classes
Generative models model the distribution of individual classes
Edit:
A Generative model is the one that can generate data. It models both the features and the class (i.e. the complete data).
If we model P(x,y): I can use this probability distribution to generate data points - and hence all algorithms modeling P(x,y) are generative.
Eg. of generative models
Naive Bayes models P(c) and P(d|c) - where c is the class and d is the feature vector.
Also, P(c,d) = P(c) * P(d|c)
Hence, Naive Bayes in some form models, P(c,d)
Bayes Net
Markov Nets
A discriminative model is the one that can only be used to discriminate/classify the data points.
You only require to model P(y|x) in such cases, (i.e. probability of class given the feature vector).
Eg. of discriminative models:
logistic regression
Neural Networks
Conditional random fields
In general, generative models need to model much more than the discriminative models and hence are sometimes not as effective. As a matter of fact, most (not sure if all) unsupervised learning algorithms like clustering etc can be called generative, since they model P(d) (and there are no classes:P)
PS: Part of the answer is taken from source

A generative algorithm model will learn completely from the training data and will predict the response.
A discriminative algorithm job is just to classify or differentiate between the 2 outcomes.

All previous answers are great, and I'd like to plug in one more point.
From generative algorithm models, we can derive any distribution; while we can only obtain the conditional distribution P(Y|X) from the discriminative algorithm models(or we can say they are only useful for discriminating Y’s label), and that's why it is called discriminative model. The discriminative model doesn't assume that the X's are independent given the Y($X_i \perp X_{-i} | Y$) and hence is usually more powerful for calculating that conditional distribution.

My two cents:
Discriminative approaches highlight differences
Generative approaches do not focus on differences; they try to build a model that is representative of the class.
There is an overlap between the two.
Ideally both approaches should be used: one will be useful to find similarities and the other will be useful to find dis-similarities.

This article helped me a lot in understanding the concept.
In summary,
Both are probabilistic models, meaning they both use probability (conditional probability , to be precise) to calculate classes for the unknown data.
The Generative Classifiers apply Joint PDF & Bayes Theorem on the data set and calculate conditional probability using values from those.
The Discriminative Classifiers directly find Conditional probablity on the data set
Some good reading material: conditional probability , Joint PDF

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio